DataTau | Ask DT: Postgres or MongoDB for Data Science?

DataTau

	Ask DT: Postgres or MongoDB for Data Science?
	5 points by eisforinnovate 3556 days ago \| 6 comments
	Explain if possible

4 points by jcbozonier 3556 days ago | link

I analyze hundreds of GB of JSON log files from a custom analytics solution and I've found RedShift (essentially postgres) wonderful. The portions of the schema that are consistent I've reified into columns and the one misc data element I have on my objects I stick into a large varchar column that I query with Redshift's JSON functions. It wouldn't work if I had to do intense JSON queries, but I don't find myself needing to.

SQL has a very rich and mature query language as well. I can't imagine using MongoDB in nearly the capacity I use Redshift.

EDIT: If you are currently using MongoDB and it works that's great. It wouldn't be the direction I would personally head in though if I were to start from scratch.

-----

3 points by binalpatel 3554 days ago | link

I have to second RedShift; it's been very easy to use, and just as importantly, it's dirt cheap to start out with. I used it to create a customer data warehouse for my current company; a mix of traditional flat data and JSON data that's either stored as large varchar columns or parsed into columns.

More to the data science part of it, it's allowed me to link together and analyze gigabytes of customer data. Easy Tableau plug-and-play integration was also a big selling point; it allowed me to give access to all the data to anyone who wanted it, without being bogged down by daily requests.

-----

2 points by adamlaiacano 3555 days ago | link

Neither. Or both. Or, depending on what you're doing, one will prove to be obviously better than the other.

-----

2 points by kisamoto 3556 days ago | link

I suppose it really depends on what you're looking for.

I'm actually using a combination of MongoDB and PostgreSQL for my data storage and analytics. The data is uploaded and instantly stored in MongoDB. Being schemaless this makes it really easy for me to add a new attribute or dimension to my data and not worry about the API failing due to a column not present in PostgreSQL.

MongoDB has reasonable horizontal scalability as your data grows and you want to store as much as possible but it does have it's limitations (particularly in the geo field where I perform a lot of my analytics but also in the date/time area[1]).

Next step is data extraction and preparation for data science and this is where PostgreSQL comes in. I can read my data in chunks from MongoDB into PostgreSQL schemas and perform complex geo queries and analysis on them, often storing or adding the results back into a new mongo collection.

A really useful comparison is a blog post on aggregating NBA data in both datastores[2]. Due to the age and maturity of PostgreSQL and the powerful query language it provides clear syntax and database level power.

Tl;Dr - MongoDB = Schemaless, scalable datastore for evolving data. PostgreSQL = Powerful analytics through the SQL query language and mature features.

[1] - http://stackoverflow.com/questions/17834596/mongodb-querying...

[2] - http://tapoueh.org/blog/2014/02/17-aggregating-nba-data-Post...

-----

2 points by kisamoto 3555 days ago | link

There's also a good answer on DataScience about the use of NoSQL in general in DataScience:

http://datascience.stackexchange.com/questions/793/uses-of-n...

-----

1 point by usr_bin 3554 days ago | link

Postgres or Vertica CE depending on the use case. The Postgres column store extensions get better every day, making Vertica CE less attractive. Redshift is great, but it's still missing some stuff like copy from STDIN that makes I/O difficult and restricts interoperability with Postgres. I don't always love Vertica and it's too expensive if you're not running CE, but being able to start a project in a small Postgres instance and then migrate seamlessly to Vertica when you need is very useful sometimes.

There is no reason to use Mongo unless your data is JSON or really fits the JSON flexible schema use case. Loading tabular or well-dimensioned data into Mongo is a huge mistake.

-----

RSS | Announcements