DataTaunew | comments | leaders | submitlogin
Data Science Wars: R vs Python (datascience.community)
13 points by ryanswanstrom 3242 days ago | 12 comments


5 points by Tomrod 3237 days ago | link

What are the limitations of just using RPy2?

-----

2 points by gipp 3238 days ago | link

As someone who's never spent much time using R, I'm always curious about the strength of CRAN over PyPI always being mentioned as one of R's main advantages. I don't recall ever wanting to try an approach and not finding something relevant on PyPI (99% of the time some combination of statsmodels, pandas, pymc and/or sklearn gets it done easily).

Can someone give me some examples of where there are "no module replacements for the 100s of essential R packages"? The idea of Python's massive ecosystem somehow being a negative is strange to me.

-----

2 points by TheCartographer 3235 days ago | link

I'm not sure about "100s of essential R packages." If there are 100 essential R packages, then to me this would suggest R isn't doing what it is supposed to do and the users are writing functionality to get around that. I think there are probably 5, maybe 10 essential R packages (devtools, ggplot2, reshape2, pick-your-poison-ODBC-driver package, a few others).

What R brings to the table, just like any other FOSS ecosystem, is it's community. And the community for R is academic and other high level statisticians and researchers. And it's a big community: 6666 packages in the CRAN repository as of today, plus stuff on other repo systems like GitHub and spinoffs like bioconductor.

The majority of those packages are of limited use to the general user. Where their strength lies is in the specialist implementations for sepecific algorithms/analytics/tools etc.

So whether it's a standard analytic technique for a specific field of study, or a cutting edge technique that is just becoming a topic of research in the literature, someone has probably implemented it in R already.

Python will always be more effective for general use, data manipulation, I/O, etc. It's a great Swiss Army knife.

R is a poor Swiss Army knife, but a great scalpel. If you would rather use someone's N-space vector decomposition or cutting edge classification algorithms rather than going to the trouble of implementing your own, R is awesome. There's a package that implements Author et al 2010 already.

For more general work, Python is king.

-----

2 points by isms 3241 days ago | link

Why choose?

-----

3 points by SixSigma 3241 days ago | link

Choosing is an option.

Declaring war on those who made different choices is asinine and childish.

-----

2 points by chishaku 3241 days ago | link

Time is potentially a constraint.

-----

2 points by hailekofi 3241 days ago | link

Clearly one doesn't have to but it does seem like people live in one or the other, even though they happily play in both.

-----

1 point by jrminter 3239 days ago | link

It is not either/or. I use both. Python/Jython is a great scripting language that can be incorporated into a program such as ImageJ or DTSA-II (processing X-ray microanalysis data) to permit a user to automate tasks to make them more reproducible. R provides a package for pretty much anything one wants to use and Sweave - fast becoming superseded by knitr - to permit the generation of reproducible reports. Gnu make works with both... What is not to like? The biggest problem is remembering which name to use for a function, i.e. len() or length()...

-----

1 point by TheCartographer 3239 days ago | link

Oh god, are you me? The len() / length() issue is non-trivial for me; for some reason (and even now) I cannot remember which goes with which and have to pound the keyboard before I can figure it out.

My own knowledge of R predates my knowledge of python by a few years, so I've been loathe to switch to pure python for that reason, and because ggplot2 produces such pretty charts and is very easy to use.

Generally though, I think that most pre-processing and raw data handling is best done in python, and I will usually use it to do something like troll through a directory of raw sensor data, strip out the metadata and values, and import them into a PostgreSQL database. Python's syntactic sugar - list and dictionary comprehension mostly - let's you batch process raw text tables in a minimum amount of code.

Using constraints in postgres are the fastest and easiest way to ensure a proper qa/qc of the data. If postgres starts barfing errors back at you, it's pretty trivial to either adjust your python coding to identify and handle specific problem cases, or to except the error and insert the problem row or values into a text table.

To my mind, though, R is the only way to go for visualization and/or statistical analyses. ggplot2 is just too easy to use and too powerful - I have yet to find anything that can compare, particularly in the quality of charts it produces and the ease of handling multivariate data. Any sort of pre-processing or formatting of data in R is an absolute bear though - something about using tapply() and sapply() ties my poor brain in knots. I find any sort of complex, functional, or iterative programming in R is a nightmare.

The other thing that I wish R did better was handle and plot spatial data. To date, I have yet to find a good package for making maps. ggplot2 is fine for simple point maps, but complex polygons are a friggin nightmare.

-----

1 point by lamlink 3239 days ago | link

Have you tried ggmaps? What about bokeh in python?

-----

1 point by TheCartographer 3239 days ago | link

ggmaps I have tried. I don't remember what the exact issue was but I will update the package and take another look. :-P

bokeh I haven't heard of, but I will definitely check it out.

Thank you for both the recommendations!

-----

1 point by Lofkin 3239 days ago | link

For exploratory plotting in python, there is also seaborn... though bokeh is likely poised to play this role as well in the future.

Sure!

-----




RSS | Announcements