DataTau | The Grammar of Data Science

DataTau

	The Grammar of Data Science (stitchfix.com)
	23 points by astrobiased 3327 days ago \| 10 comments

5 points by ubercode5 3327 days ago | link

As someone who knows python and hasn't taken the jump into R, from my perspective the R syntax in the first example seemed harder to read at a glance.

The lack of correct scaling for the python plots seemed like an annoyance at the default behavior of lmplot.

The dyplr example was impressive.

It's an interesting read, but as far as syntax goes the readability really is a personal preference.

-----

3 points by kiyoto 3326 days ago | link

Disclaimer: I am an active R user who once wrote a lifetime worth of Python in finance.

First, I am with you: R's syntax is idiosyncratic and rather horrible in places. Python is much easier on the eyes and more consistent syntactically. Also, dealing with strings is confusing at best in R and piece of cake in Python.

That said, R's semantics is pretty powerful, and ggplot2/dplyr (or any R library by Hadley Wickham) takes full advantage of R's expressiveness. The ">%>" operator is surely ugly, but as far as I know (happy to be proven wrong), that kind of operator is not even implementable in Python.

-----

1 point by rlayton 3326 days ago | link

(Never used R) What does >%> do? My best guess would be that is performs the modulo operation, but that doesn't fit with your comment.

-----

4 points by kiyoto 3326 days ago | link

It's the "pipe" operation. So, if you do

data %>% group_by(column)

That's the same as

group_by(data, column)

Essentially, this allows computations to be written with few nesting like

data %>% group_by(column) %>% summarise(f = length(another_c0lumn) %>% filter(f > 20)

A similar idea in other languages is method-chaining, which is what pandas does to implement something similar.

I personally like "%>%" better than method-chaining, probably because I think more functionally than OOP. But I now feel like I am opening a different can of worms.

-----

1 point by ubercode5 3326 days ago | link

I am with you there, piping is a very powerful operation and makes more sense from a purely functional perspective.

Method chaining isn't too terrible, but it also means those functions need to be attached to the object, which makes it rigid to reusably extend if you aren't the author. Maybe we should petition the python community for piping :).

The even more ugly option would be function nesting a(b(c(data))), which feels like reading reverse polish notation..

-----

1 point by tfturing 3326 days ago | link

I use Python whenever I can. But honestly, R is still better when it comes to exploratory data analysis.

Some advice: R is digestible (and sometimes elegant) if you only read it line by line. Just try to look at an entire file R file as little as possible or your eyes will melt.

-----

1 point by ubercode5 3326 days ago | link

Haha good to hear. I definitely want to learn more R, but it's worth tempering my expectations that it's the end all be all of data analysis.

-----

1 point by kiyoto 3326 days ago | link

Interesting. What's your take on pandas regarding EDA?

-----

1 point by tfturing 3324 days ago | link

Pandas is great if you have a background in object-oriented programming! If you don't, it has been known to make R-users who don't have a background in computer science very angry.

-----

1 point by isms 3325 days ago | link

Why choose just one when you can use both?

The %R line magics in IPython/Jupyter notebooks are awesome! You can %Rpush and %Rpull data back and forth, and do whatever you want in between:

http://nbviewer.ipython.org/github/ipython/ipython/blob/3607...

-----

RSS | Announcements