DataTaunew | comments | leaders | submitlogin
How to setup up a data science environment using Docker and Jupyter (dataquest.io)
20 points by vikp 3070 days ago | 8 comments


2 points by bkd9 3070 days ago | link

Hmm, I just got into virtual environments, but it hasn't been an entirely smooth ride. I tried virtualenv, but then discovered that matplotlib doesn't work in these venvs? Well that was obviously not going to fly. Using the virtual environments in conda fixes this problem, but conda doesn't support many of the libraries I need. Now I can use pip instead but I think I'm getting issues with dual package managers. Not to mention any package with a messy install like xgboost and vispy can be even more difficult on a venv. These types of libraries also make it impossible for me to package my code in a distributable way. Even if I list them as requirements, the user would still need to spend potentially hours working through a manual install.

It sounds like Docker may be the tool to fix these problems, but I am worried that I'll run into more poorly documented incompatibilities. Can anyone speak to this? If you have adopted Docker, what has your experience been?

-----

1 point by serkov 3069 days ago | link

There is also a vagrantup.com - a kind of doker virtual enviroment that runs in VirtualBox. You could try checking vegrant package at http://datasciencetoolbox.org/ to see how it's built

-----

1 point by grahama 3070 days ago | link

xgboost is a messy install?

  git clone https://github.com/dmlc/xgboost
  cd xgboost
  bash build.sh
  cd python-package
  pip3 install -e .

and fwiw, I don't think if you are having troubles with virtualenv's and issues that docker is going to be any easier. Both are annoying and frustrating starting out. Also reading this post, seem's like theres a lot of not entirely true statements that make containers sound better than they are or not have a thorough understanding of what docker actually is.

-----

1 point by bkd9 3061 days ago | link

Huh, I just tried this and it worked. This was not my experience in August-- I spent hours installing xgb and one member of my team never got it to work. It looks like xgb has matured a lot since I first tried it, and has become much more accessible, which is great!

I stand by my point that some libraries are more difficult to install than a quick pip. I just got through installing vispy, which requires a backend that was tricky to get working.

-----

1 point by vikp 3069 days ago | link

On the package installation front, the "best case" install is always great, but there are strange platform and other inconsistencies that can cause hard to debug issues. Just google "xgboost installation error" to see.

In our experience helping people new to data science get started, package installation is a non-trivial hurdle. Docker has helped reduce the number of error cases substantially. At the very least, it reduces the number of installation issues to debug to 1 -- Docker itself. Docker is also evolving rapidly, and you may have used a previous version.

As for the making containers sound better than they are / not having a thorough understanding, this article is targeted to those new to Docker, and makes some simplifications. If you want to highlight specific inaccuracies, would love to discuss, but this comes across as FUD if not.

-----

1 point by grahama 3068 days ago | link

Oh sorry, can definitely see how it came across like that in retrospect. In terms of docker being easier or better than VM's, I think this is the best explanation (although it is maybe outdated even though it's only 1 year old) of how people often misunderstand what docker or vagrant are: http://stackoverflow.com/a/21314566/4696622 I've used both and I generally stick with vagrant because unless you are already on a linux machine, you will need to use a VM anyways at which point the whole 'startup time' argument is pretty much equivalent and an actual VM is a lot more useful for anything dev/toy stuff than a mishap of docker containers (well from MY experience of course).

I think both are cool and I'm sure someone much smarter than me can explain when to use one over the other but there really are a lot of people who discuss the pro's or con's but have little in-depth experience with either (not saying that about this, more aimed at HN comments I've seen). I also think knowing how to configure virtualenv's/conda is important because that is becoming a really fundamental part of python understanding (venv, virtualenv, whatever else options there currently are).

-----

1 point by vikp 3070 days ago | link

Docker does have inconsistencies and issues, but those inconsistencies usually affect the person who builds the image vs the person who runs it. Someone who has to run a docker image will generally have a much easier time than someone who gets a requirements.txt file and has to install everything in it.

I also haven't had any issues with matplotlib in virtualenvs, might be platform-specific, though.

-----

2 points by bkd9 3070 days ago | link

Thanks for the reply. This is very helpful. What snags have you found?

Yes, the matplotlib issue is specific to OSX http://matplotlib.org/1.5.0/faq/virtualenv_faq.html

-----




RSS | Announcements