DataTaunew | comments | leaders | submitlogin

It's great.

A tutorial that explains the reasoning behind word embeddings and demonstrates how to use these techniques to create clusters of similar words using data from 500,000 Amazon reviews of food.

That's a bogus title for a sentiment classifier (happy, sad, angry, others)
1 point by carlosgg 6 days ago | link | parent | on: The Python Graph Gallery

R-graph gallery in same site

But the question is will it be stable for many years like R and Python have grown as a community.

This is a course I've started teaching. Topics covered include k-means, hierarchical clustering, t-SNE, PCA and NMF. Included are lots of exercises (as Jupyter notebooks) where these techniques can be practised, almost always on real-world data. I hope you like it!

A blog post by my colleague Christoph Schock.

npy is not the winner. It's equal to many other according to this post.

Ha, probably because you can open it in Excel? You can also read it into R etc, while I don't think you can do that for .npy

This process can be expensive time wise:

PyTorch is much faster because NumPy arrays and PyTorch Tensors share the same memory locations, and pytoelrch tensors have. ‘numpy’ attribute.

Why does csv perform worse when compared to other. Why do we prefer csv as the standard format?

1 point by larrydag 12 days ago | link | parent | on: Federal Spending Transparency

I've often thought of creating a site like this one. I'm looking forward to seeing what it holds.

Great work, Perth!

This was great. Thanks a bunch for sharing

We just won the presentation award at Quantify Datathon 2017. Feel free to ask any question about the project!

chicago > NYC

just sayin

I'm glad he noted the dirty data. All the while I was reading I was wondering how it would cope with

Organic mushroom crepe in sauce

Crepe champignon - your favourite crepe drizzled with a creamy wild mushroom sauce

French pankcake with a light sauce made from fresh mushrooms

1 point by SilverSurfer 25 days ago | link | parent | on: NumPy Cheat Sheet

thanks for sharing this

nice read
2 points by rounakbanik 27 days ago | link | parent | on: TED Talks Dataset

There is a kernel associated with this dataset. You can see how the data is used here:
1 point by larsyencken 31 days ago | link | parent | on: Franchise: a sql notebook

This has had some attention on HN, thought this is the right crowd though.

Here's the Github link:

"This data enables us to learn about individuals, and not just population averages."

Good luck selling that sort unconsented* analysis as privacy respecting!

*Marketing is about influencing without express consent of the target (though rarely against the explicit will of a person), influencing with consent is mostly the realm of self-help books, doctors, bank clerks and others.


Thanks for your feedback :)

Regarding the MSc it's more than I could ask for. There are introductory courses in statistics, programming, databases, machine learning etc and there are specific elective courses like Big Data Systems, Natural Language Processing etc.

TBH I think that especially the math/stats background is something invaluable that is not really learned by reading data science specific books, and usually it's neglected. It might not be what strictly prerequisite to work with machine learning algorithms but it's a huge help to delve deeper into it.

You can see the whole curriculum of the course here:

Awesome post. I've been wanting to play with Terraform. I'll check it out.

I'm curious: How do you like the MSc Data Science program? What do the courses look like?

Hi all,

I wrote this post as a reference point for having a system to quickly set up high-end VMs on AWS.

The problem I was usually faced with as a MSc student in data science is that I would be trying to develop/run machine learning algorithms on my laptop but it would take too much time.

The two alternatives I have are either buy a high end PC or learn how to use the cloud VMs.

Since I didn't have the budget to buy a high end PC, I was left with the option to create VMs on AWS, though this had problems of it's own, mainly it takes a bit of time to create and configure the machine.

That's why I tried to automate this procedure and ended up with this guide.

Yet another series by the author of Modern Pandas series of posts. Well worth a read if you haven't:

Pseudo labeling is an interesting technique and the code you provided is interesting too.

However i strongly disagree with your conclusion. The competitions had a huge leader board shake-up. So saying you had a gain on the leaderboard is simply false. You finished 2551/3835 and lost 759 places on the private leader board. Therefore saying that pseudo labeling improved your score is in my opinion not right.

Moreover i haven't seen yet any Kaggle master use this technique in order to improve their model.

This is a blog post from a colleague that discusses the role of the choice of tree in hierarchical softmax in e.g. word2vec. It reproduces some experiments of Mnih and Hinton, but measures performance on the word analogy task (instead of language modelling).

RSS | Announcements