DataTaunew | comments | leaders | submitlogin
Fuzzy Matching with Yhat (yhathq.com)
6 points by rohit 3563 days ago | 2 comments


3 points by tbjohns 3561 days ago | link

After the basic pandas and sklearn, the post doesn't demonstrate much thought regarding the deduplication problem's real challenges. For example, since most records are not duplicates, how do you efficiently collect a useful training set? There are also O(n^2) record pairs, so how do you link them together in an efficient and consistent manner, especially if you cannot do even the supervised learning on a laptop? Also, do we really want random forests, or should we learn a model that enforces additional structure (for example a distance metric)?

Here are two papers that begin to look at these questions:

[1] Distance metric learning: http://papers.nips.cc/paper/2164-distance-metric-learning-wi...

[2] Active learning for deduplication: http://cvs.cs.umd.edu/class/spring2012/cmsc828L/Papers/Saraw...

-----

1 point by elliott34 3561 days ago | link

Ah yes I wish all blog posts were published research papers. Then I wouldn't have to waste my time looking at thoughtless blog posts...or innane DT comments

-----




RSS | Announcements