Any good implementation of greedy set cover for large datasets?
There is a well-known greedy approximation algorithm for set cover that is also easy to implement in whatever language of your choice. The algorithm itself is described here:
http://en.wikipedia.org/wiki/Set_cover_problem#Greedy_algorithm
It is so simple that the easiest thing is just to write it from scratch.
Notably, it is also the best polynomial-time approximation algorithm known for set cover. That means that to get better worst-case performance (more compact result set) you would need to have non-polynomial running times (= slow algorithms for large sets).
Unfortunately the Wikipedia entry doesn't actually cover weighted set cover, which is the case here. The extension is simple, and is described e.g. here:
http://pages.cs.wisc.edu/~shuchi/courses/880-S07/scribe-notes/lecture03.pdf
Some more useful notes:
http://www.cs.ucr.edu/~neal/non_arxiv/Young08SetCover.pdf http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec20.pdf
My linear time / space implementation of greedy set cover in c++ is available at github.
https://github.com/martin-steinegger/setcover
A calculation for 40.000.000 sets with avg. 10 elements per set takes around 4 Minutes on computed on Amazon AWS m2.2xlarge instances.
I still work on some tricks to improve the performance
- remove subsets that are covered by a bigger set with MinHash
- remove all sets that just contain one element that is no other set