Fast hamming distance queries in postgres
Well, I spent a while looking at writing a custom postgres C extension, and wound up just writting a Cython database wrapper that maintains a BK-tree structure in memory.
Basically, it maintains a in-memory copy of the phash values from the database, and all updates to the database are replayed into the BK-tree.
It's all up on github here. It also has a LOT of unit-tests.
Querying across a dataset of 10 million hash values for items with a distance of 4 results in touching ~0.25%-0.5% of the values in the tree, and takes ~100 ms.
MOAR ANSWERS!
Ok, I've finally taken the time to write a custom PostgreSQL indexing extension. I used the SP-GiST interface.
This was fairly challenging, mostly because Posgres is big.
Anyways, as usual, it's up on github here.
Performance-wise, it's currently ~2-3 times slower then the pure-in-memory implementation in my other answer to this question, but it's so much more convenient to use I'll happily eat that performance hit (realistically, it's ~50 ms/query - 150 ms/query, which is still pretty small).