How to optimize "text search" for inverted index and relational database?

I would strongly recommend Sphinx Search Server, wchich is best optimized in full-text searching. Visit http://sphinxsearch.com/.

It's designed to work with MySQL, so it's an addition to Your current workspace.


I do not pretend to have THE solution but here is my ideas. First, I though like you for time-consuming queries LIKE%% : I would execute a query limited to a few answers in MySQL, like a dozen, return that to user, and wait to see if user wants more matching records, or launch in background the full-query, depending on you indexation needs for future searches.

More generally, I think that storing everything in memory could lead, one day, to too-much memory consumption. And althrough the search-engine becomes faster and faster when it keeps everything in memory, you'll have to keep all these caches up-to-date when data is added or updated and it will certainly take more and more time.

That's why I think the solution I saw a day in an "open-source forum software" (I couldn't remember its name) is not too bad for text searching in posts : each time a data is inserted, a table named "Words" keeps tracks of every existing word, and another table (let's say "WordsLinks") the link between each word and posts it appears in. This kind of solution has some drawbacks:

  • Each Insert, Delete, Update in database is a lot slower
  • Data selection for search engine must be anticipated : if you choose to keep two letter words you never kept, it is too late for already recorded data, unless you launch a complete data re-processing.
  • You must take care of DELETE as well as UPDATE and INSERT

But I think there are some big advantages:

  • Computing time is probably the same than the "memory solution" (eventually), but it is divided in each database Create/Update/Delete, rather than at query time.
  • Looking for a whole word, or words "starting with" is instantaneous : when indexed, searching in "Words" table is dichotomic. And "WordLinks" table query is very fast either with an index.
  • Looking for multiple words at the same time could be simple : gather a group of "WordLinks" for each found Word, and execute an intersection on them to keep only "Database Ids" common to all these groups. For example with the words "tree" and "leaf", the first one could give Table records {1, 4, 6}, and the second one could give {1, 3, 6, 9}. So with an intersection it is simple to keep only common parts : {1, 6}.
  • A "Like %%" in a single-column table is probably faster than a lot of "Like %%" in different fields of different tables. And each database engine handles some cache : "Words" table could be little enough to be kept in memory
  • I think there is a small risk of performance and memory problems if data becomes huge.
  • As every search is fast, you can even look for synonyms. For example search "network" if user didn't find anything with "ethernet".
  • You can apply rules, like splitting camel case words to generate for example the 3 words "wood", "X", "woodX" from "woodX". Each "word" is very lightweight to store and find, so you can do a lot of things.

I think the solution you need could be a blend of methods : for example you can keep lightweight UPDATE, INSERT, DELETE, and launch "Words" and "WordsLinks" feeding from a TRIGGER.

Just for anecdote, I saw a software developped by my company in which it was decided to keep "everything" (!) in memory. It leads us to recommend to our customers to buy servers with 64GB RAM. A little bit expensive. It explains why I am very prudent when I see solutions that could lead, eventually, to memory filling.