As mentioned in my previous post I'd be writing about my journey through SoK in upcoming posts, here I am with my first status report.
So far what I've learnt/done: - Skimmed through xapian's 'getting started' Documentation to get familiar with the basics of xapian. - Analyzed baloo's code to know how and why are we using xapian. - Then I implemented a small program using xapian to add index and search. I did this to get familiar the basic API's. The code can be found in my scratch repository
Note: It is necessary to explain what an xapian document is before proceeding further. I'll try my best to keep it breif. Xapian Document is the basic unit returned by xapian after a search. The implementation in baloo is: every indexed file has an associated document which contains the terms relevant to the file. This is stored in xapian's database and mapped to the file using and SQLite database.
- Got familiar with xapian-inspect, which is a tool used to inspect contents of an xapian database table. I used this for inspecting tables of the database created by the basic utility I made to understand how an xapian database is organized internally. The main tables in the database are:
- Posting list table: This maps the indexed terms to the documents in which they occur. This is the main table that used in resolving queries.
- Record able: This stores the data associated with a document.
- Term list table: This table maps the documents to the to the terms that occur in the document.
- Position table: This table maps document + term to the position in the document. This information required for phrase queries.
Started looking for problems with xapian in accordance to our use case. I've made a wiki page for a list. The problems that I've currently figured out are:
- It heavily relies on exceptions. Exceptions are not well supported in Qt and might make the application crash as mentioned here. For example while locking a database Xapian expects the program to catch certain exceptions and retry if they are caught.
- If we want to read and write to an Xapian database simultaneously we need to keep separate copies for reading and writing, thus wasting memory.
- It does not handle data that is changing frequently, if data in document changes to frequently it can lead to a conditions in which locking the database for writing becomes impossible thus making baloo fail.
- Baloo needs support for normalizing text i.e. removing all diacritic marks and also needs to split words with '_' to generate terms, Xapian's term generator doesn't provide support for either. So baloo uses its own term generator.
- While searching for something the user may not type complete words so we need to look for every possible expansion of the words in a query, xapian doesn't provide this feature so we're using our own query parser.
This list is not exhaustive and I'll be adding more problems as I figure them out so keep a look out on the wiki page. Once again a big thanks to my mentor Vishesh Handa who guided me through all this.