read

As mentioned in my previous post I’d be writing about my journey through SoK in upcoming posts, here I am with my first status report.

So far what I’ve learnt/done:

  • Skimmed through xapian’s ‘getting started’ Documentation to get familiar with the basics of xapian.
  • Analyzed baloo’s code to know how and why are we using xapian.
  • Then I implemented a small program using xapian to add index and search. I did this to get familiar the basic API’s. The code can be found in my scratch repository

Note: It is necessary to explain what an xapian document is before proceeding further. I’ll try my best to keep it breif. Xapian Document is the basic unit returned by xapian after a search. The implementation in baloo is: every indexed file has an associated document which contains the terms relevant to the file. This is stored in xapian’s database and mapped to the file using and SQLite database.

  • Got familiar with xapian-inspect, which is a tool used to inspect contents of an xapian database table. I used this for inspecting tables of the database created by the basic utility I made to understand how an xapian database is organized internally. The main tables in the database are:
    • Posting list table: This maps the indexed terms to the documents in which they occur. This is the main table that used in resolving queries.
    • Record able: This stores the data associated with a document.
    • Term list table: This table maps the documents to the to the terms that occur in the document.
    • Position table: This table maps document + term to the position in the document. This information required for phrase queries.
  • Started looking for problems with xapian in accordance to our use case. I’ve made a wiki page for a list. The problems that I’ve currently figured out are:

    • It heavily relies on exceptions. Exceptions are not well supported in Qt and might make the application crash as mentioned here. For example while locking a database Xapian expects the program to catch certain exceptions and retry if they are caught.

    • If we want to read and write to an Xapian database simultaneously we need to keep separate copies for reading and writing, thus wasting memory.

    • It does not handle data that is changing frequently, if data in document changes to frequently it can lead to a conditions in which locking the database for writing becomes impossible thus making baloo fail.

    • Baloo needs support for normalizing text i.e. removing all diacritic marks and also needs to split words with ‘_’ to generate terms, Xapian’s term generator doesn’t provide support for either. So baloo uses its own term generator.

    • While searching for something the user may not type complete words so we need to look for every possible expansion of the words in a query, xapian doesn’t provide this feature so we’re using our own query parser.

This list is not exhaustive and I’ll be adding more problems as I figure them out so keep a look out on the wiki page. Once again a big thanks to my mentor Vishesh Handa who guided me through all this.

Blog Logo

Pinak Ahuja


Published

Image

Blog

The place where I write about the boring stuff I do.

Back to Overview