In the past few years, we've worked on a few projects with data in languages other than English. For some of these collections, we've applied transliteration and stemming to assist with the effectiveness of the search function for particular languages.
In this diary post, we briefly explain what these processes are, and also record the technical challenges that we have encountered when implementing these two language transformation techniques into Veridian.
Transliteration is the conversion of text from one script to another, involving swapping letters (thus trans- + liter-) in predictable ways, such as Greek ⟨α⟩ → ⟨a⟩, Cyrillic ⟨д⟩ → ⟨d⟩, Greek ⟨χ⟩ → the digraph ⟨ch⟩, Armenian ⟨ն⟩ → ⟨n⟩ or Latin ⟨æ⟩ → ⟨ae⟩ (Wikipedia). This conversion process sometimes gets applied as part of the indexing process to help broaden the search matches, for example when a user searches for ‘cafe’ or ‘café’, documents that contain either of these words would return as matches.
This is important for some of our collections because of the language differences between different countries and regions, even if the base language is the same. For example, Swiss Standard German is virtually identical to Standard German as used in Germany, with differences coming into play around pronunciation, vocabulary and orthography. Looking at the alphabet for each, there are small differences such as the fact that Swiss Standard German always uses a double s (ss) instead of the eszett (ß).
Stemming is another form of text conversion where a word is reduced to its root or base form, for example ‘fishes’ and ‘fishing’ would be reduced to ‘fish.’ As with transliteration, stemming can be applied as part of the indexing process to help broaden the search matches. As an example, when users search for ‘fish’, documents that contain either ‘fishes’ and/or ‘fishing’ would return as matches. The process is particularly useful as it means when users search the singular form of a word, the plural form matches automatically and vice versa.
However in the standard Veridian, stemming is not enabled. This is because while it helps to broaden the search matches by matching both singular and plural forms, it can also create issues. An example of this is when ‘fishing’ is stemmed to its base form of ‘fish.’ This means that if users attempt to search for ‘fishing competition’, they will also retrieve results for a ‘fish competition.’
This section is rather technical, but since this is a diary post and we found this implementation an interesting experience, we've decided to record it here.
To implement transliteration and stemming into Veridian, there are two aspects we needed to consider. The first aspect is the indexer, which is relatively straightforward as Solr already has support for both transliteration and stemming. This meant that we just needed to configure Solr carefully and rebuild the index.
One important point to note here is the order of the Solr filters. During both the indexing and querying processes, Solr applies the filters according to the order they are configured. Because the stemming algorithm is language specific, the transliteration filter has to go after the stemming filter. If not, once the word is transliterated into a different form the stemming algorithm will no longer determine the root word.
The second aspect is related to query term highlighting and search result snippets, which is slightly more complex. Once Solr returns the matched document IDs, Veridian determines the highlighting and generates the snippets based on the search query. This means that the query and the text in the database needs to go through the same conversion process as Solr.
For the query part, it is relatively straightforward as there are already standard C++ libraries that support transliteration and stemming. But for the text in the database, the ingestion process is written in Perl and unfortunately there aren’t existing Perl support for these two libraries.
Our first thought was that we should compile both transliteration and stemming binaries and call them each time we tried to process a word. However, a newspaper batch can easily have several million words, and launching the binaries several million times would increase the ingestion time by a massive percentage.
As we didn't want to re-implement both algorithms in Perl, we decided to utilize SWIG. SWIG is a software development tool that helps to preload a C or C++ binary into memory in different languages (in our case Perl). This way, instead of loading the binary to memory several million times, it is only loaded once when the ingestion process starts.