Donec rutrum viverra diam eget elementum. Fusce at diam eget ipsum consectetur
I’m involved with, the Groen Kennisnet (Green Knowledge Net) Search portal, where we use the Apache Solr server to support this functionality.
A full-text search server indexes human-readable documents, like documents marked up with XML, HTML web pages, Word documents and PDF files, by importing them to its internal database. It typically does this by converting these documents into its own internal format, extracting metadata, analysing textual properties, and a variety of other information retrieval related functions. The server also functions as a search engine: it can be queried with a search term, via a web page or an API, which retrieves the documents that are matched by the search term, and returns the search results.
Another common feature of a search server is support for faceting: search results can be grouped (aggregated) on certain common characteristics and navigated by drilling down: these groupings are usually shown as links with counts. An advanced feature of such a server is the particular ordering or ranking of the retrieved documents. There are potentially many ways to influence this ordering, some of which are usually built-in to the search server (as default), and others which can be externally configured. The process of determining the relevancy of documents using certain criteria is called boosting or scoring.
Apache Solr [sounds like 'solar'] is such a full-text search server. It is written in Java, and is being developed as an open-source project by the Apache community. Solr is based on the Lucene search library (another Apache project), which provides Solr with its basic full-text indexing, querying and boosting capabilities, and other advanced features. Solr turns this Lucene library into a server, and adds enterprise search features like extensive XML-based configuration options, an open RESTful HTTP-centric API with support both XML and JSON, a web-based administration interface, and out-of-the-box support for scalability via Solr Cloud. Solr can be freely downloaded from http://lucene.apache.org/solr . Solr’s index consists of separate documents that can have any number of fields that conform to a certain data type in a schema; in other words Solr is not schemaless (though it does support dynamic field types based on a certain naming convention).
The Groen Kennisnet Search portal (Dutch: Groen Kennisnet Zoeken, from hereon GKZ) is a service of all the ‘Green’ colleges in the Netherlands, supported by the Dutch Ministry of Economic Affairs, which makes all the educational source material on the topics of Food, Nature and Environment searchable through a web portal, with an accompanying web API. The (barebones) web portlet can be visited at http://zoeken.groenkennisnet.nl . GKZ is usually embedded in a fully fleshed out real web portal via its search interface. For a sample run try to enter “nl” in the search box, which will return most of the indexed documents. The primary participant in and most significant provider of content for GKZ is the Wageningen University & Research centre (WUR).
GKZ indexes 5 primary textual resources, which are harvested on a daily basis:
The learning objects often reference full-text content stored in web pages, PDF files and such, which are separately processed with a web crawler. Each textual resource (known as a content type) is shown in its own tab with 10 search results per page, and links for scrolling through the search results.
All the search results are shown in the order as determined by Lucene’s default relevancy algorithm and Solr’s search handler configuration (more on that below). Therefore the relevancy of these documents (which determines their ordering) could in theory be differentiated per content type. In actual practice this is not yet the case, but this is currently under investigation in the Search Boosting project. For instance, it could be interesting to show the search results for the News tab (the RSS feeds) ordered by the publishing date, and the results for the Contacts tab ordered alphabetically by the name of the contact.
Lucene determines the relevancy of each document using a specific scoring algorithm, which takes into consideration 4 primary factors:
Term Frequency: the higher the frequency of a search term appearing in the document, the higher the document’s score.
Inverse Document Frequency: the less often a search term appears in all documents in the index, the more it contributes to the score.
Coordination Factor: a document will have a higher score if it contains more matching search terms.
Field Norm: if a field matches a search term, and the field’s length (the number of words it contains) is high, it will lower the score.
These are all hard-coded settings, although it is possible to override the Lucene scoring algorithm with a custom implementation in Java of the DefaultSimilarity class.
Another option is to have some aspects overridden in Solr’s configuration file. For instance, one of the client’s requests in the Search Boosting project is to disable the influence that the number of keywords that the “keyword” field has on the relevancy (when it matches a search term). We are able to accomplish by adding a new field “keyword.s” to the Solr index with its own field type (solr.TextField) and disabling the term frequency of this specific field with the extra omitNorms=”true” option on its field type. This will disable the Field Norm specific part of the scoring algorithm (4. above) and therefore the number of times the search term matches the keyword no longer matters.
In the configuration for a search handler of type edismax (Extended DisMax Query Parser) Solr offers a plethora of options to boost documents based on certain criteria. One of these options is that each field can have a custom boost. This is an integer that is attached to the field name with a caret (^); the default boost of each field is 1. The client requested certain changes in an earlier phase of the project to the boosting of specific fields, so that more relevant documents would show first, and this we could easily configure in Solr. For instance, we boosted the keyword field, where matches with a search term are very relevant, with a high factor: keyword.s^15. This results in documents that have (multiple) matches with search terms in its keywords field to be shown much higher up in the search results, and thereby improving the quality of the relevancy of GKZ.
Another advanced option that Solr has is to not just simply boost certain fields, but to boost them based on the content of that particular field. For that Solr supports boosting functions, which are functions that can calculate things based on the content of a field. We have used this to improve the relevancy of GKZ by boosting on the content of the publishyear field: the more recent this year, the higher the resulting score. This will be improved upon in the context of the Search Boosting project. The client requests that we give the publishyear even more weight: we accomplish this by tweaking a certain factor in the boosting function, so that publishyear is even more relevant than it is now.
Solr’s boost functions are a very powerful and versatile tool to tune the relevancy of a query. It supports a variant of the if .. then .. else statement in procedural languages, where if a condition is valid then a certain action is taken, else (otherwise) another action is taken. We use this function in the context of the current project to replace a missing publishyear field in a document with a custom entryyear field, so that such documents will still get a relevant score and therefore be shown in meaningful order in search results. Another use which we have made of boosting functions in this context is to multiply the publishyear boost with a pre-set number the smoothen out the distribution (bell curve) of the publishyear score, so that there is not such a big gap between documents dated to this year compared to previous years.
To recapitulate, Solr offers a compelling number of features, some out-of-the-box and some via its configuration mechanism, to improve the relevancy of matching documents. The default features are part of Lucene, and improve upon the relevancy of matching documents without any setup. Custom features are part of Solr’s configuration, and offer detailed options to override certain aspects of Lucene’s default behaviour, and to boost the relevancy of certain fields, either relative to one another, or even within a field relative to its content using boosting functions.