Elasticlunr.js

Lightweight full-text search engine in Javascript for browser search and offline search.

Elasticlunr.js is a lightweight full-text search engine in Javascript for browser search and offline search. Elasticlunr.js is developed based on Lunr.js, but more flexible than lunr.js. Elasticlunr.js provides Query-Time boosting and field search. Elasticlunr.js is a bit like Solr, but much smaller and not as bright, but also provide flexible configuration and query-time boosting.

Getting Started

Open your browser's developer tools on this page to follow along or you could use Node.js to try in different way.

A very simple search index can be created using the following scripts:

var index = elasticlunr(function () {
    this.addField('title');
    this.addField('body');
    this.setRef('id');
});

Adding documents to the index is as simple as:

var doc1 = {
    "id": 1,
    "title": "Oracle released its latest database Oracle 12g",
    "body": "Yestaday Oracle has released its new database Oracle 12g, this would make more money for this company and lead to a nice profit report of annual year."
}

var doc2 = {
    "id": 2,
    "title": "Oracle released its profit report of 2015",
    "body": "As expected, Oracle released its profit report of 2015, during the good sales of database and hardware, Oracle's profit of 2015 reached 12.5 Billion."
}

index.addDoc(doc1);
index.addDoc(doc2);

Then searching is as simple:

index.search("Oracle database");

Also, you could do query-time boosting by passing in a configuration:

index.search("Oracle database profit", {
    fields: {
        title: {boost: 2},
        body: {boost: 1}
    }
});

Features

Elasticlunr.js is developed based on lunr.js, but more flexible than lunr.js. The main features are as followings:

  • 1. Query-Time boosting, you don't need to setup boosting weight in index building procedure, this make it more flexible that you could try different boosting scheme.
  • 2. More rational scoring mechanism, Elasticlunr.js use quite the same scoring mechanism as Elasticsearch, and also this scoring mechanism is used by lucene.
  • 3. Field-search, you could choose which field to index and which field to search.
  • 4. Boolean Model, you could set which field to search and the boolean model for each query token, such as "OR", "AND".
  • 5. Combined Boolean Model, TF/IDF Model and the Vector Space Model, make the results ranking more reliable.
  • 6. Fast, Elasticlunr.js removed TokenCorpus and Vector from lunr.js, by using combined model there is need to compute the vector of a document to compute the score of a document, this improve the search speed significantly.
  • 7. Small index file, Elasticlunr.js did not store TokenCorpus because there is no need to compute query vector and document vector, then the index file is very small, this is especially helpful when elasticlunr.js is used as offline search.

Download

Query-time Boosting

Because elasticlunr.js has a very perfect scoring mechanism, so for most of your requirement, simple search would be easy to meet your requirement.

index.search("Oracle database profit");

It's easy to setup which fields to search in by passing in a JSON configuration, and setup boosting for each search field. If you setup this configuration, then elasticlunr.js will only search the query string in the specified fields with boosting weight. If on fields is specified, elasticlunr.js will search all the fields that your configured when you created the index.

The scoring mechanism used in elasticlunr.js is very complex, please goto details for more information.

index.search("Oracle database", {
    fields: {
        title: {boost: 2},
        body: {boost: 1}
    }
});

Boolean Model

Elasticlunr.js also support boolean logic setting, if no boolean logic is setted, elasticlunr.js use "OR" logic defaulty. By "OR" default logic, elasticlunr.js could reach a high Recall.

index.search("Oracle database profit", {
    fields: {
        title: {boost: 2},
        body: {boost: 1}
    },
    boolean: "OR"
});

Token Expand

If user want to increase RECALL, user could configure to expand query tokens. For example, user query "micro", assume that "microwave" and "microscope" are both in the index, then documents contain "microwave" or "microscope" also will be returned. Each expanded query token's results are penalized because expanded token is not user query token.

index.search("micro", {
    fields: {
        title: {boost: 2, bool: "AND"},
        body: {boost: 1}
    },
    bool: "OR",
    expand: true
});

Pipeline

Every document and search query that enters lunr is passed through a text processing pipeline. The pipeline is simply a stack of functions that perform some processing on the text. Pipeline functions act on the text one token at a time, and what they return is passed to the next function in the pipeline.

By default lunr adds a stop word filter and stemmer to the pipeline. You can also add your own processors or remove the default ones depending on your requirements. The stemmer currently used is an English language stemmer, which could be replaced with a non-English language stemmer if required, or a Metaphoning processor could be added.

  var index = lunr(function () {
    this.pipeline.add(function (token, tokenIndex, tokens) {
      // text processing in here
    })

    this.pipeline.after(lunr.stopWordFilter, function (token, tokenIndex, tokens) {
      // text processing in here
    })
  })

Tokenization

Tokenization is how lunr converts documents and searches into individual tokens, ready to be run through the text processing pipeline and entered or looked up in the index.

The default tokenizer included with lunr is designed to handle general english text well, although application, or language specific tokenizers can be used instead.

Stemming

Stemming increases the recall of the search index by reducing related words down to their stem, so that non-exact search terms still match relevant documents. For example 'search', 'searching' and 'searched' all get reduced to the stem 'search'.

lunr automatically includes a stemmer based on Martin Porter's algorithms.

Stop words filtering

Stop words are words that are very common and are not useful in differentiating between documents. These are automatically removed by lunr. This helps to reduce the size of the index and improve search speed and accuracy.

The default stop word filter contains a large list of very common words in English. For best results a corpus specific stop word filter can also be added to the pipeline. The search algorithm already penalises more common words, but preventing them from entering the index at all can be very beneficial for both space and speed performance.