More Like/Related

Introduction

Using the MoreLike method it is possible to find documents whose text content are "like" a given string. This functionality is typically used for, but not limited to, finding related documents/objects.

Examples

A simple example can look like this:

searchResult = client.Search<BlogPost>()
    .MoreLike("guitar")
    .GetResult();

After having invoked the MoreLike method we can customize the search query with a number of methods. For instance, given that we don't have a lot of documents with similar content we will probably want to lower the minimum document frequency requirement. That is, the level at which words will be ignored which do not occur in at least that many documents, which defaults to five.

searchResult = client.Search<BlogPost>()
    .MoreLike("guitar")
        .MinimumDocumentFrequency(1)
    .GetResult();

A full list of extension methods for customizing the query follows below. But before we look at those, let us look at an example of finding documents "related" to a given document. Assuming we have indexed two BlogPosts with similar content we can search for similar documents as the first and expect the second using a query such as this:

var firstBlogPost = //Some indexed blog post about guitars
var secondBlogPost = //Another blog post about guitars

searchResult = client.Search<BlogPost>()
    .MoreLike(firstBlogPost.Content)
        .MinimumDocumentFrequency(1)
    .Filter(x => !x.Id.Match(firstBlogPost.Id))
    .GetResult();

Note that when issuing these types of queries it's usually a good idea to use some caching as the result is not likely to change very often and even if it does a few minutes delay might not matter.

Customization methods

As the nature of the content can differ greatly between indexes and types it is often a good idea to play around with the many settings available after having invoked the MoreLike method. Below is a list of all methods that can be called to customize the query (with explanations from the Elastic Search guide).

MinimumDocumentFrequency

The frequency at which words will be ignored which do not occur in at least this many docs. Defaults to 5.

MaximumDocumentFrequency

The maximum frequency in which words may still appear. Words that appear in more than this many docs will be ignored. Defaults to unbounded.

PercentTermsToMatch

The percentage of terms to match on. Defaults to 30 (percent).

MinimumTermFrequency

The frequency below which terms will be ignored in the source doc. The default frequency is 2.

MinimumWordLength

The minimum word length below which words will be ignored. Defaults to 0.

MaximumWordLength

The maximum word length above which words will be ignored. Defaults to unbounded (0).

MaximumQueryTerms

The maximum number of query terms that will be included in any generated query. Defaults to 25.

StopWords

A list of words considered “uninteresting” and which will be ignored.