Introduction
Using the MoreLike method it is possible to find documents whose text content are "like" a given string. This functionality is typically used for, but not limited to, finding related documents/objects.
Examples
A simple example can look like this:
C#
searchResult = client.Search<BlogPost>()
.MoreLike("guitar")
.GetResult();
After having invoked the MoreLike method we can customize the search query with a number of methods. For instance, given that we don't have a lot of documents with similar content we will probably want to lower the minimum document frequency requirement. That is, the level at which words will be ignored which do not occur in at least that many documents, which defaults to five.
C#
searchResult = client.Search<BlogPost>()
.MoreLike("guitar")
.MinimumDocumentFrequency(1)
.GetResult();
A full list of extension methods for customizing the query follows below. But before we look at those, let
us look at an example of finding documents "related" to a given document. Assuming we have indexed two BlogPosts with similar content we can search for similar documents as the first and expect the second using a query such as this:
C#
var firstBlogPost =
var secondBlogPost =
searchResult = client.Search<BlogPost>()
.MoreLike(firstBlogPost.Content)
.MinimumDocumentFrequency(1)
.Filter(x => !x.Id.Match(firstBlogPost.Id))
.GetResult();
Note that when issuing these types of queries it's usually a good idea to use some
caching as the result is not likely to change very often and even if it does a few minutes delay might not matter.
Customization methods
As the nature of the content can differ greatly between indexes and types it is often a good idea to play around with the many settings available
after having invoked the MoreLike method. Below is a list of all methods that can be called to customize the query (with explanations
from the Elastic Search guide).
MinimumDocumentFrequency
The frequency at which words will be ignored which do not occur in at least this many docs. Defaults to 5.
MaximumDocumentFrequency
The maximum frequency in which words may still appear. Words that appear in more than this many docs will be ignored. Defaults to unbounded.
PercentTermsToMatch
The percentage of terms to match on. Defaults to 30 (percent).
MinimumTermFrequency
The frequency below which terms will be ignored in the source doc. The default frequency is 2.
MinimumWordLength
The minimum word length below which words will be ignored. Defaults to 0.
MaximumWordLength
The maximum word length above which words will be ignored. Defaults to unbounded (0).
MaximumQueryTerms
The maximum number of query terms that will be included in any generated query. Defaults to 25.
StopWords
A list of words considered “uninteresting” and which will be ignored.
Do you find this information helpful? Please log in to provide feedback.