LibrarySites.Banner

Understanding Analyzers and Sitecore 7

When you first start using Sitecore 7, you may be impressed with how easy it is to use the new content search API. If you look at some of the configuration files, you may be impressed with how many options there are, and by how well the out-of-the-box settings work.

But sooner or later you will run into situations that require you to change those settings. What if you have a field on an item that you want to be treated as a single value, even if the value consists of multiple words?

This is especially important when you are searching specific fields for a specific value. When someone searches for documents that are tagged with the location "New York", you want that treated as a single word. You don't want the search to find any document that are tagged with a location that contains either "New" or "York".

In order to configure Sitecore 7 to work this way, you need to understand about analyzers. The purpose of this post is to introduce analyzers by explaining how they are used at index-time and search-time.

What is an analyzer?

In order to index content, a search engine needs to be told what to index. This involves breaking up text into words (or tokens).

This process of breaking up text into tokens is called tokenization. A search engine will often provide an administrator with the ability to specify how text should be tokenized. With Lucene (and Solr), tokenization is controlled by components called analyzers.

In addition to tokenizing, an analyzer may perform other tasks:

  • Convert the characters to lower-case in order to support content that is not case-sensitive.
  • Remove common words that are irrelevant to the search. These are called stop words. In English, this includes words like "a", "not", "on" and "the".
  • Transforming a word into its root form in order to support a wider range of matching. In English, this would allow a search for "playing" to match "play", "player" and "played". This is a process called stemming.

What are the components that make up an analyzer?

An analyzer depends on other components to do its work.

The analyzer's tokenizer is responsible for breaking up a value into tokens. The following are some of the common tokenizers that are used:

  • Lucene.Net.Analysis.KeywordTokenizer - Emits the entire input as a single token.
  • Lucene.Net.Analysis.LetterTokenizer - Divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by System.char.isLetter() predicate.
  • Lucene.Net.Analysis.Standard.StandardTokenizer - Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. Recognizes email addresses and internet hostnames as one token.

The analyzer's filters transform the tokens. The following are some of the common filters that are used:

Custom tokenizers and filters can be created using the Lucene API.

What analyzers are available for Sitecore 7?

Sitecore 7 includes a number of analyzers. The following are some of the common analyzers that are used:

Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer

  • Tokenizer: KeywordTokenizer
  • Filters: LowerCaseFilter

Lucene.Net.Analysis.SimpleAnalyzer

  • Tokenizer: LetterTokenizer
  • Filters: LowerCaseFilter

Lucene.Net.Analysis.Standard.StandardAnalyzer

  • Tokenizer: StandardTokenizer
  • Filters: LowerCaseFilter, StopFilter

Sitecore.ContentSearch.LuceneProvider.Analyzers.SynonymAnalyzer

  • Tokenizer: StandardTokenizer
  • Filters: LowerCaseFilter, StopFilter, StandardFilter, SynonymFilter

Custom analyzers can be created using the Lucene API.

How does the search provider determine which analyzer to use?

Analyzers are used at both index-time and query-time. At index-time the analyzer is used to determine what will be indexed. At query-time, the analyzer does the same thing. The search text is transformed into the same format it would be if it were being indexed. This way, matches can be made with content that has been indexed.

Consider the following code:

var index = ContentSearchManager.GetIndex("sitecore_master_index");
using (var context = index.CreateSearchContext())
{
  var query = context.GetQueryable<MyItem>()
               .Where(item => item.Language == "en")
               .Where(item => item.TemplateName == "City")
               .Where(item => item.State == "New York");
}

The search provider must convert the LINQ expression into the following Lucene search expression:

+state:new york +(+_templatename:city +_language:en)

When the query is executed, analyzers are responsible for treating "new york" as a single value. So when the query is built, the search provider must be able to determine which analyzer to use for each field.

The fieldmap (described in a previous post) is used to define which fields should be indexed and how they should be indexed. The analyzer is an important part of this. Each field can have its own analyzer specified.

You can find examples of how an analyzer can be explicitly specified in Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config, at the following path:

/configuration/sitecore/contentSearch/configuration/DefaultIndexConfiguration/fieldMap/fieldNames

If an analyzer has not been explicitly set for a field, the default analyzer is used. Each index has its own default analyzer. You can find the default analyzer in Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config, at the following path:

/configuration/sitecore/contentSearch/configuration/DefaultIndexConfiguration/Analyzer

What if I am not using Lucene or Solr? What analyzers are available to me?

The answer depends on your search engine. Most search engines will have a component that corresponds to Lucene's analyzer.

  • Thanks Adam.  What if we DO want to find any items that contain either "New" or "York"? You say that the default Analyzer is set to StandardAnalyzer, however, a search for .Where(item => item.Content == "New York") does not create an implicit "OR" between the terms as you are suggesting... at least not for me.  -Derek  

  • Like any LINQ statement, you can expand on what it is you want.   e.g., .Where(i => i.Content == "New" || i.Content == "York) or you use an ilist, such an array or List<T> e.g., .Where(i => (new [] {"New","York"}).Contains(i.Content)).  =)

  • <field fieldName="kitchen position" storageType="YES" indexType="TOKENIZED"    vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">                     <analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" />                  </field>  I am using above configuration but value is not getting lower case..

  • If I use the default analyzer for a field with spaces in it, searching for "New York" will match "New xxx" or "neW.xxx" or "nEw-xxx" because it's set to TOKENIZED. My input is lower-cased for the lucene query (+state:new york) but it doesn't matter because the search is case insensitive.  If I want to match the exact string "New York" and nothing else, I need to set it to UNTOKENIZED. However, I've found that when I do this my query is always lower-cased as before (+state:new york) and the query becomes case sensitive. I.e. "new york" != "New York" so it never matches. I've tried every kind of analyzer and it doesn't make any difference.  Does anyone know how I can get around this? Surely it's a bug? I can't lowercase the value going into the index and I can't search for an uppercase value because my query gets auto lower-cased.

  • I have a field name called "Text" that doesn't get read in the Lucene Search. Are there any such special keywords that would not work with the search?