LibrarySites.Banner

How to handle Accented Characters

Now that I am a local/official Dane (I went through an initiation phase of eating an entire raw herring, liquorice ice cream and my sarcasm level increased by 3) and have finished my first stint of Danish language lessons, it is my duty - nay - honour to help solve the issue of search with languages that use accents i.e. ø æ.

German, Swedish, Danish, Norwegian, Icelandic, Turkish, Finnish are just some of the languages that need extra care when working with search when it comes to accented or diacritic characters.

There are many generic methods to solve this requirement. Some are to simply replace the characters with another character (æ to a), some are to embed character codes (&134;) and another is character folding.

The first solution is not enough to solve the problem and would result in some queries bringing back items that are not expected. The second would work but is not as efficient on storage as character folding is.

Character folding is the idea of taking a character like æ and expanding it out to ae, ø will turn into oe. Luckily, Lucene.net gives us one in the form of the ASCIIFoldingFilter. Below is some code for an Analyzer that will help you with accented characters.

public class AccentedAnalyzer : StandardAnalyzer
{
    Version MatchVersion;

    public AccentedAnalyzer(Version matchVersion) : base(matchVersion)
    {
        MatchVersion = matchVersion;
    }

    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        var result = new StandardTokenizer(MatchVersion, reader);
        result = new StandardFilter(result);
        result = new ASCIIFoldingFilter(result);
        return result;
    }
}

You may also find it useful to have this for the LowercaseKeywordAnalyzer as well i.e. just inherit from LowercaseKeywordAnalyzer instead.

For the SOLR users, you get this as well for free! In the schema.xml you can let your fields go through the following FilterFactory solr.ASCIIFoldingFilterFactory.

It is worth noting that this is a simple solution. Languages are an extremely complex things to solve when it comes to search. There is also some thought and architectural decisions that need to be made around sites that you build that cover languages which character support such as accented and morphological.

Mange Tak,

Tim