LibrarySites.Banner

Sitecore 7 Performance Tuning Part 2

Tuning Performance of Sitecore 7 with Lucene

Key Takeaways:

  • Performance Tuning with Sitecore 7 is highly customizable.
  • Do yourself a favour and buy good hardware as it is the cheapest performance gain. Buy an SSD and lots of RAM.
  • Different configurations will give better performance for indexing or searching, you can't really have both operating at optimal performance. Based off your requirements you can either have a system that is good at both, or fantastic at one.
  • If your index is small enough, consider running the entire thing in RAM using an InMemoryLuceneIndex.
  • If you understand how Lucene works it will make things easier, but you don't need to know it.
  • You can use the FillDB.aspx page to test if your settings are better for you or worse.

Sitecore 7 provides a lot of flexibility on tuning performance towards what you need to achieve. Although a lot of these recommendations apply to other providers as well, for this blog, we will specifically focus on Lucene.net.

Before tweaking Sitecore's implementation, there are a few provider agnostic tweaks that we would suggest.

• Be sure you really need to speed things up!!! We have already provided Sitecore 7 with a framework that is quite fast at querying and indexing for a mid to large amount of content. If you have a small amount or an extremely large content repository and you are experiencing non-optimal indexing or search speeds then please read on.

• For Lucene, use a local filesystem. Remote filesystems are prone to network contention, locking issues and are nowhere near as performant as a local disk.

• Get faster hardware, especially a faster IO system. If possible, use a solid-state disk (SSD). In production this may be a far stretch, but using an SSD will give you much better performance.

• Use as much RAM as you can afford. More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. You also open up the possibility of running your entire indexes with an InMemoryLuceneIndex by simply switching the Index implementation in the Sitecore.ContentSearch.Lucene.Indexes.***.config files.

• Instead of indexing many small text fields, aggregate the text into the "content" field that is in the index by default.


The guide above is very generic to all providers and it may sound obvious to say "get better hardware", but for search provider performance upgrading your hardware is the simplest win you can get.

If you had made the choice of Lucene.net as your provider then the following section will help you tune your environment. If you chose SOLR instead, then the following will still be of interest but you would tune these parts more in the solrconfig.xml file instead.

For Quicker Indexing

• Increase ContentSearch.IndexMergeFactor until you find the sweet spot. Larger mergeFactors defers merging of segments until later, thus speeding up indexing. However, this will slow down searching while this is happening and the symptoms of having a mergeFactor too high will be that you run out of file descriptors. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives. This is a classic case of, "do you get the hit often but in small chunks, or not as often but in bigger chunks." Both options are suited to solve different requirements.

• Turn off any features you are not using. If you are storing fields but not using them at query time, don't store them. This is why in Sitecore 7 we do no store much by default, only a couple of fields. Likewise for term vectors, if you aren't using MoreLikeThis queries etc., then do not store them. If you are indexing many fields, turning off norms for those fields may help performance.

• Use a faster Analyzer. Sometimes analysis of a document takes a lot of time. For example, the StandardAnalyzer is quite time consuming as it has a lot more to do than other Analyzers. If you were to change to the SimpleAnalyzer, this would speed up Indexing but at the same time, when searching, you might find it a bit more confusing as for why some search results are not showing up. You can use Luke to see exactly how different Analyzers will tokenize content. You can use a different Analyzer for every single field within Sitecore, so this is where you get to have fun with finding out which one is the best for the job.

• When measuring performance, disregard the first query. The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results at run-time. To get around this, please look at the Sitecore.Buckets.WarmupQueries.config file and place commonly run queries into this file. Don't put too many in here as it will increase the amount of time that Sitecore takes to startup. Likewise, you can also hook into the QueryWarmup pipeline and run queries programmatically there as well. In this way, when the warmed queries are run by a user, those queries are already cached.

• Don't iterate over more hits than needed. Iterating over all hits is slow and memory hungry for two reasons. In Sitecore 7 we have implemented a TopDocCollector which makes this problem dissapear. If your search matches 1 Million documents, we make sure that using LINQ extensions likePage(), or built in IQueryable methods like Take() and Skip(), that we only iterate the results for a single page. Projections (Select()) in your LINQ statements will make things faster if you don't need the complete documents but only some fields or properties. These fields will use the FieldCache to cache one or many fields and have fast access to it.

• Consider using filters or Filter() in your LINQ statements. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a Where() clause. This is especially true for restrictions that match a small number of documents of a large index. Filters are typically used to restrict the results to a category, to a user, to a role but could in many cases be used to replace any Where() clause. One difference between using a Where() and a Filter() is that the Where() has an impact on the score while a Filter() does not. Filters do however not scale well with big document sets. If your filter would have millions of documents as results, we would not recommend using the Filter() but rather Where(). Why? Because the Filter() will load 1 Million documents into a bit array and keep that in memory. Although they are only represented by a 0 or 1 this can still be quite large in memory.

• Flush the index by RAM usage instead of Document Count. (See configuration below for this Setting)

For Quicker Searching

• Decrease ContentSearch.IndexMergeFactor. Smaller ContentSearch.IndexMergeFactor means fewer segments and searching will be faster. You should only retrieve documents for the current Page() the user will see, not for all documents in the full result set. For each document retrieved, Lucene must seek to a different location in various files.

Now that you have had a small introduction to performance tuning with Sitecore 7, have a look at some of the dials you can tune in your config files below.

If you want to change the settings, we would suggest not changing them in the Sitecore config files, but rather patch your changes in with another config file e.g. called Sitecore.Lucene.HighIndexingPerformance.config


<!--  CALIBRATE MERGE SIZE BY DELETES
        Determines if the segment size of the index files should be calibrated by the number of deletes when choosing segments for merge.
  -->
  <setting name="ContentSearch.CalibrateSizeByDeletes" value="true" />

  <!--  CONCURRENT MERGE SCHEDULER MAX THREADS
        Determines the number of threads used for the merging of index segments.
  -->
  <setting name="ContentSearch.ConcurrentMergeSchedulerThreads" value="25" />

  <!--  INDEX MERGE FACTOR
        Determines when an index will merge its different segments.

        Increasing the merge factor increases the indexing speed, but only to a point. Higher values also use more RAM and if they’re set
        too high, they may cause your indexing process to run out of memory. Using larger merge factors defers the merging of segments until
        later, thereby speeding up indexing because merging is a large part of indexing. 

        However, this will slow down searching and you will run out of file descriptors if you make it too large. Values that are too large 
        may even slow down indexing because merging more segments at once means much more seeking on the hard drives.

        Do not make this lower than 2.

        Default value: 10
  -->
  <setting name="ContentSearch.IndexMergeFactor" value="10" />

  <!--  MAX LUCENE QUERY CLAUSE COUNT
        Boolean Max Clause Count. Increasing this value increases memory consumption. Only increase it if you need to run very large queries. 
        This setting allows you to increase or decrease the clause count for Lucene depending upon how big you think the queries could grow.

        Default value: 1024
  -->
  <setting name="ContentSearch.LuceneQueryClauseCount" value="1024" />

  <!--  MAX DOCUMENT BUFFER SIZE
        This setting determines the Lucene document buffer size. The buffer size indicates how many documents Lucene stores in RAM before
        writing to disk. 

        If you use a dedicated indexing server, set this to a high value. This will store more Lucene index documents in RAM before writing 
        to disk. This will decrease the rebuild time of your indexes.

        Default value: 10000 
        This value is not tuned for a dedicated indexing server.

        If RamBufferSize is set then it will use a first come first serve algorithm
  -->
  <setting name="ContentSearch.MaxDocumentBufferSize" value="10000" />

  <!-- MAX DOCUMENTS ADDED BEFORE MERGE
       Maximum number of documents before the provider will flush the documents to disk.

       Default is 10000 -> This is not tuned for a dedicated indexing server.
  -->
  <setting name="ContentSearch.MaxMergeDocs" value="10000" />

  <!-- MAX RAM USED BEFORE MERGE
       Maximum ram used before the provider will flush the documents to disk.

       Default is 512 -> This is not tuned for a dedicated indexing server.
  -->
  <setting name="ContentSearch.MaxMergeMB" value="512" />

  <!-- MIN RAM USED BEFORE MERGE
       Minimum ram used before the provider will flush the documents to disk.

       Default is 10 -> This is not tuned for a dedicated indexing server.
  -->
  <setting name="ContentSearch.MinMergeMB" value="10" />
  <!--  RAM BUFFER SIZE
        Specifies the RAM buffer size for your indexes (in MB).

        If you use a dedicated indexing server, set this to a high value. This will store more Lucene index documents in RAM before writing 
        to disk. This will decrease the rebuild time of your indexes.

        Default value: 512 (MB)
        This value is not tuned for a dedicated indexing server.
  -->

  <setting name="ContentSearch.RamBufferSize" value="512" />

  <!--  TERM INDEX INTERVAL
        Sets the interval between indexed terms. Large values cause less memory to be used by IndexReader, 
        but slows down random-access to terms. Small values make IndexReader use more memory and speed up random-access to terms.
        This parameter determines the amount of computation required per query term, regardless of the number of documents that contain 
        that term. In particular, it is the maximum number of other terms that must be scanned before a term is located and its frequency 
        and position information can be processed. In a large index that contains query terms entered by the user, query processing time 
        is likely to be dominated not by term lookup but rather by processing the frequency and positional data. In a small index or when 
        many uncommon query terms are generated for example, by wildcard queries, term lookup may become the dominant cost.

       In particular, numUniqueTerms/interval terms are read into memory by an IndexReader, and, on average, interval/2 terms must be scanned for each random term access.

        Default value: 256
                       This value is not tuned for a dedicated indexing server.
  -->
  <setting name="ContentSearch.TermIndexInterval" value="256" />

  <!--  COMPOUND FILES
        This setting lets you use a compound file. In a compound file, multiple files for each segment are merged into a single file when 
        segment creation is finished. This is done regardless of which directory is in use.

        Building the compound file format takes time during indexing (7-33%). However, doing this greatly increases the number 
        of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.

        Default value: false 
  -->
  <setting name="ContentSearch.UseCompoundFile" value="false" />

  <!--  WAIT FOR MERGES
        This setting will control how the index operates when the Merge Scheduler starts to merge all segment files. If true, it will block all 
        operations on the indexWriter until the segments have been merged.

        Default value: true 
  -->
  <setting name="ContentSearch.WaitForMerges" value="true" />

It is important to note that it is an incremental process to tune these settings. We would suggest making an educated change to a setting, and then running a full rebuild to see if the settings are faster or slower.

NOTE: Don't test your changes on small numbers of items, use the FillDB.aspx to at least test on 100,000 items. It will take you 8 seconds to build the items and only 1 minute to rebuild on good hardware.

If you think this is highly configurable, just wait until the team starts talking about SOLR.

Dev Team