LibrarySites.Banner

Sitecore 7 Search Provider Part 1 - Manually Triggered Indexing

This is the first post in a series that explains how to build a search provider for Sitecore 7.

A search provider is used to drive search inside the Sitecore client and on published sites. By default, Sitecore 7 uses a Lucene search provider. A Solr search provider is also available. But you can integrate any search engine you want. Explaining how to do this is the goal of this series.

I have to start somewhere, so I'm going to start with manually triggered indexing.

XML Search Provider

The purpose of this blog series is to guide you through the process of building a search provider for Sitecore 7. I'm not going to use a real search engine. I'm going to use .NET's XML API and XML files to represent a search engine. The search provider is going to write to and read from XML files instead of an actual search engine. 

Why am I going to describe how to build something you'll never actually use? So you can learn how to build your own search provider instead of just using one. I'm using XML in hopes that this example will:

  • Keep setup as simple as possible. No additional software must be installed in order to follow along at home.
  • Minimize the distraction of another system. There should be no confusion over what the XML API is and what the Sitecore API is.

This seems like a good idea at the moment. I think the real test will be when we get to performing searches. But that will come later. First we need to get content into the XML files. This is what the indexing process is all about.

Re-Index Tree Button

There are many reasons why items get re-indexed. I'm going to start with the most obvious: items are re-indexed when a user tells Sitecore to re-index the items.

In Sitecore 7, an individual item (and its descendants) can be re-indexed by using the Re-Index Tree button. This button is located in Content Editor, under the Developer strip.

Indexing Tools

Clicking this button starts a series of events that culminates in the Refresh method on each index being called: The command indexing:refreshtree is triggered. This command calls the RefreshTree method on the IndexCustodian class. This method returns a collection of jobs. Each job represents a call to the Refresh method on a specific index.

Create Visual Studio Project

This is where my work begins. I need to create a class that implements the Refresh method.

  1. Create a new Visual Studio project for .NET Framework 4.5.
  2. Add a reference to Sitecore.ContentSearch.dll.
  3. Add a reference to Sitecore.ContentSearch.Linq.dll.
  4. Add a reference to Sitecore.Kernel.dll.

Search Index

The Refresh method is defined in the ISearchIndex interface. This interface represents the search index. It provides a way for Sitecore to manage and use an index, regardless of the search engine. Before the Refresh method can be implemented, some supporting code is required.

  1. Add a class named XmlProviderIndex that inherits from Sitecore.ContentSearch.ISearchIndex.
  2. Implement the members required by the interface. The methods that return void are empty. Methods that return values return null. Over the course of this and subsequent posts, each of these methods will be explained and implemented.
  3. Add a property FolderName to store the folder name. A file-based search provider needs to know the location of the files it interacts with.
    public string FolderName { get; private set; }
  4. Add a constructor.
    public XmlProviderIndex(string name, string folderName)
    {
      this.Name = name;
      this.FolderName = folderName;
    }
  5. Add a property IndexFilePath. This property will store the full path to the XML file that is used as the index.
    public string IndexFilePath { get; private set; }
  6. Implement the Initialize method. This method is called when the index is added. The method sets the value for the IndexFilePath property.
    public void Initialize()
    {
      string path = null;
      if (Path.IsPathRooted(this.FolderName))
      {
        path = this.FolderName;
      }
      else
      {
        path = FileUtil.MapPath(FileUtil.MakePath(Settings.IndexFolder, this.FolderName));
      }
      this.IndexFilePath = FileUtil.MakePath(path, "index.xml");
    }

Crawlers

The Refresh method from the XmlProviderIndex class iterates through the crawlers, and for each crawler, calls the RefreshFromRoot method. Sitecore provides a base crawler which does everything I need. But I will create a custom class so I can include my own logging messages. I will cover crawlers in more detail in a later post. For now, what you need to know about crawlers is that they locate content in Sitecore that should be indexed.

  1. In the XmlProviderIndex class, add a property Crawlers. This property will store the crawlers for the index.
    public List<IProviderCrawler> Crawlers { get; private set; }
  2. Implement the method AddCrawler. This method allows crawlers to be added to the index, and it ensures the crawler is properly initialized.
    public void AddCrawler(IProviderCrawler crawler)
    {
      crawler.Initialize(this);
      this.Crawlers.Add(crawler);
    }
  3. Add the following line to the constructor. This ensures the Crawlers collection is available when the AddCrawler method accesses it.
    this.Crawlers = new List<IProviderCrawler>();
  4. Add a class named XmlDatabaseCrawler that inherits from Sitecore.ContentSearch.AbstractProviderCrawler.
  5. Override the Initialize method.
    public override void Initialize(ISearchIndex index)
    {
      base.Initialize(index);
      var msg = string.Format("[Index={0}] Initializing XmlDatabaseCrawler. DB:{1} / Root:{2}", index.Name, base.Database, base.Root);
      CrawlingLog.Log.Info(msg, null);
    }

Index Operations

The RefreshFromRoot method from the AbstractProviderCrawler uses an object that implements the interface IIndexOperations in order to update the crawler's index. This is where the logic for creating an XML representation of the content is implemented.

  1. Add a class named XmlIndexOperations that inherits from Sitecore.ContentSearch.IIndexOperations.
  2. Implement the members required by the interface. The methods that return void are empty. Methods that return values return null. Over the course of this and subsequent posts, each of these methods will be explained and implemented.
  3. Add the following method. It creates the XDocument object that represents a Sitecore item.
    protected virtual XDocument GetDocument(IIndexable indexable)
    {
      var item = (Item)(indexable as SitecoreIndexableItem);
      var doc = new XDocument(
        new XElement("item",
          new XAttribute("id", item.ID.ToString()),
          new XAttribute("name", item.Name),
          new XAttribute("path", item.Paths.Path)
        )
      );
      return doc;
    }
  4. Implement the Update method. It gets an XDocument object and then passes that object to the update context, which is explained next.
    public void Update(IIndexable indexable, IProviderUpdateContext context, ProviderIndexConfiguration indexConfiguration)
    {
      var doc = GetDocument(indexable);
      context.UpdateDocument(doc, null, null);
    }

Update Context

The IIndexOperations object is used to prepare content to be indexed. The IProviderUpdateContext object takes the prepared content and actually indexes it. In this example, instead of indexing the XML, the XML is saved in a file.

  1. Add a class named XmlProviderUpdateContext that inherits from Sitecore.ContentSearch.IProviderUpdateContext.
  2. Implement the members required by the interface. The methods that return void are empty. Methods that return values return null. Over the course of this and subsequent posts, each of these methods will be explained and implemented.
  3. Add a property to store the index object.
    private readonly XmlProviderIndex _index;
  4. Implement the Index property.
    public ISearchIndex Index
    {
      get { return _index; }
    }
  5. Add a field to store the XML before it is saved.
    private List<XDocument> _updateDocs;
  6. Add a constructor.
    public XmlProviderUpdateContext(XmlProviderIndex index)
    {
      _index = index;
      _updateDocs = new List<XDocument>();
    }
  7. Implement the UpdateDocument method. This saves the XML in memory before it is saved to disk.
    public void UpdateDocument(object itemToUpdate, object criteriaForUpdate, IExecutionContext executionContext)
    {
      var doc = itemToUpdate as XDocument;
      _updateDocs.Add(doc);
    }
  8. Add the AddXmlToDocument method. This method either updates the existing XML or adds new XML to the document.
    protected virtual void AddXmlToDocument(XDocument doc1, XDocument doc2)
    {
      var itemIdValue = doc2.Root.Attribute("id").Value;
      var existingNode = doc1.Descendants("item").FirstOrDefault(i => i.Attribute("id").Value == itemIdValue);
      if (existingNode != null)
      {
        existingNode.ReplaceWith(doc2.Root);
      }
      else
      {
        doc1.Root.Add(doc2.Root);
      }
    }
  9. Add the GetOrCreateIndexFile method.
    protected virtual XDocument GetOrCreateIndexFile()
    {
      XDocument doc = null;
      if (File.Exists(_index.IndexFilePath))
      {
        doc = XDocument.Load(_index.IndexFilePath);
      }
      else
      {
        var dirPath = Path.GetDirectoryName(_index.IndexFilePath);
        if (! Directory.Exists(dirPath))
        {
          Directory.CreateDirectory(dirPath);
        }
        doc = new XDocument(new XElement("items"));
      }
      return doc;
    }
  10. Implement the Optimize method. This removes any duplicate items from the document collection. In such a simple example this step is not needed.
    public void Optimize()
    {
      _updateDocs = _updateDocs.GroupBy(x => x.Root.Attribute("id").Value).Select(g => g.First()).ToList();
    }
  11. Implement the Commit method. This saves the XML in memory to disk.
    public void Commit()
    {
      var doc1 = GetOrCreateIndexFile();
      foreach (var doc2 in _updateDocs)
      {
        AddXmlToDocument(doc1, doc2);
      }
      doc1.Save(_index.IndexFilePath);
      _updateDocs.Clear();
    }
  12. In the XmlProviderIndex class, implement the CreateUpdateContext method.
    public IProviderUpdateContext CreateUpdateContext()
    {
      return new XmlProviderUpdateContext(this);
    }

Search Index Refresh

Now that all of the supporting classes have been created, the Refresh method on the search index needs to be implemented. Remember, this is the method that is triggered when the Re-Index Tree button is clicked.

  1. In the XmlProviderIndex, implement the Refresh method.
    public virtual void Refresh(IIndexable indexableStartingPoint)
    {
      using (var context = this.CreateUpdateContext())
      {
        foreach (var crawler in this.Crawlers)
        {
          crawler.RefreshFromRoot(context, indexableStartingPoint);
        }
        context.Optimize();
        context.Commit();
      }
    }

Configuration

I want my search provider to be configurable using config files. Sitecore 7 includes a number of classes that facilitate this.

  1. Add a class named XmlSearchConfiguration that inherits from ProviderIndexSearchConfiguration.
  2. Add a method named AddIndex. This method allows search indexes to be specified in a config file.
    public virtual void AddIndex(ISearchIndex index)
    {
      this.Indexes[index.Name] = index;
      index.Configuration = this.DefaultIndexConfiguration;
      index.Initialize();
    }
  3. Create a config file named Blog.Search.config. The purpose of the config file is to specify what content in Sitecore should be indexed, and which components should be used to do the indexing.
    <configuration xmlns:patch="https://www.sitecore.com/xmlconfig/">
      <sitecore>
        <contentSearch>
          <configuration type="Blog.Search.XmlSearchConfiguration, Blog.Search">
            <DefaultIndexConfiguration type="Sitecore.ContentSearch.ProviderIndexConfiguration, Sitecore.ContentSearch">
              <DocumentOptions type="Sitecore.ContentSearch.DocumentBuilderOptions, Sitecore.ContentSearch" />
              <IndexAllFields>true</IndexAllFields>
            </DefaultIndexConfiguration>
            <indexes hint="list:AddIndex">
              <index id="xml_master_index" type="Blog.Search.XmlProviderIndex, Blog.Search">
                <param desc="name">$(id)</param>
                <param desc="folderName">$(id)</param>
                <locations hint="list:AddCrawler">
                  <crawler type="Blog.Search.XmlDatabaseCrawler, Blog.Search">
                    <Database>master</Database>
                    <Root>/sitecore</Root>
                    <Operations type="Blog.Search.XmlIndexOperations, Blog.Search" />
                  </crawler>
                </locations>
              </index>
            </indexes>
          </configuration>
        </contentSearch>
      </sitecore>
    </configuration>

Deploy and Test

Assuming you have followed these instructions carefully, you can deploy and test your new - admittedly limited - search provider.

  1. Compile the project.
  2. Deploy Blog.Search.dll to your Sitecore server's bin folder.
  3. Deploy the Blog.Search.config file to your Sitecore server's Include folder.
  4. Open Content Editor.
  5. Select an item.
  6. Click the Re-Index Tree button.
  7. You should see a folder named xml_master_index in your Sitecore server's Data/indexes folder.
  8. You should see an XML file in the xml_master_index folder.

Conclusion

It may seem like there are a lot of moving parts involved with building a search provider. There are. But hopefully this introduction has left you feeling like it is manageable. Please let me know what you think of the new API.

My next post will cover indexing that is triggered by item lifecycle events, such as changing an item.

  • Thanks Adam, nice post. Note: to make this sample working on 70 rev130424 constructor should accept one more parameter, e.g.: public XmlProviderIndex(string name, string folderName, IIndexPropertyStore propertyStore)         {             this.Name = name;             this.FolderName = folderName;             this.PropertyStore = propertyStore;             this.Crawlers = new List<IProviderCrawler>();         }

  • We don't need to reset (delete) index before refresh (re-index tree)? I don't see reset in Lucene's implementation as well. I am just wondering what if some items are deleted from database and not synced up with the index, and then using manually "reindex tree" will only add/update, but not delete, so leave some orphaned index?