LibrarySites.Banner

Sitecore 7: Indexing Media with IFilters

This blog post describes the new features to support search indexing using IFilters in version 7 of the Sitecore ASP.NET web Content Management System (CMS). IFilters act as plugins for full-text search engines to allow indexing of text in binary file formats such as PDF. Before you read this blog post, please read the Sitecore 7: Introduction blog post linked in the list of resources at the end of this page.

Sitecore 7 ships configured to use IFilters to index text in the binary content of media items. To use this feature, you must install IFilters for the types of media items that you want your solution to index. You can use software such as the free IFilter Explorer from Citeknet to investigate the IFilters installed on your system.

If the system hosting a Sitecore solution does not have an IFilter for a given media type, Sitecore can only index the metadata stored in that media item, not its binary content. Additionally, whether search results include media items can depend on the encoding of the format of data contained in those media. For example, IFilters may not be able to convert images of text in media items to structured text to parse. Finally, you must install IFilters on the relevant hosts in your production environments (both content management and delivery); having an IFilter installed in a development environment will not allow indexing of that data type in your production environments.

Due to its placement within the within the /configuration/sitecore/contentSearch/DefaultIndexConfiguration element in the Web.config file (technically, the /App_Config/Include/Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config Web.config include file), Sitecore uses the field named _content in each search index to manage text retrieved using available IFilters for supported types of media items:

<field fieldName="_content" storageType="no"  indexType="tokenized">Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor,Sitecore.ContentSearch</field>

As you probably guessed, this uses a computed index field as described in a previous blog post linked in the Resources section at the end of this page.

To change the types of media items that Sitecore uses IFilters to index, you can uninstall the relevant IFilter(s) for those types of files. Alternatively, or for example to control which media items Sitecore uses IFilters to index, you can you can override the class specified by this element in the Web.config file. The constructor of this class configures which specific mime types and file name extensions to index with IFilters. Such approaches can improve indexing performance by preventing Sitecore from reading the binary streams for media types you do not want to index. For example, by default, Sitecore 7 does not try to use IFilters to process video and other types of data that are unlikely to contain text.

Some IFilters may require a value of true for the impersonate attribute of the /configuration/system.web/identity element in the /web.config file. Because this element does is outside of the /configuration/sitecore element, you cannot make this change with a Web.config include file.

Resources

  • Hi, We're currently trying to implement ifilters with Sitecore 6.6 on Azure (PaaS), but it doesn't seems to be easy. Will your implementation support Azure? Any idea of how to achieve it? Thanks!

  • Is there any follow up on this one? Apparently Ifilter 11 does not work with Sitecore 7.5 the only recommendation that I found is to rever to Ifilter 9, but unfortunately I cannot find it....   ManagedPoolThread #4 17:17:35 ERROR Could not compute value for ComputedIndexField: _content for indexable: sitecore://master/{7E5F66DF-2A4E-448F-B8DF-656BE6D4DA19}?lang=en&ver=1 Exception: System.Runtime.InteropServices.COMException Message: Error HRESULT E_FAIL has been returned from a call to a COM component. Source: Sitecore.ContentSearch    at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.IClassFactory.CreateInstance(Object pUnkOuter, Guid& refiid, Object& ppunk)    at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadFilterFromDll(String dllName, String filterPersistClass)    at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterLoader.LoadAndInitIFilter(String fileName, String extension)    at Sitecore.ContentSearch.Extracters.IFilterTextExtraction.FilterReader..ctor(String fileName)    at Sitecore.ContentSearch.ComputedFields.MediaItemIFilterTextExtractor.ComputeFieldValue(IIndexable indexable)    at Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor.ComputeFieldValue(IIndexable indexable)    at Sitecore.ContentSearch.LuceneProvider.LuceneDocumentBuilder.AddComputedIndexFields()

  • Sorry, I have not worked with iFilters since I wrote this post. If you file a support case, please comment here with its ID.

  • Instead of installing an IFilter on the server we can also write our custom code (i.e crawler) to extract the content for a mime type by defining it:

    <mimeType type="......">text/html</mimeType>

    where type points to a .Net type implementing IComputedIndexField.

    Refer community.sitecore.net/.../media-content-indexing-updates-in-sitecore-7-2

    For pdfs, a custom crawler can be written using iTextSharp for pdf extraction.

    Refer sitecorecommerce.wordpress.com/.../