LibrarySites.Banner

Media Content Indexing Updates in Sitecore 7.2

In Sitecore 7.2 we've changed the way in which media content is extracted to give you more control over what gets indexed and how.

In Sitecore 7+, media content is extracted using a computed field. This is required because the media content itself in Sitecore is stored as a blob (binary large object) which would not be ideal to push into the search provider. And regardless, the binary data is in the format of the file type and needs to be decoded to extract the textual data that we're interested in, hence the computed field.

Computed fields were introduced in Sitecore 7 with the Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor, Sitecore.ContentSearch class handling the extraction of the media items content. The original version of the class looked for an installed IFilter to handle first the MIME type of the media item, then falling back to one to handle the file name extension.

The offshoot was that all media types were extracted (where an IFilter was found). This approach was chosen as it required the least amount of configuration and should "just work". However, customer feedback since the release of Sitecore 7 has indicated this isn't an ideal approach and was much heavier than some would like. They'd prefer to only extract content for a small selection of media types.

To address these concerns, in one of the Sitecore 7.0 updates we changed how the MediaItemContentExtractor was configured to only process the top 20 media types. However customers still wanted more control over what media types got extracted and indexed. Some feedback we received even stated the customer wanted to change which media types were extracted at different times of the year. This made it clear that a static list of media types was never going to cut it as everyone's requirements were going to be different. 

In Sitecore 7.2 the MediaItemContentExtractor allows configuring which MIME types and file name extensions to include in content extraction or exclude from content extraction. For example, one may want to index the content of Microsoft Powerpoint (ppt and pptx) files and PDF files, but exclude Microsoft Excel (xls and xlsx) files.

The MediaItemContentExtractor class now reads configuration from it's definition element in the configuration XML. Let's take a look at the default configuration. This is from a Lucene index but the same can be used with SOLR:

<configuration xmlns:patch="https://www.sitecore.com/xmlconfig/">
    <sitecore>
        <contentSearch>
            <indexConfigurations>
                <defaultLuceneIndexConfiguration>
                    ...
                    <fields hint="raw:AddComputedIndexField">
                        ...
                        <field fieldName="_content" type="Sitecore.ContentSearch.ComputedFields.MediaItemContentExtractor, Sitecore.ContentSearch">
                            <mediaIndexing ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration/mediaIndexing"/>
                        </field>
                        ...
                    </fields>
                    ...
                </defaultLuceneIndexConfiguration
            </indexConfigurations>
        </contentSearch>
    </sitecore>
</configuration>

As you can see, the MediaItemContentExtractor computed field now contains a child element referencing another section of the configuration. Let's take a look at the new configuration element:

<mediaIndexing hint="skip">
    <mimeTypes>
        <excludes>
            <mimeType>*</mimeType>
        </excludes>
        <includes>
            <mimeType>application/pdf</mimeType>
            <mimeType type="Sitecore.ContentSearch.ComputedFields.MediaItemHtmlTextExtractor, Sitecore.ContentSearch">text/html</mimeType>
            <mimeType>text/plain</mimeType>
        </includes>
    </mimeTypes>
    <extensions>
        <excludes>
            <extension>*</extension>
        </excludes>
        <includes>
            <extension>rtf</extension>
            <extension>odt</extension>
            <extension>doc</extension>
            <extension>dot</extension>
            <extension>docx</extension>
            <extension>dotx</extension>
            <extension>docm</extension>
            <extension>dotm</extension>
            <extension>xls</extension>
            <extension>xlt</extension>
            <extension>xla</extension>
            <extension>xlsx</extension>
            <extension>xlsm</extension>
            <extension>xltm</extension>
            <extension>xlam</extension>
            <extension>xlsb</extension>
            <extension>ppt</extension>
            <extension>pot</extension>
            <extension>pps</extension>
            <extension>ppa</extension>
            <extension>pptx</extension>
            <extension>potx</extension>
            <extension>ppsx</extension>
            <extension>ppam</extension>
            <extension>pptm</extension>
            <extension>potm</extension>
            <extension>ppsm</extension>
        </includes>
    </extensions>
</mediaIndexing>

As you can see from the above configuration we've got all the popular formats covered with the default configuration. We chose to implement the configuration of the MediaItemContentExtractor in configuration files rather than in the CMS (through items) to allow for easier deployment. 

The mimeTypes node controls which MIME types are included or excluded from content extraction while the extensions node handles which file name extensions are included or excluded from content extraction.

Note the text/html MIME type inclusion node in the above configuration. Unlike the other entries, this one includes a type attribute. This attribute can reference any .net type implementing the Sitecore.ContentSearch.ComputedFields.IComputedIndexField interface, same as any other entry in the <fields hint="raw:AddComputedIndexField"> section. It is itself also a computed field. This allows you to implement custom logic yourself to extract content from a specific media file type. If an entry doesn't contain a type attribute then the IFilters will be used. For more information on configuring your IFilters check our previous post Why does my IFilter not work. The type attribute can be supplied in entries in the mimeTypes node or the extensions node.

The wildcard (*) characters are used to set whether whitelisting or blacklisting is used in the configuration. In whitelisting, we add the wildcard to the excludes section which exclude everything by default and only processes entries in the includes section. Whereas in blacklisting, we add the wildcard to the includes section which includes everything by default and only excludes entries in the excludes section.

But what if you'd like your Sitecore 7.2 solution to handle media the same way as previous Sitecore 7+ solutions without the need for additional configuration when new file types are added? After all, content authors may not tell you when they upload a new file type to the media library (I just hope you have an IFilter already installed to handle it). The following mediaIndexing configuration will include all file types.

<mediaIndexing hint="skip">
    <mimeTypes>
        <includes>
            <mimeType>*</mimeType>
        </includes>
    </mimeTypes>
</mediaIndexing>

And now, if we'd like to exclude a single file type, we can add an entry to the excludes section. The following sample will exclude all PDFs from having their content extracted.

<mediaIndexing hint="skip">
    <mimeTypes>
        <includes>
            <mimeType>*</mimeType>
        </includes>
        <excludes>
            <mimeType>application/pdf</mimeType>
        </excludes>
    </mimeTypes>
</mediaIndexing>
  • Hi Product team,  query here due to Need Help: SSO in the community no reply's for all previous post.   I have a web page which build using sitecore 6.6 and supports SSO okta. working fine both environment auto-login (webpage, Admin) with Active directory list of users.   1. I did a 7.0 initial upgrade from sitecore 6.6 following the steps given in the SDN.   working fine both environment.  2. then, I did a 7.1 initial upgrade from sitecore 7.1 following the steps given in the SDN.  working fine both environment.  3. finally to 7.2 initial; same way I did a 7.2 upgrade from sitecore 7.1 following the steps given in the SDN.  not working one environment, admin not working auto-login.   what may be the issue? is specifically compare to 7.1 - 7.2 has a user security changes which will affect such activity with active directory??.  do you have any guess being a sitecore techy?? I know sitecore have no integration experience with okta sso but am just thinking to ask is there any change in the core files driving fail?? with knowledge you have with the user security stuff?.  kindly give a clue if you aware something?. or can you point out who have knowledge in this area?. apart from support-(support don't have clue)   Tried comparing many files through RAZL cant find out perfect answer. i know something wrong which breaking working stuff but couldn't identify what because logs not enough to find out this - also no more error message during wizard and .package installation.   hope my query is clear.