LibrarySites.Banner

Introducing the Sitecore Analytics Index

To me, one of the most interesting parts of Sitecore 7.5 is that the data in xDB is indexed and available using the Sitecore content search API. The beautiful end-result is that you can write LINQ expressions against xDB (using the LINQPad Driver for Sitecore, for example).

But before you start using the analytics index there are a few things that, if you understand them, will make your life a little easier. That's the main purpose of this post. So this post introduces the analytics index and how it is populated.

One of the main design goals for the Sitecore content search API was to allow any search engine to be plugged into Sitecore. Everything I cover in this post applies to all search engines (unless I explain otherwise). But for the sake of simplicity I am going to assume you are using the default Sitecore configuration, which means you are using Lucene. If you are using a different search engine, file names and configuration may be different.

Super-simplified explanation

When the Sitecore session ends, interaction data is written to MongoDB. This is probably the most important thing to understand about the analytics index. Until the session ends you will see no data in the xDB and you will see no data in the analytics index.

But the session ending isn't what causes the analytics index to be updated. That is handled by the aggregation process. All of the official product documentation I've seen so far only discusses the aggregation process in terms of its role in populating the reporting database. Aggregation also runs crawlers that update the analytics index.

Keep on reading if you're interested in some details on how it all works...

How does data get written to the analytics database?

A database in MongoDB is a set of collections. A collection is like a table in a relational database. It's a collection of documents. The analytics database has a collection for contacts, a collection for interactions, a collection for devices, and so on.

A visitor interacts with Sitecore, Sitecore collects information about the visitor. When the visitor's session ends that information is written to the collections in the analytics database.

(This is handled by the commitSession pipeline, and specifically, the SubmitSession processor.)

It is possible to update the analytics database without ending the visitor's session. You need to use the ContactManager. The following code demonstrates how:

var tracker = Sitecore.Analytics.Tracker.Current;
var manager = Sitecore.Configuration.Factory.CreateObject("tracking/contactManager", true) as Sitecore.Analytics.Tracking.ContactManager;
manager.FlushContactToXdb(tracker.Contact);
manager.SaveAndReleaseContact(tracker.Contact);
var ctxManager = Sitecore.Configuration.Factory.CreateObject("tracking/sessionContextManager", true) as Sitecore.Analytics.Data.SessionContextManagerBase;
ctxManager.Submit(tracker.Session);

The ContactManager is created based on the settings in the Sitecore config files. In the configuration a ContactRepository is assigned to the ContactManager. The ContactRepository is used to interact with the data layer (meaning MongoDB). The ContactManager uses the ContactRepository to save the contact to xDB.

The ContactRepository, in turn, has a DataAdapterProvider assigned to it. The DataAdaptorProvider is the first component that is actually aware of MongoDB. This provider is responsible for writing to the analytics database.

How does data in the analytics database get indexed?

In a word: crawlers. Crawlers are the way all data gets into indexes using the Sitecore content search API. In the Sitecore config files there are a number of crawlers that are added to the index named sitecore_analytics_index.

Each crawler that indexes the analytics database inherits from the ObserverCrawler<T> type. The generics type identifies the kind of IIndexable object the crawler handles. For example, one of the crawler handles visit objects and another crawler handles contact objects.

When the crawler is initialized it registers itself as an observer for the generics type. So the contact object crawler registers itself as an observer for ContactIndexable objects.

(The objects that are observed - such as ContactIndexable - are created during the aggregation process, which I'll cover in the next section.)

Being an observer means the crawler is notified when it needs to take action. Not only is the crawler notified, but the notification process also passes data (an IIndexable object) to the crawler. The crawler, then, uses this data to update the analytics index.

What is the aggregation process?

Aggregation involves handling the raw data that has been collected in the xDB. The aggregation process is a fairly generic thing. It takes the data in the xDB and does something with it.

Aggregation covers more than just interactions, but for this post I am going to limit myself to describing how interaction aggregation works.

When Sitecore starts, a hook runs that starts the AggregationLoader. The loader instantiates the AggregationModule, which is basically a container for agents. The agents are specified in the Sitecore config files. These agents handle the various processes that ensure the aggregation process runs properly.

One of these agents is the aggregation agent. This agent has a property named Dispatcher. The dispatcher provides InteractionWorkItem objects to the agent.

When the session ends an entry is made in the tracking database (which is another MongoDB database) that identifies the interaction that represents the session activity. The aggregation agent's Dispatcher reads the entries from the tracking database. Those entries are exposed as InteractionWorkItem objects.

The aggregation agent uses an InteractionAggregator object to process each InteractionWorkItem object. The InteractionAggregator object reads the interaction ID from the InteractionWorkItem. It passes that information to the interactions pipeline. The pipeline is run for each interaction.

The interactions pipeline runs a number of processors that are involved in updating the analytics index. These processors inherit from the type ObservableAggregator<T>. Each of these processors finds (or creates) the IIndexable objects that are passed to the crawler whose generic type matches that of the generic type specified on the processor.

Conclusion

If you've made it this far, bravo! This was a pretty dense post without any illustrations to explain the control flow.

But I feel like I've left one important topic unaddressed: how do you rebuild the analytics index? That is a topic for another post!

Still, I hope the information helps you better understand how the analytics index is updated.

  • Hi Team ,  I am new to Sitecore Development. Currently I am working on a task "Agent Server"  Agent Server: It has all the Sitecore processes related to Sitecore.Analytics.dll.   Details: Actually Currently we are havig a website for ex: www.abc.com . It contains sitecore processes related to Sitecore.Analytics.dll enabled through Sitecore.Analytics.Config  The processes in Sitecore.Analytics.Config are:  1. Invoking lookup Manager class which calls MaxMindprovider class to get IP info.  2.DataAdapter manager  3. Initialize pipelines of Sitecore.Analytics.Pipeline.Loader  Due to all these processes the site is Spinnning off or some times impacts performance.  Due to this we are planning to move these Sitecore Processes to a new box callled "Agent Server".  The Agent Server is built with Sitecore and enabled through configuration.  Queries:  1)Please expain how IP functioanlity works?  2) How can we trigger Agent server to run the Sitecore Processes from Agent server as we did as earlier.  3) How Agent Server able to read the IP's of a User Loggedin to a site wwww.abc.com?  4) How can we resolve IP addresses using Agent Server?  5) How Ip addresses Info(City,State etc) writes to Database?  Please explain...      

  • Hi Sitecore Team,  I have a query regarding the entries in Sitecore Analytics DataBase.  Please see my description below  We have two sites called 1)abc-www.example.com 2) sas-medicine.com. These two sites shares the same Sitecore Analytics database. When ever we are browsing these two sites we can see only one site entries in Visits Table. Sitecore.Analtyics.Config file and code is same for both sites. Could you please help do we need to make any settings or changes specific to site to see entries in Visits Table . Please assist

  • Hey there, I'm trying to upgrade SC instance from 7.2 to 7.5 and after the initial release 7.5 upgrade, I'm running into this error:   Could not find configuration node: contactRepository, not quite sure what I might be missing. Any help would be appreciated.

  • Hi Adam. Is there any way to tell whether and interaction has been indexed? All page visits on our site ends up in the Interactions collections correctly, but I can't find them in the "Experience Analytics" dashboard inside Sitecore, so I suspect that they are not indexed. I can make it work locally but I can't make it work in our QA environment. I found that the ProcessingPool collection in the tracking_live MongoDB database holds a lot of documents on the QA server but not on my local machine - so I am guessing that the contents of the MongoDB are not processed correctly.