LibrarySites.Banner

Sitecore media library and Azure blob storage

I like blobs. Specifically, I like Sitecore and how it uses blobs for storing media library assets. What I don't like, however, is that blobs can quickly inflate the size of your content database - especially when you start dealing with blobs numbering in the tens of thousands (or more). Wouldn't it be nice if you could retain all the nice features of storing blobs in your database, but not store blobs in your database? Well my friend, you can!

Background

Before we get into the good stuff, let's walk through a primer on Sitecore blob storage.

Open the default Sitecore Master database and you'll see a table named Blobs. This table will contain all blob data used in the Sitecore media library. There are two columns in the Blobs table worth noting: BlobId and Data. The BlobId column contains a GUID used to uniquely identify a blob. The Data column contains the binary data for the blob.

Blobs table

When you upload a file to the Sitecore media library, a Sitecore media item is created. That media item contains a field named Media, which is used to represent the blob associated with the media item. Behind the scenes, the aforementioned BlobId value is stored as the value of the Media field in a media item, thereby providing a way to reference a blob from a media item without directly storing the blob within one of the media item fields. This is an important concept, as it provides separation between blob storage and content storage.

By distinctly separating blob storage from content storage and allowing us to override the default SQL Server data provider, we have the opportunity to roll our own blob storage container - without impacting standard media library functionality. For the purpose of this post, I will be demonstrating the use of Azure blob storage, but in theory you could use any storage container in which you have the ability to read/write data and uniquely identify a blob (e.g. file system, separate SQL server, NOSQL, Azure Table storage, etc...).

Deconstructing the Data Provider

Any time I need to override or extend existing Sitecore functionality, the first place I visit is the web.config file to look for potential integration points, then on to .NET Reflector to determine what needs to be done. In this case, I want to get as close to the data as possible so I can be sure all touch points related to blob storage end up filtering through my code - which naturally leads me to the main Sitecore SQL Server data provider (Sitecore.Data.SqlServer.SqlServerDataProvider).

Using Reflector to take a look under the hood, I can see that Sitecore.Data.SqlServer.SqlServerDataProvider contains 3 methods related to blob handling that override base class methods: BlobStreamExists, GetBlobStream, SetBlobStream.

Sitecore.Data.SqlServer.SqlServerDataProvider

Walking up the inheritance chain I see that the base class, Sitecore.Data.DataProviders.Sql.SqlDataProvider contains 1 virtual method related to blob handling (CleanupBlobs) and 1 overridden method related to blob handling (RemoveBlobStream).

Sitecore.Data.DataProviders.Sql.SqlDataProvider

Walking one step further up the inheritance chain, to the Sitecore.Data.DataProviders.DataProvider class, I don't see any other methods related to blob handling that need to be overridden. Therefore, I now have a list of methods to override in a custom data provider class - 5 methods, not too bad!

The Code

First, I created a new class that extends the Sitecore.Data.SqlServer.SqlServerDataProvider class.

public class AzureBlobStorageProvider : Sitecore.Data.SqlServer.SqlServerDataProvider
{
    private readonly LockSet _blobSetLocks;
     
    public AzureBlobStorageProvider(string connectionString) : base(connectionString)
    {
        _blobSetLocks = new LockSet();
    }
}

Next, I added some properties to provide convenient (and efficient) access to the Azure Storage account and blob container.

private Microsoft.WindowsAzure.CloudStorageAccount _storageAccount;
private Microsoft.WindowsAzure.StorageClient.CloudBlobClient _blobClient;
private Microsoft.WindowsAzure.StorageClient.CloudBlobContainer _blobContainer;
 
protected Microsoft.WindowsAzure.CloudStorageAccount StorageAccount
{
    get { return _storageAccount ?? (_storageAccount = CloudStorageAccount.Parse(Configuration.Settings.Media.AzureBlobStorage.StorageConnectionString)); }
}
 
protected Microsoft.WindowsAzure.StorageClient.CloudBlobClient BlobClient
{
    get { return  _blobClient ?? (_blobClient = StorageAccount.CreateCloudBlobClient()); }
}
 
protected Microsoft.WindowsAzure.StorageClient.CloudBlobContainer BlobContainer
{
    get
    {
        if (_blobContainer == null)
        {
            _blobContainer = BlobClient.GetContainerReference(Configuration.Settings.Media.AzureBlobStorage.StorageContainerName);
            _blobContainer.CreateIfNotExist();
        }
        return _blobContainer;
    }
}

I also created a convenience class for retrieving settings that are specific to the Azure blob storage provider.

public class Settings
{
    public static class Media
    {
        public static class AzureBlobStorage
        {
            public static string StorageContainerName
            {
                get { return GetSetting("Media.AzureBlobStorage.ContainerName"); }
            }
 
            public static string StorageConnectionString
            {
                get { return GetSetting("Media.AzureBlobStorage.ConnectionString"); }
            }
        }
    }
 
    public static string GetSetting(string name)
    {
        return Sitecore.Configuration.Settings.GetSetting(name);
    }
}

I also created an extension method class for extending the Microsoft.WindowsAzure.StorageClient.CloudBlob object. Currently, only one extension method is implemented which determines whether or not a blob object exists in the Azure storage account container. Note: there is an unpleasant smell to the code below due to exception-based logic, but it's functional. Extracted from this blog post - http://blog.smarx.com/posts/testing-existence-of-a-windows-azure-blob

public static class BlobExtensions
{
    public static bool Exists(this CloudBlob blob)
    {
        try
        {
            blob.FetchAttributes();
            return true;
        }
        catch (StorageClientException e)
        {
            if (e.ErrorCode == StorageErrorCode.ResourceNotFound)
            {
                return false;
            }
            throw;
        }
    }
}

And now onto the data provider methods...

CleanupBlobs

I won't go into the code for this method in detail, as it largely uses much of the same code from the Sitecore.Data.DataProviders.Sql.SqlDataProvider.CleanupBlobs method. However, the general algorithm is as follows:

  • Directly query the Sitecore database in use by the provider for all Sitecore item fields containing references to blobIds in the Blobs table. This provides a list of blobs that are in use by Sitecore fields.
  • Using SQL, generate a list of blobs that are NOT in use by Sitecore fields.
  • Delete unused blob references from the Blobs table.
  • Delete unused blobs from the Azure blob storage container.

SetBlobStream

This was actually a fairly easy method to implement until I started working on the CleanupBlobs operation. From an Azure standpoint, it's pretty simple, we get a reference to the blobId passed in to the SetBlobStream method, then upload the blob stream argument to Azure using that reference.

From a Sitecore standpoint, however, we also want to create an "empty" reference to the blob in the BlobsBlobs table, just without the blob. During the CleanupBlobs operation, all Sitecore item field values are examined for references to blobIds stored in the Blobs table. If a blobId is orphaned (i.e. not in use by any Sitecore item fields), then the related blob should be deleted. If we didn't use SQL to generate a list of unused blobs, the alternative would be to retrieve an entire list of blobs from the Azure blob storage container, then iterate through that list to determine which blobs aren't in use within Sitecore and should be "cleaned up" (i.e. removed). That would be an expensive operation, especially as the number of blob items in your Azure storage container increases.

public override bool SetBlobStream(Stream stream, Guid blobId, CallContext context)
{
    lock (_blobSetLocks.GetLock(blobId))
    {
        var blob = BlobContainer.GetBlobReference(blobId.ToString());
         
        blob.UploadFromStream(stream);
         
        //insert an empty reference to the BlobId into the SQL Blobs table, this is basically to assist with the cleanup process.
        //during cleanup, it's faster to query the database for the blobs that should be removed as opposed to retrieving and parsing a list from Azure.
        const string cmdText = "INSERT INTO [Blobs]( [Id], [BlobId], [Index], [Created], [Data] ) VALUES(   NewId(), @blobId, @index, @created, @data)";
        using (var connection = new SqlConnection(Api.ConnectionString))
        {
            connection.Open();
            var command = new SqlCommand(cmdText, connection)
            {
                CommandTimeout = (int)CommandTimeout.TotalSeconds
            };
            command.Parameters.AddWithValue("@blobId", blobId);
            command.Parameters.AddWithValue("@index", 0);
            command.Parameters.AddWithValue("@created", DateTime.UtcNow);
            command.Parameters.Add("@data", SqlDbType.Image, 0).Value = new byte[0];
            command.ExecuteNonQuery();
        }
    }
    return true;
}

BlobStreamExists

In this method, retrieve a reference to the blobId in question, then use the extension method mentioned earlier to return whether or not the blob exists.

public override bool BlobStreamExists(Guid blobId, CallContext context)
{
    var blob = BlobContainer.GetBlobReference(blobId.ToString());
    return blob.Exists();
}

GetBlobStream

In this method, retrieve a reference to the blobId in question. If the referenced blob doesn't exist in Azure storage, then return null. If it does exist, download the blob to a System.IO.MemoryStream object and return that stream.

public override Stream GetBlobStream(Guid blobId, CallContext context)
{
    Assert.ArgumentNotNull(context, "context");
     
    var blob = BlobContainer.GetBlobReference(blobId.ToString());
    if (!blob.Exists())
        return null;
 
    var memStream = new MemoryStream();
    blob.DownloadToStream(memStream);
    return memStream;
}

RemoveBlobStream

In this method, first retrieve a reference to the blobId in question. Then use the Microsoft.WindowsAzure.StorageClient.CloudBlob.DeleteIfExists method to delete the blob if it exists in the Azure storage container. Lastly, call the base class Sitecore.Data.DataProviders.Sql.SqlDataProvider.RemoveBlobStream method. This ensures that any record in the Blobs database table, which references the blobId in question, is deleted from the Blobs table.

public override bool RemoveBlobStream(Guid blobId, CallContext context)
{
    var blob = BlobContainer.GetBlobReference(blobId.ToString());
    blob.DeleteIfExists();
    return base.RemoveBlobStream(blobId, context);
}

Usage

  • Clone or fork the git repo found here: https://bitbucket.org/aweber1/sitecore.azureextensions
  • Compile the project
  • Modify the /App_Config/Include/Sitecore.AzureExtensions.config file, adjusting the Media.AzureBlobStorage.ConnectionString and Media.AzureBlobStorage.ContainerName settings as needed
  • Copy/deploy/reference the compiled /bin/Sitecore.AzureExtensions.dll file and /App_Config/Include/Sitecore.AzureExtensions.config file in your Sitecore solution
  • Verify the new provider is operating by uploading a file to the Sitecore media library. You should see this same file mirrored in your Azure blob storage container.
  • Enjoy!

Conclusion

I'll echo the words of developers everywhere - "It works in my environment". As such, your experience may vary and you would be wise to exercise caution if you choose to implement some version of the provider demonstrated in this article - especially if you're considering it for production use.

A few other considerations to keep in mind if you choose to use Azure/"the cloud" as a storage provider:

  • Deleting blobs. When you delete a media item from the Sitecore media library, by default this item is archived in the Sitecore recycle bin. Because you can restore items from the recycle bin, the blob for a recycled media item is not removed from your blob storage container - whether it be SQL Server, Azure, etc... Therefore, if you delete a media item from the Sitecore media library and recycle bin functionality is enabled in your Sitecore implementation, don't expect the associated blob to be removed from your storage container. When the recycle bin is enabled, blobs are only removed from blob storage when their referencing item is permanently removed from the recycle bin AND the database cleanup operation is performed.
  • Performance. Uploading/downloading media assets to/from the cloud will likely be slower than doing the same with local database storage. Along the same lines, if multiple Sitecore users are uploading/downloading media assets at the same time, external bandwidth will be divided across each ongoing network operation. From a content delivery perspective, the standard Sitecore media caching will occur as usual.
  • Each default Sitecore database - core, master, web - contains a Blobs table. The Azure blob storage data provider does not differentiate between the databases, it acts as the sole repository for all Sitecore blobs. It would be possible to create separate repositories for each database, but then you'd have to extend the publishing process to move blobs from the master repo to the web repo on publish - probably more work than it's worth and not really necessary. Also, the core database doesn't really make use of blobs.


  • Hi Adam,  It is a nice solution. Although I when I tried deleting a media item and permanently deleting it from the recycle bin, I don't see the debugger hitting the RemoveBlobStream override method. Since you say, that the doing the above will remove the media from the storage container, i will expect the debugger to hit this method. Any ideas?

  • Hi Adrian,  As noted in the last section of the article:  "When the recycle bin is enabled, blobs are only removed from blob storage when their referencing item is permanently removed from the recycle bin AND the database cleanup operation is performed."  In other words, when you "permanently" delete media items from the recycle bin, only the item and blob reference are deleted. The actual blob will still remain in storage (either database or your custom storage) even after a permanent delete from the recycle bin. In order to remove the blob (and subsequently execute the RemoveBlobStream method), you need to run the database cleanup operation (via the Sitecore control panel). This is standard Sitecore behavior when the recycle bin is enabled and not specific to the provider example.   If you want to make things more seamless and delete blobs when you permanently delete a media item in the recycle bin, then you'd likely need to explore extending the Sitecore.Data.Archiving.SqlArchive class - specifically the various "RemoveEntries" methods. The challenge will be in determining whether or not an item being permanently removed from the recycle bin contains any blob fields and then obtaining a reference to the blob to be deleted.  Cheers, adam  

  • Very useful. I'm looking to use something different to Azure, but still very valuable information. Thanks.

  • Hi Adam,  I wanted to figure out the size of all media item which is stored as blob in my Sitecore master DB,how can I generate this report?