How to query Cloud Blobs on Windows Azure Storage
What I've realized about Windows Azure blob storage is that it is bare-bones. As in extremely bare-bones. You should use it only to store documents and associated metadata and then retrieve individual blobs by ID.
I recently migrated an application from MongoDB to Windows Azure blob storage. Coming from MongoDB, I was expecting a bunch of different efficient ways to retrieve documents. After migrating, I now rely on a traditional RDBMS and ElasticSearch to store blob information in a more searchable way.
It's really too bad that Windows Azure blob storage is so limiting. I hope to see much-enhanced searching capabilities in the future (e.g., search by metadata, property, blob name regex, etc.) Additionally, indexes based on map/reduce would be awesome. Microsoft has the chance to convert a lot of folks over from other document storage systems if they did these things.
The method ListBlobs retrieves the blobs in that container lazily. So you can write queries against that method that are not executed until you loop (or materialize objects with ToList or some other method) the list.
Things will get clearer with few examples. For those that don't know how to obtain a reference to a container in your Azure Storage Account, I recommend this tutorial.
Order by last modified date and take page number 2 (10 blobs per page):
blobContainer.ListBlobs().OfType<CloudBlob>()
.OrderByDescending(b=>b.Properties.LastModified).Skip(10).Take(10);
Get specific type of files. This will work if you have set ContentType at the time of upload (which I strongly recomend you do):
blobContainer.ListBlobs().OfType<CloudBlob>()
.Where(b=>b.Properties.ContentType.StartsWith("image"));
Get .jpg files and order them by file size, assuming you set file names with their extensions:
blobContainer.ListBlobs().OfType<CloudBlob>()
.Where(b=>b.Name.EndsWith(".jpg")).OrderByDescending(b=>b.Properties.Length);
At last, the query will not be executed until you tell it to:
var blobs = blobContainer.ListBlobs().OfType<CloudBlob>()
.Where(b=>b.Properties.ContentType.StartsWith("image"));
foreach(var b in blobs) //This line will call the service,
//execute the query against it and
//return the desired files
{
// do something with each file. Variable b is of type CloudBlob
}
Edit
Now in preview is blob index for azure storage which is a managed index of metadata you can add to your blobs (new or existing). This will remove the need to use creative container names for pseudo indexing or to maintain a secondary index yourself.
Original answer
For returning specific results, one possible option is to use the blob and/or container prefix to effectively index what you're storing. For example you could prefix a date and time as you add blobs, or you could prefix a user, depends on your use case as to how you'd want to "index" your blobs. You can then use this prefix or a part of it in the ListBlobs[Segmented] call to return specific results, obviously you'd need to put the most general elements first, then more specific elements, e.g.:
2016_03_15_10_15_blobname
This would allow you to get all 2016 blobs, or March 2016 blobs, etc. but not March blobs in any year without multiple calls.
Downside with this is that if you needed to re-index blobs you'd need to delete and recreate them with a new name.
For paging generally you can use the ListBlobsSegmented method which will give you a continuation token that you can use to implement paging. That said it's not much use if you need to skip pages as it only works by starting from where the last set of actual results left off. One option with this is to calculate the number of pages you need to skip, get these and discard them, then get the actual page you want. If you have a lot of blobs in each container this could get pretty inefficient pretty quickly....
You could also just have this as the fail back method, using a page by page approach and storing the continuation token if the user is clicking one page to the next sequentially OR you could potentially cache blob names and do your own paging from that.
You can also combine these two approaches, e.g. filtering by your "index" then paging on the results.