Reading line by line from blob Storage in Windows Azure
To directly answer your question, you will have to write code to download the blob locally first and then read the content in it. This is mainly because you can not just peak into a blob and read its content in middle. IF you have used Windows Azure Table Storage, you sure can read the specific content in the table.
As your text file is a blob and located at the Azure Blob storage, what you really need is to download the blob locally (as local blob or memory stream) and then read the content in it. You will have to download the blob full or partial depend on what type of blob you have uploaded. With Page blobs you can download specific size of content locally and process it. It would be great to know about difference between block and page blob on this regard.
Yes, you can do this with streams, and it doesn't necessarily require that you pull the entire file, though please read to the end (of the answer... not the file in question) because you may want to pull the whole file anyway.
Here is the code:
StorageCredentialsAccountAndKey credentials = new StorageCredentialsAccountAndKey(
"YourStorageAccountName",
"YourStorageAccountKey"
);
CloudStorageAccount account = new CloudStorageAccount(credentials, true);
CloudBlobClient client = new CloudBlobClient(account.BlobEndpoint.AbsoluteUri, account.Credentials);
CloudBlobContainer container = client.GetContainerReference("test");
CloudBlob blob = container.GetBlobReference("CloudBlob.txt");
using (var stream = blob.OpenRead())
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
}
}
I uploaded a text file called CloudBlob.txt to a container called test. The file was about 1.37 MB in size (I actually used the CloudBlob.cs file from GitHub copied into the same file six or seven times). I tried this out with a BlockBlob which is likely what you'll be dealing with since you are talking about a text file.
This gets a reference to the BLOB as usualy, then I call the OpenRead() method off the CloudBlob object which returns you a BlobStream that you can then wrap in a StreamReader to get you the ReadLine method. I ran fiddler with this and noticed that it ended up calling up to get additional blocks three times to complete the file. It looks like the BlobStream has a few properties and such you can use to tweak the amount of reading ahead you have to do, but I didn't try adjusting them. According to one reference I found the retry policy also works at the last read level, so it won't attempt to re-read the whole thing again, just the last request that failed. Quoted here:
Lastly, the DownloadToFile/ByteArray/Stream/Text() methods performs it’s entire download in a single streaming get. If you use CloudBlob.OpenRead() method it will utilize the BlobReadStream abstraction which will download the blob one block at a time as it is consumed. If a connection error occurs, then only that one block will need to be re-downloaded(according to the configured RetryPolicy). Also, this will potentially help improve performance as the client may not need cache a large amount of data locally. For large blobs this can help significantly, however be aware that you will be performing a higher number of overall transactions against the service. -- Joe Giardino
I think it is important to note the caution that Joe points out in that this will lead to an overall larger number of transactions against your storage account. However, depending on your requirements this may still be the option you are looking for.
If these are massive files and you are doing a lot of this then it could many, many transactions (though you could see if you can tweak the properties on the BlobStream to increase the amount of blocks retrieved at a time, etc). It may still make sense to do a DownloadFromStream on the CloudBlob (which will pull the entire contents down), then read from that stream the same way I did above.
The only real difference is that one is pulling smaller chunks at a time and the other is pulling the full file immediately. There are pros and cons for each and it will depend heavily on how large these files are and if you plan on stopping at some point in the middle of reading the file (such as "yeah, I found the string I was searching for!) or if you plan on reading the entire file anyway. If you plan on pulling the whole file no matter what (because you are processing the entire file for example), then just use the DownloadToStream and wrap that in a StreamReader.
Note: I tried this with the 1.7 SDK. I'm not sure which SDK these options were introduced.
This the code I used to fetch a file line by line. The file was stored in Azure Storage. File service was used and not blob service.
//https://docs.microsoft.com/en-us/azure/storage/storage-dotnet-how-to-use-files
//https://<storage account>.file.core.windows.net/<share>/<directory/directories>/<file>
public void ReadAzureFile() {
CloudStorageAccount account = CloudStorageAccount.Parse(
CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudFileClient fileClient = account.CreateCloudFileClient();
CloudFileShare share = fileClient.GetShareReference("jiosongdetails");
if (share.Exists()) {
CloudFileDirectory rootDir = share.GetRootDirectoryReference();
CloudFile file = rootDir.GetFileReference("songdetails(1).csv");
if (file.Exists()) {
using(var stream = file.OpenRead()) {
using(StreamReader reader = new StreamReader(stream)) {
while (!reader.EndOfStream) {
Console.WriteLine(reader.ReadLine());
}
}
}
}
}
In case anyone finds themselves here, the Python SDK for Azure Blob Storage (v12) now has the simple download_blob()
method, which accepts two parameters - offset and length.
Using Python, my goal was to extract the header row from (many) files in blob storage. I knew the locations of all of the files, so I created a list of the blob clients - one for each file. Then, I iterated through the list and ran the download_blob method.
Once you have created a Blob Client (either directly via connection string or using the BlobServiceClient.get_blob_client()
method), just download the first (say,) 4k bytes to cover any long header rows, then split the text using an end of line character ('\n'). The first element of the resulting list will be a header row. My working code (just for a single file) looked like:
from azure.storage.blob import BlobServiceClient
MAX_LINE_SIZE = 4096 # You can change this..
my_blob_service_client = BlobServiceClient(account_url=my_url, credential=my_shared_access_key)
my_blob_client = my_blob_service_client.get_blob_client('my-container','my_file.csv')
file_size = my_blob_client.size
offset = 0
You can then write a loop to downloading the text line by by line, by counting the byte offset at the first end-of-line, and getting the next MAX_LINE_SIZE bytes. For optimum efficiency, it'd be nice to know the maximum length of a line, but if you don't, guess a sufficiently large length.
while offset < file_size - 1:
next_text_block = my_blob_client.download_blob(offset=offset, length=MAX_LINE_SIZE)
line = next_text_block.split('\n')[0]
offset = len(line) + 1
# Do something with your line..
Hope that helps. The obvious trade-offs here are network overhead, each call for a line of text is not fast, but it achieves your requirement of reading line-by-line.