locked
Azure Blob Storage - ListBlobsSegmented page fragmentation RRS feed

  • Question

  • Hello -

    We've been experiencing some odd behavior lately with regards to the ListBlobsSegmented command in the Azure Blob Storage API.  To provide some background to our application, we use the AzureBlobStorage as file drop-off location from another application.  Once the 1st application is finished processing, it drops off an output file on AzureBlobStorage, which is picked up by a 2nd application for file delivery.  This file delivery application runs the ListBlobsSegmented command on a blob reference directory to see if there are files to process.  Once files are delivered, they are deleted from BlobStorage.

    Lately we've seen an increase in volume, and have noticed some issues with what is reported back with the ListBlobsSegmented command.  We first started seeing 0 results returned, however a non-null continuation token was being passed back.  Interestingly, when passing that continuation token back into the command, results were then returned from the application.  What's odd (and problematic) is that even though I'm always requesting 1000 files, the number of results that return vary.  In most cases, I seem to only receive a fraction of what is being requested (even though there are over 10K files in the blob container to-be processed).  

    I read some articles that spoke about garbage collection being the cause of this issue.  Basically, when files are deleted from blob storage, they still exist until garbage collection takes place.  This issue tends to come up and be more apparent whenever we have the need to process a bunch of files.  Unfortunately, this GC process is not sometime we seem to have control over and we seem to be at the mercy Azure's scheduler.  

    Below is an example of the code that I'm running that searches for files to process.  Does anyone have any ideas of ways to work around this problem?

    private CloudStorageAccount mStorageAccount;
    void Main() {
    	mStorageAccount = CloudStorageAccount.Parse("...");
    	var BlobClient = mStorageAccount.CreateCloudBlobClient();
    	var container = BlobClient.GetContainerReference("...");
    	var dir = container.GetDirectoryReference("...");
    
    	int TokenDepth = 0;
    	BlobContinuationToken token = null;
    	
    	while (true) {
    		Console.WriteLine(String.Format("Current Token Depth: {0}", TokenDepth));
    		var blobs = dir.ListBlobsSegmented(false, BlobListingDetails.None, 1000, token, null, null);
    
    		if (blobs.Results.Count() > 0) {
    			Console.WriteLine(String.Format("Blob File Count: {0}", blobs.Results.Count()));
    			break;
    		} else if (blobs.ContinuationToken == null) {
    			break;
    		} 
    		token = blobs.ContinuationToken;
    		TokenDepth++;
    	}
    }
    
    private CloudBlobClient AzureTableClient {
    	get {
    		return mStorageAccount.CreateCloudBlobClient();
    	}
    }
    
    

    Monday, July 10, 2017 7:14 PM

All replies

  • To clarify: Which version of windows Azure Storage client library are you using?

    Do click on "Mark as Answer" on the post that helps you, this can be beneficial to other community members.

    Tuesday, July 11, 2017 10:06 AM
  • I'm using version 8.1.4
    Tuesday, July 11, 2017 1:04 PM
  • Try to follow the continuation tokens until there are none left. According to the specification mentioned.
    -------------------------------------------------------------------------------------------

    Do click on "Mark as Answer" on the post that helps you, this can be beneficial to other community members.

    Tuesday, July 11, 2017 6:24 PM
  • I tried doing this, however the results that were returning were extremely odd.  For example, if I executed the "ListBlobs" command, it took a minute or two to return the full directory listing with several thousand files inside it.  When I changed the code to continue iterating through the container user continuation tokens, it seemed to have endless iterations and the number of files (added up) was more than what was in the container.  

    Since my original posting, the GC process must have been run and now the container is back to normal.  My concern though is, are we going to experience this same behavior whenever we have a large number of files to process?  

    My other question is, is Blob Storage the right solution for storing a bunch of small files for a short period of time?  We've been on this technology for a while, and really have not had any problems like this until recently.  Since we first developed our application, new services such as Azure File Service have come out.  If that service, or another, is better suited for temporary storage of small files, then perhaps we would consider migrating.  It's odd though we've not had an issue like that until recently, and I'm wondering if any tweaks have been made that may have impacted how our service works.  

    Wednesday, July 12, 2017 1:37 PM
  • Hi,

    ListBlobs merely follows the continuation tokens under the hood, so the behavior between the two should not be different.

    Yes, you should expect this to sometimes occur depending on how the data is represented on the backend. It actually need not be related to GC as well -- this has to do with how the data is partitioned. File storage could be a good solution for you, depending on your throughput needs, but blob storage is a fine solution otherwise.

    Thanks, Peter

    • Proposed as answer by Md Shihab Sunday, July 16, 2017 12:14 PM
    Wednesday, July 12, 2017 4:12 PM