locked
Writing from a blob to a file RRS feed

  • Question

  • I've been trying to find a solution to this for quite some time an have searched all over the MSDN forums so forgive me if this has already been answered.

     

    I've currently got a program written that takes a .csv file and adds each row as a new queue item. For the sake of simplicity, lets just say that its just 100 rows of numbers. The worker then takes the row off the queue and calculates the average of all the numbers in the queue, then returns the value (as a MemoryStream) to be saved in my blob (using blob.UploadStream). So after the program runs, I have 100 .txt files in my blob.

     

    What I'd like to be able to do is somehow have all 100 .txt files saved as one large .csv file in the blob. Ideally, I want to be able to take the same file and repeatedly write to it. I've tried various other methods but nothing seems to quite work right. Efficiency is kind of a big deal because my actual .csv file has about 4500 rows.

     

    Thanks!

    Monday, August 16, 2010 4:12 PM

Answers

  • Why don't you:

    1) save each csv file in blob storage and insert an entry in the queue for each file.

    2) have workers processing the queue one message (and blob) at a time

    3) have a single worker create the merged blob once all the math processing is completed

    • Proposed as answer by Joe Giardino Friday, July 8, 2011 6:18 PM
    • Marked as answer by Brad Calder Monday, March 12, 2012 9:21 AM
    Monday, August 16, 2010 8:04 PM
    Answerer

All replies

  • Do you need support for multiple instances of your work role? If so, is it possible to have each instance monitor its own queue with a single input csv file going into the queue?

    Another alternative (because I no nothing about the solution you're working on) is uploading the .csv to blob storage and then simply dropping a single message into the queue that tells the worker to retrieve and process that file.

    Monday, August 16, 2010 5:58 PM
  • I do have multiple instances of my worker, but I guess I don't understand either possible solution you are suggesting.


    For the first one, do you mean that I would make one queue for each of my workers? I guess I'm not sure how doing this would help me, but I'm probably just missing the point.

     

    And I don't think the 2nd idea would work because I need each row to act independently, and having just one worker process a whole .csv file would take a long time.

     

    I appreciate your response but I'm not sure its exactly what I'm looking for. All I need is a way to get all my processed data from the worker roles to a single .csv file.

    Monday, August 16, 2010 6:28 PM
  • Problem will be aggregating the data together into a single file with multiple writers and no shared disk instance (which is how we usally manage the file locks for this type of work). You've also got a potential issue in that there's no guarantee that your rows will be processed in FIFO order (I've run tests, and it doesn't always work that way).

    So IMHO, your best solution is to solution (provided that the FIFO hiccup doesn't affect you) is to have worker roles process each row independently then send them to another queue. This output queue is then read by a single worker role process which will append them all to a single file and save that file to azure storage.

    The saving of the file to Azure Storage can be done automagically using Azure Diagnostics so save you a few steps. Additionally, the aggregation process could actually be done by one of your worker roles as a background process that runs automatically when the primary input queue is empty. Thus allowing you to use the sleep cycles to process the aggregation of the file out of band.

    Still not sure if I get exactly what you're trying to do, but hopefully these suggestions are at least more helpful. :) 

    Monday, August 16, 2010 6:39 PM
  • This is much more helpful. FIFO doesn't matter, so I like this idea better. My only concern is the speed and time this will take. My actual .csv file consists of about 4500 rows. I was hoping to be able to implement something right into my current code so it can be done by each worker as finishes processing the row.

     

    Basically, what my code does is this:

    1. Gets .csv input from the user. File is comprised of one column that labels the row and 60 columns of numbers

    2. Splits .csv into rows and adds each row to the queue as a message

    3. Worker roles grab the message, performs some math stuff

    4. The row label and math results are returned as a memory stream

    5. The returned value is uploaded to the blob as a .txt file

    Repeat 3-5 for all 4500 rows.

     

    The goal would be able to write to a file sometime during steps 3-5. I know there are limitations on what I can do with multiple workers, but I also need to have this thing be able to run somewhat fast. Does that make more sense?

    Monday, August 16, 2010 7:17 PM
  • The issue is the simultaneous write of the results to a single blob. For this, you've got to have some type of locking or collision detection in place. And this is going to create a bottleneck however you do it. So the only option I see is finding a way to create that final blob/file as efficiently as possible.
    Monday, August 16, 2010 8:04 PM
  • Why don't you:

    1) save each csv file in blob storage and insert an entry in the queue for each file.

    2) have workers processing the queue one message (and blob) at a time

    3) have a single worker create the merged blob once all the math processing is completed

    • Proposed as answer by Joe Giardino Friday, July 8, 2011 6:18 PM
    • Marked as answer by Brad Calder Monday, March 12, 2012 9:21 AM
    Monday, August 16, 2010 8:04 PM
    Answerer