Possible to bulk insert many records into Azure tables?

Answered Possible to bulk insert many records into Azure tables?

  • 2010年1月8日 21:41
     
     
    I need to import millions of records into an Azure table.   What is the quickest way to do this?  

    The PDC video for Tables show large number of records in their examples. 

    I assume it can only be done 1000 at a time but all the third party tools still do it 1 by 1.  Is there any way to bulk insert?

    Thanks

    Chad

全部回复

  • 2010年1月8日 23:07
    答复者
     
     
    Rob Gillen did a couple of posts describing his experience of adding seriously large numbers of entities. They are here and here.
  • 2010年1月9日 3:45
     
     
    Thanks for your reply Neil.   Rob doesn't really give a lot of info on how he did it, and definitely doesn't provide any code. 

    I think it is unfortunate that the PDC videos purport to show tables with millions of records without giving us any idea how insert/select in a reasonable amount of time.


    I am sure this is the question of a lot of people new to Azure, but what is the point of Azure tables if it is going to be relatively slow and inefficient as compared to SQL db?  I don't see the cost benefit. 

  • 2010年1月9日 6:59
    答复者
     
     
    I think the way to upload large datasets is to upload the data as blobs containing many rows each and use multiple worker role instances to process these blobs inserting the data in batches. This way you at least avoid the Atom feed overhead inserting entities over the internet.

    SQL Azure and Azure Tables support different features. SQL Azure is a relational database currently limited to databases of under 10GB. Azure Tables is a scalable storage service providing cost-effective access to terabytes of data.
  • 2010年1月9日 12:05
     
     已答复
    I need to import millions of records into an Azure table.   What is the quickest way to do this?  

    The PDC video for Tables show large number of records in their examples. 

    I assume it can only be done 1000 at a time but all the third party tools still do it 1 by 1.  Is there any way to bulk insert?

    Thanks

    Chad
    It can be done with batches of 100. As for 3rd party tools: TableXplorer will insert your data in the batch-mode and fall back on 1-by-1 every time a batch fails. You can import your data from a CSV-file (the only limitation being that the type of all columns will be set to string by default) or from a SQL Server database/table (with type information preserved). Make sure though that your data is sorted by the PartitionKey (e.g. by specifying ORDER BY in the SQL import query), because that's how the bulk import is possible in the first place - otherwise the table storage will return an error and the fallback to the 1-by-1-mode will happen.
  • 2010年1月11日 21:20
     
     
    Thank you for your answers.

    I believe I have found my answer here.

    http://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/653c13b1-32aa-45b4-a867-24e0b0eb0572

    I will be testing this out tonight and will let you know how it goes.
  • 2010年1月12日 5:37
     
      包含代码
    I have gotten the bulk import to work using:

    ctx.SaveChanges(SaveChangesOptions.Batch)
    The issue I am having is looping through my local db table with multiple partition keys (of course this is just a normal column, it is a partition key in the Azure table).  

    My logic is as follows:

    1.  Read all rows into a datareader ordered by the columns i use as partition key, row key.
    2.  Loop through datareader using while.
    3.  Based on global/local int variables, I determine if the ctx entity count is at 100
    4.  If it is at 100, i save the batch to the cloud
    5.  Based on global/local string variables, I determine if the partition key has changed
    6.  If it is a change in partition key, i save the batch to the cloud 

    The problem is that I don't know how to delete the records from the context (TableServiceEntity) that have been saved to the cloud. 

    Is there a method to clear the context (TableServiceEntity) ?  Am I approaching this correctly?  

    I really wanted to avoid running #1 above using a where clause for each partition key and/or keeping track of what is written to the cloud with a bit field in my local db tables.

    Thanks for your time.

     

    • 已编辑 Chad.Hensley 2010年1月12日 5:38 edited it to be more clear
    •  
  • 2010年1月12日 5:46
     
     
    I have figured it out. After every save to the cloud, just create a new entity. ctx = New TableContext(tableStorage)
  • 2010年1月21日 17:40
     
     
    Chad,

    I apologize for not replying earlier or having the specific code samples on my site... a task I had intended to get to but simply hadn't yet.

    The way you figured it out is exactly what I'm doing... for each batch of 100  (or partial batch) I'm creating a new table context..

    Thanks - rob
    Rob Gillen
    http://rob.gillenfamily.net
    http://www.csm.ornl.gov/
    http://www.go-planet.com
  • 2010年1月22日 17:09
     
     
    No problem Rob.   Do you have any of your code where you used JSON?
  • 2010年2月9日 19:30
     
     
    so, the JSON interaction was limited to the downloads, and only as a data proxy. If you are interested in this code, it is actually available as a project on codeplex located here: http://ogdi.codeplex.com/ .

    This uses a 3rd-party JSON library also located on Codeplex: http://json.codeplex.com/

    If you were looking for some other part of my test code, please let me know and I'll get that posted for you soon.
    Rob Gillen
    http://rob.gillenfamily.net
    http://www.csm.ornl.gov/
    http://www.go-planet.com