none
4 billion rows on Azure Table Storage

    Question

  • Hello, I've trying to create a database of 4,294,967,296 rows (65536*65536) that are made up of [ID1], [ID2], and [NEWID].

    Both [ID1] and [ID2] can be any number from 0 to 65535.

    [NEWID] is a calculation of [ID1] and [ID2]. The goal is to make a searchable database, search term being [NEWID], returning all the possible combinations of [ID1] and [ID2] that create it. [NEWID] can be any number from 0 to 999999, and is repeated.

    Currently, I'm running an application on my Azure VM that writes to Azure SQL but at its speed it'd be about 3 months before I finish populating the database, bottleneck seems to be on the SQL side. At 39,509,324 rows, there is already starting to be visible lag.

    I was advised to use Azure Table Storage due to the number of rows. Tweaking the same application to push to Azure Table Storage instead, I have the following code:

                for (int id1 = 0; id1 < 65535; id1++)
                {
                    for (int id2 = 0; id2 < 65535; id2++)
                    {
                        int newid = // Calculation happens
                        NewIDEntity NewIDEntry = new NewIDEntity(newid, id1);
                        NewIDEntry.id2 = id2.ToString();
    
                        TableOperation insertOperation = TableOperation.Insert(NewIDEntry);
                        table.Execute(insertOperation);
                    }
                }
    
    ...
    
        public class NewIDEntity : TableEntity
        {
            public NewIDEntity(int newid, int id1)
            {
                this.PartitionKey = newid.ToString();
                this.RowKey = id1.ToString();
            }
    
            public NewIDEntity() { }
    
            public string id2 { get; set; }
        }

    1. Would like to know if this is the correct way to store the entities in Table Storage for my scenario, with NewID being the search term.

    2. How do I get the number of entities? I can't find an equivalent of SELECT COUNT(*).

    3. Is there a better way to generate my data?

    4. When I ran this as a test, after a few minutes I hit the following issue:

    Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (409) Conflict. ---> System.Net.WebException: The remote server returned an error: (409) Conflict.
       at System.Net.HttpWebRequest.GetResponse()
       at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteSync[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 677
       --- End of inner exception stack trace ---
       at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteSync[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 604
       at Microsoft.WindowsAzure.Storage.Table.TableOperation.Execute(CloudTableClient client, CloudTable table, TableRequestOptions requestOptions, OperationContext operationContext) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Table\TableOperation.cs:line 44
       at Microsoft.WindowsAzure.Storage.Table.CloudTable.Execute(TableOperation operation, TableRequestOptions requestOptions, OperationContext operationContext) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Table\CloudTable.cs:line 52
       at Generator.Program.Main(String[] args) in D:\Generator\Generator\Generator\Program.cs:line 37
    Any idea why this might have happened, and how to prevent it?


    Friday, February 3, 2017 2:18 AM

Answers

  • Hello,

    If you partition the data and stripe it across multiple storage accounts, you can get as much performance as you need from it. You just need another layer to aggregate the data afterwards.

    You could potentially have a slower process that is creating one large master table in the background. You may have found this already, but there is an excellent article about importing large datasets into Azure Tables.

    The problem may be due to code that row key is just ID1, so  probably going to have duplicates.  We can’t have duplicates.  Try something like:

    this.RowKey = id1.ToString() + “|” + id2.ToString();

    Inserting 4 billion entries into table storage will still take a while this way (about 50 days assuming and maintain 1,000 inserts per second, which is high).

    SQL would have a similar problem.  Either way needs to look at some kind of bulk insert/updates (and be careful with table organization), or completely change his approach. 

    For example, could probably build this using a simple bucket scheme (hash table).  Each entry would be 8 bytes of data, assume another 8 bytes of overhead for indexing and the whole thing fits in 64 GB of space.  With a little work, might be able to optimize that and cut it in half.  Either way, try to build this in memory in one of our larger VMs and then you will be able to do millions of inserts per second (assuming he can generate the data that quickly).  Once generated, you could write the whole thing to disk if he wants to persist it.  Properly optimized, lookups could also be blazingly fast.

    You might also look at tools like Azure Data Lake, which are much better designed for handling large amounts of data.  For example, he could just write all the data out as a set of CSV files and then do searches in ADL (or many other big data solution packages).

    Disclaimer: This response contains a reference to a third party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there.

    There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.

    Regards,
    Sumanth BM
    ----------------------------------------------------------------------------------------------
    Kindly click "Mark as Answer" on the post that helps you, this can be beneficial to other community members reading the thread. And Vote as Helpful.


    Friday, February 3, 2017 4:45 PM
    Moderator