none
Why is this social network query so slow in Azure Table Storage?

    Question

  • I am developing a simple social networking application and want to use Azure Table Storage because it is cheap and is supposed to scale very well. I’ve tried to follow the guidance in this article to design the tables and queries (https://msdn.microsoft.com/en-us/library/azure/hh508997.aspx). However, when I load the table with millions of microblog records, the query takes more than 5 minutes. Here are my tables and the query:

    Users Table
    PartitionKey - GUID
    RowKey – Email address
    FirstName
    LastName
    ThumbnailImageUrl

    Friends Table 
    PartitionKey – Email address of user1
    RowKey – Email address of user2
    Timestamp

    Microblogs Table
    PartitionKey - DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks
    RowKey – Email address of user
    Timestamp
    Text – Text of message
    FirstName – First Name of user
    LastName – Last Name of user
    ThumbnailImageUrl – Thumbnail of user

    Code snippet with the queries for the microblog:

    public Social GetMicroblogs(string userEmail)
    {
    List<string> lstFriends = null;
    lstFriends = context.CreateQuery<Friends>(Friends)
                                                        .Where(Friends => (Friends.PartitionKey.Equals(userEmail)))
                                                        .AsTableServiceQuery(context).Execute()
                                                        .Select(Friends => Friends.RowKey)
                                                        .ToList();
    lstFriends.Add(userEmail);

    if (lstFriends!= null)
    {
                  var microblogTableQuery = new TableQuery();
    var microblogCondition = string.Empty;
                  foreach (var Friends in lstFriends)
    microblogCondition += "RowKey eq '" + Friends + "' or ";
    microblogCondition = microblogCondition.Remove(microblogCondition.Length - 3);

    var microblogDynamicTableEntities = microblogTable.ExecuteQuery(microblogTableQuery.Where(microblogCondition))
                                                                            .Skip((pageIndex - 1)*pageSize).Take(pageSize)
                                                                            .ToList();
    }
    }

    Logic
    Each time two users become friends, the software will add a record for each user in a Friends table. For example, if user1 connects with user2, then two records are added to the table, as follows: Record 1 would have PK = user1 and RK = user2, Record 2 would have PK = user2 and RK = user1.

    I chose “DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks” as the partition key in the Microblogs table so that the query results are automatically sorted in chronological order with the newest microblogs first. I do not need to use entity group transactions, and since I will have a large number of microblogs I assumed that having small partitions will enable the queries to be load balanced across multiple servers.

    In my query above, I first get the list of friends and then use that as the filter for the query on the microblog table.

    Question
    My question is, why is this query so slow? Based on the article referenced above, I expected it to achieve the fastest results?




    • Edited by Xethnic Friday, May 22, 2015 10:33 PM
    Friday, May 22, 2015 10:27 PM

Answers

  • Hi Xethnic,

    Yes, if you only specify the filters for the PartitionKey but don't specify a RowKey, the query will return 1,000 entities for the first user and none of the entities for the other users. The query result is sorted firstly in the PartitionKey order, then in RowKey order.

    Besides the requirement to query all microblogs of one user, do you have another requirement to query a particular number of "all users' latest microblogs"? If so, please consider creating two tables, one with user ID PK and timestamp RK, and one with timestamp RK and user ID RK. When a microblog is posted, please add one entity to both tables respectively. The advantage of this design is you can support both query effectively, but the disadvantage is the cost of removing an user will be pretty high.

    If you are concerning about the consistency issue while inserting entities to two tables, please refer to "Eventually Consistent Transaction Pattern" in the following article.

    https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#eventually-consistent-transactions-pattern

    Best Regards,

    Zhaoxing Lu

    Thursday, May 28, 2015 7:02 AM

All replies

  • Hi Xethnic,

    Thanks for your post!

    The tables and the query you have pasted seems to be right in place. Not sure, why the query is taking a long time.

    Since I'm not very good with this topic, I shall involve someone from the team who is an expert in this matter.

    Meanwhile you may refer the link below which might be helpful to you.

    http://stackoverflow.com/questions/28129884/model-simple-social-network-in-azure-table-service

    Regards,

    Sadiqh

    Saturday, May 23, 2015 3:10 PM
    Moderator
  • Thanks, Sadiqh. I really appreciate your help. That post on Stackoverflow is also mine. Before I did the design, I sought advice on this forum and Stackoverflow. The Stackoverflow response gave me some good ideas that led to the current design, so I posted some comments hoping the original contributor might have some good feedback. -- Tim
    Sunday, May 24, 2015 11:16 AM
  • Hi Xethnic,

    The query you're using for microblog table will actually trigger a whole table scan since you only provide a fixed RowKey but no PartitionKey to the query. Please consider changing your microblog table design, use UserID as PartitionKey and use "DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks" as RowKey so that this query will only trigger a partition scan.

    For further information of Azure Table Storage and how to well design tables, please refer to:

    http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx

    http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

    If you have any further questions, please feel free to ask me here.

    Best Regards,

    Zhaoxing Lu

    Tuesday, May 26, 2015 2:03 AM
  • Thank you, Zhaoxing Lu. I expect each user to have more than 1,000 entities in the microblogs table. In the case of this design, if I only specify the filters for the PartitionKey but don't specify a RowKey, will the query return 1,000 entities for the first user and none of the entities for the other users? I hadn't tried it yet, but that was what I expected based on the documentation. -- Tim
    Tuesday, May 26, 2015 4:58 PM
  • Hi Xethnic,

    Yes, if you only specify the filters for the PartitionKey but don't specify a RowKey, the query will return 1,000 entities for the first user and none of the entities for the other users. The query result is sorted firstly in the PartitionKey order, then in RowKey order.

    Besides the requirement to query all microblogs of one user, do you have another requirement to query a particular number of "all users' latest microblogs"? If so, please consider creating two tables, one with user ID PK and timestamp RK, and one with timestamp RK and user ID RK. When a microblog is posted, please add one entity to both tables respectively. The advantage of this design is you can support both query effectively, but the disadvantage is the cost of removing an user will be pretty high.

    If you are concerning about the consistency issue while inserting entities to two tables, please refer to "Eventually Consistent Transaction Pattern" in the following article.

    https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#eventually-consistent-transactions-pattern

    Best Regards,

    Zhaoxing Lu

    Thursday, May 28, 2015 7:02 AM
  • Hi Zhaoxing Lu,

    I do not need to show all microblogs for all users. My requirement is to show a particular number of microblogs for the logged in user and his friends. If I use the user ID for PK and the timestamp for RK, how do I query to get only a particular number of the microblogs for the user and his friends?

    I can build up a filter of the user IDs for the user and his friends and pass that in to filter on PK. I think if each user has more than 1,000 entities then I will only get the results for the first user. What I can't figure out is how to get the latest 20 microblogs across all of the partitions, in chronological order with the latest microblogs first. Then I need to query again to get the next 20 microblogs in chronological order and so forth (i.e. implement progressive rendering).

    Can you think of a way to accomplish that?

    Best Regards,

    Tim

    Friday, May 29, 2015 2:35 PM
  • Hi Xethnic,

    As far as I am concerned, you problem is more likely what a relationship database can solve. If you still would like to choose Azure Storage Table (which is actually a non-SQL database), a solution I can think of is shown below:

    Since you would like to implement segmented query inside a friend circle efficiently, the microblogs of the user and his friend should have the same partition key, and only those microblogs are in this partition. Therefore, you need to maintain a MicroblogInFriendCircle table, whose PartitionKey is a user Id, and the RowKey is timestamp of the corresponding microblog. While a microblog is posted, you should add the microblog into the poster and his friends' partitions. In addition, if you'd like to avoid microblog data duplication, please consider maintain a dedicated Microblog table whose PartitionKey is poster user Id and RowKey is microblog Id, so that you can just add microblog Id into MicroblogInFriendCircle as a reference.

    When you want to query the microblogs of an user and his friends chronologically, just specify a filter "PK=UserId" and ".Take(20)" to MicroblogInFriendCircle table, and leverage the TableContinuationToken to fetch table entities segment by segment. And if what you stored in MicroblogInFriendCircle table is just microblogs references, you need to get the real microblog from Microblog table by issuing individual 20 requests (which can be done in parallel of course).

    A summary: this solution can achieve best performance for your requirement to query the microblogs of an user and his friends chronologically, because it's a partition scan (rather than whole table scan) which leverages the RowKey order perfectly. However, it's not a good solution honestly. Besides complexity, the solution has another disadvantage you should note: if the friend list of an user is changed, you have to maintain the MicroblogInFriendCircle table with lots of adding/deleting operations, which is pretty costly. And again, you have to leverage "Eventually Consistent Transaction Pattern" to achieve data consistency.

    Maybe other guys can think of a better table design for your question, the solution mentioned above is just my personal idea. :)

    Best Regards,

    Zhaoxing Lu

    • Proposed as answer by Zhaoxing Lu Monday, June 1, 2015 8:09 AM
    • Unproposed as answer by Zhaoxing Lu Monday, June 1, 2015 8:09 AM
    • Proposed as answer by Zhaoxing Lu Wednesday, June 17, 2015 1:57 AM
    Sunday, May 31, 2015 4:34 AM