none
Selecting from many partition keys in Azure Table Storage

    Question

  • In Azure Table Storage, I store news updates and the partition key is the keyword assigned to a news category e.g. "politics", "sports", etc.

    When the user logs in, I want to be able to select records based on user's interests -- which are persisted in another database. So the user may be interested in "politics" and "sports". Clearly, we may have a large number of categories that the user is interested. It could be 20+ categories.

    How can I query my table so that I can get any news updates in those categories?

    Typically, in a NoSQL database, the solution is to de-normalize but in this case, this would be a terrible alternative. Say I use user's Id as my partition key. If I have 1 million users who are interested in sports updates, it would make no sense to create 1 million copies of the same record using different partition keys so that each user can easily get their own updates.

    How do I handle this scenario using Azure Table Storage?


    Thanks, Sam


    • Edited by imsam67 Monday, June 20, 2016 5:27 PM
    Monday, June 20, 2016 4:34 PM

Answers

  • Hi Sam - It really depends on what your requirements are for storing and accessing your data. Across the spectrum of BLOB, NoSQL and Relational data stores there are tradeoffs for performance & scale, complexity and query and costs.

    As a basic (not comprehensive) comparison NoSQL stores offer Internet scale, flexible data models, simplicity and are more cost effective. Relational stores offer a relational data model, rich and complex query capabilities, ACID Transactions, server-side procedures and is traditionally more expensive.

    Your solution could be implemented effectively using either a NoSQL or Relational store, it just depends on the tradeoffs you are willing to make for the capabilities each store offers.

    • Marked as answer by imsam67 Wednesday, July 6, 2016 3:18 PM
    Wednesday, July 6, 2016 12:35 AM

All replies

  • Hi Sam,

    What if your User table would contain a favorite categories column (comprising of an array of partition key's for example). That way you only would need to get one User record and then select each category (partition key) linked to that user. If you need to do this for many user frequently Redis Cache might be a good candidate to prevent unnecessary trips to Table Storage, as it is capable of holding new news items as well.

    But this would depend on many variables of course to make a usable solution.

    From a cost perspective, storing redundant data could as well be a sensible solution as putting data in other places (like a relational DB) sometimes just doesn't make sense money wise. I guess simplicity should be the main factor in determining a good choice.

    Hope that was a little bit useful. Let us know factors that prevent this from being a realistic approach.


    Please mark answered question as anwered to let others know about it.

    Monday, June 20, 2016 10:15 PM
  • Hi Valery,

    Thanks for your response.

    The challenging part is the limit on the number of conditions I can specify in a SELECT statement against Azure Table Storage. A typical user could have as many as 30, 40 or even more categories that he/she follows.

    If I were to use the category as my partitionKey, I couldn't possibly specify that many conditions to read news updates in all the categories the user is interested in with a single query.

    Another option I considered is to run parallel queries but I really don't feel that's a simple and robust solution for such a simple scenario.

    Achieving what I want to do with Azure SQL Database would be very easy so I'm considering that now.

    I really would love to use Azure Table Storage in this scenario -- due to performance and scalability but if I can't figure out a way to run a single query that would return results with 30+ partitionKeys, I may have to go to Azure SQL Database.


    Thanks, Sam


    • Edited by imsam67 Monday, June 20, 2016 11:08 PM
    Monday, June 20, 2016 11:07 PM
  • Just a quick update. I have not been able to come up with a nice way to handle this in Azure Table Storage and I'll be implementing my solution in an Azure SQL Database.

    Thanks, Sam

    Thursday, June 23, 2016 5:08 AM
  • Sam - To clarify, you have three different Azure Tables: Users, Interests and NewsUpdates. The NewsUpdates table has a PK which is the category. Is there a Row Key for this table? What are the PK/RKs for the other two tables? Please validate I have your data model correct.

    Thank you!

    Thursday, June 30, 2016 11:04 PM
  • Gus,

    Thanks for your response.

    News categories are stored in SQL Server and cached in Redis because the number of categories is in the 100s i.e. not too many.

    Users are stored in DocumentDb and their interests are stored in the User object as well which looks something like this:

    {
       "id": "6b923507-76c2-4d20-a577-3d88e5438602",
       "firstName": "John",
       "lastName": "Smith",
       "interests": [
          "politics", "rockclimbing", "nodejs"
       ]
    }

    In Azure Table Storage (ATS), I have a news updates table with the category as the PK. The RK in this table is human readable date/time value followed by a unique GUID that looks like this: 20160630225733:e3b69f61-1c06-4346-8ad0-b542d6be7514

    So, a typical news update entity in ATS looks like this:

    PK: politics
    RK: 20160630225733:e3b69f61-1c06-4346-8ad0-b542d6be7514
    NewsUpdate: "What Brexit means for the United States"

    The reason for the additional GUID in the RK is to make sure the RK will always be unique and RK starts with a "time stamp" so that if I want to read only political news updates that came in today, I can run a query with PK: politics and RK: starts with 20160630

    The challenging part is to pull news updates from those categories that the user is interested in.

    When the user logs in, I get his/her interests from DocumentDb along with other information about the user which is why I use DocumentDb for that. A single read function gives me all I need to know about the user.

    The part where I struggled is getting news updates from ATS that are in those categories that the user is interested in. The user may be interested in 20 or 25 categories and I cannot run a query against ATS with 20+ conditions -- especially reading from different partitions.

    My ATS design can be modified, for example, I can put categories in RK but I still couldn't come up with an efficient way to create a query for this scenario.

    I'm currently implementing this solution using Azure SQL Database but I really would love to be able to use ATS.


    Thanks, Sam



    • Edited by imsam67 Friday, July 1, 2016 5:24 AM
    Friday, July 1, 2016 5:11 AM
  • Thanks for the additional context, that helps. Another question, why do you want to keep all of the news updates in one table? Have you thought about having a table per category?  
    Friday, July 1, 2016 3:58 PM
  • I could create multiple tables -- which means 100+ tables -- but what would be the benefit of doing that?

    Thanks, Sam

    Friday, July 1, 2016 5:41 PM
  • You definitely should not do a single query which results in a scan that skips a lot of rows - which is what would happen in the case of one query crossing categories and using a bunch of conditions. De-normalization per user wouldn’t make any sense either, it won’t scale well with the number of users (you could make it scale, but it’ll be quite costly).

    Given number of categories you could issue one query for each category using the time range of interest. If the user has 10 categories of interest it would be 10 queries, and you can do the queries in parallel.

    You could also improve performance by adding caching above the Storage tier. You could populate the cache every N minutes with a query, and then have user queries only hit the cache; you could also still  query against Storage directly, but caching will make it more efficient if you are worried about performance or scale.

    Comparing this to de-normalization, this will have less transactions and a lot less Storage use so the cost should be lower. Caching can solve any latency concerns if you end up querying against Storage too much.

    Hopefully this helps!

    Friday, July 1, 2016 7:09 PM
  • Gus,

    Again, thank you for your response.

    All of this makes sense BUT don't you think an Azure SQL Database solution is much more straight forward? I feel though the multi-table approach in ATS would work nicely, it's a way to get around some of the limitations of Azure Table Storage -- and this is NOT a criticism of ATS. I happen to LOVE ATS for many reasons. It does what it does very very well.

    I feel that in this scenario an Azure SQL Database is much more straight forward -- though potentially more costly.

    I value your opinion which is why I'm asking you. My top priorities are responsiveness and scalability. Cost is the third concern.


    Thanks, Sam

    Friday, July 1, 2016 9:38 PM
  • Hi Sam - It really depends on what your requirements are for storing and accessing your data. Across the spectrum of BLOB, NoSQL and Relational data stores there are tradeoffs for performance & scale, complexity and query and costs.

    As a basic (not comprehensive) comparison NoSQL stores offer Internet scale, flexible data models, simplicity and are more cost effective. Relational stores offer a relational data model, rich and complex query capabilities, ACID Transactions, server-side procedures and is traditionally more expensive.

    Your solution could be implemented effectively using either a NoSQL or Relational store, it just depends on the tradeoffs you are willing to make for the capabilities each store offers.

    • Marked as answer by imsam67 Wednesday, July 6, 2016 3:18 PM
    Wednesday, July 6, 2016 12:35 AM
  • Thank you Gus!

    Thanks, Sam

    Wednesday, July 6, 2016 3:18 PM