Answered NoSql Design Question

  • Thursday, August 30, 2012 4:33 PM
     
     

    Hello,

    I have traditional relational database and i want to change the design to suit the table storage of windows azure. after reading a book and some articles i still didn't come across a real life table storage scenario. so i thought of my own way in the design and i want experts here help me in it.

    the scenario is social network where users post on their friends walls ( i call this post "Status") and others comment on each status. so the traditional database had these tables:

    User(ID, Name ... ), Status(FromID, ToID, Subject, ...), Comment(StatusID, Subject ...).

    now to store that in the azure table storage i thought to merge all tables together in one table by using one entity for all like this:

    Entity(ID, Name, StatusTopic, CommentTopic, RowType, ...)

    this entity represents any of the traditional tables. depending on the RowType enum i can recognize if a row is a User, Status or a Comment.

    why i make this design is because i want everything related to a user to be set together in the same partition. so now i partition the table by the ID. when i open a user's wall i will get the related Statuses and Comments from the same partition. and when i want to insert a user for example, i will supply properties for the User (his name, age ... etc) and set the RowType to "User" and omit properties which are not related to identifying the User (StatusTopic, CommentTopic ..etc). when i insert a Status i set the RowType to "Status" and supply the StatusTopic and omit properties which are not related to Status. inserting a Comment is same with an extra property StatusID to link Comments with a specific Status. 

    for searching and varied queries, i want to make copies of rows. for instance, suppose i want to search for a Status by StatusSubject -regardless on which wall this status reside-. if we want to search in such a design our search have to scan all partitioning servers then come back with the result, which would be slow. so i thought to copy each Status to another table which has its own partitioning for such search purposes.

    any problems with such design?

    Thank you.


    Alan-SY





    • Edited by Alan-Sy Thursday, August 30, 2012 8:16 PM
    •  

All Replies

  • Thursday, August 30, 2012 7:56 PM
     
     
    I am talking to myself here :/ !!?

    Alan-SY


    • Edited by Alan-Sy Saturday, September 01, 2012 8:49 AM
    •  
  • Tuesday, September 04, 2012 2:34 AM
     
     Answered

    Some of my thoughts:  I’m not in favor of combining three types of entities together for several reasons: 1) it introduces the problem that you had to use a “type” field to distinguish different record types. 2) Combining the entities seemly to have some benefits to one particular query – to list all status and comments of a user. However in your application I assume there would be other types of queries that would be quite common – such as searching for a user by a user name (like search for friends features in many social sites), find statues that a user has commented on (but not owning them), etc. I think the combined table makes such queries less intuitive.  3) When you display statues along with their comments, you’ll need to group all comments by statuses. Having user ID taking up the partition key, it’s hard to come up with a row key that makes such query/grouping effective.

    If it was up to me, I’d like to keep things simple and straightforward. I’ll have three tables, one for user, one for status, and one of status. Now let’s discuss how to choose partition keys and row keys to make common queries more efficient. For user table, I’d choose name as partition key and id as row key. This may seem backwards but it makes search by name faster – names will be in a clustered index in alphabet order, making scanning partitions faster than the case when ids are used. For status table, I’d actually use a combined key = {user id} +{time stamp} as partition key. You rarely want to load all statuses posted by a user in ALL time. Instead, probably you want to load recent updates first and have the options to load older statuses. With this partition key, you can quickly get statuses for a particular user by the time stamp, whose granularity is of your choice. As for comments table, probably {user id} + {status id} (or {status id} only if it’s globally unique) is a fine choice.

    I was literally typing as I was thinking, so if I misunderstood something, or the thoughts seem less profound, I apologize.

    Hope this helps.

    • Marked As Answer by Alan-Sy Tuesday, September 04, 2012 5:42 PM
    •  
  • Tuesday, September 04, 2012 4:43 AM
     
     

    Hello Friendly dog,

    thank you so much for the answer. hope you bear with me a little.

    let me first talk about the reasons you said for rejecting my design:

    for 1, i totally agree.

    for 2, agree also, but you can always repeat to meet different queries. and in social networks, browsing walls (or threads) is the main operation rather than searching.

    for 3, i don't really understand what you mean. the design is that a user, statuses on his wall, and comments on those statuses are all living together in one partition, partitioned by this user id (or name), and a Guid is there row key. a comment has to supply the property (StatusID). so i group them by this property.

    I agree that your design is better to handle different queries. the problem in it is: when i open a user's wall, to see the most recent 10 statuses for example, each status with the most recent 2 comments, i need to wait these data to be collected from these partitioning servers:

    1. one server holding the user data.

    2. one holding the statuses.

    3. 10 holding the comments.

    so 12 server, plus the transactions will be many. am i right? if i am right is this number of servers acceptable enough to adopt your design? (at the end i want my design to get everything from one server and lessen network communications).

    thank you so much again :).


    Alan-SY


    • Edited by Alan-Sy Tuesday, September 04, 2012 5:06 AM
    •  
  • Tuesday, September 04, 2012 5:16 PM
     
     

    Hi Alan:

    I understand your concern about data spreading to multiple servers – this is why smaller partitions are more favorable than larger partitions. And at a second thought, probably having statuses and comments within one table isn’t that bad, given, as you said, listing statues + comments is most common operation.  As for point 3 – to have time stamp in partition key, let me explain with an example: Let’s say user A signs on to the system on 9/4/2012, to get latest status updates, I’ll look for A-9-4-2012 partition only to start with, and as user requests more I’ll trace back to A-9-3-2012, A-9-2-1021, and so on and so forth. In this case (after we merge status and comment table) all data is from one partition for each query. Of course you can choose other time granularity such as by week instead of by day.

    Hope this helps.

    Regards,

    Friendly Dog

  • Tuesday, September 04, 2012 5:42 PM
     
     

    Hello again,

    I understand your example :). and for the design i will consider your notes and then think further to decide :).

    thank you so much.

    Regards,

    Alan.


    Alan-SY