none
Data changing while reading multiple tables to a disconnected DataSet. RRS feed

  • Question

  • I am a newbie when it comes to a concurrency issues for data access.  There is a lot of documentation on handling concurrency when updating disconnected data but what about when you are filling a DataSet with several tables?


    Let's take a simple example.  Suppose we have a database with 2 tables: TblFamily which stores info about each family (e.g. family_id, family_name, etc), and TblMember which holds info about each of the members in each of the families (e.g. person_id, person_name, family_id, etc).  No family can have zero members.


    We want to retrieve info on a particular family and its members to a disconnected DataSet. From what I've read, most books/articles advocate filling the DataSet table by table. So,


    Step 1: read one record from TblFamily to a DataTable in my DataSet. This info includes the family_id.


    Step 2: use the family_id to retrieve a subset from TblMember of the members for just that family, filling a second DataTable in the same DataSet.


    But now suppose the family and its members are deleted from the database between Step 1 and 2.  After step 2, my DataSet will be inconsistent.  I have a family with zero members.


    Granted, in this simple example, I could test for such a case. Or perhaps use a join query, which, from what I understand, would be an atomic operation and so preclude the problem from arising (*Is this true?*).


    But what about more complicated situations?  In general, what are the common approaches to handle/prevent data changing in related tables, while you fill a DataSet?


    Can anyone point me to any books/papers/info on this?


    Thanks,


    Dominic

    Wednesday, April 2, 2008 4:21 PM

Answers

  • You can't prevent data from changing in a database that you're not connected to unless you implement some kind of checkout scheme, so that the database knows that a given user owns a piece of data.  And you also need to implement a way to enforce it, so that the code in your application always respects the scheme.

     

    For a simple example, when you first go to fill your family table, you can insure you retrieve only rows that have members with a JOIN, and you can insure that you retrieve only rows that aren't checked out to another user with a WHERE clause, something like:

     

       SELECT * FROM family f

       JOIN (SELECT family_id FROM member GROUP BY family_id) m ON f.family_id = m.family_id

       WHERE f.user_id IS NULL

     

    and then you update the rows you're retrieving to make sure no other user gets them:

     

       UPDATE f SET f.user_id = @user_id FROM family f

       JOIN (SELECT family_id FROM member GROUP BY family_id) m ON f.family_id = m.family_id

       WHERE f.user_id IS NULL

     

    and then you can retrieve the members of families that you have checked out:

     

       SELECT * FROM member m

       JOIN family f ON f.family_id = m.family_id

       WHERE f.user_id = @user_id

     

    Whenever any process wants to delete (or update) a row from the member table, you restrict it from deleting or updating a row checked out to another user with a WHERE clause, e.g.:

     

       DELETE FROM member WHERE member_id = @member_id AND user_id = @user_id

     

    Generally, when you implement a scheme like this, you enforce it by hiding the tables from the data-access layer, and forcing the DAL to go through stored procedures to talk to the database.  This is very important:  if you provide developers with any other way of accessing the tables, they'll use it, and your checkout scheme will break.

     

    Also, typically, the design of the checkout mechanism needs to be more sophisticated than just adding a user ID column to your table.  For instance, what happens when a user checks a row out and never checks it back in?  You don't want the row to be eternally inaccessible.

     

    In an application I'm building, I have a checkout table that I add rows to, and join to from any table that I can check things out from.  Each checkout has an expiration time.  To check out rows:

     

       BEGIN TRANSACTION

     

       INSERT INTO checkout (user_id, expiration_time) VALUES (@user_id, dateadd(dateadd(mi, 10, getdate())))

       SET @checkout_id = SCOPE_IDENTITY()

     

       UPDATE f SET f.checkout_id = @checkout_id

       FROM foo f

       LEFT JOIN checkout c ON f.checkout_id = c.checkout_id

       WHERE c.checkout_id IS NULL or c.user_id = @user_id or datediff(mi, c.expiration_time, getdate()) > 10)

     

       IF @@ROWCOUNT = 0

       BEGIN

          ROLLBACK

          RETURN

       END

     

       SELECT * FROM foo WHERE checkout_id = @checkout_id

     

       COMMIT

     

    (I use the transaction to prevent a row from being added to checkout if no rows are actually checked out.)

     

    When updating rows, you only let the user update checked out rows whose checkout hasn't expired, and you refresh the checkout's expiration time if they're successful.  And again, you wrap it in a transaction, to handle the edge case where the checkout expires between the two commands and another user is checking out rows:

     

       BEGIN TRANSACTION

     

       UPDATE f SET ...

       FROM foo f JOIN checkout c ON f.checkout_id = c.checkout_id

       WHERE f foo_id = @foo_id AND @c.user_id = @user_id AND c.expiration_time <= getdate()

     

       IF @@ROWCOUNT > 0

          UPDATE c SET c.expiration_time = dateadd(mi, 10, getdate())

          FROM foo f JOIN checkout c ON f.checkout_id = c.checkout_id

          WHERE foo_id = @foo_id AND @c.userID = @user_id

     

       COMMIT

     

    This may seem like lot to go through in order to accomplish something relatively simple.  The thing is, disconnected data access isn't simple at all.  The optimistic concurrency scheme that ADO implements is as simple as it gets, and there are many scenarios in which it's not good enough.

    Wednesday, April 2, 2008 8:07 PM

All replies

  • You can't prevent data from changing in a database that you're not connected to unless you implement some kind of checkout scheme, so that the database knows that a given user owns a piece of data.  And you also need to implement a way to enforce it, so that the code in your application always respects the scheme.

     

    For a simple example, when you first go to fill your family table, you can insure you retrieve only rows that have members with a JOIN, and you can insure that you retrieve only rows that aren't checked out to another user with a WHERE clause, something like:

     

       SELECT * FROM family f

       JOIN (SELECT family_id FROM member GROUP BY family_id) m ON f.family_id = m.family_id

       WHERE f.user_id IS NULL

     

    and then you update the rows you're retrieving to make sure no other user gets them:

     

       UPDATE f SET f.user_id = @user_id FROM family f

       JOIN (SELECT family_id FROM member GROUP BY family_id) m ON f.family_id = m.family_id

       WHERE f.user_id IS NULL

     

    and then you can retrieve the members of families that you have checked out:

     

       SELECT * FROM member m

       JOIN family f ON f.family_id = m.family_id

       WHERE f.user_id = @user_id

     

    Whenever any process wants to delete (or update) a row from the member table, you restrict it from deleting or updating a row checked out to another user with a WHERE clause, e.g.:

     

       DELETE FROM member WHERE member_id = @member_id AND user_id = @user_id

     

    Generally, when you implement a scheme like this, you enforce it by hiding the tables from the data-access layer, and forcing the DAL to go through stored procedures to talk to the database.  This is very important:  if you provide developers with any other way of accessing the tables, they'll use it, and your checkout scheme will break.

     

    Also, typically, the design of the checkout mechanism needs to be more sophisticated than just adding a user ID column to your table.  For instance, what happens when a user checks a row out and never checks it back in?  You don't want the row to be eternally inaccessible.

     

    In an application I'm building, I have a checkout table that I add rows to, and join to from any table that I can check things out from.  Each checkout has an expiration time.  To check out rows:

     

       BEGIN TRANSACTION

     

       INSERT INTO checkout (user_id, expiration_time) VALUES (@user_id, dateadd(dateadd(mi, 10, getdate())))

       SET @checkout_id = SCOPE_IDENTITY()

     

       UPDATE f SET f.checkout_id = @checkout_id

       FROM foo f

       LEFT JOIN checkout c ON f.checkout_id = c.checkout_id

       WHERE c.checkout_id IS NULL or c.user_id = @user_id or datediff(mi, c.expiration_time, getdate()) > 10)

     

       IF @@ROWCOUNT = 0

       BEGIN

          ROLLBACK

          RETURN

       END

     

       SELECT * FROM foo WHERE checkout_id = @checkout_id

     

       COMMIT

     

    (I use the transaction to prevent a row from being added to checkout if no rows are actually checked out.)

     

    When updating rows, you only let the user update checked out rows whose checkout hasn't expired, and you refresh the checkout's expiration time if they're successful.  And again, you wrap it in a transaction, to handle the edge case where the checkout expires between the two commands and another user is checking out rows:

     

       BEGIN TRANSACTION

     

       UPDATE f SET ...

       FROM foo f JOIN checkout c ON f.checkout_id = c.checkout_id

       WHERE f foo_id = @foo_id AND @c.user_id = @user_id AND c.expiration_time <= getdate()

     

       IF @@ROWCOUNT > 0

          UPDATE c SET c.expiration_time = dateadd(mi, 10, getdate())

          FROM foo f JOIN checkout c ON f.checkout_id = c.checkout_id

          WHERE foo_id = @foo_id AND @c.userID = @user_id

     

       COMMIT

     

    This may seem like lot to go through in order to accomplish something relatively simple.  The thing is, disconnected data access isn't simple at all.  The optimistic concurrency scheme that ADO implements is as simple as it gets, and there are many scenarios in which it's not good enough.

    Wednesday, April 2, 2008 8:07 PM
  • Thanks for the detailed reply Robert.  It is much appreciated.

    I just find it strange that the literature concentrates on collisions when updating disconnected datasets, but has not a word on such problems when you first come to fill the dataset.

    It also seems odd to me that in the age of design patterns and other models setting out general solutions to various common problems in software engineering, which let others 'stand on the shoulders of giants' rather than reinventing the wheel, there is scant advanced material describing real world solutions with respect to ado.net and concurrency - just plenty of toy examples.

    Dominic


    Friday, April 4, 2008 2:25 PM
  • For the general problem of dealing with concurrency issues when filling a dataset, the answer's pretty simple:  fill your dataset in a transaction.  It may be that as soon as you commit your transaction, the pending deletes that were stacking up while your transaction was running will delete rows that you now have in your dataset, but your dataset will be an accurate picture of what was in your database the moment you started filling it.

    Friday, April 4, 2008 6:27 PM