none
Updating SQL Database using DataSets - best practices RRS feed

  • Question

  • Greetings,
     
    I have a real-time, high-throughput application that uses in-memory DataSets that are based on SQL Server database tables to perform record keeping, logging, etc.  The database tables must be updated in as close to real-time as possible, but my application cannot simply wait for the updates to occur inline, so the method I've divised is as follows:

    On main application thread, I make changes to the DataSet(s) (I add rows, delete rows, and/or modify values in rows).  Whenever one of these changes occur, I notify my "SaveToDatabase" thread that a change has occurred.

    MainThread:
    1. lock database mutex
      1. perform modification to DataSet
      2. notify SaveToDatabase thread that changes have occurred
    2. unlock database mutex

     

    My SaveToDatabase thread then essentially does the following:

    1. listen for notifications that changes have occurred
      1. lock database mutex
        1. for each table within my DataSet
          1. DataRow[] changes = nextDataTable.GetChanges();
        2. end for
      2. unlock database mutex
      3. for each DataRow[] that I created above
        1. if changes <> null
          1. TableAdapter.Update(changes);
        2. end if
      4. end for
    2. repeat

     So I'm only ever locking in the SaveToDatabase thread long enough to get the changes from the DataTable, this allows the MainThread to process additional changes to the DataSet more efficiently.

    My question is: is this the ideal way to have changes in my in-memory DataSet reflected to the underlying SQL Server database, or is there a built-in Microsoft technology that would make this easier and/or more efficient?

    Thanks,
    Rick





    Wednesday, February 18, 2009 5:23 PM

Answers

  • Hey Rick,

    Thanks for taking the time to explain the system. It seems like there are two discrete pieces here that are important.

    1) Caching certain pieces of Data for the exchange so that information can retrieved quickly and saving this information

    2) Sending the many many updates of the audit log to the database

    For the audit log database save I would look into various ways to simply just send a write to a quene to be processed. Sql Server Service Broker is one way. There are a couple of other ways as well.

    For the other caching piece, now if you wanted to distribute the cache and have processing occur on multiple machines I would definitely look into velocity as previously discussed. Having a distributed cache might help so that if one machine goes down you don't lose all the data. If its on just one machine DataSet could be your best bet in these scenarios because of its indexing logic so this enables you to get retrieve objects quickly. The downside though is that even particular read operations will alter internal state at times. One thing to think about would be to look at your overall schema and see how all the pieces are connected. Perhaps particular DataTables do not have relationships to others. You might be able separate out one DataSet into multiple DataSets instead. Following the existing pattern you were doing with locking you might be able to have less locking contention this way. Or alternatively you could break up one dataset into multiple dataset's in a completely different way. Perhaps lots of the data comes in regionally, or perhaps grouping by Users. I'm not exactly sure. I was discussing this forum question with other people at work and we discussed problems that internal applications have faced. For instance an app had about 50 concurrent users on Sql Server, most did reads, some did writes. Problem is the people writing slowed everything down for people just querying data. The solution was to use one server for reads and the other for writes since the writes were done much less and replicate the data in between. The readonly database didn't have the update to date information but was replicated over every x hours or minutes.

    So in terms of best practices for DataSet I believe you are following the write things.

    Thanks,

    Chris Robinson

    Program Manager - DataSet


    This posting is provided "AS IS" with no warranties, and confers no rights.
    Friday, February 27, 2009 6:12 PM

All replies

  • Rick,

    I don't know that I can help you with best practices, but I am following a similar path for a medical application and I am finding typed dataset 'best practice' resources to be lacking.

    I am having an issue with the Update method and was wondering if you were encountering this as well:

    Update() does not seem to write changes to the server until the NEXT time it is called. If I change a row and call Update(), nothing is immediately written. If I change another row and call Update(), the previous changes are written, but not the most recent ones. For a real-time application this is not useful.

    I am also curious if you are using backgroundworker threads for your Update processing. I have used them to process queued SQL commands that I generate based on incoming asynchronous data from RFID badges. I generally start the thread at timed intervals to process any SQL command objects that may have been generated between timer events. I store the SQL command objects in a hash table and then use a SyncLock block to prevent the hash table from being modified while the SQL commands are being processed. 

    I don't know if this is 'best practice', but it has been working for me for about 2 years in an app that runs 24-7.
    Wednesday, February 18, 2009 8:54 PM
  • BGV,

    I do not see the same behavior as you w/ the Update() call.  I am also never calling Update() on the table itself, rather I'm first calling GetChanges() (which returns an array of DataRows), and then I call update on this array -- however, I believe in the past I *was* calling Update on the table with no issues.  Make sure after you call Update you also call AcceptChanges() -- this may be necessary (though I'm only guessing here).  Sounds like something is off though b/c Update should update immediately (or as soon as it can).

    I have a dedicated thread for my updating (which loops for the duration of the application).  It's essentially a while loop:

    while ( m_WaitHandle.WaitAny( ... ) ){
        DataRow[] table1Changes = null;
        DataRow[] table2Changes = null;
        DataRow[] table3Changes = null;
        DataRow[] table4Changes = null;

         lock ( mutex ) {
              table1Changes = dataSet.table1.GetChanges();
              table2Changes = dataSet.table2.GetChanges();
              table3Changes = dataSet.table3.GetChanges();
              table4Changes = dataSet.table4.GetChanges();
              dataSet.AcceptChanges();
        }
        table1Adapter.Update( table1Changes );
        table2Adapter.Update( table2Changes );
        table3Adapter.Update( table3Changes );
        table4Adapter.Update( table4Changes );
    }

    So the WaitAny() method blocks until I notify that changes have been made.  The nice thing about this method is, I only lock the mutex long enough to get the changes -- the calls to update the underlying SQL Server database are outside fo the lock so it doesn't matter how long this takes.  It sounds like you might not need quite the efficiency I need (this is a futures trading application) so I'm sure your method (if it's been solid for 2 years) is equally appropriate.

    Rick
    Thursday, February 19, 2009 1:22 PM
  • Hey Rick,

    Can I get some more information on your application? Is there a web tier layer? Are you attempting to scale this out to more than one machine? Are there too many write operations being made to the database at one time?

    There are a number of different technologies that can be used here but I want to understand a little more about your specific scenario.

    For example there is a new technology called Velocity which is about scaling mid tier applications
    http://msdn.microsoft.com/en-us/data/cc655792.aspx

    Also there is an api to send bulk copy operations to Sql Server as well
    http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx

    Basically what I am trying to understand is when did you hit a performance wall? Does your particular implementation improve performance, and by how much? How big is the schema that we are talking about? How many tables and relationships?

    If I get a better idea about this I can talk to other people on my larger team to come up with a good recommendation here.

    Thanks
    Chris Robinson
    Program Manager - DataSet
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Tuesday, February 24, 2009 10:52 PM
  • Chris,

    Thanks for the reply.  I'll try to give you some more information if I can.

    I'm not certain I've hit a performance wall -- because my application only locks long enough to call GetChanges() on each table, and because it only uses the in-memory DataSet (which has roughly 20 tables, with no relationships), performance with my application seems great.

    This is a trading application, and so data persistence is of utmost importance.  What I worry about is the lag between when a change is made to the in-memory DataSet and when it's written to the underlying database.  That lag doesn't impact system performance per se, but let's say the application exits ungracefully (power outage, system crash, whatever reason), if that lag between modifying the in-memory DataSet and writing to the database is too large, I run the risk of "losing" data forever. 

    I communicate with an exchange, and I acknowledge receipt of every message the exchange sends me - if my system crashes, and the exchange has sent messages that I do not acknowledge, the next time I initiate my connection, it will re-send.  However, if I acknowledge a message, update my DataSet and then before it has time to write to the database my system crashes, there's no way for me to recover that message.

    For example, I have a table, OpenOrders, in my DataSet that contains rows that represent currently open, working orders the trader has.  Let's say the trader wishes to change the price of an order from 105.0 to 106.0.  I send the request to the exchange, and it then confirms the price change.  When I get the confirmation, using my strongly typed DataSet (and DataTables), I make the following call:

    MyStronglyTypedDataSet.OpenOrdersRow orderRow = m_DataSet.OpenOrders.FindByOrderID(orderIDthatJustChanged);
    orderRow.Price = newPrice;

    I then notify the other thread that a table has changed and it updates the database.  The speed with which it updates the database doesn't directly affect the performance of my application.  I may get 1500 price change notifications in a 2-second span, the only bottleneck really is the speed with which the FindByOrderID() method searches the in-memory DataSet.  (these changes to the in-memory DataSet are surrounded by a lock, even though my example above doesn't include that).  Even if the write operation were to take 2 minutes, during that time, updates would be occuring to the in-memory DataSet.  Once the SQL Server update operation completed, it would just start over again, lock the DataSet, get ALL the changes that occurred since the last time, and then write these changes.

    I realize there is no perfect solution here, I was just hoping someone with more expertise than myself could look at my methodology and determine if it was sound (namely, make changes to a DataSet, notify a different thread that changes have been made, that other thread then locks only long enough to GetChanges(), then unlocks, then writes those changes using a TableAdapter).  I was wondering if there's some way to automatically have the DataSet reflect changes in the database that might be more efficient than me doing it manually.

    I'm not sure what the definition of a "mid tier" application is -- this app runs on a server that is accessed via TCP at the moment.  There are some self-hosted web services in place, yes, (we have a small web site where someone can modify orders, etc.) but none of these interface with the DataSet (or the database) directly.  All of the "brains" of the operation exist in the application that is running on the server.

    Thanks again for your time.

    Rick

    Wednesday, February 25, 2009 4:45 PM
  • Hey Rick,

    So there are a couple things I'm still confused about related to this application. How is data sent in to this application? You said that its sent in over TCP, but I'm curious how specifically, is it through .net remoting?

    A mid tier is something that is used to take information process it with biz logic and accept or reject it and send it to the database if its accepted and send some info back on what happened.

    For caching scenarios Velocity is the best bet. Things would have to be rearchitectured however. It has a much different approach than Dataset. The problem with DataSet is that there are Read operations that can change data basically. It appears that for your patterns you are not hitting this. The pattern that you have described seems to be pretty solid for the most part. But I'm still not sure that DataSet is the best answer here. For instance if velocity is used, you could put into the cache each object itself. velocity will ensure consistency across many machines even. I would look into it more.

    Another possibility is Sql Server itself. It has something called a Service Broker. And introduction is here
    http://msdn.microsoft.com/en-us/library/ms345108.aspx

    more information here
    http://msdn.microsoft.com/en-us/library/ms166043.aspx

    Simple samples here.
    http://msdn.microsoft.com/en-us/library/ms160932(SQL.90).aspx

    Ideally I think the Sql Server broker is more what you want in this scenario. It seems to me that you need to take lots and lots of data coming in and just process and send it to Sql Server right away. If you are storing it any where in the meantime if the computer goes down the data is lost. I don't think there is much you can do if the computer is just turned off however. The data would be lost, I think the best thing is to keep updates as close to the database as possible.

    How many things is that DataSet currently cached for you? Do you ever run into outOf memory exceptions with DataSet? What are other problems that you have encountered?

    Thanks
    Chris Robinson
    Program Manager - DataSet
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Thursday, February 26, 2009 4:23 AM
  • Chris,

    Here is a crude represenation of our system:

                               ,--------.
                               | Client |
                               '--------'
                                   /
                                (tcp)
                                 /
     ,------.           ,=====. /            ,--------.
     | Exch |<--(tcp)-->| APP |-----(tcp)--->| Client |
     '------'           '=====' \            '--------'
                                 \
                               (tcp)
                                   \
                               ,--------.
                               | Client |
                               '--------'


    The application ("APP") communicates with the exchange via TCP (it also gets market data via udp, but that's not important here).

    A connected client will say "I wish to buy X contracts of commodity Y" via TCP using a proprietary messaging format that the Application will receive, parse, make sure that the order doesn't exceed certain risk parameters (is this trader allowed to trade that large of an order?  does this trader already have an open position that is too large?  would this order put him over that maximum open position size?).  If the criteria are met, the Application will send a "new order" request to the exchange.  If the Exchange approves, it will send an acknowledgement.

    So a lot of stuff is going on here behind the scenes.  We are mandated to keep a very detailed "audit trail" throughout each trading day, which is basically just a log of everything that occurs.

    1) when the trader requests the order, we add that request to the audit trail
    2) if we reject the order, we add that to the audit trail
    3) if we accept it, and then pass it on to the exchange, we log the exchange's response to the audit trail.
    4) later, if we get trade notifications (say we wish to buy 100 of something, we may get "partial" trades throughout the day.  a 2-lot here, a 10-lot there, etc.), we get notifications from the exchange, each of which is added to the audit trail.

    So one of our tables in the DataSet is AuditTrail.  However, this table I treat differently than all the others because it can easily contain 1,000,000 rows in one trading day.  In my write-to-SQL thread, after I call GetChanges() on AuditTrail, I then call .Clear().  So the in-memory table will at most have the number of rows that have accumulated since the last SQL Server update.  Before I did this, the machine would basically come to its knees by the end of the day, as you might imagine.

    All of the other tables, however, don't really accumulate throughout the day like this.  Orders are added, cancelled, completed, etc. so that table remains relatively consistent in size.  However, I *must* keep these and other tables in memory for the "risk" management I mentioned above:

    When a trader wishes to enter an order, I first need to evaluate what other orders he/she has already entered, and the accumulated trade quantities for each, what his or her open position is (another table), etc.  So you see why I must keep this in memory?  Having to query the database would be too slow.  We run automated trading strategies that literally can submit/cancel orders thousands of times per second, each of those orders must be evaluated against risk parameters (we wouldn't want an infinite loop causing us to sell a million S&P futures.... it's happened).

    One of the other major reasons we can't just access the database is for when clients connect to our server.  Whenever a client connects, the following occurs:

    1) TCP listener accepts a new connection, gets the authentication information, etc.
    2) Application LOCKS access to the DataSet from other threads, so if incoming updates from the exchange occur, they get queued up.
        2a) Application pulls from in-memory DataSet all information relevant to this particular user
        2b) Application sends this specific DataSet to user via TCP
    3) Application unlocks access -- any subsequent updates from the exchange that pertain to this user (even those that were queued up while locked) will be sent to him from this point forward

    The reason I do this locking is because we need to ensure the client has 100% awareness of current working orders and positions, so I have to make sure that no messages are lost.  All the messages that we send to clients are incremental updates.  Meaning, I know the client got a "snapshot" of his working orders which was 100% awareness at that time, so whenever things change, I need only send small incremental messages that he will then use to modify *his* copy of the DataSet -- I don't need to resend the DataSet each time.  So I need to make sure that (1) no messages are missed and, just as important (2) he doesn't get any redundant data (i.e. I get an update XYZ from the exchange, modify the dataset with this message, get a client connect request, send him this modified dataset but also send him the message XYZ.  The dataset already reflects message XYZ so if I send it to him and he applies it to his copy, he will have bad data).

    So I hope it's clear why I need to have an in-memory copy of the data (except the AuditTrail -- I don't send that to the clients).  I suppose I could keep all the relevant risk information in in-memory objects that would be used whenever new orders were submitted, but I don't know how I could ensure the client had 100% perfect data if, when he connects, I get the snapshot from the database -- What if there are messages that have been processed but haven't been applied to the database.  Do either of the technologies you suggested above ensure that when I apply a change in memory, it will be reflected in the DB?  If so, when a client connects, I could lock access to the database, get the snapshot from the database, and then unlock.  This would prevent the need to keep DataSets in memory...

    I will look into both suggestions.  Thanks again for your time.

    Rick
    Thursday, February 26, 2009 2:19 PM
  • Hey Rick,

    Thanks for taking the time to explain the system. It seems like there are two discrete pieces here that are important.

    1) Caching certain pieces of Data for the exchange so that information can retrieved quickly and saving this information

    2) Sending the many many updates of the audit log to the database

    For the audit log database save I would look into various ways to simply just send a write to a quene to be processed. Sql Server Service Broker is one way. There are a couple of other ways as well.

    For the other caching piece, now if you wanted to distribute the cache and have processing occur on multiple machines I would definitely look into velocity as previously discussed. Having a distributed cache might help so that if one machine goes down you don't lose all the data. If its on just one machine DataSet could be your best bet in these scenarios because of its indexing logic so this enables you to get retrieve objects quickly. The downside though is that even particular read operations will alter internal state at times. One thing to think about would be to look at your overall schema and see how all the pieces are connected. Perhaps particular DataTables do not have relationships to others. You might be able separate out one DataSet into multiple DataSets instead. Following the existing pattern you were doing with locking you might be able to have less locking contention this way. Or alternatively you could break up one dataset into multiple dataset's in a completely different way. Perhaps lots of the data comes in regionally, or perhaps grouping by Users. I'm not exactly sure. I was discussing this forum question with other people at work and we discussed problems that internal applications have faced. For instance an app had about 50 concurrent users on Sql Server, most did reads, some did writes. Problem is the people writing slowed everything down for people just querying data. The solution was to use one server for reads and the other for writes since the writes were done much less and replicate the data in between. The readonly database didn't have the update to date information but was replicated over every x hours or minutes.

    So in terms of best practices for DataSet I believe you are following the write things.

    Thanks,

    Chris Robinson

    Program Manager - DataSet


    This posting is provided "AS IS" with no warranties, and confers no rights.
    Friday, February 27, 2009 6:12 PM
  • Chris,

    Thanks for taking the time -- I think you brought up a good point: there really is no reason to have my DataTables all within the same DataSet (there are no relationships).  I think at design-time it was easier for me to simply (1) create a DataSet and (2) drag all the tables from my Server Explorer onto the one designer. 

    Separating them into multiple DataSets, like you said, should result in less lock'ing which will help. 

    And yes, this very likely will never be a distributed application, so I think my current practices will suit me well.  Thanks again.

    Rick
    Monday, March 2, 2009 9:26 PM