locked
One large dataset or several small ones RRS feed

  • Question

  • I'm fairly new to Visual Basic and I'm creating strongly typed datasets using the dataset designer.  My question is simple...I think.  I've done many searches and read all that I can find but I cannot get a census of whether or not it is more efficient to create one large dataset that includes all the tables (or at least all those related to each other) rather than create several small datasets. 

    My problem with creating many smaller datasets is that one table will most definitely be used in multiple datasets creating TableAdapters for each.  This doesn't seem very efficient as whenever I make a change to a tableadapter query in one dataset, I'll most likely need to make that same change in all of them.  I'd rather make the change once.

    My problem with creating one large dataset is that I don't know if I'll be wasting memory space by including a table in a dataset that I might not need to fill.

    Does anyone have any best practices that they can share?
    Wednesday, April 15, 2009 9:40 PM

Answers

  • Well, typically I think developers create 1 dataset per database.  So all the tables from a single database normally reside in the same dataset.  Keep in mind that you do not necessarily have to instance the entire dataset; you can just instance the relevant table(s) if you only plan to use one or two on a particular form or control.

    But at the same time, I don't think there would be much overhead in creating an instance of the dataset and then only filling a small subset of the available tables.

    It would seem to me that the convience of having all tables and relations in a single place would out-weigh the allocation of a few block of memory for unused tables...

    You might also compromise and create a DataSet with all the most commonly used tables, and another with only the rarely used tables.  But again, I'm not sure it's really worth it - I mean, I've never heard or have seen much to indicate that developers often split-up a single database into multiple datasets... (doesn't mean it doesn't happen, just means I haven't seen it much).

    Perhaps SJ, JW, JO, RD, TD, TH, or one of the top contributors who frequent this forum can add their $0.02 as well...  :)


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"
    • Marked as answer by Yichun Feng Wednesday, April 22, 2009 12:47 AM
    Thursday, April 16, 2009 2:59 PM
  • I suggest you to completely avoid the path of creating one single dataset for your whole database, but have nothing to say if you are working with very few tables. 

    I always design data access layer from scratch with individual class per table, but simply recollecting what I studied I suggest you these things.

    The disadvanatages as you said is more memory uses, not only that, you will have data relations which becomes nasty when you are working inserts and updates  which goes opposite path of delete and might create deadlock if locks are not implemented properly.


    There are more, methods like merge, getchanges that has to do a lot of work even if your tables are empty, and those methods get poor when references between tables goes more with your design.

    Even if you took that route I ask you to be very careful and think on the those things as they grow, and remember working with less tables is always easier and less complex that desiging complex structure.




    Arjun Paudel
    • Marked as answer by Yichun Feng Wednesday, April 22, 2009 12:48 AM
    Thursday, April 16, 2009 3:53 PM
  • Hmmm... I typically agree with your posts Arjun, but in this case I am not so sure...

    If you use the designer to create a new database, or connect to an existing one, the default action would be to build a DataSet with all selected database tables containted within.  This would indicate that the designer generally intends for you to have 1 dataset per database.  Given that the DataSet is a in-memory cache of a database schema, this would also indicate one dataset per database.

    I'm also not sure how relations would interfere with updates...  Typically the first Fill() method defined on a Table is the standard "SELECT * FROM [TableName]" query, which will allow the wizard to build all the insert, update, and delete commands for you.  Then you add additional Fill() methods if you need to query a subset of rows or columns.

    I know that my website uses one dataset for the majority of the tables in the primary database.  In this case, the primary database is for our catalog items, but it also contains some tables for other website data such as our distributor information.  So I have 1 dataset with all the item related tables and another with just the distributor information.  The distrubutor dataset is only used by the distributor page on the site.  But the catalog dataset is used all over the place in many different ways.  The catalog dataset contains 8 tables, 6 of which contain relationships to one another (the other 2 are stand alone tables).  Depending on what you do on the site, 1 or more of these tables may be used.  For instance, browsing our catalog from the "Full Catalog" page will fill all 6 interconnected tables where as performing an item number search will only fill one table.

    The only speed issues I have are related to cross-platform access the site must do when a user wants pricing on an item... the catalog usage itself is very fast.  And some of these tables get quite large... at the top of the heirarchy the tables are generally small.  Level 1 has 12 rows.  Level 2 has 340 rows which relate in chunks to the rows in level 1.  Level 3 contains 9612 rows relating in chunks to level 2.  Level 4 contains 42,288 rows relating in chunks to level 3.  Then there are two more table with additional info that relates to specific items...

    Now, these tables are fairly well normalized so they contain only 4 to 10 fields with very few NULL values.  This improves the performance of the database and is seen when filling the dataset.  A database which is not well normalized can affect the performance of your dataset since just calling Fill() will likely cause the tables to contain extraneous information (unless to take a lot of care to only fill it with relevant information).

    This one of the things (as well as a number of other projects I've worked on) that leads me to belive that so long as the database is well designed in the first place, there is little penalty to having a fairly large dataset which contains tables that are not always used at the same time.

    So for now, I'm going to stick to my guns and back the latter option posted by the OP:  "...create one large dataset that includes all the tables (or at least all those related to each other) ..."

    Of course, "large" is kinda a loose term...  I suppose there is a point where performance would begin to be affect if the dataset were TOO large - but then again, having a huge number of tables may be a sign that the database requires further normalization into seperate databases... (now you get into primary lookup vs. primary change databases, and may need to split up tables so that each DB can be performance tweaked for its purpose...)

    This one has me curious now as to what a concensus might be...  hopefully we'll get a few more view points...

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"
    • Marked as answer by Yichun Feng Wednesday, April 22, 2009 12:48 AM
    Thursday, April 16, 2009 4:26 PM

All replies

  • Well, typically I think developers create 1 dataset per database.  So all the tables from a single database normally reside in the same dataset.  Keep in mind that you do not necessarily have to instance the entire dataset; you can just instance the relevant table(s) if you only plan to use one or two on a particular form or control.

    But at the same time, I don't think there would be much overhead in creating an instance of the dataset and then only filling a small subset of the available tables.

    It would seem to me that the convience of having all tables and relations in a single place would out-weigh the allocation of a few block of memory for unused tables...

    You might also compromise and create a DataSet with all the most commonly used tables, and another with only the rarely used tables.  But again, I'm not sure it's really worth it - I mean, I've never heard or have seen much to indicate that developers often split-up a single database into multiple datasets... (doesn't mean it doesn't happen, just means I haven't seen it much).

    Perhaps SJ, JW, JO, RD, TD, TH, or one of the top contributors who frequent this forum can add their $0.02 as well...  :)


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"
    • Marked as answer by Yichun Feng Wednesday, April 22, 2009 12:47 AM
    Thursday, April 16, 2009 2:59 PM
  • I suggest you to completely avoid the path of creating one single dataset for your whole database, but have nothing to say if you are working with very few tables. 

    I always design data access layer from scratch with individual class per table, but simply recollecting what I studied I suggest you these things.

    The disadvanatages as you said is more memory uses, not only that, you will have data relations which becomes nasty when you are working inserts and updates  which goes opposite path of delete and might create deadlock if locks are not implemented properly.


    There are more, methods like merge, getchanges that has to do a lot of work even if your tables are empty, and those methods get poor when references between tables goes more with your design.

    Even if you took that route I ask you to be very careful and think on the those things as they grow, and remember working with less tables is always easier and less complex that desiging complex structure.




    Arjun Paudel
    • Marked as answer by Yichun Feng Wednesday, April 22, 2009 12:48 AM
    Thursday, April 16, 2009 3:53 PM
  • Hmmm... I typically agree with your posts Arjun, but in this case I am not so sure...

    If you use the designer to create a new database, or connect to an existing one, the default action would be to build a DataSet with all selected database tables containted within.  This would indicate that the designer generally intends for you to have 1 dataset per database.  Given that the DataSet is a in-memory cache of a database schema, this would also indicate one dataset per database.

    I'm also not sure how relations would interfere with updates...  Typically the first Fill() method defined on a Table is the standard "SELECT * FROM [TableName]" query, which will allow the wizard to build all the insert, update, and delete commands for you.  Then you add additional Fill() methods if you need to query a subset of rows or columns.

    I know that my website uses one dataset for the majority of the tables in the primary database.  In this case, the primary database is for our catalog items, but it also contains some tables for other website data such as our distributor information.  So I have 1 dataset with all the item related tables and another with just the distributor information.  The distrubutor dataset is only used by the distributor page on the site.  But the catalog dataset is used all over the place in many different ways.  The catalog dataset contains 8 tables, 6 of which contain relationships to one another (the other 2 are stand alone tables).  Depending on what you do on the site, 1 or more of these tables may be used.  For instance, browsing our catalog from the "Full Catalog" page will fill all 6 interconnected tables where as performing an item number search will only fill one table.

    The only speed issues I have are related to cross-platform access the site must do when a user wants pricing on an item... the catalog usage itself is very fast.  And some of these tables get quite large... at the top of the heirarchy the tables are generally small.  Level 1 has 12 rows.  Level 2 has 340 rows which relate in chunks to the rows in level 1.  Level 3 contains 9612 rows relating in chunks to level 2.  Level 4 contains 42,288 rows relating in chunks to level 3.  Then there are two more table with additional info that relates to specific items...

    Now, these tables are fairly well normalized so they contain only 4 to 10 fields with very few NULL values.  This improves the performance of the database and is seen when filling the dataset.  A database which is not well normalized can affect the performance of your dataset since just calling Fill() will likely cause the tables to contain extraneous information (unless to take a lot of care to only fill it with relevant information).

    This one of the things (as well as a number of other projects I've worked on) that leads me to belive that so long as the database is well designed in the first place, there is little penalty to having a fairly large dataset which contains tables that are not always used at the same time.

    So for now, I'm going to stick to my guns and back the latter option posted by the OP:  "...create one large dataset that includes all the tables (or at least all those related to each other) ..."

    Of course, "large" is kinda a loose term...  I suppose there is a point where performance would begin to be affect if the dataset were TOO large - but then again, having a huge number of tables may be a sign that the database requires further normalization into seperate databases... (now you get into primary lookup vs. primary change databases, and may need to split up tables so that each DB can be performance tweaked for its purpose...)

    This one has me curious now as to what a concensus might be...  hopefully we'll get a few more view points...

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"
    • Marked as answer by Yichun Feng Wednesday, April 22, 2009 12:48 AM
    Thursday, April 16, 2009 4:26 PM