locked
Data Generation Plan Sequential Data Bound Generator Row Limit

    Question

  • Hello,

    I have hit a bit of a problem with the limit of data that can be generated by a Sequential Data Bound Generator.  Let me tell you about the data generation process I am currently experimenting with:

    1. Deploy the database and post-deployment scripts populate all lookup/reference/type/state tables with static data (For example the userstates table will be populated with 1 - Logged out, 2 - Logged In, 3 - Password Expired etc).
    2. Configure a data generation plan that uses the Sequential Data Bound Generator for these static data tables with a connection defined to the database that has just been deployed and a query like select userstateid from userstates as the data source.
    3. All other tables are left with the default random data generation options, in other words, the operational data will be truely random.
    4. Execute the data generation plan.

    Now this all works very nicely when the static data set is small per table.  However, I can consistently get the plan to fail with a "Cannot insert DB.Null value into a non-nullable field" error when the table that is the data-source for the sequential data bound generator contains ~3190 rows.

    I have replicated this problem on two tables, I can get reliably 3187 rows into one of them (3188 causes the plan to fail) on another I can get 3191 (3192 causes the plan to fail).

    Do not be fooled into thinking this is a data issue.  Remember the data is coming from the database that the data generation plan is targeting.  Basically the generator reads in all the data, deletes the data, then puts it back.  It is the putting it back step where it is failing.

    I cannot exclude this table from the generation plan.  If I do, I cannot generate any data for the table which have FK's to the reference table. 

    Has anyone else seen anything like this?  I searched on the net and in these forums but did not see anything similar to this issue.

    Any help or further information that anyone can provide would be greatly appreciated.

    Thanks,

    Gareth.

    Wednesday, October 27, 2010 10:48 AM

Answers

  • Thanks for the reply Gert, we too use a lightweight method for deploying reference data.  We use the post-deployment scripts.  However, I think our wires are becoming crossed.  I want to use the dgen for its true purpose, to deploy test data!  However, to deploy the test data I need reference data.  The dgen will delete all the data from the target database but my target database already has all the reference data inside it thanks to my post-deployment scripts. 

    As such, I am now using my Static Data Generator to grab this reference data so that it can be used in the dgen plan to create all the other random data.  When the dgen runs it populates the reference data and the random data into the target database plus this upside is that I do not need to manage some other data-source where the reference data is independently maintained.

    For those who are interested, I have kept my promise and popped an installer and the source code to this generator up on CodePlex.  Have a look and see what you think.  I am not sure what level of interest there will be as this does seem to be a corner case where I have a very large volume of reference data.

    Anyway, the link is here: Static Data Generator

    I hope someone finds this useful. We certainly are. Thanks again for the help and information. Cheers, Gareth
    • Marked as answer by Gareth CPW Friday, November 05, 2010 1:13 PM
    Friday, November 05, 2010 1:12 PM

All replies

  • Just tried with what I think represents you scenario and it does not show, I was able to move 100K rows without a problem. Would it be possible to send me a repro project? Anything special on maybe the data type? Based on the error is looks like something is shifting which is not good.


    GertD @ www.DBProj.com
    Wednesday, October 27, 2010 7:50 PM
  • Gert,

    Firstly, thanks for getting back to me.  I am sorry you have not been able to reproduce my problem.  I do have a small test project here that demonstrates the issue perfectly.  I am more than happy to send this across if you can let me know where to send it.

    Just to be sure that what I am doing has come across correctly:

    I have a database with a post-deployment script that populates a table with 10,000 rows.

    I have a dgen plan that uses the Sequential Data Bound Generator for all the columns in the target table.

    The connection information for each column points to the *same* database that the data generation plan will populate.

    The select query for each column selects only the column from the target database that contains the static data. (select [column_1] from baretable])

    Setting the dgen plan to 5,000 rows for that table == failure.

    Setting the dgen plan to 3,000 rows == success.

    When executing the dgen plan I always choose to delete the existing data.

    Does that make anything clearer?

     

    Anyway, let me know where I can send the project to and I will pop it over.

     

    Thanks again,

     

    Gareth.

    Thursday, October 28, 2010 8:55 AM
  • If you could send a zipped project to gertd at dbproj dot com I can look at it.


    GertD @ www.DBProj.com
    Friday, October 29, 2010 12:04 AM
  • Gert,

    I have sent you the example solution over.  Hopefully you have received it.  I have replied here in case anyone else is watching this thread so that they know where we are up to.

     

    Thanks again,

     

    Gareth.

    Friday, October 29, 2010 10:50 AM
  • Thank you I received the project and I can repro the problem, however I believe this is expected behavior, since you are pulling from the same table, inside the sequential data bound generator, as you are inserting to.

    When you execute the DGEN plan by default it first deletes all rows in the target table, since the source is the target in your case the input value is now a NULL.

    When I change it to pull from a copy BareTable2 which I created using

    select * into BareTable2 from BareTable

    It works as expected.

    Also another note. When you use an indeitical query for both generator configurations like this:

    select column_1, column_2 from baretable2

    You can bind column_1 to column_1 and bind column_2 to column_2, difference being that you execute only one query:

    select column_1, column_2 from baretable

    Instead of two queries
    select column_1 from baretable
    select column_2 from baretable

    Hope this helps,


    GertD @ www.DBProj.com
    Saturday, October 30, 2010 5:53 PM
  • Gert,

    Thanks for the reply, however, I think I must respectfully disagree.  The way the generators appear to work is this:

    1. Select data from source database
    2. Delete data from source database
    3. Generate a complete and relationally consistent data set for all tables.
    4. Bulk insert the result back into the target database

    You are right in asserting that my source and target database are the same but then how else do you recommend I manage the static data in the database without having two databases for every live database (with one being the source of static data and one for the insertion of generated data + static data selected from the source)?

    For me this causes a particular issue as I have 130 database servers with ~650 databases.  Using this logic I now need a development environment capable of support ~1300 databases.  I am not sure I am ever going to get traction with our purchasing department to put an environment of this scale together.

    Your suggestion of using another "source table" has a worse affect in my opinion, as now a production database will be polluted with tables that are only used for development purposes.  My databases will have 2x the number of tables holding static data.  Further, taking a purely relational view, would we agree this does not even sound like a normalised solution?

    Even more damning is that fact that I can get the above process to work with my personally coded data generator that uses a DataTable rather than a DataReader as the Sequential Data Bound Generator does. 

    So I guess, if possible, I am after a bit more of an explanation as some of what you have said doesn't add up with me.  However, I do not know everything.  If you could tell me where I have gone wrong here I would be most appreciative.

    Finally, thanks for taking the time to look into this.  Your help so far has/is appreciated I hope we can get to the bottom of it.

     

    Thanks,

     

    Gareth.

    Monday, November 01, 2010 9:12 AM
  • You are right in asserting that my source and target database are the same but then how else do you recommend I manage the static data in the database without having two databases for every live database (with one being the source of static data and one for the insertion of generated data + static data selected from the source)?

    You can disagree, but I do not know how this ever can work when you use the target and source connection being the same and when you delete the content. If I run with identical source and target and do not delete the rows in the target, it will add all, but that is not what you want. If you delete the rows in the target, you effectively delete the source. Since the sequential data bound generator is a stream it pulls rows one-by-one, it does not pull the source before the delete occurs.

    Maybe the data bound generator, which uses a DataSet underneath does this, but then you might have other problems.

    I understand what you want to do, but I would expect that the reference data comes from a source code controlled location to begin with, so I fail to understand the logic how the source can be the target for reference data? I have done this by having a single server/database which serves as the master for the other database, so you have a single place where you pull the truth from.

    Alternatively we use a private tool which uploads to a temp table and we perform 3 joins to determine which rows need to be inserted, deleted or updated based on the PK. We prefer this method since I do not have to unbind FK relationships that point to the reference data.

     


    GertD @ www.DBProj.com
    Monday, November 01, 2010 5:17 PM
  • Gert,

    I take your points on board so I will part by saying two things:

    1. From reading your response it would seem clear to me that MS expects a different source for static data and as such expects the users of VSDB to expand their development environments to deal with this.
    2. I stand by that fact that the problems that I am experiencing are directly related to the implementation of the Sequential Databound Generator.  The choice of using a DataReader means that the Select data step and Delete data step of the data generation process are in conflict within *this* generator but not in others (the Databound Generator).

    For anyone else that is reading this thread and possibly suffering the same woes (though I imagine this group is small as this problem only manifests with large data sets),  I will be posting my generator on codeplex in the next few days. 

    Using the new generator you can truely have a database project that deploys to one database and will generate data for that database without needing any other external sources that will then require management and maintenance.

    Gert, thanks for the help and explanation.  I am afraid in this area we must agree to disagree as I have nothing else to say that I think will convince you that the current implementation of the Sequential Databound Generator overly restricts its use.  However, I also remain unconvinced that adding another system to the development world purely for reference data is a reasonable solution for people with large installations of SQL Server.

    Anyway, thanks for all the help and I will reply to this thread soon with a link to codeplex.

    Thanks,

    Gareth.

    Tuesday, November 02, 2010 10:29 AM
  • The reason why we use a different mechanism in VS2010 is because I am using a deployment contributor to deploy the reference data. This is interesting since VSDBCMD.EXE is what I use to deploy the schema and it can no also deploy data. Since DGEN plans are not executed by VSDBCMD.EXE only by MSBuild or the VS shell this was not helping us.

    Besides that DGEN it main goal was to generate test data, for deploying reference data I can use a much lighter weight system.

     


    GertD @ www.DBProj.com
    Tuesday, November 02, 2010 3:36 PM
  • Thanks for the reply Gert, we too use a lightweight method for deploying reference data.  We use the post-deployment scripts.  However, I think our wires are becoming crossed.  I want to use the dgen for its true purpose, to deploy test data!  However, to deploy the test data I need reference data.  The dgen will delete all the data from the target database but my target database already has all the reference data inside it thanks to my post-deployment scripts. 

    As such, I am now using my Static Data Generator to grab this reference data so that it can be used in the dgen plan to create all the other random data.  When the dgen runs it populates the reference data and the random data into the target database plus this upside is that I do not need to manage some other data-source where the reference data is independently maintained.

    For those who are interested, I have kept my promise and popped an installer and the source code to this generator up on CodePlex.  Have a look and see what you think.  I am not sure what level of interest there will be as this does seem to be a corner case where I have a very large volume of reference data.

    Anyway, the link is here: Static Data Generator

    I hope someone finds this useful. We certainly are. Thanks again for the help and information. Cheers, Gareth
    • Marked as answer by Gareth CPW Friday, November 05, 2010 1:13 PM
    Friday, November 05, 2010 1:12 PM