locked
Need help designing a flexible database to collect different kinds of data values RRS feed

  • Question

  • Sorry, I don't know how to better express this in a short subject line.  Let me explain a little bit about what I'm trying to achieve.

    I'm looking for some help on wrapping my mind around this database design:  I need to build a database to house some ongoing performance metrics, things like how many products sold per year, per department, number of this-and-that (again, per year, per department), amount of money spent (per year, per dept), and so on.  So, there are different data metrics (Num of Products, Num of People hosted at a conference, Num of this-and-that, Money Spent, etc.) which would store various kinds of data values (integers, decimals, currency, etc.) depending on which metric we're talking about.

    So, I'm trying to figure out how to structure the tables so that the various data categories can be captured.  This data would ultimately need to be REPORTED in summary reports, something that would look like this:

      2009 2010 2011
    Subgrouping ABC
        Products sold 530 645 700
        Money Spent $8,000 $9,000 $11,000
        Number of Student Participants 45 80 63
    Subgrouping XYZ
        People hosted at Conference XYZ 200 150 340
        Number of Conference Volunteers 10 15 6
    Subgrouping etc.
         Metric etc.
    etc. etc. etc.
        etc.

     

    Tables that are obvious to me: tblDEPARTMENTS , tblSUBGROUPINGS , tblFISCALYEARS .

    What's not obvious is how to store the METRICS themselves.  It seems things are pointing towards something like this:

    tblMETRICS and its columns:
    - MetricID (PK)
    - MetricTitle (i.e., MoneySpent, PeopleHostedAtConferenceXYZ, ProductsSold, etc.)
    - SubgroupingID (FK)

    ...plus

    tblMETRICDATA and its columns:
    - MetricDataID (PK)
    - DeptID (FK)
    - FiscalYear (FK)
    - MetricID (FK)
    - MetricValue (i.e., "530", "$8,000", etc.)

     

    So, the tblMETRICS table sort of acts like a "FIELDS" table, whereas the tblMETRICDATA table acts as "FIELD VALUES" table... if that makes sense...  But if I do it that way, all values would have to be stored as varchar, and then later CASTed a certain way in queries and procedures.  This seems a bit scary, but perhaps databases like that are OK?

     

    Soooo... i might be completely on the wrong track here, and feel free to yell at me for missing a simple solution, if there is one.  Hopefully, the generic examples above are good enough to give you an idea of what I'm trying to accomplish.  Please let me know what your observations and thoughts are on all this.

    Many thanks, and I apologize for the long post!

    Monday, April 19, 2010 4:45 PM

Answers

  • 1. It doesn't make sense to add tbl prefix to the table name.

    2. You need to look at SSAS for the reporting purposes and different metrics. Design your database normally as collection of necessary information using normalization models. Then after this part is done, you would need to create data warehouse and build SSAS models.

     


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Monday, April 19, 2010 5:10 PM
    Answerer
  • The idea you are suggesting is a standard Master/Master_Detail setup, commonly used as Order/Line_Item.

    If a set of metrics work together, it's makes sense to have a Metric TABLE, holding the "batch" id (as in, this batch of metrics).

    Why would they have to be stred as varchar? Aren;t they all numbers in the end?

    If the metric are truly different from each other, each should have a separate TABLE.

    Metric(Id, Date)
    Metric_Money(Metric FK, Dept, Year, Value)
    Metric_Attendance(Metric FK, Dept, Year, Value)
    ...

    Monday, April 19, 2010 5:47 PM
    Answerer
  • Sure.

    BTW, another approach, is to have both COLUMNs, but only allow one to be populated:

    Metric
     Metric        (FK)
     Department    (FK)
     FiscalYear    (FK)
     Value_INT    INT
     Value_VARCHAR    VARCHAR()
     CHECK
     (    (Value_INT IS NOT NULL AND Value_VARCHAR IS NULL)
      OR    (Value_INT IS NULL AND Value_VARCHAR IS NOT NULL)
     )

     It does seem wrong, because two COLUMNs are used for one attribute. But it is an option if all else fails.

    Monday, April 19, 2010 7:39 PM
    Answerer
  • Duplicating TABLEs is not an issue. The main question is, is this Metric different?

    IOW, if you were to keep attributes about buildings, the TABLE might be:

    Bulding(Id, Name, Address, Hight, Floors, Capacity, License, Builder, Founder, Owner)

    Having one TABLE is a good thing, because they are all attributes of the one entity, that is, the building.

    If however, we recorded usage of the building, like painters and renters:

    Building_Painter(Building, Painter)

    Building_Renter(Building, Customer)

    These are child TABLEs for two reasons:

    1) They may have multiple entries.

    2) They are not attributes of the building itself. They are entities associated with it.

    If it just had multiple entries, but they were attributes, it goes in the same TABLE, for example, Address_Line_1 and Address_Line_2. If they weren't attributes, but could not be multiplied (or only one, such as the latest, is kept) it would also be in the same TABLE, for example: Owner.

    Both together mean a new child TABLE.

    Applying this reasoning to the Metric TABLE, there are two main options:

    1) Metric(Id, Student, Money, Attendence)

    2) The suggestion made above.

    If #2 is the correct option, (the path you've taken so far) the question is, does each Metric gets its own child TABLE, or should they all be in one? The answer is based on if these Metrics are different. This is usually subjective, so the designer needs to go with a feeling, and with the needs of the project. So, if there are ten thousand different sub-entities, or if the list is arbitrary, manageability comes into play and the one big TABLE (or many separated by data type) becomes a viable option. If, however, the sub-entities are truly different, and management isn't too hard, the multi-TABLE approach seems better.

    Ultimately, option 1 is still a good choice. Even for many attributes. With 1000 COLUMNs supported, and 2008's WIDE TABLE that supports 30,000 COLUMNs.

    OK, i'm rambling here. I wanted to get these thoughts out to help.

    I guess the next questions are, how many attributes are there, are they always all filled in, and for each metric can there be more than one of each.

    Tuesday, April 20, 2010 1:06 PM
    Answerer
  • >>If I do option #1 you described above, I would end up with more attributes than rows. There will likely be 40 - 50 metrics at first, with a possibility of adding more later.  The rows in that case would have to be departments, I guess, and there are only 13 of them, not likely to change any time soon.  The only other candidate for rows would be the YEARs, meaning that each YEAR gets a new row, then all the appropriate attributes get filled in.  Still, very few rows.<<

    I might be missing the point, but having a wide table is not a bad thing. Keeping like data in its own column is way better than mixing it as rows in a table because that just isn't the way SQL works well. If you want to do a query on WidgetMass, you have to filter it out.  If WidgetMass is a column, it is a very natural process.  Creating data in one row is a simple insert.  Creating it in rows is one insert per column (possible to do in one statement/transaction, but really cumbersome.

    Adding a column to a table is just as easy as adding a new rowtype to your table, and far more natural/understandable to the users.  Plus, instead of having this single column that contains > 1 type of thing (leaving it unprotectable without a menagerie of code) you get each column with an independent meaning and an independent type/constraint.

    If you had said that the metrics are completely unknown and have to be dealt with as they arrive, then the EAV pattern like this would make sense. But if you are going to do any serious data work with this data and you have at least 15 minutes of lead time before needing the new attribute, a table with lots of columns is better than having to decipher the data everytime you need it, and then producing columns when people want to do reports. 

    Your plan to have > 1 table separated by datatype just makes it a lot more difficult to work with, and frankly, even though there is a lot of issues with sql_variant in terms of usage, I would probably use it and keep one column. 

    It all comes down to using SQL Server like SQL Server was created.  Each step you go away from using columns and rows to access data and implement your own catalog like this, the harder the relational engine will be to use (and the more dissatisfied you will become) 

    But I will be honest and say that you can make what you are doing work, and possibly work well.  Much like you can use a hammer to drive a screw... Not as easy, but it gets the job done. Good luck..


    Louis

    Tuesday, April 20, 2010 11:55 PM
  • Either. If the grouping is a completely separate piece of information, it should be in separate TABLE. It works better with the EAV approach, because it can be joined. For the WIDE TABLE approach, the TABLE is informational, and might be better implemented in a VIEW (where it is hardcoded). So, i guess the actual answer is EAV.

    Thursday, April 22, 2010 1:32 PM
    Answerer
  • I definitely think you understand the issues with EAV. It is really great for storind data and expanding your schema, but using the data requires some manipulation to turn it into a more relational friendly set and if you can by any means just start out in that format things will be a lot easier for you.
    Louis

    Thursday, April 22, 2010 9:30 PM

All replies

  • 1. It doesn't make sense to add tbl prefix to the table name.

    2. You need to look at SSAS for the reporting purposes and different metrics. Design your database normally as collection of necessary information using normalization models. Then after this part is done, you would need to create data warehouse and build SSAS models.

     


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Monday, April 19, 2010 5:10 PM
    Answerer
  • The idea you are suggesting is a standard Master/Master_Detail setup, commonly used as Order/Line_Item.

    If a set of metrics work together, it's makes sense to have a Metric TABLE, holding the "batch" id (as in, this batch of metrics).

    Why would they have to be stred as varchar? Aren;t they all numbers in the end?

    If the metric are truly different from each other, each should have a separate TABLE.

    Metric(Id, Date)
    Metric_Money(Metric FK, Dept, Year, Value)
    Metric_Attendance(Metric FK, Dept, Year, Value)
    ...

    Monday, April 19, 2010 5:47 PM
    Answerer
  • Ah, I see.  I could separate into different "value" tables based on data types, like you are suggesting.  Interesting.

    To answer your questions about "numbers"...  Most values are going to be numbers, yes, but there is a chance that there could be some qualitative values in there as well for certain metrics.  I'm not sure what those will be yet (this is all still sort of in the proof-of-concept stage), but I would guess that such non-standard metrics might need to be varchar values.

    But by keeping the number-based values in one common table, that should make sense.

    Re: the idea of "batch"...  That makes sense.  Thanks.  I'll go forward and experiment further with these ideas.
    Monday, April 19, 2010 7:33 PM
  • Sure.

    BTW, another approach, is to have both COLUMNs, but only allow one to be populated:

    Metric
     Metric        (FK)
     Department    (FK)
     FiscalYear    (FK)
     Value_INT    INT
     Value_VARCHAR    VARCHAR()
     CHECK
     (    (Value_INT IS NOT NULL AND Value_VARCHAR IS NULL)
      OR    (Value_INT IS NULL AND Value_VARCHAR IS NOT NULL)
     )

     It does seem wrong, because two COLUMNs are used for one attribute. But it is an option if all else fails.

    Monday, April 19, 2010 7:39 PM
    Answerer
  • "I could separate into different "value" tables based on data types"

    Not based on data types. Based on Metric types. But that's probably what you meant anyway.

    Monday, April 19, 2010 7:41 PM
    Answerer
  • Well, yes and no.  I suppose Metric Types could be based on actual data types.  So, the table Metric_Money could store all metrics that deal with currency, while table Metric_Count (i can probably come up with better name) stores all metrics that have a "number of <something>" values (which would really be an INT data type).  But that might be stretching the idea of "metric type."

    And there's a likelihood that those table definitions would be almost identical to each other, except for the VALUE column (money vs. int vs. double vs. varchar, etc.), and that seems like unnecessary duplication of tables, perhaps.  But I wonder if that is still cleaner/better than going with one table with multiple columns for one attribute, like you were saying above.  <scratch head />

    I'm also thinking about not using a money data type at all, because the money values would likely all be rounded in thousands, millions, or billions, so I wonder if a simple INT or BIGINT can cover the needs for almost all number-based metric values.  But in that case, I might need an additional field in the METRICS table specifying what the "intended" data type should be, so that it can be used in a CAST function in reports, so that the data is accurately displayed...

    METRICS (MetricID PK, MetricTitle, SubGrouping FK, IntendedDataType )

    ... but something about that approach doesn't feel right.

    Oh, and another thing is that I will likely have a couple of metrics that are valued in decimals, such as AverageNetworkUsage (Mb/s), where a value might be 5.6, for example.  So, that would have to be stored in a separate table, or another column with a CHECK constraint.

    Monday, April 19, 2010 10:16 PM
  • Duplicating TABLEs is not an issue. The main question is, is this Metric different?

    IOW, if you were to keep attributes about buildings, the TABLE might be:

    Bulding(Id, Name, Address, Hight, Floors, Capacity, License, Builder, Founder, Owner)

    Having one TABLE is a good thing, because they are all attributes of the one entity, that is, the building.

    If however, we recorded usage of the building, like painters and renters:

    Building_Painter(Building, Painter)

    Building_Renter(Building, Customer)

    These are child TABLEs for two reasons:

    1) They may have multiple entries.

    2) They are not attributes of the building itself. They are entities associated with it.

    If it just had multiple entries, but they were attributes, it goes in the same TABLE, for example, Address_Line_1 and Address_Line_2. If they weren't attributes, but could not be multiplied (or only one, such as the latest, is kept) it would also be in the same TABLE, for example: Owner.

    Both together mean a new child TABLE.

    Applying this reasoning to the Metric TABLE, there are two main options:

    1) Metric(Id, Student, Money, Attendence)

    2) The suggestion made above.

    If #2 is the correct option, (the path you've taken so far) the question is, does each Metric gets its own child TABLE, or should they all be in one? The answer is based on if these Metrics are different. This is usually subjective, so the designer needs to go with a feeling, and with the needs of the project. So, if there are ten thousand different sub-entities, or if the list is arbitrary, manageability comes into play and the one big TABLE (or many separated by data type) becomes a viable option. If, however, the sub-entities are truly different, and management isn't too hard, the multi-TABLE approach seems better.

    Ultimately, option 1 is still a good choice. Even for many attributes. With 1000 COLUMNs supported, and 2008's WIDE TABLE that supports 30,000 COLUMNs.

    OK, i'm rambling here. I wanted to get these thoughts out to help.

    I guess the next questions are, how many attributes are there, are they always all filled in, and for each metric can there be more than one of each.

    Tuesday, April 20, 2010 1:06 PM
    Answerer
  • If I do option #1 you described above, I would end up with more attributes than rows. There will likely be 40 - 50 metrics at first, with a possibility of adding more later.  The rows in that case would have to be departments, I guess, and there are only 13 of them, not likely to change any time soon.  The only other candidate for rows would be the YEARs, meaning that each YEAR gets a new row, then all the appropriate attributes get filled in.  Still, very few rows.

    And the metrics are numerous enough and fluid enough (new ones might be added over time), that I don't think it'd be manageable to keep them as 40+ Tables.

    So, I think it makes sense to keep the metrics themselves as rows in one or few tables.  So for 40 metrics, across 13 departments, I'd be looking at 40 x 13 = 620 rows (per year).  We likely wouldn't fill in more than several years of back-data, and would only add new information as it comes in.  And if I setup the table as METRICSDATA (MetricID FK, DeptID FK, Year, Value), then the only issue is that I can't use one common data type for the VALUE column, because not all metrics can be represented with it.  So, then, I suppose it's either FEW different "Value" columns to cover few possible data types (and ensuring that only one can be filled in for each row), OR a FEW different tables dictated by data type for the VALUE column.

    I think I'm leaning toward few tables separated by data type.  But either way, I think you've given me enough to think about and play with.  Thanks for the good discussion and all your time!  I'm going to start mocking up something in the next few days and see where it takes me.

    Tuesday, April 20, 2010 3:28 PM
  • OK, let's here how it goes then.

    If the separation is by data type, i'd use one TABLE. But i never got myself comfortable with either option, i just settled on one.

    On my current project, we're storing millions of records like this each month. The equivalent of say 5000 Divisions and 3000 Metrics stored for 3-4 years. Luckily, we were able to talk them out of the two non-numeric types, and went with DECIMAL(14, 4).

    Tuesday, April 20, 2010 4:00 PM
    Answerer
  • >>If I do option #1 you described above, I would end up with more attributes than rows. There will likely be 40 - 50 metrics at first, with a possibility of adding more later.  The rows in that case would have to be departments, I guess, and there are only 13 of them, not likely to change any time soon.  The only other candidate for rows would be the YEARs, meaning that each YEAR gets a new row, then all the appropriate attributes get filled in.  Still, very few rows.<<

    I might be missing the point, but having a wide table is not a bad thing. Keeping like data in its own column is way better than mixing it as rows in a table because that just isn't the way SQL works well. If you want to do a query on WidgetMass, you have to filter it out.  If WidgetMass is a column, it is a very natural process.  Creating data in one row is a simple insert.  Creating it in rows is one insert per column (possible to do in one statement/transaction, but really cumbersome.

    Adding a column to a table is just as easy as adding a new rowtype to your table, and far more natural/understandable to the users.  Plus, instead of having this single column that contains > 1 type of thing (leaving it unprotectable without a menagerie of code) you get each column with an independent meaning and an independent type/constraint.

    If you had said that the metrics are completely unknown and have to be dealt with as they arrive, then the EAV pattern like this would make sense. But if you are going to do any serious data work with this data and you have at least 15 minutes of lead time before needing the new attribute, a table with lots of columns is better than having to decipher the data everytime you need it, and then producing columns when people want to do reports. 

    Your plan to have > 1 table separated by datatype just makes it a lot more difficult to work with, and frankly, even though there is a lot of issues with sql_variant in terms of usage, I would probably use it and keep one column. 

    It all comes down to using SQL Server like SQL Server was created.  Each step you go away from using columns and rows to access data and implement your own catalog like this, the harder the relational engine will be to use (and the more dissatisfied you will become) 

    But I will be honest and say that you can make what you are doing work, and possibly work well.  Much like you can use a hammer to drive a screw... Not as easy, but it gets the job done. Good luck..


    Louis

    Tuesday, April 20, 2010 11:55 PM
  • >Keeping like data in its own column is way better than mixing it as rows in a table
    >because that just isn't the way SQL works well.

    That depends, doesn't it? If the metrics are INDEXed, A two-COLUMN INDEX can have a much more effective INDEX that a new INDEX for each COLUMN. Also, when comparing Metrics, PARTITIONs can be used to group data by Metric (Id).

    The speed depends on it's use. Though, normally, the COLUMN approach is much better.

    >Creating data in one row is a simple insert.  Creating it in rows is one insert per column

    If the data is sparse, and an SP INSERTs the data, the opposite is true. Adding another row is easy, and does not use dynamic SQL. Adding another COLUMN to a query requires dynamic SQL, or SETting each and every COLUMN no matter what.

    Personally, i don't like the EAV approach because of the logical problems. The storage a retrieval problems, however, can go either way.

    Wednesday, April 21, 2010 12:07 PM
    Answerer
  • Thanks for the good discussion, everybody!  This is all helping me a great deal, just thinking through all this from different perspectives.

    If you had said that the metrics are completely unknown and have to be dealt with as they arrive, then the EAV pattern like this would make sense. But if you are going to do any serious data work with this data and you have at least 15 minutes of lead time before needing the new attribute, a table with lots of columns is better than having to decipher the data everytime you need it, and then producing columns when people want to do reports.

    The metrics are mostly going to be known upfront, and I could also argue that a "15-minute" lead time would always be there for adding any new ones.  So, a wide-table could work fine, I suppose.  However, I can also see us adding a web front-end, eventually, and allowing users to add their own metrics as needed.  Would a wide-table approach be able to accommodate dynamic insertion of attributes?  Can that be relatively easily scripted and executed from a UI (sorry, I'm still very much new to SQL Server)?

    For example, a user goes to the UI, adds a new metric titled "ABC," specifies that it will be storing CURRENCY, and will be grouped under SUBGROUPING "XYZ"...  This, behind the scenes, adds a new column in the MetricData table.

    This is not a major requirement of the project, but I was wondering how it would affect the db architecture, particularly when it comes to EAV vs. wide-table.

     

    EDIT: Also, what about grouping several attributes together in reports?  If all attributes are stored as many columns in one table, how do I specify which ones go together?  Do I just create separate views, representing these subgroupings, where each view selects only certain columns?

    EDIT #2: To illustrate, here's what I'm thinking of for the wide-table approach:

    TABLE MetricData (DeptID FK, FiscalYear, Metric1, Metric2, Metric3,... Metric50...)

    So, let's say that Metric1, Metric2, and Metric 3 are all part of some arbitrary business subgrouping "ABC," and Metric4, Metric5, and Metric6 are subgrouped under "XYZ," and so on, how would that be specified in the database?  Or would that be left to the business tier to retrieve whichever attributes are needed to report on "ABC" and "XYZ"?

    I *could* take these business subgroupings and make tables ABC, XYZ, etc., where they would be:

    - table ABC (DeptID FK, FiscalYear, Metric1, Metric2, Metric3)
    - table XYZ (DeptID FK, FiscalYear, Metric4, Metric5, Metric6)

    If I were to do that, I'd probably end up with 10-15 tables, and that probably wouldn't change much over time, because these subgroupings are based on certain business Programs/Initiatives that are represented across all or most Departments.

    That DOES provide a "natural" distinction among tables. But it seems more tempting to want to keep this in one table, and just use some metadata, either in the database or in the business tier, which somehow decides how to group certain attributes together.

    Thoughts?  Sorry for making this more complicated. :-)

    Wednesday, April 21, 2010 4:40 PM
  • Groups can be specified in another TABLE.

    Metric_Group(Id, Name)
    Metric_Group_Metric(Metric_Group, Metric)

    Or just do that in a VIEW. It depends on hardcoded or not.

    If the attributes are dynamic, EAV is usually the way to go. But sparsely populated WIDE TABLEs do indeed work. Issuing DDL from a website to ALTER a TABLE, is usually not a good idea.

    Wednesday, April 21, 2010 7:45 PM
    Answerer
  • Groups can be specified in another TABLE.

    Metric_Group(Id, Name)
    Metric_Group_Metric(Metric_Group, Metric)

    Is that with the EAV approach, or the wide-table approach, or either?  If I'm doing wide-table approach, would the "Metric" column in Metric_Group_Metric table contain values which match the column names in the MetricData table (Metric1, Metric2, Metric3, etc.)?
    Wednesday, April 21, 2010 9:30 PM
  • Either. If the grouping is a completely separate piece of information, it should be in separate TABLE. It works better with the EAV approach, because it can be joined. For the WIDE TABLE approach, the TABLE is informational, and might be better implemented in a VIEW (where it is hardcoded). So, i guess the actual answer is EAV.

    Thursday, April 22, 2010 1:32 PM
    Answerer
  • I thought so.

    The wide-table approach seems appealing right now because of the ease of manipulating data and producing aggregate results in views and SPs.  For example, one thing I'm finding is that some of these metrics need to be treated differently in aggregate queries than other metrics, and that might be an extra difficulty to overcome in the EAV approach.

    Most metrics are stored as "number of <something>," where that number can (and should) be summed in a per/year total, for example.  However, some metrics are different, like "Average Bandwidth (Mbps)" which would also be recorded per department, per year.  However, in an aggregate report, this number would NOT be summed, but rather averaged across all departments, for a given year.  And because with EAV, all metric VALUES would be in one (or very few) columns, it seems that doing aggregate queries and reports would be difficult.

    So, I'm still a little torn between the two approaches, but I think I'm liking the straightforwardness of the wide-table approach a little bit more, with a possibility of separating the metrics into different tables, perhaps, if it provides any sort of a meaningful separation.  And perhaps the best and most natural way would be to separate tables by these groupings that I'm talking about, even though for every table, the metrics would always be per department, per year.

    Thursday, April 22, 2010 9:25 PM
  • I definitely think you understand the issues with EAV. It is really great for storind data and expanding your schema, but using the data requires some manipulation to turn it into a more relational friendly set and if you can by any means just start out in that format things will be a lot easier for you.
    Louis

    Thursday, April 22, 2010 9:30 PM