locked
The Ultimate Data Duplication Question RRS feed

  • Question

  • Hi guys,

    We have a 35 million row database that looks like ->

     ----------------------------   

        [ID] [int] NOT NULL DEFAULT (''),

    [PARTNUMBER] [nvarchar](30) NOT NULL DEFAULT (''),

    [LINESTARTDATE] [smalldatetime] NOT NULL DEFAULT ('1900-01-01 00:00:00.000'),
    [LINEENDDATE] [smalldatetime] NOT NULL DEFAULT ('1900-01-01 00:00:00.000'),
    [LINEDELETEDATE] [smalldatetime] NOT NULL DEFAULT ('1900-01-01 00:00:00.000'),
    [REDUCEDPRICE] [numeric](28, 12) NOT NULL  DEFAULT ((0)),
    [LISTPRICE] [numeric](28, 12) NOT NULL  DEFAULT ((0)),
    [QUANTITYMAX] [numeric](28,12) NOT NULL DEFAULT ((0)),
    [QUANTITYCONSUMED] [numeric](28, 12) NOT NULL  DEFAULT ((0)),
    [VENDPARTLINE] [nvarchar](20) NOT NULL DEFAULT (''),
    [LINEPRICETYPE] [nvarchar](20) NOT NULL DEFAULT (''),
    [QUANTITY] [numeric](28, 12) NOT NULL  DEFAULT ((0)),
    [LINENUMBER] [int] NOT NULL DEFAULT ((0)),
    [CUSTID] [tinyint] NOT NULL DEFAULT (''),
    [LINESTATUS] [tinyint] NOT NULL DEFAULT ((0)),
    [QUANTITYMIN] [numeric](28, 12) NOT NULL DEFAULT ((0)),
    [QUANTITYREMAINING] [numeric](28, 12) NOT NULL DEFAULT ((0)),
    [PRICEBASE] [numeric](28, 12) NOT NULL DEFAULT ((0))

    ------------

           One of the major problems I'm seeing is that some of this data is duplicated a lot like I notice the 4 quantity fields of '[numeric](28,12)' have only 2,000 unique records each. And LINEPRICETYPE of '[nvarchar](20)' has only about 40 different values. AND all this duplicated data is spread across 35 million rows. Keep one thing in mind.. if the user modifies the quantitymax of a single row from 100.2' to '100.5' that doesn't means that that all rows with '100.2' quantity should be changed. It just means that I might need another quantitymax unique record now.

    My questions: would it make sense to put all this data in its own unique table? i.e normalize the data? I have read that normalization could be a bad thing for performance. 

    Please enlighten me :)




    EDIT: SQL Server 2005 - 32 Bit
    Sunday, November 11, 2012 11:22 PM

Answers

All replies

  • I think moving repeated information in their own tables and only having a link in the main table is the correct way in your situation.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Monday, November 12, 2012 12:51 AM
  • Thank you for the reply.

    Trying to stay on point here. Now aside from making queries quite a bit more complicated. What other pros/cons will would this change on my queries?

    Other thoughts: I did have an idea of converting this data into an OLAP DB at some point. Will this blow complexity out of the roof?


    Monday, November 12, 2012 1:50 AM
  • Your databases will take less space and I don't think adding a few lookup tables will complicate your queries too much. So, I am for normalization.

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog

    Monday, November 12, 2012 2:14 AM
  • Recently my good friend Doron Farber wrote great article about how efficiently remove duplicates on huge tables based on real world example. 
    http://www.dfarber.com/computer-consulting-blog/2011/12/26/remove-duplicate-records-in-sql.aspx

    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Blog: Large scale of database and data cleansing
    MS SQL Consultants: Improves MS SQL Database Performance

    Monday, November 12, 2012 6:42 AM
    Answerer
  • >i.e normalize the data?

    Yes definitely normalize it.

    > I have read that normalization could be a bad thing for performance

    Performance tuning comes after good database design:

    http://www.sqlusa.com/articles/query-optimization/

    For OLTP (transactional) database we do 3NF normalized design. For Star Schema data warehouse we design (unnormalized) flat dimension tables.  The difference: OLTP is dynamic data, DW is static, second-hand data.

    >LINEPRICETYPE of '[nvarchar](20)' has only about 40 different values.

    That is 40 bytes, the average maybe less, but definitely slows down the table.

    [ID] [int] NOT NULL DEFAULT (''),

    Use 0 or other numeric figure for numeric defaults.

    >PARTNUMBER] [nvarchar](30) NOT NULL DEFAULT (''),

    You should move this too its own table and just FOREIGN KEY reference it from this table.

    If you move all the nvarchar columns from the table, the table will be fixed size, narrower, that means better performance.


    Kalman Toth SQL 2008 GRAND SLAM
    New Book: Beginner Database Design & SQL Programming Using Microsoft SQL Server 2012

    Wednesday, November 21, 2012 9:27 AM