I have a design question, I have not dealt with this much data before, I have concerns which may not be valid.
We have 40 or so tables that have 3300+ partitions, between 14 - 400 columns (average 150 columns), with char(5) + int + int clustered primary key on them. The table with most rows is the main one has over 750 million rows (probably closer to billion now). We are in the middle of migration project from mainframe to sql server therefore we have portion of the data loaded to our server. Currently data is in a single database. Our business is data, we gather data from us counties and clean it up, make it relational and at the end create products of it.
The requirement is when the new data comes in usually this relates to 200K+ records (the original source has 5000 columns) which each record will split into multiple tables will be processed multiple times with multiple update, will be cleaned up about 30 something processes it has to pass through. And when everything is ok it will be replacing the existing data and available within our products.
One suggestion is to duplicate the data (which is over 3TB) in a different database do all processing on that database and when processing is done update the final database. In words sounds good but to synch large tables like that concerns me, backup restore times on these databases are huge, index maintenance is a challange and coming up with efficient synching mechanism can be difficult, we used cdc before but there are some cases the changes become huge and cdc will not catch up and tough to fix that issue.
I thought about having a duplicate table for each table, let's say StagA and ProdA, ProdA has full data and StagA is empty at first. Create a view that does full outer join between these two tables which will prefer the row from StagA if it exists and if not it will still show ProdA
create view fulltable as select isnull(a.ID, b.ID) as ID, isnull(a.val,b.val) as Val, .... FROM ProdA a Full outer join StagA on a.ID = b.ID
where b.flag <> 'DELETED'
It's litte bit more than that. The idea is the programs that needs access to staging data would use this view and the programs that new production data will use prod table directly. Then I created instead of triggers, one for del one for ins one for upd and if update made and staga did not have any data I copy the data from proda and modify and insert it into staga, if data was already in staga then modify it, for deletes insert a record into staga with deleted flag and for inserts insert into staga again. When the users want to commit data to production then merge staga data into proda data and clean staga data. The code works, there is only a single batch within the three triggers. Of course it has an overhead, updates, deletes, inserts are slower and retrieving the data from the view is expensiver than hitting the table directly.
I am concerned about duplicating that much data, the server has 48 cpu's , 256GB memory and more than enough storage space but it seems like it will cut the resources of the server (memory, cpu etc) into half if we duplicate that much active data. On the other sides triggers are not something I like to begin with. Also the servers are clustered therefore we have to work with SAN for all storage needs, we are not able to add a fusionio to tempdb and gain more performance etc (this is what I am told)
Would you have other suggestion or recommendation to this dilemma?
- 移動 Tom Phillips 2012年2月13日 13:42 Database Design question (From:SQL Server Database Engine)
You should be very careful with the systems like that (amount of data, complexity, etc). You need to understand system in details, know about use cases, data flow, physical setup and quite a few other factors before you can even think about design and solution that would be appropriate for that particular case. General design question would lead to general answer that would not be optimal. Devil is in details we don't know about.
We simply don't have enough information to help you. And you would not be able to provide information we need without asking us to sign NDA :)
Back to your question - yes, view will allow to access data from both tables. Yes, you're right about overhead. Would overhead be acceptable - we don't know. Are there better ways - it depends. Most likely - yes. What are those ways - we don't know. At the bare minimum we need to know in details how products are using staging and production data.
I personally would suggest you to hire consultants who specialize on architecture and performance. SQL Skills would be the very good choice and I'm pretty sure it would be cheaper for your company in the long run rather than refactoring after refactoring.
My blog: http://aboutsqlserver.com
2012年2月10日 20:40cool, I am going to training with SQLSkills on March 5th in Tampa ;-), I will ask them there I guess, I'll have the database structure in my laptop with some sample data.
I just want to second what Dmitri said, the questions are going to be about your data flow processes, that have to be addressed before anyone can start to address schema and physical database or other technical issues.
Do you have any baseline numbers? What happens if you just try to update stuff with standard SQL processes?
ps - I'm having major forum access problems from my regular workstation since they updated the software last week, so may not be able to respond quickly to any posts.
- 編集済み JRStern 2012年2月13日 4:20
to simplify my question earlier, it was about would you rather take the route of doing some process modifications or use instead of triggers to mimic the behavior of duplication a 3 TB database, or would you rather keep the processes simpler and have 3tb duplicated data.
I understand this question has to be in depth analized etc, but per your past experience what would be your concerns in any of these approaches.
I just do not duplicating data, it's to me cutting the server resources in half, more pages are competing for memory, multiple copies of the same data in the memory and synchronization difficulties etc, backup restore times, double the maintenance (index rebuilds etc). I rather write more code instead of dulicating data. If it were to be couple hundredt gb data I wouldn't worry as much, but to backup 3tb takes over 4-6 hours by itself.
What is your experience concerns with any of the approaches?
I would never duplicate 3TB of data as long as I have a choice. Same time I seriously doubt that there are no other approaches besides that view and INSTEAD OF trigger. For example, why not to put "processed" flag to the table and use filtered indexes to provide access to non-processed records to processing framework?
I know, I know, it would not work because of A, B and C. But the point is that we need to know those A, B, C in order to suggest you something.
PS Are you planning to attend South Fl Code Camp this Saturday? Stop by, if so. Perhaps we would be able to figure something out while drawing a few diagrams on the napkins..
My blog: http://aboutsqlserver.com
- 回答としてマーク Ed Price - MSFTMicrosoft Employee, Owner 2012年12月6日 1:54
Well, my issue is I am not able to explain to my collegees and manager that our database is large and duplicating the database on top is not a good idea. For them it's a simpler design duplicating the data, having an unedited copy of the data they can work, update as much as they want and only when they are happy they want to sent the modified data to the real database. I am trying to change their mind ...
Sure I'll be there on Saturday for at least 8:30-11 :-) Would you have extra time before or after?
- 回答としてマーク Ed Price - MSFTMicrosoft Employee, Owner 2012年12月6日 1:55