locked
Technical challenge with database design RRS feed

  • Question

  • I have a technical challenge on my hands, the situation is that I need to load data from about 200 XML data sources from various XML feeds, the problem I have with the challenge is that out of the 200 XML feeds, the client doesnt know what fields or columns will be required to load onto the database. The 200+ XML feeds are already loaded onto various oracle databases and are in a normalised form, the client wants to have direct access to the XML fields and hence has decided to feed the data straight from XML onto the SQL server database which I have represented using the red line.

    The question is, how can I load 200+ XML feeds onto the database, do validation checks on the staging database to pick out holes on the data and to also compare is against expected values etc. After the data is loaded onto the staging database, I aim proposing that all columns from all feeds be loaded onto it, as since business requirements are unknown i.e we dont know what columns are required, if all the flat data is loaded onto the staging database, then columns that are required now are loaded from staging onto 3NF and to a relational database, after which they will end up in a data warehouse.

    Considerations are as follows

     

    • SSIS to be used
    • Some form of automated scheduling system to be used where business rules can be defined, i.e dont load currency trades unless exchange rate file has been loaded.
    • Performance is very important
    • Checks to be performed by the staging database to pick out business rules

    • Moved by Tom PhillipsEditor Wednesday, August 10, 2011 2:00 PM Database Design question (From:SQL Server Database Engine)
    Tuesday, August 9, 2011 8:50 PM

Answers

  • IIUC, a partial XML file is not valid because there are missing closing tags, therefore, either you have the XML or you do not. If you have the XML it can be stored, if you do not have the XML, it cannot be stored.

    If data were to be streaming anyway, a batch process could be opened storing data as it came in, with metadata identifying if it was complete or not:

    XML_File
    ----------
    Id
    Source
    DTD
    Started
    Finished

    XML_Line_Item
    -----------------
    XML_File (FK)(PK)
    Record (incremented id per XML_File)(PK)
    Data (type XML)

    In this case, the XML is stored record by record as XML, tied together by the FK. A surrogate key would likely be required to make it unique and add a PK.

    A standard EAV can be used as well, it is the most flexible, but that can lead to poor performance.

    XML_File
    ----------
    Id
    Source
    DTD
    Started
    Finished

    XML_Data
    -----------------
    XML_File (FK)(PK)
    Record (incremented id per XML_File)(PK)
    Field(PK)
    Data

    Just a couple ideas.


    Thursday, August 11, 2011 11:23 AM
    Answerer

All replies

  • I have a technical challenge on my hands, the situation is that I need to load data from about 200 XML data sources from various XML feeds, the problem I have with the challenge is that out of the 200 XML feeds, the client doesnt know what fields or columns will be required to load onto the database.

    Well then there's really nothing anyone here can tell you, is there?

    You get to choose what to do.

    Pretty typical of a lot of IT projects these days - no requirements, no process, just "here's a couple of gigabytes of data every day, hold onto it for us!"

    ... "and oh by the way, check it, clean it, and make it super-fast".

    You probably want to shred the XML in your staging and ETL process so you can apply the business rules, and yet since you have no requirements, probably the users are imagining you will just forward all the XML to the ODS or DW where they can run XQuery on it.

    This is only every project I'm on these days.

    Just take your best whack at it, and be ready for new requirements to come in at any moment, probably contradicting what they told you yesterday, or else demanding you do now what you offered to do a month ago but they turned down at that time.

    And of course they want it by 6PM tonight.

    Josh

     

    • Proposed as answer by Naomi NEditor Tuesday, August 9, 2011 10:26 PM
    Tuesday, August 9, 2011 9:18 PM
  • Thanks for the reply, you are just to right. I do have couple of questions though, how can one import data from XML feed into a SQL server database. I have not really played much with this from SQL server, its always been a case that there is an API already in place which serves as the loader/interface.

    My challanges are, how can one get all those feeds onto the database. As I have not delved much into XML, and from my understanding of say flat files, you have it locally on disk and use SSIS or bulk insert to load onto the database. with XML streaming off a website, I just dont get it but I'm guessing there must be a simple way to do this.

     

    Wednesday, August 10, 2011 8:49 AM
  • Anyone
    Wednesday, August 10, 2011 12:58 PM
  • Is this feed coming from something like a webservice? If so an SSIS package would probably be the way to go, plus you can do most of your cutting and sorting in it. I'm not totally sure what you mean by a feed though. Is it a feed that you connect to via tcp through a socket connection? Feed like a webservice? etc.

    -Sean

    Wednesday, August 10, 2011 1:06 PM
  • For the data coming from Oracle, that is easy.

    For the data coming in as XML, perhaps you can store the XML as-is. SQL Server does support an XML data type, which would allow storage with no modification and access to the individual COLUMNs via an XML query.

    Whether this causes a performance issue or not will only come out in testing.

    Wednesday, August 10, 2011 3:24 PM
    Answerer
  • lets say the XML is a website which is streaming it, in such cases how do you then get the data.

    In another example, lets say that the XML is already stored on disk.

    Wednesday, August 10, 2011 6:31 PM
  • IIUC, a partial XML file is not valid because there are missing closing tags, therefore, either you have the XML or you do not. If you have the XML it can be stored, if you do not have the XML, it cannot be stored.

    If data were to be streaming anyway, a batch process could be opened storing data as it came in, with metadata identifying if it was complete or not:

    XML_File
    ----------
    Id
    Source
    DTD
    Started
    Finished

    XML_Line_Item
    -----------------
    XML_File (FK)(PK)
    Record (incremented id per XML_File)(PK)
    Data (type XML)

    In this case, the XML is stored record by record as XML, tied together by the FK. A surrogate key would likely be required to make it unique and add a PK.

    A standard EAV can be used as well, it is the most flexible, but that can lead to poor performance.

    XML_File
    ----------
    Id
    Source
    DTD
    Started
    Finished

    XML_Data
    -----------------
    XML_File (FK)(PK)
    Record (incremented id per XML_File)(PK)
    Field(PK)
    Data

    Just a couple ideas.


    Thursday, August 11, 2011 11:23 AM
    Answerer