locked
Working on huge data sets in sql server RRS feed

  • Question

  • Hi All

    We are working on a project where we want to identify the discrepancies of Mutual Fund brokerage.

    In this project we are going to do below tasks:

    1. We have two tables in database

    -          “Mutual Fund Brokerage” data which includes 3,00,00,0000 records (30 Millions)

    -          “Rate Master” where approx. 10,00,000 records (1 Million). This is nothing but the brokerage plan prepared for broker by Mutual Fund organization.

    1. We want to execute some sql queries on both the tables and identify if the rate which is calculated (actual provided to broker) in “Mutual Fund Brokerage” table is correct or not.
    2. There is no direct connection between both the tables. We have to join tables on the basis of transaction date, broker type, location, transaction value etc.

    We are doing this task from last few months which take almost 5 to 6 days to prepare proper queries and execute them and generate discrepancy reports from result set.

    Now we want to develop some kind of desktop based tool which can do all below tasks

    1. Tool should connect with provided SQL Server instance
    2. It should create required queries in that database.
    3. Execute the queries and generate proper reports in excel format.

    Concerns:

    1. We want to make this tool generic so other organisations can also use it to validate the brokerage amount. So challenge is to formalise provided data in tools required format.
    2. What approach we should take while working on such a large data sets?
    3. How much we should concern about hardware capabilities of desktop where this tool is going to be installed?

    We are planning to use WPF technology to develop the tool.

    Please suggest your views on this.

    Thanks in advance.

     



    Best Regards, Somnath


    • Edited by Somnath Tingare Friday, July 29, 2016 6:51 AM
    • Moved by Bob Beauchemin Friday, July 29, 2016 3:22 PM Moved to a more appropriate forum for best results
    Friday, July 29, 2016 6:51 AM

Answers

  • Hi All

    We are working on a project where we want to identify the discrepancies of Mutual Fund brokerage.

    In this project we are going to do below tasks:

    1. We have two tables in database

    -          “Mutual Fund Brokerage” data which includes 3,00,00,0000 records (30 Millions)

    -          “Rate Master” where approx. 10,00,000 records (1 Million). This is nothing but the brokerage plan prepared for broker by Mutual Fund organization.

    1. We want to execute some sql queries on both the tables and identify if the rate which is calculated (actual provided to broker) in “Mutual Fund Brokerage” table is correct or not.
    2. There is no direct connection between both the tables. We have to join tables on the basis of transaction date, broker type, location, transaction value etc.

    We are doing this task from last few months which take almost 5 to 6 days to prepare proper queries and execute them and generate discrepancy reports from result set.

    Now we want to develop some kind of desktop based tool which can do all below tasks

    1. Tool should connect with provided SQL Server instance
    2. It should create required queries in that database.
    3. Execute the queries and generate proper reports in excel format.

    Concerns:

    1. We want to make this tool generic so other organisations can also use it to validate the brokerage amount. So challenge is to formalise provided data in tools required format.
    2. What approach we should take while working on such a large data sets?
    3. How much we should concern about hardware capabilities of desktop where this tool is going to be installed?

    We are planning to use WPF technology to develop the tool.

    Please suggest your views on this.

    Thanks in advance.


    Best Regards, Somnath

    Good day,

    My 2 sent...

    1. I don't think that a system this level is something we can architect in the forum. In order to design such a system we must have a full understanding on your business logic, you goals, aims, and means. We must undersstand the exact queries you want to to have in the future and you current exact situation... We might made more bad than good by convince you to go to in specific direction...

    The information that we need to start design such a system cannot be fill in several forum lines of message.

    I do not agree that this is more fit for develop issue, but ths does not fit for DBA as well. This question fit to architects, who have understanding laterally in range of technologies related to solving the problem (including developing, DBA...

    2. One of the first steps should be to break down the design into small blocks of services starting from input data, storing the data, analyze the data... For each block you will need to to think about the best option for the server side and for the client side

    There a nice architecture that might fit your case in these first steps named Event-driven Service Oriented Architecture (SOA 2). check this name in Google

    3. Your question regarding WPF might suggest wrong approach. It is not important if you are going to use WPF or Windows Forms or Web application or any other tech in order to implement the client side. In fact your best option is to think ahead and not to use any of these as base client that reach the database layer directly. Your best default option is to use API service, which later on all these technologies and others can use. You should think about breaking the system design into layers (database, API service, Client...)

    * I am assuming that you are not going to use any built-in option of BI, but your application as you mentioned...

    4. We have two tables in database

    I don't have idea how you already have a built-in database while you still should be in design step according to your question description. If you already have a database DDL which we have to use, this limit the design a lot! This is usually a result of DBA that try to design a system without the architect, and therefore he start with the Database, which in lot of cases should be re-design from scratch once we more to the next step, and we see that the database structure does not fit. Moreover, maybe the database type does not fit (I am not sure at this time that relational database is your best tool... I will need to much more information before I make this choice).... for the sake of this discussion let's assume that we must use the current database as it is (which is a VERY bad assumption!)

    5. We want to execute some sql queries on both the tables and identify if the rate which is calculated (actual provided to broker) in “Mutual Fund Brokerage” table is correct or not.

    Not clear to me... sound to me like story instead of information with sample. Please provide DDL+sample DML of your current database structure, instead of stories about the table structure :-)

    Dealing with DBA this is best option usually as you already have two tables in database . It will help a lot if in addition to your description you can "show use the numbers" - I mean show us a examples (for example a sample of actual provided to broker.. correct).

    6. There is no direct connection between both the tables. We have to join tables on the basis of transaction date, broker type, location, transaction value etc.

    If I understand you correctly there is no relation between the tables (which are in relational database... point to think about*). In first glance it seems like you should think about using cubes (Again I cannot advice anything and I am not sure what fit your need as I have no info... this is just something that if you did not thought about then you should have)

    take almost 5 to 6 days to prepare proper queries and execute them and generate discrepancy reports from result set.

    What take 5 to 6 days? The prepare proper queries  or execute them or generate discrepancy reports ?!?

    How do you prepare proper queries? can you elaborate the procedure or better give example step by step what you are doing today

    8. 

    Now we want to develop some kind of desktop based tool which can do all below tasks

    1. Tool should connect with provided SQL Server instance
    2. It should create required queries in that database.
    3. Execute the queries and generate proper reports in excel format.

    >> It should create required queries in that database.

    Do you want that the tool will create the queries? Do you know how long Microsoft are working on their internal Execution plan Engine in order to get best performance (This is not the same, but same idea). If your queries are not totally "random" and you can create them in advance it will probably give you better performance. It will gives you the option to use partially processed data in some cases instead of query the original data each execution.

    Again more info and specific example could help to understand and advice

    >> So challenge is to formalise provided data in tools required format.

    Or maybe, if it can be done, process the data partially in advance and store it ready for the final step which is done dynamically... 

    I would think about the option to use 2 tools (two modules in the tool): one to create the query and maybe store data partially processed, and another one to execute the data in order to create the report. I am not saying that this is what I will do, but at least I would think about this option.

    >> We are planning to use WPF technology to develop the tool.

    Sound to me like you are doing mistake which you will pay on it later on. As I mentioned in first glance seems like you need to work with service layer (API) and not design you solution in specific client tech! This is probably the only point I feel almost OK to advice with the current info!

    * By the way, personally I would probably prefer to use web app as my first client (The app that connect to the API layer), for lot of reasons.

    ...

    This was long..
    I hope it was useful a bit,
    and at least gave you some points to think about

    If you can provide more details and examples we might have more specific points.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]


    • Edited by pituachMVP Sunday, October 30, 2016 7:40 PM
    • Marked as answer by Somnath Tingare Wednesday, November 2, 2016 5:09 AM
    Sunday, October 30, 2016 7:37 PM

All replies

  • Hi Somnath Tingare,

    Based on my understanding, it’s more like a develop issue to me. In this case,  I would suggest you post a new thread in WPF forum for better support.

    If you have any other questions, please let me know.

    Regards,
    Lin
    Monday, August 1, 2016 8:59 AM
  • How many rows do you expect to get returned by your queries?

    This sounds like something that you can do directly in Excel via PowerQuery.  That way you don't have to develop any new tool

    If you have Excel 2013 or better then you can download the PowerQuery add in and write your connections and query and then deliver it to your client.  When they need to process the data they just run the query and the results come into Excel.


    SOTATS, Inc.

    Sunday, October 30, 2016 2:20 PM
  • Having worked in the financial industry for a while I have a few observations.

    1) data from feeds will be identical in structure everywhere. Ie Bloomberg, Bridge, Nasdaq.

    2) internal tables vary widely and wildly between financial institutions.

    It will be difficult IMHO to generate a tool which can be used to find such discrepancies.

    Working with large data sets will be difficult, but SQL Server is up to handling millions of rows like this. Note that there seems to be something incorrect in your currency formatting.

    3,00,00,0000 records (30 Millions) - should be displayed as 30,000,000 or 300,000,000. You say 30 million, but the number is 300 million.

    You will need do do everything in a batch. I have written batch queries for teclo data that handle this volume and look for discrepancies and it can be done in hours not days. You may want to consider streaminsight or hdinsight to deal with the data near real time.

    Your processing should be done on the server and merely displayed on the desktop. You should focus on the hardware requirements of the server not the desktop.

    Sunday, October 30, 2016 2:30 PM
  • Hi All

    We are working on a project where we want to identify the discrepancies of Mutual Fund brokerage.

    In this project we are going to do below tasks:

    1. We have two tables in database

    -          “Mutual Fund Brokerage” data which includes 3,00,00,0000 records (30 Millions)

    -          “Rate Master” where approx. 10,00,000 records (1 Million). This is nothing but the brokerage plan prepared for broker by Mutual Fund organization.

    1. We want to execute some sql queries on both the tables and identify if the rate which is calculated (actual provided to broker) in “Mutual Fund Brokerage” table is correct or not.
    2. There is no direct connection between both the tables. We have to join tables on the basis of transaction date, broker type, location, transaction value etc.

    We are doing this task from last few months which take almost 5 to 6 days to prepare proper queries and execute them and generate discrepancy reports from result set.

    Now we want to develop some kind of desktop based tool which can do all below tasks

    1. Tool should connect with provided SQL Server instance
    2. It should create required queries in that database.
    3. Execute the queries and generate proper reports in excel format.

    Concerns:

    1. We want to make this tool generic so other organisations can also use it to validate the brokerage amount. So challenge is to formalise provided data in tools required format.
    2. What approach we should take while working on such a large data sets?
    3. How much we should concern about hardware capabilities of desktop where this tool is going to be installed?

    We are planning to use WPF technology to develop the tool.

    Please suggest your views on this.

    Thanks in advance.


    Best Regards, Somnath

    Good day,

    My 2 sent...

    1. I don't think that a system this level is something we can architect in the forum. In order to design such a system we must have a full understanding on your business logic, you goals, aims, and means. We must undersstand the exact queries you want to to have in the future and you current exact situation... We might made more bad than good by convince you to go to in specific direction...

    The information that we need to start design such a system cannot be fill in several forum lines of message.

    I do not agree that this is more fit for develop issue, but ths does not fit for DBA as well. This question fit to architects, who have understanding laterally in range of technologies related to solving the problem (including developing, DBA...

    2. One of the first steps should be to break down the design into small blocks of services starting from input data, storing the data, analyze the data... For each block you will need to to think about the best option for the server side and for the client side

    There a nice architecture that might fit your case in these first steps named Event-driven Service Oriented Architecture (SOA 2). check this name in Google

    3. Your question regarding WPF might suggest wrong approach. It is not important if you are going to use WPF or Windows Forms or Web application or any other tech in order to implement the client side. In fact your best option is to think ahead and not to use any of these as base client that reach the database layer directly. Your best default option is to use API service, which later on all these technologies and others can use. You should think about breaking the system design into layers (database, API service, Client...)

    * I am assuming that you are not going to use any built-in option of BI, but your application as you mentioned...

    4. We have two tables in database

    I don't have idea how you already have a built-in database while you still should be in design step according to your question description. If you already have a database DDL which we have to use, this limit the design a lot! This is usually a result of DBA that try to design a system without the architect, and therefore he start with the Database, which in lot of cases should be re-design from scratch once we more to the next step, and we see that the database structure does not fit. Moreover, maybe the database type does not fit (I am not sure at this time that relational database is your best tool... I will need to much more information before I make this choice).... for the sake of this discussion let's assume that we must use the current database as it is (which is a VERY bad assumption!)

    5. We want to execute some sql queries on both the tables and identify if the rate which is calculated (actual provided to broker) in “Mutual Fund Brokerage” table is correct or not.

    Not clear to me... sound to me like story instead of information with sample. Please provide DDL+sample DML of your current database structure, instead of stories about the table structure :-)

    Dealing with DBA this is best option usually as you already have two tables in database . It will help a lot if in addition to your description you can "show use the numbers" - I mean show us a examples (for example a sample of actual provided to broker.. correct).

    6. There is no direct connection between both the tables. We have to join tables on the basis of transaction date, broker type, location, transaction value etc.

    If I understand you correctly there is no relation between the tables (which are in relational database... point to think about*). In first glance it seems like you should think about using cubes (Again I cannot advice anything and I am not sure what fit your need as I have no info... this is just something that if you did not thought about then you should have)

    take almost 5 to 6 days to prepare proper queries and execute them and generate discrepancy reports from result set.

    What take 5 to 6 days? The prepare proper queries  or execute them or generate discrepancy reports ?!?

    How do you prepare proper queries? can you elaborate the procedure or better give example step by step what you are doing today

    8. 

    Now we want to develop some kind of desktop based tool which can do all below tasks

    1. Tool should connect with provided SQL Server instance
    2. It should create required queries in that database.
    3. Execute the queries and generate proper reports in excel format.

    >> It should create required queries in that database.

    Do you want that the tool will create the queries? Do you know how long Microsoft are working on their internal Execution plan Engine in order to get best performance (This is not the same, but same idea). If your queries are not totally "random" and you can create them in advance it will probably give you better performance. It will gives you the option to use partially processed data in some cases instead of query the original data each execution.

    Again more info and specific example could help to understand and advice

    >> So challenge is to formalise provided data in tools required format.

    Or maybe, if it can be done, process the data partially in advance and store it ready for the final step which is done dynamically... 

    I would think about the option to use 2 tools (two modules in the tool): one to create the query and maybe store data partially processed, and another one to execute the data in order to create the report. I am not saying that this is what I will do, but at least I would think about this option.

    >> We are planning to use WPF technology to develop the tool.

    Sound to me like you are doing mistake which you will pay on it later on. As I mentioned in first glance seems like you need to work with service layer (API) and not design you solution in specific client tech! This is probably the only point I feel almost OK to advice with the current info!

    * By the way, personally I would probably prefer to use web app as my first client (The app that connect to the API layer), for lot of reasons.

    ...

    This was long..
    I hope it was useful a bit,
    and at least gave you some points to think about

    If you can provide more details and examples we might have more specific points.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]


    • Edited by pituachMVP Sunday, October 30, 2016 7:40 PM
    • Marked as answer by Somnath Tingare Wednesday, November 2, 2016 5:09 AM
    Sunday, October 30, 2016 7:37 PM
  • Ronen, thanks.  You took the extra time to really step through and break down the initial post to reveal the complexities and challenges of such an endeavor while raising important questions about what the poster is trying to accomplish and instructing on what is missing from such a post in order to provide meaning feedback.

    SOTATS, Inc.

    Sunday, October 30, 2016 8:06 PM
  • Ronen, thanks.  You took the extra time to really step through and break down the initial post to reveal the complexities and challenges of such an endeavor while raising important questions about what the poster is trying to accomplish and instructing on what is missing from such a post in order to provide meaning feedback.

    SOTATS, Inc.

    Thanks Ulysses, I appreciate the nice feedback :-)
    I hope the OP will read my entire message as well, and find it useful...

    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]

    Monday, October 31, 2016 7:43 AM
  • Ronen and Ulysses thank you so much for your responses.

    As per your comments/statements I will revisit all the requirements and possibilities again and will come back to you with more appropriate question draft.

    Your comments are really very helpful for me.


    Best Regards, Somnath

    Wednesday, November 2, 2016 5:17 AM
  • You are most welcome

    I glad if I could help a bit :-)

    * Once you are ready with new and organized information and question, please open a new thread, so it will be "clear new page" for these who see it first, but try to add link to the previous relevant threads like this one, if it can give anything like background


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]

    Wednesday, November 2, 2016 7:17 AM