none
F# how to improve performance for inserting records into SQL Server database

    Question

  • Hello:

    I have a text file delimited by tab “\t”, let’s say it has 3 columns, there are Symbol, Description and Exchange.  I saved this file in: C:\Temp\symbols.txt

    The following is the SQL Server 2016 data table design, let’s say I have a database named “myDB”:

    CREATE TABLE [dbo].[symbolTable]

    (

    [ID]            [int]            NOT NULL,

    [Symbol]        [nvarchar](30)   NOT NULL,

    [DESCRIPTION]   [nvarchar](250)  NOT NULL,

    [Exchange]      [nvarchar](20)   NOT NULL

    CONSTRAINT [PK_symbolTable] PRIMARY KEY(ID))

    GO

    Since there is some issue with SQL Server 2016 data import/export, I can’t import the text file directly into SQL Server 2016, all kinds of wired errors.  Therefore, I decided to write F# code to import the text file into the data table.

    I am using the following F# code (Visual Studio 2017, Visual F# 4.1)

    I have a rather new PC with KabyLake i7-7700T 4-core CPU, 32 GB RAM, two SSD disks each has 1TB capacity, Windows 10 Enterprise (64-bit).

    I think my PC seems to be rather powerful, the text file has about 3.6 million lines, each line has 3 columns.

    But my program has been running for nearly 8 hours, it has not finished half of the job yet, it has barely inserted 1.8 million records.

    Please advice how I can improve the speed of inserting records into SQL Server 2016 database.

    #light

    open FSharp.Data.TypeProviders

    open FSharp.Linq

    open System

    open System.Collections.Generic

    open System.Data

    open System.Data.Linq

    open System.Data.SqlClient

    open System.Data.SqlTypes

    open System.IO

    open System.Linq

    open System.Text

    open System.Text.RegularExpressions

    type dbSchema = SqlDataConnection<"Data Source=.;Initial Catalog=myDB;Integrated Security=True">

    let db = dbSchema.GetDataContext()

    let connDB = new SqlConnection("server=.; Integrated Security=True; Database=myDB")

    let tempSymbolFile = @"C:\Temp\symbols.txt"

    let add1Symbol(id1: int, tick1: string, desc1: string, exchange1: string) =

        connDB.Open()

        use sbc = new SqlBulkCopy(connDB, SqlBulkCopyOptions.TableLock, null, BatchSize=10000,

                                  BulkCopyTimeout=1200, DestinationTableName="symbolTable")

        try

            sbc.WriteToServer(let dt = new DataTable()

                              [ "ID", typeof<int>

                                "Symbol", typeof<string>

                                "DESCRIPTION", typeof<string>

                                "Exchange", typeof<string>

                              ] |> List.iter (dt.Columns.Add >>ignore)

                              let record = dt.NewRow()

                              record.["ID"] <- id1

                              record.["Symbol"] <- tick1

                              record.["DESCRIPTION"] <- desc1

                              record.["Exchange"] <- exchange1

                              dt.Rows.Add(record)

                              dt)

            if ((id1 % 10000) = 0) then printfn "Inserting record # %d"  id1

        with error -> printfn "Can not insert End of Day account record!"

        connDB.Close()

        ()

    type tempSymbol ={

                      Symbol       : string

                      Description  : string

                      Exchange     : string

    } with

        static member fromFile file =

            file

            |> File.ReadLines     

            |> Seq.skip 1

            |> Seq.map (fun s -> Regex.Split(s, "\t"))

            |> Seq.map (fun a -> {Symbol=string a.[0];

                                  Description=string a.[1];

                                  Exchange=string a.[2]})

            |> Seq.iteri(fun i a -> add1Symbol(i + 1, a.Symbol, a.Description, a.Exchange))

    let time0 = DateTime.Now.ToLongTimeString()

    printfn "Begin inserting IQFeed symbols at: %s" time0

    tempSymbol.fromFile tempSymbolFile

    let time1 = DateTime.Now.ToLongTimeString()

    printfn "Done inserting IQFeed symbols at: %s" time1

    Thanks,

    Sunday, April 30, 2017 5:46 PM

Answers

  • The first one is a silly and obvious thing -- the arguments to `batch` as I wrote it the first time around are the wrong way around for chaining.  Instead of


    let batch (source : IEnumerable<'a>) n =

    write


    let batch n (source : IEnumerable<'a>) =

    And as for the second -- all columns after the third are being ignored.

    I discarded the original splitting and packing the columns of interest into a temporary record before calling the batch function because -- sequences being lazy -- the line is only read from the file as it is being passed to Regex.Split and the packing into a temporary value, as you had it, will actually involve more allocations per line than just splitting the line at the point of use does.

    • Marked as answer by zydjohn Thursday, May 04, 2017 5:25 PM
    Thursday, May 04, 2017 4:24 PM

All replies

  • The problem is that you open connection, create a data table, fill it with single record, then bulk insert it and close the connection. It's extremely slow. Split all "symbols" into chunks of reasonable size, like 100K, then for each such chunk create a data table, insert all 100K records into it, bulk insert it. When all the chunks are inserted, close the connection.
    Monday, May 01, 2017 6:57 AM
  • Hello:

    I think your idea seems to be a good one.  But I can't figure out how to write the code.

    I think using |> Seq.iteri, then my code looks nice.

    But how you can split all the "symbols" into chunks of 100K?

    Please show me your code.

    Thanks,

    John

    Monday, May 01, 2017 12:39 PM
  • Chunking an input sequence into batches to process like this is a common enough operation that it is surprising that there isn't a library function for it; and there's no good way that I've found to assemble from what we have in the Seq module as something purely functional.

    This is one of those places where it is simplest to write some imperative code with explicit loops at some point in the process or, alternatively, get deep into the details of how sequences are implemented and write generic enumerator/enumerable wrapper classes to control the evaluation of the sequence; or make some combination of the two, as in this C# example.
    Monday, May 01, 2017 4:55 PM
  • What he means is to insert 100,000 lines from the source file, then commit to the database.

    The most expensive part of a database connection is opening the connection. Doing that once for each line is going to be painfully slow.

    By way of an example, here's what I did for a similar project several years ago (not exact code - but this will give you an idea):

    1. Connect to the database and Create a data table (outside the loop loop)
    2. loop through the rows in the file
    3. For each line:
    4. Add line to datatable
    5. if counter is 100,000, write datatable to database, then clear datatable
    6. Repeat 4 and 5 until all the rows are written.
    7. Close the database

    Don't allocate any variables in the loop unless you absolutely have to, and make sure to null them out before you go to the next section of the loop (otherwise you'll have a memory leak).

    Your exact implementation of something like the above is going to be specific to the f# way of doing things, but I hope that helps.

    Monday, May 01, 2017 6:01 PM
  • Hello:

    Your suggestion seems to be nice.  My concern is also the memory leak.  From my testing, my program finally finished its job after nearly 12 hours of running.  The good thing is, the memory usage is very small, it is less than 12MB.  But I am afraid, if I save a big chunk of data in memory, then it could lead to memory leak.

    However, I am not sure how to write the code yet.  If you can share your code, even if it is C#, it may also help me to figure out how to write it in F#.

    Thanks,

    John

    Tuesday, May 02, 2017 3:10 PM
  • This is the C# from the link I gave translated into F#

    open System.Collections.Generic

    let batch (source : IEnumerable<'a>) n =
      let batchElements (source : IEnumerator<'a>) n =
        seq {
          yield source.Current
    // bury the imperative code in an inner function
          let mutable i = 0
          while (i < n) && source.MoveNext() do
           yield source.Current
           i <- i + 1
        }
      seq {
        use enumerator = source.GetEnumerator()
        while enumerator.MoveNext() do
          yield batchElements enumerator (n-1)
      };;

    // simple test
    let test = { 0..12 };;

    batch test 5;;

        

    It lazily evaluates, so there's no build up of records that's not under your control.

    You can even do this

    > batch (Seq.initInfinite id) 5;;
    val it : seq<seq<int>> =
      seq
        [seq [0; 1; 2; 3; ...]; seq [5; 6; 7; 8; ...]; seq [10; 11; 12; 13; ...];
         seq [15; 16; 17; 18; ...]; ...]


    • Edited by Mr. Tines Tuesday, May 02, 2017 7:22 PM Infinite sequence example
    Tuesday, May 02, 2017 7:19 PM
  • This is an obvious first refactoring of your code using the batching.

     let batchsize = 10000 
     
     let addSymbols (id1: int) (chunk : seq<string>) =
            let dt = new DataTable()
            [ "ID", typeof<int>
            "Symbol", typeof<string>
            "DESCRIPTION", typeof<string>
             "Exchange", typeof<string> ] |> List.iter (dt.Columns.Add >>ignore)
    
            chunk 
            |> Seq.iteri (fun i s -> let a = Regex.Split(s, "\t")
                                     let record = dt.NewRow()
                                     record.["ID"] <- (batchsize * id1 + i)
                                     record.["Symbol"] <- string a.[0]
                                     record.["DESCRIPTION"] <- string a.[1]
                                     record.["Exchange"] <- string a.[2]
                                     dt.Rows.Add(record))
            printfn "Inserting record # %d"  (id1 * batchsize)
            connDB.Open()
            try
                use sbc = new SqlBulkCopy(connDB, SqlBulkCopyOptions.TableLock, null, BatchSize=10000,
                                  BulkCopyTimeout=1200, DestinationTableName="symbolTable")
                sbc.WriteToServer(dt)
            with error -> printfn "Can not insert End of Day account record!"
            connDB.Close()
    
    static member fromFile file =
            file
            |> File.ReadLines     
            |> Seq.skip 1
            |> batch batchsize
            |> Seq.iteri (fun i a -> addSymbols i a)

    Adding a dt.Clear() when closing the connection may possibly keep the peak memory usage lower, but that's the sort of thing that can only be determined by experiment and otherwise counts as premature optimisation.

    Since garbage collection is non-deterministic and lazy, you may observe processes accumulating memory -- especially in server GC mode -- rather than purging it, if they are not under memory pressure.



    • Edited by Mr. Tines Wednesday, May 03, 2017 9:51 AM Memory considerations
    Wednesday, May 03, 2017 9:40 AM
  • Hi, Dear Mr. Tines:

    Thanks for your code.  However, I am confused.  Your previous post showed the function: batch.  It showed something different, and your last post did NOT show the function for batch.  And the 2 posts are quite different.  I can't figure out how to write the batch in your last post.  Compiler complains not able to find the function batch.

    Please advice.

    Thanks,

    John

    Wednesday, May 03, 2017 8:09 PM
  • Use both pieces of code -- the one that defines the batch function (which is general purpose code and really ought be part of the Seq module), and the one that uses it (which applies it to your problem at hand).  I had expected that to be too obvious to actually need to be explained like this.

    Thursday, May 04, 2017 6:23 AM
  • Dear Mr. Tines:

    I have changed my code according to your advice:

    #light

    open FSharp.Data.TypeProviders

    open FSharp.Linq

    open System

    open System.Collections.Generic

    open System.Data

    open System.Data.Linq

    open System.Data.SqlClient

    open System.Data.SqlTypes

    open System.Diagnostics

    open System.Globalization

    open System.IO

    open System.Linq

    open System.Net

    open System.Net.Sockets

    open System.Text

    open System.Text.RegularExpressions

    open System.Threading

    type dbCFDSchema = SqlDataConnection<"Data Source=.;Initial Catalog=myDB;Integrated Security=True">

    let db = dbCFDSchema.GetDataContext()

    let connDB = new SqlConnection("server=.; Integrated Security=True; Database=myDB")

    let tempSymbolFile = @"C:\Temp\symbols.txt"

    let batchsize = 10000

    let batch (source : IEnumerable<'a>) n =

        let batchElements (source : IEnumerator<'a>) n =

            seq  {

                 yield source.Current

                 let mutable i = 0

                 while (i < n) && source.MoveNext() do

                 yield source.Current

                 i <- i + 1

                 }

        seq {

            use enumerator = source.GetEnumerator()

            while enumerator.MoveNext() do

            yield batchElements enumerator (n-1)

            }

       

    let addSymbols (id1: int) (chunk : seq<string>) =

                let dt = new DataTable()

                [ "ID", typeof<int>

                  "Symbol", typeof<string>

                  "DESCRIPTION", typeof<string>

                  "Exchange", typeof<string> ] |> List.iter (dt.Columns.Add >>ignore)

                chunk

                |> Seq.iteri (fun i s -> let a = Regex.Split(s, "\t")

                                         let record = dt.NewRow()

                                         record.["ID"] <- (batchsize * id1 + i)

                                         record.["Symbol"] <- string a.[0]

                                         record.["DESCRIPTION"] <- string a.[1]

                                         record.["Exchange"] <- string a.[2]

                                         dt.Rows.Add(record))

                printfn "Inserting record # %d"  (id1 * batchsize)

                connDB.Open()

                try

                    use sbc = new SqlBulkCopy(connDB, SqlBulkCopyOptions.TableLock, null, BatchSize=10000,

                                              BulkCopyTimeout=1200, DestinationTableName="symbolTable")

                    sbc.WriteToServer(dt)

                with error -> printfn "Can not insert symbol record!"

                connDB.Close()

      

    type tempSymbol ={

                      Symbol       : string

                      Description  : string

                      Exchange     : string

    } with

        static member fromFile file =

                file

                |> File.ReadLines    

                |> Seq.skip 1

                |> batch batchsize

                |> Seq.iteri (fun i a -> addSymbols i a)

    let time0 = DateTime.Now.ToLongTimeString()

    printfn "Begin inserting symbols at: %s" time0

    tempSymbol.fromFile tempSymbolFile

    let time1 = DateTime.Now.ToLongTimeString()

    printfn "Done inserting symbols at: %s" time1

    printfn "Done"

    I got compiler errors:

    Type mismatch. Expecting a  'seq<string> -> 'a'    but given a  'int -> seq<seq<'b>>'    The type 'seq<string>' does not match the type 'int'

    The type 'int' is not compatible with the type 'IEnumerable<'a>'

    For this statement:

                |> batch batchsize

    It is not easy to know which type is expected.

    Another minor issue is: the original file: @"C:\Temp\symbols.txt"

    has actually 8 columns, but I need only 3 columns: Symbol, Description and Exchange, I want to discard other 5 columns, as they are not useful.  I don’t know if I want the function chunk seq<string> to ignore the other 5 columns from the original file, it will have any effect when it process the original file.

    Thanks,

    John

    Thursday, May 04, 2017 9:33 AM
  • The first one is a silly and obvious thing -- the arguments to `batch` as I wrote it the first time around are the wrong way around for chaining.  Instead of


    let batch (source : IEnumerable<'a>) n =

    write


    let batch n (source : IEnumerable<'a>) =

    And as for the second -- all columns after the third are being ignored.

    I discarded the original splitting and packing the columns of interest into a temporary record before calling the batch function because -- sequences being lazy -- the line is only read from the file as it is being passed to Regex.Split and the packing into a temporary value, as you had it, will actually involve more allocations per line than just splitting the line at the point of use does.

    • Marked as answer by zydjohn Thursday, May 04, 2017 5:25 PM
    Thursday, May 04, 2017 4:24 PM
  • Dear Mr. Tines:

    You know what?  After I changed my code according to your advice, I did a test to find out how fast my program can go.  Here comes the results from screen:

    +++++++++++++++++++++++++++++++++++

    Begin inserting symbols at: 7:22:52 PM
    Inserting record # 0
    Inserting record # 1000000
    Inserting record # 2000000
    Inserting record # 3000000
    Done inserting symbols at: 7:23:25 PM

    ++++++++++++++++++++++++++++++++

    I changed batchsize to 1 Million, it took only 33 seconds to finish inserting the records into SQL Server 2016 database, even the SQL command: SELECT * FROM SymbolTable took 14 seconds to retrieve more than 3 million records.

    This time, the performance increase about 1300 times.

    Thank very much for your kind help.

    Wish you all the best.

    John

    Thursday, May 04, 2017 5:30 PM