locked
Count read lines for bulk insert RRS feed

  • Question

  • Hello,

    Is it possible to count lines which was read from import file ? 

    INSERT INTO table1(col1, col2)  
    SELECT col1, col2 
    FROM  OPENROWSET(BULK  'F:\test.csv',  FORMATFILE='f:\test.XML',  ERRORFILE='F:\test.csv.log') AS a
    WHERE col1%3=0;

    Monday, December 24, 2012 6:58 PM

Answers

  • Performance will never be any good, because there will always be two scans of the data, one way or another. Here is a solution that is a bit crazy:

    DECLARE @count TABLE (cnt int NOT NULL PRIMARY KEY WITH (IGNORE_DUP_KEY = ON))

    CREATE TABLE test (id int NOT NULL, name nvarchar(149) NOT NULL)

    ; WITH query(col1,col2, cnt) AS (
       SELECT ProductID, ProductName, COUNT(*) OVER()
       FROM  OPENROWSET(BULK  'C:\temp\slask.bcp', FORMATFILE = 'C:\temp\slask.xml') AS a
    )
    MERGE test t
    USING query q ON 1 = 0
    WHEN NOT MATCHED BY TARGET AND col1 % 3 = 0 THEN
        INSERT (id, name)
           VALUES(col1, col2)
    OUTPUT q.cnt INTO @count(cnt)
    ;
    SELECT * FROM test
    SELECT * FROM @count
    go
    DROP TABLE test

    It is drawn from the solution with the CTE that you posted. To capture the count I use MERGE instead of INSERT, because this permits me to use the cnt column in the OUTPUT clause. SQL Server will attempt to insert the count once for every row in the source, but since the table variable has IGNORE_DUP_KEY, only row is inserted.

    An alternate solution is this:

    DECLARE @count TABLE (dummy bit NOT NULL)

    CREATE TABLE test (id int NOT NULL, name nvarchar(149) NOT NULL)

    ; WITH query(col1,col2) AS (
       SELECT ProductID, ProductName
       FROM  OPENROWSET(BULK  'C:\temp\slask.bcp', FORMATFILE = 'C:\temp\slask.xml') AS a
    )
    MERGE test t
    USING query q ON 1 = 0
    WHEN NOT MATCHED BY TARGET AND col1 % 3 = 0 THEN
        INSERT (id, name)
           VALUES(col1, col2)
    OUTPUT 1 INTO @count(dummy)
    ;
    SELECT * FROM test
    SELECT COUNT(*) FROM @count
    go
    DROP TABLE test

    But this will of course take up quite some space for the table variable, and you will need to do the count. However, it will not be 20 GB, since it's only a bit column.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by Yura Developer Tuesday, December 25, 2012 5:04 PM
    Tuesday, December 25, 2012 2:49 PM
  • If is it possible for you, can you explain how it works in low level ?

    Does Sql Server load all data from csv into temp table before count and insert records ? 

    If you look at the query plan, you will see that there is a spool operator, and, yes a spool is some sort of work table. So it reads all rows into this work table, and then it runs the count over that worktable.

    Sorry, about the other solution I posted. I was not observant when I checked the output. Here is a different query on the same idea, and this one is even crazier that the previous query:

    CREATE TABLE #count (dummy bit NOT NULL)

    CREATE TABLE test (id int NOT NULL, name nvarchar(149) NOT NULL)

    SET STATISTICS IO ON

    INSERT test (id, name)
       SELECT ProductID, ProductName
       FROM (MERGE #count t
             USING  (SELECT ProductID, ProductName
                     FROM   OPENROWSET(BULK  'C:\temp\slask.bcp',
                              FORMATFILE = 'C:\temp\slask.xml') AS q) AS q ON 1 = 0
             WHEN NOT MATCHED BY TARGET THEN
               INSERT (dummy) VALUES(1)
             OUTPUT q.ProductID, q.ProductName) AS m
       WHERE  ProductID % 3 = 0

    SET STATISTICS IO OFF

    SELECT * FROM test
    SELECT rows
    FROM   tempdb.sys.partitions
    WHERE  object_id = object_id('tempdb..#count')
    go
    DROP TABLE test, #count
    go

    This solution makes use of composable DML. That is, the inner part runs the MERGE where we only insert into a temp table with a single bit column. Then we use the OUTPUT clause to transport the data to the embracing INSERT, and we can filter the OUTPUT from MERGE.

    From a logical point of view, we would have to scan #count to count the number of rows, but I sneak the value from a catalog view. Please beware that Books Online describes this as an approxamite value. I'm told that it usually correct, but if you want to be sure you need to use SELECT COUNT(*).

    Looking that the query plan, the main query does not have any spools, and this is a streaming plan. We are filling up the temp table, but are keeping it to a minimum.

    I suggest that you test both and see one that performs the best.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by Yura Developer Wednesday, December 26, 2012 6:58 PM
    Tuesday, December 25, 2012 11:04 PM
  • I found soilution which is acceptable for me.

    I generate delimiter string of list of sold products and use created CLR class.

    I have added condition  WHERE dbo.check_sold(product) = 1

    Thank you for your attention!

    DECLARE @products VARCHAR(MAX) = '';
    
    SELECT @products = @products + '|' + product 
    FROM soldproducts 
    

        static public class class_products_load
        {
    
            static private HashSet<String> productss; 
    
            [SqlFunction]
            static public Int32 init(String products_list)
            {
                products = new HashSet<String>();
    
                foreach (String el in products_list.Split(new char[] { '|' }))
                {
                    products.Add(el);
                }
    
                return 0;
            }
    
            [SqlFunction]
            static public Int32 contains(String product)
            {
                return (products.Contains(product))?1:0;
            }
    
            [SqlFunction]
            static public Int32 destroy()
            {
                products = null;
                return 0;
            }
        }
    

    Sunday, March 10, 2013 11:37 AM

All replies

  • Monday, December 24, 2012 7:14 PM
  • It's not answer. 

    I can count inserted rows use @@ROWCOUNT,  but I would like count read lines in source file CSV.

    Tuesday, December 25, 2012 12:42 PM
  • Yura

    I do not think that is possible, what is the purpose? Take a look into creating SSIS package for such requirements.


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Blog: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance

    Tuesday, December 25, 2012 12:59 PM
    Answerer
  • I have solution but with bad performance fot files > 20 GB. 

    I use instead of trigger for destination table and add WHERE clause  in it.

    Then @@ROWCOUNT shows number of READ lines from file.   

    Tuesday, December 25, 2012 1:23 PM
  • I have other solution, but I don't know how to insert value of variable cnt for query1 into  @variable  

    WITH query(col1,col2)
    AS
    (
    SELECT col1, col2 
    FROM  OPENROWSET(BULK  'F:\test.csv',  FORMATFILE='f:\test.XML',  ERRORFILE='F:\test.csv.log') AS a
    ),
    query1(cnt)
    AS
    (
     select count(*) from query
    )
    INSERT INTO table1(col1, col2)  
    SELECT col1, col2 
    FROM query
    WHERE col1%3=0;

    Tuesday, December 25, 2012 2:01 PM
  • Performance will never be any good, because there will always be two scans of the data, one way or another. Here is a solution that is a bit crazy:

    DECLARE @count TABLE (cnt int NOT NULL PRIMARY KEY WITH (IGNORE_DUP_KEY = ON))

    CREATE TABLE test (id int NOT NULL, name nvarchar(149) NOT NULL)

    ; WITH query(col1,col2, cnt) AS (
       SELECT ProductID, ProductName, COUNT(*) OVER()
       FROM  OPENROWSET(BULK  'C:\temp\slask.bcp', FORMATFILE = 'C:\temp\slask.xml') AS a
    )
    MERGE test t
    USING query q ON 1 = 0
    WHEN NOT MATCHED BY TARGET AND col1 % 3 = 0 THEN
        INSERT (id, name)
           VALUES(col1, col2)
    OUTPUT q.cnt INTO @count(cnt)
    ;
    SELECT * FROM test
    SELECT * FROM @count
    go
    DROP TABLE test

    It is drawn from the solution with the CTE that you posted. To capture the count I use MERGE instead of INSERT, because this permits me to use the cnt column in the OUTPUT clause. SQL Server will attempt to insert the count once for every row in the source, but since the table variable has IGNORE_DUP_KEY, only row is inserted.

    An alternate solution is this:

    DECLARE @count TABLE (dummy bit NOT NULL)

    CREATE TABLE test (id int NOT NULL, name nvarchar(149) NOT NULL)

    ; WITH query(col1,col2) AS (
       SELECT ProductID, ProductName
       FROM  OPENROWSET(BULK  'C:\temp\slask.bcp', FORMATFILE = 'C:\temp\slask.xml') AS a
    )
    MERGE test t
    USING query q ON 1 = 0
    WHEN NOT MATCHED BY TARGET AND col1 % 3 = 0 THEN
        INSERT (id, name)
           VALUES(col1, col2)
    OUTPUT 1 INTO @count(dummy)
    ;
    SELECT * FROM test
    SELECT COUNT(*) FROM @count
    go
    DROP TABLE test

    But this will of course take up quite some space for the table variable, and you will need to do the count. However, it will not be 20 GB, since it's only a bit column.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by Yura Developer Tuesday, December 25, 2012 5:04 PM
    Tuesday, December 25, 2012 2:49 PM
  • Thank you for your solution !!!

    If is it possible for you, can you explain how it works in low level ?

    Does Sql Server load all data from csv into temp table before count and insert records ? 

    ; WITH query(col1,col2, cnt) AS (
       SELECT ProductID, ProductName, COUNT(*) OVER()
       FROM  OPENROWSET(BULK  'C:\temp\slask.bcp', FORMATFILE = 'C:\temp\slask.xml') AS a
    )
    MERGE test t
    USING query q ON 1 = 0
    WHEN NOT MATCHED BY TARGET AND col1 % 3 = 0 THEN 
        INSERT (id, name)
           VALUES(col1, col2) 
    OUTPUT q.cnt INTO @count(cnt)
    ;

    Tuesday, December 25, 2012 5:09 PM
  • Second solution doesn't work correctly, because it's count filtred rows (not source) 

    DECLARE @count TABLE (dummy bit NOT NULL)
    
    CREATE TABLE test (id int NOT NULL, name nvarchar(149) NOT NULL)
    
    ; WITH query(col1,col2) AS (
       SELECT ProductID, ProductName
       FROM  OPENROWSET(BULK  'C:\temp\slask.bcp', FORMATFILE = 'C:\temp\slask.xml') AS a
    )
    MERGE test t
    USING query q ON 1 = 0
    WHEN NOT MATCHED BY TARGET AND col1 % 3 = 0 THEN 
        INSERT (id, name)
           VALUES(col1, col2) 
    OUTPUT 1 INTO @count(dummy)
    ;
    SELECT * FROM test
    SELECT COUNT(*) FROM @count
    go
    DROP TABLE test

    Tuesday, December 25, 2012 5:52 PM
  • If is it possible for you, can you explain how it works in low level ?

    Does Sql Server load all data from csv into temp table before count and insert records ? 

    If you look at the query plan, you will see that there is a spool operator, and, yes a spool is some sort of work table. So it reads all rows into this work table, and then it runs the count over that worktable.

    Sorry, about the other solution I posted. I was not observant when I checked the output. Here is a different query on the same idea, and this one is even crazier that the previous query:

    CREATE TABLE #count (dummy bit NOT NULL)

    CREATE TABLE test (id int NOT NULL, name nvarchar(149) NOT NULL)

    SET STATISTICS IO ON

    INSERT test (id, name)
       SELECT ProductID, ProductName
       FROM (MERGE #count t
             USING  (SELECT ProductID, ProductName
                     FROM   OPENROWSET(BULK  'C:\temp\slask.bcp',
                              FORMATFILE = 'C:\temp\slask.xml') AS q) AS q ON 1 = 0
             WHEN NOT MATCHED BY TARGET THEN
               INSERT (dummy) VALUES(1)
             OUTPUT q.ProductID, q.ProductName) AS m
       WHERE  ProductID % 3 = 0

    SET STATISTICS IO OFF

    SELECT * FROM test
    SELECT rows
    FROM   tempdb.sys.partitions
    WHERE  object_id = object_id('tempdb..#count')
    go
    DROP TABLE test, #count
    go

    This solution makes use of composable DML. That is, the inner part runs the MERGE where we only insert into a temp table with a single bit column. Then we use the OUTPUT clause to transport the data to the embracing INSERT, and we can filter the OUTPUT from MERGE.

    From a logical point of view, we would have to scan #count to count the number of rows, but I sneak the value from a catalog view. Please beware that Books Online describes this as an approxamite value. I'm told that it usually correct, but if you want to be sure you need to use SELECT COUNT(*).

    Looking that the query plan, the main query does not have any spools, and this is a streaming plan. We are filling up the temp table, but are keeping it to a minimum.

    I suggest that you test both and see one that performs the best.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    • Marked as answer by Yura Developer Wednesday, December 26, 2012 6:58 PM
    Tuesday, December 25, 2012 11:04 PM
  • THANK YOU VERY MATCH  !!!!!!!!!

    Very good solution !

    Wednesday, December 26, 2012 7:00 PM
  • Dear Erland,

    How to add subquery to last example ?  productid in (Select productid FROM soldproducts)

    INSERT test (id, name)
       SELECT ProductID, ProductName
       FROM (MERGE #count t
             USING  (SELECT ProductID, ProductName
                     FROM   OPENROWSET(BULK  'C:\temp\slask.bcp', 
                              FORMATFILE = 'C:\temp\slask.xml') AS q) AS q ON 1 = 0
             WHEN NOT MATCHED BY TARGET THEN 
               INSERT (dummy) VALUES(1) 
             OUTPUT q.ProductID, q.ProductName) AS m
       WHERE  ProductID % 3 = 0 
    AND productid IN (SELECT productid FROM soldproducts)

    Saturday, March 9, 2013 4:55 PM
  • Hm, subqueries are not allowed here. Nor are joins.

    Either you would have to stick with that INSERT and then delete unwanted products in the next step. Or you will have to revert to the other query that I suggested:

    DECLARE @count TABLE (cnt int NOT NULL PRIMARY KEY WITH (IGNORE_DUP_KEY = ON))

    ; WITH query(ProductID, ProductName, cnt) AS (
       SELECT ProductID, ProductName, COUNT(*) OVER()
       FROM  OPENROWSET(BULK  'C:\temp\slask.bcp', FORMATFILE = 'C:\temp\slask.xml') AS a
    )
    MERGE test t
    USING query q ON 1 = 0
    WHEN NOT MATCHED BY TARGET
      AND ProductID % 3 = 0
      AND  ProductID IN (SELECT productid FROM soldproducts)  THEN
        INSERT (id, name)
           VALUES(ProductID, ProductName)
    OUTPUT q.cnt INTO @count(cnt)
    ;

    As I don't have the sample data, I cannot compare performance. But INSERT + DELETE may be better, if you only shave off a minor part of the products.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Saturday, March 9, 2013 5:30 PM
  • I think COUNT(*) OVER() will be load all data into temp table, but it's not acceptable for performance reason. Import file size - 20 GB. Average string size - 500 byte.  Only 10% percent of all data rows is acceptable for import. All other rows must be filtered.  

    I have try use function. But ....  I get error -

    Function 'dbo.fnc_exist_product' is not allowed in the WHERE clause when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. This is because the function performs user or system data access, or is assumed to perform this access. By default, a function is assumed to perform data access if it is not schema-bound.

    ALTER FUNCTION dbo.fnc_exist_product 
    (
    	@productid int
    )
    RETURNS INT
    WITH SCHEMABINDING
    AS
    BEGIN
    			
    	RETURN CASE
    				WHEN EXISTS(SELECT 1 FROM dbo.soldproduct WHERE productid = @productid) THEN 1
    				ELSE 0
    				END 
    END
    GO



    Saturday, March 9, 2013 6:09 PM
  • Here is an idea. First do:

    SELECT @productlist =
       (SELECT str(productid)
        FROM   soldproducts
        ORDER  BY productid
        FOR XML PATH(''))

    Now you have a sorted string where all product IDs take up 10 characters.

    Then you write a user-defined function (don't forget WITH SCHEMABINDING) that takes to parameter: the productid and this @productlist. This function performs a binary search in the string. That is, you first look at the value in the middle, and if the value is smaller, you next look at the entry in the middle between first and the middle entry and so on, until you are down on an interval of 1 (or you find your product id.) The function returns 1 if product is found, else 0. Then you use the function in the query.

    An alternative is to write the function in the CLR for better performance. In this case, your assemblies would be unsafe and have two entries. The first would be a stored procedure that loads the table into static array, which your function then would search. (This solution is certainly not suitable in all environments, but you are probably fairly alone when you run your file-load so it should be OK.)


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Saturday, March 9, 2013 10:37 PM
  • I found soilution which is acceptable for me.

    I generate delimiter string of list of sold products and use created CLR class.

    I have added condition  WHERE dbo.check_sold(product) = 1

    Thank you for your attention!

    DECLARE @products VARCHAR(MAX) = '';
    
    SELECT @products = @products + '|' + product 
    FROM soldproducts 
    

        static public class class_products_load
        {
    
            static private HashSet<String> productss; 
    
            [SqlFunction]
            static public Int32 init(String products_list)
            {
                products = new HashSet<String>();
    
                foreach (String el in products_list.Split(new char[] { '|' }))
                {
                    products.Add(el);
                }
    
                return 0;
            }
    
            [SqlFunction]
            static public Int32 contains(String product)
            {
                return (products.Contains(product))?1:0;
            }
    
            [SqlFunction]
            static public Int32 destroy()
            {
                products = null;
                return 0;
            }
        }
    

    Sunday, March 10, 2013 11:37 AM
  • What can I say? Totally crazy things to do in SQL Server?

    Maybe a more "normal" solution would be to write a CLR procedure that reads the file, for each row checks whether the row is interesting. If it's interesting, the row is passed to the table through the SqlBulkCopy interface. Now it's trivial to count the rows.

    But as you have it working now, I guess you prefer to keep it as it is.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Sunday, March 10, 2013 1:16 PM
  • I try write personal bulk insert on c# with multithreadng.  MSSQL bulk insert works faster.
    Monday, March 11, 2013 4:31 PM
  • What about SSIS ? Can I Import and calculate selected and inserted rows ? What about performance ?

    Monday, March 11, 2013 5:38 PM
  • You may want to ask those question in an SSIS forum. I don't know SSIS myself.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
    Monday, March 11, 2013 10:59 PM