none
XML Performance Comparison - Why is OPENXML faster? RRS feed

  • Question

  • I've got a table that looks like this CONTACTS(ID UNIQUEIDENTIFIER, CONTACT_INFO XML)

    The problem is that I have to insert 305,000 records into this table as fast as possible, and inside of a transaction. Here is solution 1:

    DECLARE @contacts xml

    SET @CONTACTS =
    '<contacts>
        <contact id="fce91f99-69d7-4a55-9ecf-000072f65eb2">
           
    <name>Foo Bar</name><phone>123-432-1234</phone>
           
    <gender>mail@email.com</gender>
        </contact>
        <contact id="0982747d-9089-49d8-b870-000072fe965a">
            <name>Foo Bar</name>
            <phone>123-432-1234</phone>
            <gender>mail@email.com</gender>
        </contact>
        <contact id="a5b4874e-538e-4349-94d3-00009cf28e07">
            <name>Testing Name</name>
            <phone>432-123-4213</phone>
            <gender>e@mail.com</gender>
        </contact>
    </contacts>' 

    INSERT INTO CONTACTS(ID, CONTACT_INFO)
    SELECT node.value('@id', 'uniqueidentifier'), node.query('.')
    FROM @contacts.nodes('contacts/contact') DATA(node)

    That works fine and all, but when <contacts /> contained over 300K <contact /> nodes, I aborted after 10 minutes.
    Here is solution 2 (the way I use to do it in SQL 2000):

    DECLARE @iDoc INT

    EXEC sp_xml_preparedocument @idoc OUTPUT, @contacts

    INSERT INTO CONTACTS (ID, CONTACT_INFO)
    SELECT Id, DATA
    FROM OPENXML (@idoc, '/contacts/contact')
    WITH (Id uniqueidentifier '@id', DATA XML '.') R

    This query ran in about 3.5 minutes.

    I read in this article, that the first solution should have been faster than the second. Does anyone have enough experience with this to know what's going on here? If there's another approach I should be taking, I'd love to hear it as well.

    Monday, January 22, 2007 2:55 PM

Answers

  • The reason why the first solution is slow is that the query processor has very bad cardinality estimates for so-called table-valued functions (which the nodes() method uses if no index is specified). So the query optimizer may chose a really bad plan.

    Some tricks are to use multiple nodes() methods to trick the optimizer into a good plan (does not always work). Another workaround is to put the data into a temp table and define an index on the XML column. That should also speed the nodes() method call up.

    In general, very large amounts of XML data are best shred using the SQLXML Bulkload object instead of either nodes() or OpenXML, since it offloads the parsing of the XML into the mid-tier.

    Also, OpenXML stores the parsed document in main memory and does not use the SQL Server's memory management as effectively. So nodes() will scale better than OpenXML under stress and use less memory.

    Best regards

    Michael

    Thursday, January 25, 2007 2:24 AM
  • The first solution is for small amount of data. If you shred big xml document to relational tables, openxml is much faster.

    Your are doing the right the thing, i.e. always test different solutions.

     

    Wednesday, January 24, 2007 1:24 PM
    Answerer

All replies

  • The first solution is for small amount of data. If you shred big xml document to relational tables, openxml is much faster.

    Your are doing the right the thing, i.e. always test different solutions.

     

    Wednesday, January 24, 2007 1:24 PM
    Answerer
  • Thanks. It's also interesting to note that memory usage of SQL Server in the 1st solution was not very much, but skyrocketed in the 2nd solution.

    Do you know why openxml is faster? What is the 1st solution doing that is making it take so long? Is it executing the xpath queries more times than it has to? I'd like to have some idea so I can decide for my self which method to use in different circumstances.

    Wednesday, January 24, 2007 2:16 PM
  • The reason why the first solution is slow is that the query processor has very bad cardinality estimates for so-called table-valued functions (which the nodes() method uses if no index is specified). So the query optimizer may chose a really bad plan.

    Some tricks are to use multiple nodes() methods to trick the optimizer into a good plan (does not always work). Another workaround is to put the data into a temp table and define an index on the XML column. That should also speed the nodes() method call up.

    In general, very large amounts of XML data are best shred using the SQLXML Bulkload object instead of either nodes() or OpenXML, since it offloads the parsing of the XML into the mid-tier.

    Also, OpenXML stores the parsed document in main memory and does not use the SQL Server's memory management as effectively. So nodes() will scale better than OpenXML under stress and use less memory.

    Best regards

    Michael

    Thursday, January 25, 2007 2:24 AM
  • This is exactly the problem that I am having.  OPENXML is often much faster.  And I am not seeing any memory problems as discussed in the thread.  I tried the exact scenario that the original poster described and I have to go to 5000 rows of input before the OPENXML method takes longer than 1 second to process.  And memory usage did not skyrocket.  The same 5000 rows takes 118 seconds when using the nodes() method!

    I have writen this stand-alone script that anyone can run to see the performance comparison.  You can tweak the number of rows and the number of test repetitions (for a more accurate timing measurement).  The performance descrepancy is more pronounced for larger numbers of rows and columns, but is still present for these small datasets: 10 rows by 1 column, 1 row by 5 columns.  I have not tried this on SQL 2008.  If anyone has access to it, please run this and post back the results.  And I would be very interested in any other comments as well.  Thanks.

    ----------------------------------------


    -- Set the number of rows
    declare @N int;
    set @N=10;

    -- Set the number of test repetitions
    declare @Reps int;
    set @Reps = 100;

    -- create the test table
    declare @TempTable table (
        a int,
        b uniqueidentifier,
        c nvarchar(100),
        d datetime,
        e bit
    );

    -- Create some data
    while (@N>0) begin
        insert into @TempTable (a, b, c, d, e) values (rand(1000), newid(), newid(), getdate(), 0);
        set @N = @N - 1;
    end

    declare @Xml xml;
    set @Xml = (select * from @TempTable for xml raw('ITEM'), type, elements, root('INSERTFROMXML'), binary base64);

    declare @StartTime datetime;
    declare @Counter int;
    declare @Elapsed1 int;
    declare @Elapsed2 int;

    -- Test Method 1:

    delete from @TempTable;
    set @Counter = 0;
    set @StartTime = getdate();

    while (@Counter < @Reps) begin
        -- The slow part:
        insert into @TempTable
        select
            T.c.value('(a)[1]','int'),
            T.c.value('(b)[1]','uniqueidentifier'),
            T.c.value('(c)[1]','nvarchar(100)'),
            T.c.value('(d)[1]','datetime'),
            T.c.value('(e)[1]','bit')
        from @Xml.nodes('/INSERTFROMXML/ITEM') T(c)

        set @Counter = @Counter + 1;
    end

    set @Elapsed1 = datediff(ms, @StartTime, getdate());

    -- Test Method 2:

    delete from @TempTable;
    set @Counter = 0;
    set @StartTime = getdate();

    while (@Counter < @Reps) begin
        -- Not so slow any more:
        declare @idoc int;
        exec sp_xml_preparedocument @idoc output, @Xml;
        insert into @TempTable
        select *
        from openxml(@idoc, '/INSERTFROMXML/ITEM', 2)
        with (
            a int,
            b uniqueidentifier,
            c nvarchar(100),
            d datetime,
            e bit
        );
        exec sp_xml_removedocument @idoc;

        set @Counter = @Counter + 1;
    end

    set @Elapsed2 = datediff(ms, @StartTime, getdate());

    -- Results:

    select 'Method 2 speedup: ' + convert(varchar, round(convert(float, @Elapsed1) / @Elapsed2, 1)) + 'x';

    Thursday, June 12, 2008 3:09 PM
  • Also, the estimated execution plan for the nodes() method is MUCH more complicated than the one for OPENXML.  Perhaps there is something that can be done to the way in which I am using the nodes() method to have it generate a better, more efficient plan?  In this thread, MRys suggests that perhaps something can be done to "trick" SQL into generating a better plan, but I'm not sure what needs to be done.  I would love some more information on that topic.

    Thursday, June 12, 2008 3:21 PM
  • For what it's worth, I just tried the method 1/method 2 comparison on a SQL 2008 server...

    "Method 2 speedup: 2.6x"

    That's on a server with a pretty good amount of memory (10 GB), so using sp_xml_preparedocument even on a large-ish XML set shouldn't be too bad.

    I'm seeing a problem related to this where our developers recently changed the way XML data is being processed.  It had been using the OPENXML format and it generally ran pretty fast when inserting around 70,000 rows.  Now that they're using the nodes() method, it's been taking over 2 hours and we had to patch the code to set a commandTimeout of 3 hours to ensure it would finish.

    The nodes() method was also using a LOT of CPU time, and it may be a CPU bottleneck that makes it run so slow.

    I think I might suggest going back to the old method unless there was some other reason they switched.  Our production SQL servers have 12GB of RAM and lots of CPU power, so I'm guessing the OPENXML may be just fine for our needs.  If our XML data grew by a significant amount, it's probably time to load at bulk imports.
    Thursday, October 8, 2009 11:54 PM
  • I think "OPENXML is faster" as a statement on its own doesn't really mean that much.  There is a known feature of OPENXML where it automatically claims 1/8th of the server's memory.  In your example, one call of sp_xml_preparedocument will claim 1.25GB memory.  Subsequent calls do the same.  Please read the following articles.


    Stop Using OPENXML (Please...)
    https://www.sqlservercentral.com/blogs/michael_coles/archive/2008/01/20/stop-using-openxml-please.aspx

    Performance Optimizations for the XML Data Type in SQL Server 2005
    http://msdn.microsoft.com/en-us/library/ms345118.aspx

    XML Indexes in SQL Server 2005
    http://msdn.microsoft.com/en-us/library/ms345121(SQL.90).aspx


    A lot of people have had success making the XML typed.

    Also bear in mind, it's possible to write bad XQuery just as it's possible to write bad SQL.  Could you post a small sample of your 70,000 row query and we'll have a look at it.

    Friday, October 9, 2009 11:27 AM
    Answerer
  • I'd like to clarify the 1/8th server memory myth. 1/8th of server memory is the maximum memory that MSXML parser can take from the server. It's not that one sp_xml_preparedocument will alway take that amount of memory. The following is from SQL Server help for sp_xml_preparedocument:

    "A parsed document is stored in the internal cache of SQL Server 2005. The MSXML parser uses one-eighth the total memory available for SQL Server. To avoid running out of memory, run sp_xml_removedocument to free up the memory."


    As for the performance, it's not hard to compare them. I did it long time ago and found in most cases OPENXML won. I am using both OPENXML and xquery in my code based on my test results.

    Friday, October 9, 2009 12:36 PM
    Answerer
  • Quote from Michael Coles book, 'Pro SQL Server 2008 XML' - Apres., p79

    "The OPENXML rowset provider is based on the old COM model and it relies on the Microsoft XML Core Services
    library (MSXML) to perform XML manipulation and shredding. When SQL Server invokes MSXML it automatically
    reserves 1/8th of your SQL Server’s memory to processing the XML document
    . That means if you have
    a 2 GB SQL Server, 250 MB is assigned to process that XML. It doesn’t matter how big your XML document
    is—1 KB or 100 MB—the server automatically assigns 1/8th of your memory to the XML cache. The nodes
    method is not COM-based and is much better at dynamically allocating memory than OPENXML. This makes
    the score OPENXML 0, nodes() method +1.
    I’ve performed several ad hoc tests of OPENXML performance vs. the xml data type nodes() method,
    and I’ve found the performance of nodes() is roughly the same as OPENXML on nonindexed XML data. As
    far as speed goes, both methods gain a point. OPENXML is now at +1, nodes() is at +2.
    Although the speed difference appears to be negligible on nonindexed data, the nodes() method
    absolutely flies on a column with a primary XML index. In my simple ad hoc tests, I’ve recorded substantial
    speed increases—by as much as 50 percent. By my reckoning the nodes() method clearly beats OPENXML,
    3 to 1. And the moral of the story is, get in the habit of using the nodes() method instead of OPENXML. It will
    help you avoid OPENXML’s memory issues and, in some situations, increase performance."

    Friday, October 9, 2009 1:43 PM
    Answerer
  • Bob,
    My response is based on your saying:
    Quote:
    "In your example, one call of sp_xml_preparedocument will claim 1.25GB memory.  Subsequent calls do the same.  Please read the following articles. ..."


    As for perfromance about OPENXML and .nodes(), it depends on the hardware and software environment and test design etc. Other people's performance test result can only be your reference. You have to do your own test in your environment based on the data and use cases. E.g., my test and lots of other people in the thread found OPENXML in most cases is faster.

    Friday, October 9, 2009 5:30 PM
    Answerer
  • Quite right Peter, sp_xml_preparedocument does not take 1/8th, then 1/8th again, but repeat calls will always take 1/8th which was my meaning.  It is also possible to run out of this allocated memory with repeated calls of sp_xmlpreparedocument without calling sp_xmlremovedocument.
    Sunday, October 11, 2009 11:12 AM
    Answerer
  • I would use SqlXmlBulkLoad for loading large xml docs.  The entire xml doc is not loaded and you can still have the import transactional.

    http://msdn.microsoft.com/en-us/library/ms171721.aspx

    Thursday, February 11, 2010 6:20 AM
  • In case XML parsing is the bottleneck, I suggest that you look at vtd-xml (http://vtd-xml.sf.net), which claims to be the fastest and most memory efficient options available, it also supports incremental update nd built-in xpath

    http://vtd-xml.sf.net
    Tuesday, February 23, 2010 7:52 AM