locked
CDATA in XML is not supported to the specifications! RRS feed

  • Question

  • Why does SQL Server not support the technical specification of the CDATA or PI's?

    This Query below results in a the XML converted being escaped.  CDATA contents should not be escaped.  According to W3.org 2.4, and 2.7.

    INSERT dbo.DDLEvents
    (
        EventType
        ,EventDDL
        ,DatabaseName
        ,SchemaName
        ,ObjectName
        ,LoginName
        ,EventXML
    )
    SELECT
        'CREATE_PROCEDURE'
        ,OBJECT_DEFINITION([object_id])
        ,DB_NAME()
        ,OBJECT_SCHEMA_NAME([object_id])
        ,OBJECT_NAME([object_id])
        ,SUSER_SNAME() + ' Base Line'
        ,(select convert(xml,'<?xml version="1.0"?> <StoredProc><![CDATA[' + char(13) + OBJECT_DEFINITION([object_id]) + ']]></StoredProc>'))
    FROM
        sys.procedures;

    2.6 Processing Instructions

    [Definition: Processing instructions (PIs) allow documents to contain instructions for applications.]

    Processing Instructions

    [16]   PI   ::=   '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
    [17]   PITarget   ::=   Name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
    PIs are not part of the document's character data, but must be passed through to the application. The PI begins with a target (PITarget) used to identify the application to which the instruction is directed. The target names " XML ", " xml ", and so on are reserved for standardization in this or future versions of this specification. The XML Notation mechanism may be used for formal declaration of PI targets. Parameter entity references must not be recognized within processing instructions.

    2.7 CDATA Sections

    [Definition: CDATA sections may occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string " <![CDATA[ " and end with the string " ]]> ":]

    CDATA Sections

    [18]   CDSect   ::=   CDStart CData CDEnd
    [19]   CDStart   ::=   '<![CDATA['
    [20]   CData   ::=   (Char* - (Char* ']]>' Char*))
    [21]   CDEnd   ::=   ']]>'
    Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using " &lt; " and " &amp; ". CDATA sections cannot nest.

    An example of a CDATA section, in which " <greeting> " and " </greeting> " are recognized as character data, not markup:

    <![CDATA[<greeting>Hello, world!</greeting>]]> 

    Thursday, February 21, 2013 7:57 PM

Answers

  • I found the links which discuss CDATA.  Michael Rys, a Principal Program Manager at Microsoft questions the use case here, here and here.  This has been discussed at length in an interesting thread here.

    I should state clearly that I do not work for Microsoft nor represent them.  This is a user group where people contribute in their free time.

    As regards developers making decisions, you have all the information to hand to make an informed decision.  So at the risk of repetition, if you wish to preserve your xml exactly, use nvarchar(max), if you wish to take advantage of the additional methods of the SQL Server XML datatype, then use it, but in the full knowledge it can for example strip whitespace and remove CDATA markup.  Even where it does remove the CDATA markup, when querying the xml with the .value method, it will be rendered as was originally (excluding the markup).  Try this simple example where I demonstrate the technique I mentioned, using a computed column:

    CREATE TABLE #tmp ( originalXML NVARCHAR(MAX), computedXML AS CAST( originalXML AS XML ) )
    
    INSERT INTO #tmp VALUES ( '<x><![CDATA[<really> &all the worst characters you can think </of>]]></x>' )
    
    -- I need to see my original unaffected CDATA section
    SELECT originalXML yesSir FROM #tmp
    
    -- I want to use the XML datatype
    SELECT computedXML.value('(x/text())[1]', 'NVARCHAR(MAX)') sureCan FROM #tmp
    
    -- Cleanup
    DROP TABLE #tmp



    • Marked as answer by Iric Wen Friday, March 1, 2013 8:28 AM
    • Edited by wBobEditor Friday, March 1, 2013 9:31 AM typos
    Friday, February 22, 2013 10:17 AM
    Answerer

All replies

  • You can create CDATA sections, eg using FOR XML EXPLICIT:

    SELECT
         'CREATE_PROCEDURE' EventType
         ,OBJECT_DEFINITION([object_id]) EventDDL
         ,DB_NAME() DatabaseName
         ,OBJECT_SCHEMA_NAME([object_id]) SchemaName
         ,OBJECT_NAME([object_id]) ObjectName
         ,SUSER_SNAME() + ' Base Line' LoginName
          ,( SELECT 1 Tag, 0 Parent, OBJECT_DEFINITION([object_id]) AS [StoredProc!1!!CDATA] FOR XML EXPLICIT ) EventXML
    FROM sys.procedures

    However as soon as you cast it to the XML data-type then it will remove the CDATA.  This is a "feature" of the SQL Server XML datatype - some things get removed, eg whitespace.  If you're only doing this just so you can click on the proc definition and it open up in another window, well you can't really do it without the text being entitized. If you genuinely want CDATA, then simply avoid the cast to xml.
    Thursday, February 21, 2013 8:32 PM
    Answerer
  • The feature is fails to follow the spec of the W3 XML Support for both CDATA.  PI data is not supported in any form best can tell.

    How would Microsoft recommend retrieving the content of a column from with SSMS from either an XML column or Varchar(Max) column?

    CDATA should not be escape, I am not sure how the 'Feature' is a benefit when the specification says that the data in the CDATA Tag is not part of the XML markup and should be treated as it is in the CDATA tag.

    Is there a method to turn the 'Feature' off?  Most software products have the ability to turn Features on and off.

    I have tried your suggested solution but the CDATA tag is removed and the content of the CDATA tag is escaped again.

    There are very valid reasons why this feature causes substantial problems when storing content in XML.  The whole point of a CDATA tag is to keep the parsers hands off of the content. 

    I know when I want the parser to manipulate my XML and I know when I do not want the parser to.

    Your feature is a failure.

    • Edited by Ron A Wilson Thursday, February 21, 2013 9:07 PM clarity
    Thursday, February 21, 2013 9:04 PM
  • As suggested, by avoiding the cast to the XML datatype, eg using NVARCHAR(MAX).
    Thursday, February 21, 2013 9:07 PM
    Answerer
  • In this example attempting to copy and past the content of the NVARCHAR(MAX) column truncates the data.

    I want to use the benefits of an XML column with the data that I am storing.

    By finding this feature today, it concerns me about moving forward with a New Product we are working on and the ability to use XML as a viable data type.

    I currently have 100 Clients with the Enterprise version of  SQL Server 2005, 2008, 2008 r2 or 2012.  With over 15,000 user seats.  I guess we may need to review SQL Server as a viable option for a Data Base Engine for a modern world of data exchange, dynamic schema and Health Care interoperability.

    Thanks

    Thursday, February 21, 2013 9:12 PM
  • The truncation is a feature of Management Studio. You can extend it a bit in Tools > Options, up to 2mb I think. The xml will be preserved exactly when stored as nvarchar(max) and can be retrieved in full via an application, bcp ,SSIS etc Presumably your 15,000 users won't be copying and pasting from SSMS. I don't have a link to hand but it is documented that if you wish to preserve your XML "as is" or for auditing purposes you have to use nvarchar(max). Whitespace is another example which can be removed when using the xml datatype. You could consider storing as nvarchar and casting to xml for certain queries, storing both, or even a computed column. In short, you have options. Good luck!
    Thursday, February 21, 2013 9:53 PM
    Answerer
  • There are many reasons why one would want to store an element of XML in CDATA and take advantage of the other features of XML and maintain the integrity of the data in the CDATA element.

    All other XML parsers honor the standard for both CDATA and PI, why can't Microsoft or at least allow a developer to make the decision in his individual case.

    And you are correct my users will not be copying and pasting but I do not think it should be necessary to separate an element from an XML Data item because Microsoft is not honoring the standard.

    Do you folks up there have any clue what the ACA is attempting do with Data and Standards?


    Thursday, February 21, 2013 11:53 PM
  • I found the links which discuss CDATA.  Michael Rys, a Principal Program Manager at Microsoft questions the use case here, here and here.  This has been discussed at length in an interesting thread here.

    I should state clearly that I do not work for Microsoft nor represent them.  This is a user group where people contribute in their free time.

    As regards developers making decisions, you have all the information to hand to make an informed decision.  So at the risk of repetition, if you wish to preserve your xml exactly, use nvarchar(max), if you wish to take advantage of the additional methods of the SQL Server XML datatype, then use it, but in the full knowledge it can for example strip whitespace and remove CDATA markup.  Even where it does remove the CDATA markup, when querying the xml with the .value method, it will be rendered as was originally (excluding the markup).  Try this simple example where I demonstrate the technique I mentioned, using a computed column:

    CREATE TABLE #tmp ( originalXML NVARCHAR(MAX), computedXML AS CAST( originalXML AS XML ) )
    
    INSERT INTO #tmp VALUES ( '<x><![CDATA[<really> &all the worst characters you can think </of>]]></x>' )
    
    -- I need to see my original unaffected CDATA section
    SELECT originalXML yesSir FROM #tmp
    
    -- I want to use the XML datatype
    SELECT computedXML.value('(x/text())[1]', 'NVARCHAR(MAX)') sureCan FROM #tmp
    
    -- Cleanup
    DROP TABLE #tmp



    • Marked as answer by Iric Wen Friday, March 1, 2013 8:28 AM
    • Edited by wBobEditor Friday, March 1, 2013 9:31 AM typos
    Friday, February 22, 2013 10:17 AM
    Answerer