none
How to remove special characters(& <,>..) from XML file before its being processed by biztalk Server? RRS feed

  • Question

  • How to remove special characters(& <,>..) from XML file before processing by biztalk Server?.

    I think we can use pipeline component. Please help me out with coding for pipeline component. Thanks


    ravindra

    Monday, April 16, 2012 6:53 PM

Answers

  • In my experience it is most often not a possible solution to push back to the source and "demand" that they deliver valid xml. This could be a customer or some legacy system no one is going to tamper with. Besides this is no big deal to fix in a plc.

    Here is a code snippet you could implement to make this work:

    public Microsoft.BizTalk.Message.Interop.IBaseMessage Execute(Microsoft.BizTalk.Component.Interop.IPipelineContext pc, Microsoft.BizTalk.Message.Interop.IBaseMessage inmsg) {                              

    IBaseMessagePart body = inmsg.BodyPart;

                if (body != null)
                {
                    StreamReader strReader = null;
                    try
                    {
                        //Lookup the encoding. If we not find the specified encoding, we use UTF-8.

                        messageEncoding = GetEncoding(messageEncodingName);

                        strReader = new StreamReader(body.Data, messageEncoding);

                        // Read the entire message into a string. Not a good idea if you receive largeish messages. Then you ned a bufferen approach
                        string msgBodyContentString = strReader.ReadToEnd();

                        string[] searchStrings = SearchString.Split(STRING_SEPARATOR.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
                        string[] replaceStrings = ReplaceString.Split(STRING_SEPARATOR.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

                        for (int index = 0; index <= (searchStrings.Length -1); index++)
                        {
                            string oldValue = searchStrings[index];
                            if (!string.IsNullOrEmpty(oldValue))
                            {
                                //default to string empty if we do not have a replace value
                                string newValue = string.Empty;
                                if (index <= replaceStrings.Length - 1)
                                    newValue = replaceStrings[index];

                                msgBodyContentString = msgBodyContentString.Replace(Regex.Unescape(oldValue), Regex.Unescape(newValue));
                            }
                        }

                        // Convert to bytes
                        byte[] msgBodyContentBytes = Encoding.Unicode.GetBytes(msgBodyContentString);

                        // Convert the encoding
                        byte[] msgBodyContentBytesConverted = Encoding.Convert(Encoding.Unicode, messageEncoding, msgBodyContentBytes);

                        // Write new stream back
                        body.Data = new MemoryStream(msgBodyContentBytesConverted);                                        
                    }
                    catch (Exception ex)
                    {   //you should probably log stuff                 
                        throw ex;
                    }
                    finally
                    {
                        if (strReader != null)
                        {
                            strReader.Close();
                        }
                    }
                }

                return inmsg;

    }

      Edit: sorry about that last plc. it has a nice buffered approach, but the Regex was written to detect escaped xml. This plc is a more "generic  replacer." I believe you can find other(more polished versions on codeplex or at Eliasens blog )



    Tuesday, April 17, 2012 6:40 AM

All replies

  • Hi 

    Please help me out with coding for pipeline component. --> Check this blog post How to Develop BizTalk Custom Pipeline Components - Part1 and Creating a custom pipeline component for Biztalk 2010 server (Part 1). You can use regular expression to remove all the xml tags from the message, Please refer this for help on regular expression to remove xml tags

     


    HTH,
    Naushad Alam

    When you see answers and helpful posts, please click Vote As Helpful, Propose As Answer, and/or Mark As Answer
    alamnaushad.wordpress.com
    My new TechNet Wiki "BizTalk Server: Performance Tuning & Optimization"

    Monday, April 16, 2012 7:01 PM
    Moderator
  • Ravindra, you can get an idea of steps from this post and can modify the code as per your requirement. The easier way to create pipeline component is to use the Pipeline Component Wizard

    Please mark the post answered your question as answer, and mark other helpful posts as helpful, it'll help other users who are visiting your thread for the similar problem, Regards -Rohit Sharma (http://rohitt-sharma.blogspot.com/)

    Monday, April 16, 2012 7:04 PM
    Moderator
  • +1 to Naushad!

    Naushad has shared good links go through those too.


    Please mark the post answered your question as answer, and mark other helpful posts as helpful, it'll help other users who are visiting your thread for the similar problem, Regards -Rohit Sharma (http://rohitt-sharma.blogspot.com/)

    Monday, April 16, 2012 7:10 PM
    Moderator
  • Hi Naushad,

    Thank you so much for your reply. I tried to implement pipeline component. I have special characters in xml data like <Name1>test & environment</Name1>

    so when i try this code its not accepting because xml is invalid. I am getting error while parsing ie in the line

    xmlDoc.LoadXml("");


    ravindra

    Monday, April 16, 2012 7:39 PM
  • Hi ,

    You will get the error when you try to load the Invalid xml using XMLDoc.LoadXml(). Before that you would need to remove the "&" char from xml string. 

    • make sure the pipeline removes those chars. 
    • Try a simple console application first, which will remove those chars from xml 
    • then try to put the code in pipeline and try


    HTH,
    Naushad Alam

    When you see answers and helpful posts, please click Vote As Helpful, Propose As Answer, and/or Mark As Answer
    alamnaushad.wordpress.com
    My new TechNet Wiki "BizTalk Server: Performance Tuning & Optimization"

    Monday, April 16, 2012 7:46 PM
    Moderator
  • Where is your XML document coming from?  For an XML document to be valid, certain text content must be escaped.

    &  - &amp;

    '   - &apos;

    "   - &quot;

    <  - &lt;

    >  - &gt;

    In your sample it should be:  <Name1>test &amp; environment</Name1>

    IMHO, I would concentrate on receiving valid XML rather than trying to fix it.  To fix it you will need to completely understant the XML state rather than simply replacing strings.


    David Downing... If this answers your question, please Mark as the Answer. If this post is helpful, please vote as helpful.

    Monday, April 16, 2012 7:50 PM
  • Hi Naushad,

    Thanks for ur reply.

    I can replace &. But if < is there in the data then what should i do?


    ravindra

    Monday, April 16, 2012 7:51 PM
  • Hi 

    +1 to David

    He has suggested a list of chars which needs to be escaped, I agree with David, Please follow his list and make sure your xml is free of those xml escape chars .


    HTH,
    Naushad Alam

    When you see answers and helpful posts, please click Vote As Helpful, Propose As Answer, and/or Mark As Answer
    alamnaushad.wordpress.com
    My new TechNet Wiki "BizTalk Server: Performance Tuning & Optimization"


    Monday, April 16, 2012 7:54 PM
    Moderator
  • Hi David,

    Thanks for ur reply

    Here the xml file is very big and that is coming from third party. they are sending invalid xml if the data has special characters. Its easy from their side to correct. But they are not willing. So please advice what could be the better way.


    ravindra

    Monday, April 16, 2012 7:57 PM
  • Hi Naushad,

    Thanks for ur reply

    Here the xml file is very big and that is coming from third party. they are sending invalid xml if the data has special characters. Its easy from their side to correct. But they are not willing. So i have to do from my side.


    ravindra

    Monday, April 16, 2012 7:58 PM
  • Hi Ravindra, 

    Here is what i would do

    • create a custom receive pipeline
    • Write a method to replace each char set, suggested by David, in the message and returned the modified stream to BizTalk.
    • My method to replace the chars will be something like
    • read the message body to stream
    • read the stream by byte 
    • check if a byte is = & then write the stream with &quot;
    • i will do the above step for each escape chars.
    • The links which i have provided in my first post will help you while writing a custom pipeline component.

    However I will suggest you to first write a console application to perform the above step and then add that method in pipeline class.I hope this helps.

    This Developing a Streaming Pipeline Component for BizTalk Server document will help you while developing pipeline component


    HTH,
    Naushad Alam

    When you see answers and helpful posts, please click Vote As Helpful, Propose As Answer, and/or Mark As Answer
    alamnaushad.wordpress.com
    My new TechNet Wiki "BizTalk Server: Performance Tuning & Optimization"



    Monday, April 16, 2012 8:11 PM
    Moderator
  • I hate to say it, but brute forcing XML without using an XML reader of some sort will probably turn out to be a disaster waiting to happen.  I would push back on the third party and possibly escalate internally to the people who negotiated with the third party.

    Just to elaborate on what you're facing, the following is valid XML.  Notice element <test> has both a text node an element node, and a comment node as children... keeping track of state is not trivial.  Then there's CDATA, ...

    <test myAttrib="attribute data">
        data for test
        <child1>data for child1</child1>
        <!-- Comment -->
    </test>
    How would you substitute the "<" or ">" characters?


    David Downing... If this answers your question, please Mark as the Answer. If this post is helpful, please vote as helpful.



    Monday, April 16, 2012 8:13 PM
  • Hi Naushad,

    Thanks for ur reply.

    I can replace &. But if < is there in the data then what should i do?


    ravindra

    You shoud replace '<' also. Not sure if you are left with another choice.

    Regards,
    Bali
    MCTS: BizTalk Server 2010,BizTalk Server 2006 and WCF
    My Blog:dpsbali-biztalkweblog
    -----------------------------------------------------
    Mark As Answer or Vote As Helpful if this helps.

    Tuesday, April 17, 2012 4:59 AM
  • In my experience it is most often not a possible solution to push back to the source and "demand" that they deliver valid xml. This could be a customer or some legacy system no one is going to tamper with. Besides this is no big deal to fix in a plc.

    Here is a code snippet you could implement to make this work:

    public Microsoft.BizTalk.Message.Interop.IBaseMessage Execute(Microsoft.BizTalk.Component.Interop.IPipelineContext pc, Microsoft.BizTalk.Message.Interop.IBaseMessage inmsg) {                              

    IBaseMessagePart body = inmsg.BodyPart;

                if (body != null)
                {
                    StreamReader strReader = null;
                    try
                    {
                        //Lookup the encoding. If we not find the specified encoding, we use UTF-8.

                        messageEncoding = GetEncoding(messageEncodingName);

                        strReader = new StreamReader(body.Data, messageEncoding);

                        // Read the entire message into a string. Not a good idea if you receive largeish messages. Then you ned a bufferen approach
                        string msgBodyContentString = strReader.ReadToEnd();

                        string[] searchStrings = SearchString.Split(STRING_SEPARATOR.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
                        string[] replaceStrings = ReplaceString.Split(STRING_SEPARATOR.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

                        for (int index = 0; index <= (searchStrings.Length -1); index++)
                        {
                            string oldValue = searchStrings[index];
                            if (!string.IsNullOrEmpty(oldValue))
                            {
                                //default to string empty if we do not have a replace value
                                string newValue = string.Empty;
                                if (index <= replaceStrings.Length - 1)
                                    newValue = replaceStrings[index];

                                msgBodyContentString = msgBodyContentString.Replace(Regex.Unescape(oldValue), Regex.Unescape(newValue));
                            }
                        }

                        // Convert to bytes
                        byte[] msgBodyContentBytes = Encoding.Unicode.GetBytes(msgBodyContentString);

                        // Convert the encoding
                        byte[] msgBodyContentBytesConverted = Encoding.Convert(Encoding.Unicode, messageEncoding, msgBodyContentBytes);

                        // Write new stream back
                        body.Data = new MemoryStream(msgBodyContentBytesConverted);                                        
                    }
                    catch (Exception ex)
                    {   //you should probably log stuff                 
                        throw ex;
                    }
                    finally
                    {
                        if (strReader != null)
                        {
                            strReader.Close();
                        }
                    }
                }

                return inmsg;

    }

      Edit: sorry about that last plc. it has a nice buffered approach, but the Regex was written to detect escaped xml. This plc is a more "generic  replacer." I believe you can find other(more polished versions on codeplex or at Eliasens blog )



    Tuesday, April 17, 2012 6:40 AM
  • You don't need to manually replace all those characters or even list them down.

    Use: SecurityElement.Escape

    tagText = SecurityElement.Escape(tagText);

    http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape(v=vs.80).aspx


    Randy Aldrich Paulo

    MCTS(BizTalk 2010/2006,WCF NET4.0), MCPD | My Blog


    BizTalk Message Archiving - SQL and File
    Automating/Silent Installation of BizTalk Deployment Framework using Powershell >
    Sending IDOCs using SSIS

    Tuesday, April 17, 2012 7:06 AM
  • Pal,

    Do you have an implementation of the UnescapeXml(buffer) method?


    David Downing... If this answers your question, please Mark as the Answer. If this post is helpful, please vote as helpful.

    Tuesday, April 17, 2012 1:33 PM
  • The SecurityElement.Escape(text) escapes an entire string and makes it valid for XML Text.

    <?xml version="1.0"?>
    <test myAttrib="attribute & data">
        data for test & invalid &test; character
        <child1>data for child1 < invalid data ></child1>
        <!-- Comment & < invalid data > ' and " -->
    </test>

    becomes:

    &lt;?xml version=&quot;1.0&quot;?&gt;
    &lt;test myAttrib=&quot;attribute &amp; data&quot;&gt;
        data for test &amp; invalid &amp;test; character
        &lt;child1&gt;data for child1 &lt; invalid data &gt;&lt;/child1&gt;
        &lt;!-- Comment &amp; &lt; invalid data &gt; &apos; and &quot; --&gt;
    &lt;/test&gt;

    Ravindra only needs to replace unescaped text nodes.


    David Downing... If this answers your question, please Mark as the Answer. If this post is helpful, please vote as helpful.


    Tuesday, April 17, 2012 2:05 PM
  • Pal,

    Can you include the values for: SearchString, ReplaceString and STRING_SEPARATOR.


    David Downing... If this answers your question, please Mark as the Answer. If this post is helpful, please vote as helpful.

    Tuesday, April 17, 2012 3:18 PM
  • SearchString, ReplaceString and STRING_SEPARATOR are values you could expose in the pipeline property.

    SearchString would contain the characters you want to replace separated by STRING_SEPARATOR which is the char used to identify a list of such charachters. I often use pipe | as STRING_SEPARATOR  because it is not often as a character you would want to replace

    ReplaceString should contain an equally long list of chars separated by  STRING_SEPARATOR to replace the chars from SearchString. As commented in the code you could leave this empty if all you want to do is remove the characters put in SearchString

    Wednesday, April 18, 2012 6:03 AM