Ask a questionAsk a question
 

General DiscussionParsing poorly formed XML

  • Wednesday, November 04, 2009 4:59 PMMichaelJHuman Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hello,

    I am trying to parse a file with poorly formed XML.  I have no control over this file, and I have to read a number of files with the same issue.  There's a termintating </font> tag with no start tag.

    This file was generated by an online game and has an XLS extension.  It opens in both Excel and HTML.

    If Excel can read it, it seems reasonable there's a way to use the .NET API to read it, but Excel could have it's own parser.

    I would say it looks more like HTML than XML, but I could not find an HTML parser in .NET.



All Replies

  • Wednesday, November 04, 2009 5:28 PMScottyDoesKnow Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    My suggestion would be to read it line by line and insert a starting <font> tag or remove the </font> tag, then use XML parsing.
  • Wednesday, November 04, 2009 11:45 PMMichaelJHuman Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Unfortunately, that was the only idea I could come up with.  I ended up re-writing to a temp file and removing some of the bad tags.  It was a pain.

    I am still curious as to how Excel and IE could both read the doc.  I wonder what parser they use.

  • Thursday, November 05, 2009 12:02 AMYort Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Hi,

    There are at least two MS xml parsers... a COM one and the .Net one, it wouldn't surprise me if there were others or modified versions of those embedded in their own apps. I would expect the COM one would also fail to load badly formed Xml however (so won't solve your problem).

  • Thursday, November 05, 2009 6:50 PMYort Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi,

    It occurs to me this morning that this might be possible using an XmlReader of some kind (XmlTextReader ?) instead of an XmlDocument or XPathDocument.

    Since the readers are designed to process xml from streams, a bit at a time, they won't find the error until your parsed at least some of the document. It might also be possible to ignore the error returned by the reader and continue, or it maybe that because of the way the reader works it doesn't even notice the error unless you specifically ask it to validate the xml. I haven't used readers much, but since IE and Excel are both designed to deal with large xml files it seems like they might well use one, and that might be why they cope with the badly formed stuff.
  • Wednesday, November 11, 2009 2:13 AMHarry ZhuMSFT, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi,

    Could you please post the content of the xml file and the code you are working with?

    Harry
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.
    Welcome to the All-In-One Code Framework! If you have any feedback, please tell us.
  • Friday, November 13, 2009 4:30 AMHarry ZhuMSFT, ModeratorUsers MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    We are changing the issue type to “General Discussion” because you have not followed up with the necessary information. If you have more time to look at the issue and provide more information, please feel free to change the issue type back to “Question” by opening the Options list at the top of the post  window, and changing the type. If the issue is resolved, we will appreciate it if you can share the solution so that the answer can be found and used by other community members having similar questions.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help.
    Welcome to the All-In-One Code Framework! If you have any feedback, please tell us.