none
Parsing Word file with strange hyperlinks RRS feed

  • Question

  •  

    Hi,

     

    I'm parsing a word (docx) file from .NET with the System.IO.Packaging framework.

     

    The file contains hyperlinks, and one of them looks strange: http://www.microsoft.com()/ (note the parenthesis!)

     

    If I try to call the GetRelationships() method on the document part or even if call the GetRelationship() method with a specific ID (which is not the ID of the strange hyperlink) i get an exception below.

     

    This exception stops me processing any other relationships which is quite inconvinient...

     

    Can anyone help?

     

    Thx, Gaspar

     

     

    System.UriFormatException was unhandled
      Message="Invalid URI: The hostname could not be parsed."
      Source="System"
      StackTrace:
           at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
           at System.Uri..ctor(String uriString, UriKind uriKind)
           at MS.Internal.IO.Packaging.InternalRelationshipCollection.ProcessRelationshipAttributes(XmlCompatibilityReader reader)
           at MS.Internal.IO.Packaging.InternalRelationshipCollection.ParseRelationshipPart(PackagePart part)
           at MS.Internal.IO.Packaging.InternalRelationshipCollection..ctor(Package package, PackagePart part)
           at System.IO.Packaging.PackagePart.GetRelationships()

    ...

     

    Friday, September 28, 2007 2:48 PM

Answers

  • I didn't test it but I understand now.

    Not pretty.

    Did you try to open directly the document.xml.rels directly into an XMLDocument?

    • Marked as answer by Gaspar Nagy Tuesday, December 8, 2009 8:15 AM
    Tuesday, October 2, 2007 8:04 PM

All replies

  • 1)strURL=String.Replace(strURL,"()", "") will handle the issue

    2)Any more like this?

    A test I did create this: <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="http://www.microsoft.com" TargetMode="External" />

    GOOD.

    More Info?

     

    Saturday, September 29, 2007 2:52 PM
  • Yes, I am aware that the "()" causes the problem, but I'm writing a program that should process word files automatically. Right now, if someone uploads a file with such a link, my program cannot access the references of the word file anymore.

     

    I cannot even tell the users which link to be fixed, because from the MS Word application this link doe's not seem to be invalid (it is broken, though), and you can enter such links without problems.

     

    What I would expect is that the framework handles this problem internally and reports an invalid link only for the invalid one and does not obstruct accessing the other references.

     

    Sunday, September 30, 2007 11:26 AM
  • Just confirm me if I understand you: your problem is that Word doesn't validate the hyperlink, so today is "()", tomorrow another thing and your process will crash.

    Validating the URL would help you?

    So if the Hyperlink is not valid you can catch the problem and fix it.

    The .Net Framework doesn't have an automatic way to do this but all can be done and the Framework opens a door:

    Regular Expressions are the way to do this, I started a search and found many expressions you can use, like this in http://www.wwwcoder.com/main/parentid/526/site/5886/68/default.aspx 

     

    Function IsValidUrl(ByVal url As String) As Boolean
        Return System.Text.RegularExpressions.Regex.IsMatch(url, _
        "(http|ftp|https)://([\w-]+\.)+(/[\w- ./?%&=]*)?")
    End Function

     

    Is it useful for you?

    Monday, October 1, 2007 1:03 PM
  • Not exatly. My problem is: the System.IO.Packaging framework does not allow accessing the relations (_rels) part of the word document if there is an invalid URL in the fiel.

     

    I can check the validity of the URL, but the problem is that I cannot read out the URL from the word file from code (if there is an invalid hyperlink somewhere in the file)! :-(

     

    Tuesday, October 2, 2007 11:30 AM
  • I didn't test it but I understand now.

    Not pretty.

    Did you try to open directly the document.xml.rels directly into an XMLDocument?

    • Marked as answer by Gaspar Nagy Tuesday, December 8, 2009 8:15 AM
    Tuesday, October 2, 2007 8:04 PM