none
mshtml.HTMLDocument and System.Windows.Forms.HtmlDocument RRS feed

  • Question

  • Is there any way to convert a loaded mshtml.HTMLDocument to a Forms.HtmlDocument?

    everything i've searched for seems to go from forms.HtmlDocument to mshtml.HTMLDocument, I want to go the other way.  I want to simply load the data via the mshtml interface (because it doesn't require the WEbBrowser component to load) but then use the forms.htmldocument to parse the document.

    If it isn't possible that's okay.
    Jaeden "Sifo Dyas" al'Raec Ruiner
    "Never Trust a computer. Your brain is smarter than any micro-chip."
    PS - Don't mark answers on other people's questions. There are such things as Vacations and Holidays which may reduce timely activity, and until the person asking the question can test your answer, it is not correct just because you think it is. Marking it correct for them often stops other people from even reading the question and possibly providing the real "correct" answer.
    Wednesday, March 3, 2010 10:14 AM

Answers

  • There is no easy way of doing this as far as I know. The other way works due to Forms.HtmlDocument having the DomDocument property.

    Forms.HtmlDocument is also missing a public constructor so you can't new up a new object either.

    In case you were able to new up a Forms.HtmlDocument you could try to use reflection along the lines (not tested):

    mshtml.HTMLDocument mshtmlDoc = ...;
    HtmlDocument formDoc = ...;
    Type formHtmlDocType = formDoc.GetType();
    formHtmlDocType.GetProperty("htmlDocument2", BindingFlags.NonPublic).SetValue(formDoc,mshtmlDoc, null);

    Or you could check you something like AutoMapper to map the objects from one to the other.

    That said, why do you need to convert between the two formats?

    Another good parser is Html Agility Pack which I personally prefer for html parsing to the two DOM objects.

    Regards,
    Mikael Svenson

    Search Enthusiast - MOSS MCTS
    http://techmikael.blogspot.com/ - http://www.comperio.no/
    Wednesday, March 3, 2010 8:06 PM

All replies

  • There is no easy way of doing this as far as I know. The other way works due to Forms.HtmlDocument having the DomDocument property.

    Forms.HtmlDocument is also missing a public constructor so you can't new up a new object either.

    In case you were able to new up a Forms.HtmlDocument you could try to use reflection along the lines (not tested):

    mshtml.HTMLDocument mshtmlDoc = ...;
    HtmlDocument formDoc = ...;
    Type formHtmlDocType = formDoc.GetType();
    formHtmlDocType.GetProperty("htmlDocument2", BindingFlags.NonPublic).SetValue(formDoc,mshtmlDoc, null);

    Or you could check you something like AutoMapper to map the objects from one to the other.

    That said, why do you need to convert between the two formats?

    Another good parser is Html Agility Pack which I personally prefer for html parsing to the two DOM objects.

    Regards,
    Mikael Svenson

    Search Enthusiast - MOSS MCTS
    http://techmikael.blogspot.com/ - http://www.comperio.no/
    Wednesday, March 3, 2010 8:06 PM
  • As I suspected.

    I've done a full assessment of the Forms.HtmlDocument via reflector, and there are quite a few internal classes (again see my previous gripes about Microsoft hiding useful classes and controls).,  Basically there is a HtmlShim abstract, from which HtmlElementShim is derived, an then the HtmlShimManager, all three of which are as internal as the HtmlDocument constructor.  The whole class seems to be a wrapper for the "native" mshtml.IHtmlDocument2 COM interface, mapping data into the 'shim' classes and managed by the shim manager.  When called upon, the same calls made from the mshtml interfaces are made behind the scenes through "unsafe native calls" and stuff like that.

    The only real reason I wanted a converter is because .Net's handling of COM leaves a lot to be desired.  One can create an IHTMLDocument, via the HTMLDocumentClass, and write to it via the IHTMLDocument2 interface, but when accessing document.all, or document.getElementByTagName() etc, they all return System._COM_Object.  Which is a derivative of the Object type, but not very helpful when developing.  It took me a while to work through it, but eventually I simply used the Interfaces directly.  It woudl be easier to use these things IF microsoft's help documentation would list:  "the all property retrieves an IDispatch interface that can be cast to a IHTMLElementCollection."  None of the current documention informs us of what type is ACTUALLY returned, the help docs only say " returns an object".  well no duh.  We need the "type" of the document. 

    After much pulling of hair and gnashing of teeth i figured it out. 
    THanks
    Jaeden "Sifo Dyas" al'Raec Ruiner
    "Never Trust a computer. Your brain is smarter than any micro-chip."
    PS - Don't mark answers on other people's questions. There are such things as Vacations and Holidays which may reduce timely activity, and until the person asking the question can test your answer, it is not correct just because you think it is. Marking it correct for them often stops other people from even reading the question and possibly providing the real "correct" answer.
    Thursday, March 4, 2010 5:38 AM
  • As I suspected.

    I've done a full assessment of the Forms.HtmlDocument via reflector, and there are quite a few internal classes (again see my previous gripes about Microsoft hiding useful classes and controls).,  Basically there is a HtmlShim abstract, from which HtmlElementShim is derived, an then the HtmlShimManager, all three of which are as internal as the HtmlDocument constructor.  The whole class seems to be a wrapper for the "native" mshtml.IHtmlDocument2 COM interface, mapping data into the 'shim' classes and managed by the shim manager.  When called upon, the same calls made from the mshtml interfaces are made behind the scenes through "unsafe native calls" and stuff like that.

    The only real reason I wanted a converter is because .Net's handling of COM leaves a lot to be desired.  One can create an IHTMLDocument, via the HTMLDocumentClass, and write to it via the IHTMLDocument2 interface, but when accessing document.all, or document.getElementByTagName() etc, they all return System._COM_Object.  Which is a derivative of the Object type, but not very helpful when developing.  It took me a while to work through it, but eventually I simply used the Interfaces directly.  It woudl be easier to use these things IF microsoft's help documentation would list:  "the all property retrieves an IDispatch interface that can be cast to a IHTMLElementCollection."  None of the current documention informs us of what type is ACTUALLY returned, the help docs only say " returns an object".  well no duh.  We need the "type" of the document. 

    After much pulling of hair and gnashing of teeth i figured it out. 
    THanks
    Jaeden "Sifo Dyas" al'Raec Ruiner
    "Never Trust a computer. Your brain is smarter than any micro-chip."
    PS - Don't mark answers on other people's questions. There are such things as Vacations and Holidays which may reduce timely activity, and until the person asking the question can test your answer, it is not correct just because you think it is. Marking it correct for them often stops other people from even reading the question and possibly providing the real "correct" answer.
    And that's why I use Html Agility Pack instead :) Clean and you have the code. Of course it won't work if you use the dom inside an webbrowser control to modify some layout for a user.

    -m

    Search Enthusiast - MOSS MCTS
    http://techmikael.blogspot.com/ - http://www.comperio.no/
    Thursday, March 4, 2010 7:24 AM
  • Hi, I'm facing the same problem. But I quite new with HtmlShim, mapping data.

    Would you please show the codes to us?

    Thanks

    Youngfe Lou

    Monday, June 6, 2011 8:33 PM
  • [Yes I know this is an old thread but my post is intended for anyone in the future that is looking for answers.]

    I assume that what you are really asking is how to load HTML into a WebBrowser. I assume there is no reason you can't use alternatives for that requirement. However note that Forms.HtmlDocument has the DomDocument property that is extremely useful. (See my Introduction to Web Site Scraping). Also the HtmlDocument.Body property is a HtmlElement that has a DomElement property that is useful if we only want to work with the body.

    One problem that can be a barrier is that the HTML document does not exist at the load time of the form. So you cannot access the DomDocument in the form's load event. Instead, double-click on the WebBrowser control to create a document loaded event and then DomDocument can be used in there.

    You say you want to use "forms.htmldocument to parse the document". If you don't need the UI then there are easier (for you and the computer) ways to do it. For those that do need the UI and are loading HTML into a WebBrowser for the purpose of editing it then they might encounter a problem; I forget what that is and the solution is difficult to find.



    Sam Hobbs
    SimpleSamples.Info







    Thursday, October 13, 2016 8:11 PM