locked
Using MSHTML to use / crawl a site RRS feed

  • Question

  • User-1584200859 posted

    Hello...

    I need help with a problem i have.. im using MSHTML on my website to access a page on a site that requires login.. when i call a page on the site, im redirected to the login page.. what i need to do is get this login page , fill the logins, and get in to get to the page i need... i found the folllowing piece of code on the internet and it works fine when putting text in textbox, but it does not fire the click event of the button... i dunno y... i even tried IHTMLElementClick() but in vain :(

    can anyone pls help with this..?? 

    P.S : im not using the webbrowser control becuase this is a web application (website project to be precise) and this has to be invisible to the user.. the user will enter a url and press go.. 

        public enum  HRESULT :uint
        {
            S_OK = 0,
            S_FALSE = 1,
            E_NOTIMPL = 0x80004001,
            E_INVALIDARG = 0x80070057,
            E_NOINTERFACE = 0x80004002,
            E_FAIL = 0x80004005,
            E_UNEXPECTED = 0x8000ffff
        }
    
        [ComVisible(true), ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)]
        public interface IPersistStreamInit : IPersist
        {
            new void GetClassID(ref Guid pClassID);
            [PreserveSig()]
            int IsDirty();
            [PreserveSig()]
            HRESULT Load(UCOMIStream pstm);
            [PreserveSig()]
            HRESULT Save(UCOMIStream pstm, [MarshalAs(UnmanagedType.Bool)]
    bool fClearDirty);
            [PreserveSig()]
            HRESULT GetSizeMax([InAttribute(), Out(), MarshalAs(UnmanagedType.U8)]
    ref long pcbSize);
            [PreserveSig()]
            HRESULT InitNew();
        }
    
        [ComVisible(true), ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)]
        public interface IPersist
        {
            void GetClassID(ref Guid pClassID);
        }
    
    
        protected void Page_Load(object sender, EventArgs e)
        {
            string url = "http://images.google.com/";
    
            mshtml.HTMLDocument objMSHTML = new mshtml.HTMLDocument();
            mshtml.IHTMLDocument2 objMSHTML2;
            mshtml.IHTMLDocument3 objMSHTML3;
            int x = 10;
            //a dummy variable
    
            IPersistStreamInit objIPS;
            //here is the whole trick
            objIPS = (IPersistStreamInit)objMSHTML;
            objIPS.InitNew();
            //you have to do it, if not you will always have readyState as "loading"
            objMSHTML2 = objMSHTML.createDocumentFromUrl(url, null);
            while (!(objMSHTML2.readyState == "complete"))
            {
                x = x + 1;
            }
            objMSHTML3 = (mshtml.IHTMLDocument3)objMSHTML2;
    
            IHTMLElementCollection d = objMSHTML3.getElementsByName("q");
            IHTMLElementCollection de = objMSHTML3.getElementsByName("btnG");
            HTMLInputElementClass cn = (HTMLInputElementClass)d.item("q", 0);
            cn.value = "biso";
            HTMLInputElementClass bt =  (HTMLInputElementClass)de.item("btnG", 0);
    
            bt.click();
            // this does not fire
           
            string s = objMSHTML3. documentElement.innerHTML;
    
           // if u check (string s) in the html visualizer, ull get the google page with biso written in the text box
            
            
    
    
        }
    

     Much appreciated..

    Sunday, February 28, 2010 11:53 AM

Answers

  • User437720957 posted

    Do you really need to interact with the DOM? Can't you just send requests using WebClient or HttpWebRequest?

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Thursday, March 4, 2010 5:35 PM

All replies

  • User-1584200859 posted

    anyone ..... ! 

    Thursday, March 4, 2010 11:55 AM
  • User437720957 posted

    Do you really need to interact with the DOM? Can't you just send requests using WebClient or HttpWebRequest?

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Thursday, March 4, 2010 5:35 PM
  • User-1584200859 posted

    my application accepts a url. now to get to that specific page, ull need to log on to the system... so if i use webclient, it creates a new session and the url get redirected to the login page..  i cant let the user login agian cuz this process will take place like every 3 mins for 8 hours.... if i can copy the session of an open browser, i think that would work too...

    Friday, March 5, 2010 6:45 AM
  • User437720957 posted

    As long as you use the CookieContainer and keep track of the used session cookie values, you don't need to copy anything from the browser.

    Friday, March 5, 2010 3:09 PM
  • User-1584200859 posted

    i dont understand how is a cookie container supposed to help me.. i read a bit about it but i didnt understand how its should be of help.. anyway, heres my question in a different way cuz im really excited about this methodology ... supposed i want to create an application that logs in to a website (fb, hotmail, gmail) watever it is... given the fact that there is no api or webservice i can use, the solution would be to call the login page, fill in my details, press the login button then redirect to the inbox or home or watever.. what is the way to do something like that???

    Sunday, March 7, 2010 1:06 AM
  • User-1584200859 posted

    i dont understand how is a cookie container supposed to help me.. i read a bit about it but i didnt understand how its should be of help.. anyway, heres my question in a different way cuz im really excited about this methodology ... supposed i want to create an application that logs in to a website (fb, hotmail, gmail) watever it is... given the fact that there is no api or webservice i can use, the solution would be to call the login page, fill in my details, press the login button then redirect to the inbox or home or watever.. what is the way to do something like that??? a bot basically..

     

    (sorry for the double post... my browser crashed !)

    Sunday, March 7, 2010 1:06 AM
  • User-967169866 posted

    It really depends on how serious your crawler is going to be.  a lot of sites use javascript to fill the data on the page, so using a MSHTML object might be necessary.  If it's just to crawl data that's literally dumped on the initial html stream, using MSHTML may be overkill.  I'd suggest using System.Net.WebRequest | System.Net.WebResponse [HttpWebRequest | HttpWebResponse] if possible.   It's a lot leaner than MSHTML, but you'll have to parse out the html tags and attributes to fetch out the data you want to keep.

    Sunday, March 7, 2010 3:44 AM
  • User-1584200859 posted

    thank u for you response...

    yes,  ive used Sys.Net to view and parse pages before and it was easy and handy as far as i remember.. but if my crawler needs to log in or do some action, press a button, the .Net namespace wouldnt be very useful then would it?? if so , how then? 

    Sunday, March 7, 2010 6:59 AM
  • User-967169866 posted

    Logging in, and pressing a button and stuff like that is always one of two things: Request or Form Post.  

    I've done login and password, sessionid, and cookies pretty easy with web request.  The button clicks are a little more complicated because of the form values you need to manage.  I'm not sure if MSHTML will be able to supply that sort of functionality though.  At least it's not easy no matter how you try to do it.

    Sunday, March 7, 2010 7:28 AM