Visual C++ Developer Center > Visual C++ Forums > Visual C++ General > Extracting text from Html page using IHTMLDocument2
Ask a questionAsk a question
 

AnswerExtracting text from Html page using IHTMLDocument2

  • Friday, October 16, 2009 10:41 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Hello Everyone,

     

    I am trying to extract some text from the HTML pages. From the TABLE Tag.

     

    I am able to extract the links from the pages using href  tag and I can display it on the dialog but  I am not able to process the Table tag and extract the text.

     

    Can anyone please tell what changes I have to make to display the table contents on the dialog.

     

     

    void CTestDlg::OnBgo() 
    {
    	UpdateData();
    	CWaitCursor wait;
    	if(m_csFilename.IsEmpty()){
    		AfxMessageBox(_T("Please specify the file to parse"));
    		return;
    	}
    	CFile f;
    
    	if (f.Open(m_csFilename, CFile::modeRead|CFile::shareDenyNone)) {
    		m_wndLinksList.ResetContent();
    		CString csWholeFile;
    		f.Read(csWholeFile.GetBuffer(f.GetLength()), f.GetLength());
    		csWholeFile.ReleaseBuffer(f.GetLength());
    		f.Close();
    
    		MSHTML::IHTMLDocument2Ptr pDoc;
    		MSHTML::IHTMLDocument3Ptr pDoc3;
    		MSHTML::IHTMLElementCollectionPtr pCollection;
    		MSHTML::IHTMLElementPtr pElement;
    
    		MSHTML::IHTMLDocument2 *pDoc1 = NULL;
    		MSHTML::IHTMLElementCollection *pColl = NULL;
    
    
    		HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, 
    			IID_IHTMLDocument2, (void**)&pDoc);
    		
    		SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
    		VARIANT *param;
    		bstr_t bsData = (LPCTSTR)csWholeFile;
    		hr = SafeArrayAccessData(psa, (LPVOID*)&param);
    		param->vt = VT_BSTR;
    		param->bstrVal = (BSTR)bsData;
    		
    		hr = pDoc->write(psa);
    		hr = pDoc->close();
    		
    		SafeArrayDestroy(psa);
    
    		pDoc3 = pDoc;
    		
    		//display HREF parameter of every link (A tag) in ListBox
    		pCollection = pDoc3->getElementsByTagName(L"A");
    		for(long i=0; i<pCollection->length; i++){
    			pElement = pCollection->item(i, (long)0);
    			if(pElement != NULL){
    				//second parameter says that you want to get text inside attribute as is
    				m_wndLinksList.AddString((LPCTSTR)bstr_t(pElement->getAttribute("href", 2)));
    			}
    		}
    		 
    		pColl  = pDoc1->get_all(L"table");
    
    
    		// Loop through the Element Collection.
    		int y = pColl->length;
    		for(int x = 0; x < y ; x++)
    		{
    			//m_wndLinksList.AddString("Test Table");
    		}
    
    
    
    	}
    }
    Getting the below error : -
    error C2664: 'MSHTML::IHTMLDocument2::get_all' : cannot convert parameter 1 from 'const wchar_t [6]' to 'MSHTML::IHTMLElementCollection **'

     

    • Edited byNaveen HS Friday, October 16, 2009 11:24 AM
    •  

Answers

  • Monday, October 19, 2009 1:06 PMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer

    IHTMLElement is the interface that every element of a html page implement.
    If you are trying to acquire all the text that exists in a HTML page, the simplest way is to parse the different collection elements through a recursive algorithm.

    You should get the .innerTEXT of the simple text tags (eg: "div", "a", ...).

    Regards,

    PS: Sample code related to your issue under this link.


    Dev s r'us

All Replies

  • Friday, October 16, 2009 11:07 AMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Who is pDoc1 ?

    get_all returns a IHTMLElementCollection interface;

    try this (untested)

    pDoc1->get_all(&pColl); //also check the hresult of the call

    "herf" should be "href"
    Dev s r'us
  • Friday, October 16, 2009 11:50 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hello Sir,

    Thank you very much for the response.


    Sir i am getting the run time error
    for this line

    pDoc1->get_all(&pColl);

    Unhandled exception at 0x00417669 in test.exe: 0xC0000005: Access violation reading location 0x00000000.


    Sir can u please tell me how to do the <Table> parsing using this.



  • Friday, October 16, 2009 12:01 PMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Who is pDoc1 because the value is NULL : "MSHTML::IHTMLDocument2 *pDoc1 = NULL" ?


    Dev s r'us
  • Friday, October 16, 2009 12:23 PMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Sir i just wanted use IHTMLDocument2::get_all method to get an IHTMLElementCollection interface to access the table tag.
  • Friday, October 16, 2009 12:32 PMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    pDoc1 must be initialized,

    try

    pDoc->QueryInterface( IID_IIHTMLDocument2, reinterpret_cast<void**>(&pDoc1));

    pDoc1->get_ ....

    ....

    pDoc1->Release();

    as for the only "table" collection, you should query for the IHTMLDocument3 interface: it has the method getElementsByTagName.

    Hope it helps.


    Dev s r'us
  • Monday, October 19, 2009 10:51 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Has Code
    Hello Sir,

    Thanks a lot for the reply.

    I made changes as mentioned by you, but still i am not able to extract the data from the table. can u please tell what change i have to make.


    MSHTML::IHTMLDocument2Ptr pDoc;
    		MSHTML::IHTMLDocument3Ptr pDoc3;
    		MSHTML::IHTMLElementCollectionPtr pCollection;
    		MSHTML::IHTMLElementPtr pElement;
    
    
    
    		HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, 
    			IID_IHTMLDocument2, (void**)&pDoc);
    
    
    pDoc->get_all(&pCollection);
    
    		pCollection = pDoc3->getElementsByTagName("table");
    
    		
    		for(long i=0; i<pCollection->length; i++){
    			pElement = pCollection->item(i, (long)0);
    			if(pElement != NULL){
    				m_wndLinksList.AddString((LPCTSTR)bstr_t(pElement->getAttribute("table"),10));
    			}
    		}
    error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 1 arguments
  • Monday, October 19, 2009 11:56 AMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Hi again,

    Yep, the iterator looks pretty good ... now.
    First the IHTMLElement::getAttribute says something about 3 parametres, but I'm afraid that is rather rare to have a "table" attribute, are you searching for "href".
    Maybe you should use the get_tagName instead to get a collection of "a" tags that are inside this table and then try to get the "href" attribute.

    Regards,


    Dev s r'us
  • Monday, October 19, 2009 12:17 PMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hello Sir,

    Thank you once again for the response.


    I am very much new to the HTML Parsing i have never used the IHTMLElement, Can u please help me to solve this problem for extracting the text from the Table.

    href is not required i have to process only tables and extract data.

    Can u please give me some ideas to complete this assignment.







  • Monday, October 19, 2009 1:06 PMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Answer

    IHTMLElement is the interface that every element of a html page implement.
    If you are trying to acquire all the text that exists in a HTML page, the simplest way is to parse the different collection elements through a recursive algorithm.

    You should get the .innerTEXT of the simple text tags (eg: "div", "a", ...).

    Regards,

    PS: Sample code related to your issue under this link.


    Dev s r'us
  • Tuesday, October 20, 2009 12:12 PMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hello Sir,


    i am not able to implement this method.

    HRESULT IHTMLElement::get_innerText(BSTR *p);


    Can you please help me with this, I did not find any useful information about this method.
  • Tuesday, October 20, 2009 1:07 PMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi,

    You need to have a valid IHTMLElement object;
    IHTMLElement *pElement;

    Admitting that pElement is the IHTMLElement obtained from the collection, you should do it like this:

    BSTR itxt;

    pElement->get_innerText(&itxt);

    ...
    Dev s r'us
  • Wednesday, October 21, 2009 11:04 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Has Code
    Hello Sir,

    Thank you very much. Added the get_innerText method as below its working now its not giving any error.



    pDoc->get_all(&pCollection);
    
            pCollection = pDoc3->getElementsByTagName("table");
    
            BSTR itxt;
    
            for(long i=0; i<pCollection->length; i++){
                pElement = pCollection->item(i, (long)0);
                if(pElement != NULL){
                  m_wndLinksList.AddString((LPCTSTR)bstr_t(pElement->get_innerText(&itxt)));
    }



    I created an HTML page with one Table as below

    <html>
    <body>

    <table border="1">
    <tr>
    <th>Team Name</th>
    <th>Place</th>
    </tr>
    <tr>
    <td>BRC</td>
    <td>Bangalore</td>
    </tr>
    <tr>
    <td>DD</td>
    <td>Delhi</td>
    </tr>
    </table>


    </body>
    </html>


    Once i execute the program i am getting the Out Put as  0 , content of the table is not displayed.
    is there any other method which i have to include to retrieve the table data.



  • Wednesday, October 21, 2009 1:29 PMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hi again, this normal: pCollection is a collection of tables. Tables dont have innertext. in order to get data, you should get the collection for each element in that table ( the tr tag elements) and, in this new collection get the collection of the th / tr tag elements.
    Does it clear you up how it's done ?
    Dev s r'us
  • Thursday, October 22, 2009 9:17 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Has Code
    Hello Sir,

    I made the changes as below.

    once i get the get element tag name "table"  inside the loop i am again i am using get element by tag name "tr" for processing the each table seperatly.


    it has no errors when i compile its giving run time error when i execute it.



    MSHTML::IHTMLDocument2Ptr pDoc;
    MSHTML::IHTMLDocument3Ptr pDoc3;
    MSHTML::IHTMLElementCollectionPtr pCollection;
    MSHTML::IHTMLElementPtr pElement;
    
    
    //Added Pari
    MSHTML::IHTMLElementCollectionPtr pCol;
    MSHTML::IHTMLElementPtr pEle;
    
    
    HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, 
    			IID_IHTMLDocument2, (void**)&pDoc);
    		
    	hr = pDoc->write(psa);
    	hr = pDoc->close();
    
    	SafeArrayDestroy(psa);
    	pDoc3 = pDoc;
    		
    
    	pDoc->get_all(&pCollection);
    	pCollection = pDoc3->getElementsByTagName("table");
    
    	BSTR itxt;
    
    	for(long i=0; i<pCollection->length; i++){
    		pElement = pCollection->item(i, (long)0);
    		if(pElement != NULL){
    
    		pCol = pDoc3->getElementsByTagName(L"tr");
    				
    		for(long j=0; j < pCol->length; j++){
    		pEle = pCol->item(j,(long)0);
    		if(pEle != NULL){
    						m_wndLinksList.AddString((LPCTSTR)bstr_t(pEle->getAttribute("tr", 10)));
    						}
    					}
    	
    			}
    		}	
    
  • Thursday, October 22, 2009 9:40 AMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     

    Debug and it using F10 ... on which line do you encounter the problem ?


    Maybe this one : m_wndLinksList.AddString((LPCTSTR)bstr_t(pEle->getAttribute("tr", 10))); //?

    Also the path to text is table -> tr -> th (or td) ...


    Dev s r'us
  • Thursday, October 22, 2009 9:57 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Hello Sir,


    Just did the debug, same line is causing the run time error.

    What change i have to make here sir

    m_wndLinksList.AddString((LPCTSTR)bstr_t(pEle->getAttribute("tr", 5)));
  • Thursday, October 22, 2009 2:28 PMCodu Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     Has Code

    Here is a sample code to extract text from a very simple html page. Should be enought to make you understand how does it work:

    	IHTMLDocument2* pDoc;
    	BSTR tagName, tdtag, itext;
    	tdtag = SysAllocString( L"TD");
    	SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1);
    	if (CoInitializeEx( 0L, COINIT_MULTITHREADED) == S_OK)
    	__try{
    		if (CoCreateInstance( CLSID_HTMLDocument, 0L, CLSCTX_INPROC, IID_IHTMLDocument2, (void**)&pDoc) == S_OK)
    		__try{
    			VARIANT *param;
    			SafeArrayAccessData(psa, (LPVOID*)&param);
    			param->vt = VT_BSTR;
    			param->bstrVal = SysAllocString(
    				L"<!doctype html><html><head><title>None</title></head>\
    					<body>\
    						<table>\
    							<tr>\
    								<th>Header</th>\
    								<td>Text1</td>\
    								<td>Text2</td>\
    							</tr>\
    						</table>\
    					</body>\
    				</html>");
    			if ((pDoc->write( psa) != S_OK)&&(pDoc->close() != S_OK))
    				return 1;
    
    			IDispatch *all, *disp;
    			IHTMLElement *body, *item, *td;
    			IHTMLElementCollection *alls, *tds;
    			IHTMLElement2 *tbl2;
    			long alen, tdlen;
    			pDoc->get_body( &body);
    			body->get_all( &all);
    			body->Release();
    			body = 0L;
    			all->QueryInterface( IID_IHTMLElementCollection, (void**)&alls);
    			all->Release();
    			all = 0L;
    			alls->get_length( &alen);
    			VARIANT dummy;
    			dummy.vt = VT_I4;
    			for( int ai = 0; ai < alen; ai++)
    			{
    				dummy.intVal = ai;
    				alls->item( dummy, dummy, (IDispatch**)&disp);
    				if (disp)
    				{
    					disp->QueryInterface( IID_IHTMLElement, (void**)&item);
    					if (item){
    						disp->Release();
    						disp = 0L;
    						item->get_tagName( &tagName);
    						if (!lstrcmpW( tagName, L"TABLE")){//if the element is a table at the root
    							item->QueryInterface( IID_IHTMLElement2, (void**)&tbl2);
    							item->Release();
    							item = 0L;
    							if(tbl2){
    								tbl2->getElementsByTagName( tdtag, &tds);//get the tr
    								if (tds){
    									tds->get_length( &tdlen);
    									for (int tri = 0; tri < tdlen; tri++){
    										dummy.intVal = tri;
    										tds->item( dummy, dummy, &disp);
    										if( disp){
    											disp->QueryInterface( IID_IHTMLElement, (void**)&td);
    											if(td){
    												td->get_innerText( &itext);
    												
    												wprintf( L"%s\r\n", (LPWSTR)itext);
    												
    												SysFreeString( itext);
    												td->Release();
    												td = 0L;
    											}
    											disp->Release();
    											disp = 0L;
    										}else{
    											disp->Release();
    											disp = 0L;
    										}
    									}
    									tds->Release();
    									tds = 0L;
    								}
    								tbl2->Release();
    								tbl2 = 0L;
    							}else{
    								item->Release();
    								item = 0L;
    							}
    						}else{
    							item->Release();
    							item = 0L;
    						}
    					}else{
    						disp->Release();
    						disp = 0L;
    					}
    				}else{
    					disp->Release();
    					disp = 0L;
    				}
    			}
    			if (alls){
    				alls->Release();
    				alls = 0L;
    			}
    		}__finally
    		{
    			pDoc->Release();
    			pDoc = 0L;
    		}
    	}__finally
    	{
    		SafeArrayDestroy( psa);
    		SysFreeString( tdtag);
    		CoUninitialize();
    	}
    	return 0;
    

    I used a console application to show the text1 ... text 2 ....


    Dev s r'us
  • Friday, October 23, 2009 8:55 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Wow Thank a lot sir, I will try to implement this sir


    Once again thanks a lot for helping me to solve this.
  • Wednesday, November 04, 2009 4:49 AMNaveen HS Users MedalsUsers MedalsUsers MedalsUsers MedalsUsers Medals
     
    Thank you very much sir , i implemented the same its working fine.