Extracting text from Html page using IHTMLDocument2
Hello Everyone,
I am trying to extract some text from the HTML pages. From the TABLE Tag.
I am able to extract the links from the pages using href tag and I can display it on the dialog but I am not able to process the Table tag and extract the text.
Can anyone please tell what changes I have to make to display the table contents on the dialog.
void CTestDlg::OnBgo() { UpdateData(); CWaitCursor wait; if(m_csFilename.IsEmpty()){ AfxMessageBox(_T("Please specify the file to parse")); return; } CFile f; if (f.Open(m_csFilename, CFile::modeRead|CFile::shareDenyNone)) { m_wndLinksList.ResetContent(); CString csWholeFile; f.Read(csWholeFile.GetBuffer(f.GetLength()), f.GetLength()); csWholeFile.ReleaseBuffer(f.GetLength()); f.Close(); MSHTML::IHTMLDocument2Ptr pDoc; MSHTML::IHTMLDocument3Ptr pDoc3; MSHTML::IHTMLElementCollectionPtr pCollection; MSHTML::IHTMLElementPtr pElement; MSHTML::IHTMLDocument2 *pDoc1 = NULL; MSHTML::IHTMLElementCollection *pColl = NULL; HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, (void**)&pDoc); SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1); VARIANT *param; bstr_t bsData = (LPCTSTR)csWholeFile; hr = SafeArrayAccessData(psa, (LPVOID*)¶m); param->vt = VT_BSTR; param->bstrVal = (BSTR)bsData; hr = pDoc->write(psa); hr = pDoc->close(); SafeArrayDestroy(psa); pDoc3 = pDoc; //display HREF parameter of every link (A tag) in ListBox pCollection = pDoc3->getElementsByTagName(L"A"); for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement != NULL){ //second parameter says that you want to get text inside attribute as is m_wndLinksList.AddString((LPCTSTR)bstr_t(pElement->getAttribute("href", 2))); } } pColl = pDoc1->get_all(L"table"); // Loop through the Element Collection. int y = pColl->length; for(int x = 0; x < y ; x++) { //m_wndLinksList.AddString("Test Table"); } } }Getting the below error : -
error C2664: 'MSHTML::IHTMLDocument2::get_all' : cannot convert parameter 1 from 'const wchar_t [6]' to 'MSHTML::IHTMLElementCollection **'
- Edited byNaveen HS Friday, October 16, 2009 11:24 AM
Answers
IHTMLElement is the interface that every element of a html page implement.
PS: Sample code related to your issue under this link.
If you are trying to acquire all the text that exists in a HTML page, the simplest way is to parse the different collection elements through a recursive algorithm.
You should get the .innerTEXT of the simple text tags (eg: "div", "a", ...).
Regards,
Dev s r'us- Marked As Answer byNancy ShaoMSFT, ModeratorFriday, October 23, 2009 6:48 AM
All Replies
- Who is pDoc1 ?
get_all returns a IHTMLElementCollection interface;
try this (untested)
pDoc1->get_all(&pColl); //also check the hresult of the call
"herf" should be "href"
Dev s r'us - Hello Sir,
Thank you very much for the response.
Sir i am getting the run time error
for this line
pDoc1->get_all(&pColl);
Unhandled exception at 0x00417669 in test.exe: 0xC0000005: Access violation reading location 0x00000000.
Sir can u please tell me how to do the <Table> parsing using this.
Who is pDoc1 because the value is NULL : "MSHTML::IHTMLDocument2 *pDoc1 = NULL" ?
Dev s r'us- Sir i just wanted use IHTMLDocument2::get_all method to get an IHTMLElementCollection interface to access the table tag.
- pDoc1 must be initialized,
try
pDoc->QueryInterface( IID_IIHTMLDocument2, reinterpret_cast<void**>(&pDoc1));
pDoc1->get_ ....
....
pDoc1->Release();
as for the only "table" collection, you should query for the IHTMLDocument3 interface: it has the method getElementsByTagName.
Hope it helps.
Dev s r'us - Hello Sir,
Thanks a lot for the reply.
I made changes as mentioned by you, but still i am not able to extract the data from the table. can u please tell what change i have to make.
MSHTML::IHTMLDocument2Ptr pDoc; MSHTML::IHTMLDocument3Ptr pDoc3; MSHTML::IHTMLElementCollectionPtr pCollection; MSHTML::IHTMLElementPtr pElement; HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, (void**)&pDoc); pDoc->get_all(&pCollection); pCollection = pDoc3->getElementsByTagName("table"); for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement != NULL){ m_wndLinksList.AddString((LPCTSTR)bstr_t(pElement->getAttribute("table"),10)); } }error C2660: 'MSHTML::IHTMLElement::getAttribute' : function does not take 1 arguments Hi again,
Yep, the iterator looks pretty good ... now.
First the IHTMLElement::getAttribute says something about 3 parametres, but I'm afraid that is rather rare to have a "table" attribute, are you searching for "href".
Maybe you should use the get_tagName instead to get a collection of "a" tags that are inside this table and then try to get the "href" attribute.
Regards,
Dev s r'us- Hello Sir,
Thank you once again for the response.
I am very much new to the HTML Parsing i have never used the IHTMLElement, Can u please help me to solve this problem for extracting the text from the Table.
href is not required i have to process only tables and extract data.
Can u please give me some ideas to complete this assignment.
IHTMLElement is the interface that every element of a html page implement.
PS: Sample code related to your issue under this link.
If you are trying to acquire all the text that exists in a HTML page, the simplest way is to parse the different collection elements through a recursive algorithm.
You should get the .innerTEXT of the simple text tags (eg: "div", "a", ...).
Regards,
Dev s r'us- Marked As Answer byNancy ShaoMSFT, ModeratorFriday, October 23, 2009 6:48 AM
- Hello Sir,
i am not able to implement this method.
HRESULT IHTMLElement::get_innerText(BSTR *p);
Can you please help me with this, I did not find any useful information about this method. - Hi,
You need to have a valid IHTMLElement object;
IHTMLElement *pElement;
Admitting that pElement is the IHTMLElement obtained from the collection, you should do it like this:
BSTR itxt;
pElement->get_innerText(&itxt);
...
Dev s r'us - Hello Sir,
Thank you very much. Added the get_innerText method as below its working now its not giving any error.
pDoc->get_all(&pCollection); pCollection = pDoc3->getElementsByTagName("table"); BSTR itxt; for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement != NULL){ m_wndLinksList.AddString((LPCTSTR)bstr_t(pElement->get_innerText(&itxt))); }
I created an HTML page with one Table as below
<html>
<body>
<table border="1">
<tr>
<th>Team Name</th>
<th>Place</th>
</tr>
<tr>
<td>BRC</td>
<td>Bangalore</td>
</tr>
<tr>
<td>DD</td>
<td>Delhi</td>
</tr>
</table>
</body>
</html>
Once i execute the program i am getting the Out Put as 0 , content of the table is not displayed.
is there any other method which i have to include to retrieve the table data.
- Hi again, this normal: pCollection is a collection of tables. Tables dont have innertext. in order to get data, you should get the collection for each element in that table ( the tr tag elements) and, in this new collection get the collection of the th / tr tag elements.
Does it clear you up how it's done ?
Dev s r'us - Hello Sir,
I made the changes as below.
once i get the get element tag name "table" inside the loop i am again i am using get element by tag name "tr" for processing the each table seperatly.
it has no errors when i compile its giving run time error when i execute it.
MSHTML::IHTMLDocument2Ptr pDoc; MSHTML::IHTMLDocument3Ptr pDoc3; MSHTML::IHTMLElementCollectionPtr pCollection; MSHTML::IHTMLElementPtr pElement; //Added Pari MSHTML::IHTMLElementCollectionPtr pCol; MSHTML::IHTMLElementPtr pEle; HRESULT hr = CoCreateInstance(CLSID_HTMLDocument, NULL, CLSCTX_INPROC_SERVER, IID_IHTMLDocument2, (void**)&pDoc); hr = pDoc->write(psa); hr = pDoc->close(); SafeArrayDestroy(psa); pDoc3 = pDoc; pDoc->get_all(&pCollection); pCollection = pDoc3->getElementsByTagName("table"); BSTR itxt; for(long i=0; i<pCollection->length; i++){ pElement = pCollection->item(i, (long)0); if(pElement != NULL){ pCol = pDoc3->getElementsByTagName(L"tr"); for(long j=0; j < pCol->length; j++){ pEle = pCol->item(j,(long)0); if(pEle != NULL){ m_wndLinksList.AddString((LPCTSTR)bstr_t(pEle->getAttribute("tr", 10))); } } } }
Debug and it using F10 ... on which line do you encounter the problem ?
Maybe this one : m_wndLinksList.AddString((LPCTSTR)bstr_t(pEle->getAttribute("tr", 10))); //?
Also the path to text is table -> tr -> th (or td) ...
Dev s r'us- Hello Sir,
Just did the debug, same line is causing the run time error.
What change i have to make here sir
m_wndLinksList.AddString((LPCTSTR)bstr_t(pEle->getAttribute("tr", 5))); Here is a sample code to extract text from a very simple html page. Should be enought to make you understand how does it work:
IHTMLDocument2* pDoc; BSTR tagName, tdtag, itext; tdtag = SysAllocString( L"TD"); SAFEARRAY* psa = SafeArrayCreateVector(VT_VARIANT, 0, 1); if (CoInitializeEx( 0L, COINIT_MULTITHREADED) == S_OK) __try{ if (CoCreateInstance( CLSID_HTMLDocument, 0L, CLSCTX_INPROC, IID_IHTMLDocument2, (void**)&pDoc) == S_OK) __try{ VARIANT *param; SafeArrayAccessData(psa, (LPVOID*)¶m); param->vt = VT_BSTR; param->bstrVal = SysAllocString( L"<!doctype html><html><head><title>None</title></head>\ <body>\ <table>\ <tr>\ <th>Header</th>\ <td>Text1</td>\ <td>Text2</td>\ </tr>\ </table>\ </body>\ </html>"); if ((pDoc->write( psa) != S_OK)&&(pDoc->close() != S_OK)) return 1; IDispatch *all, *disp; IHTMLElement *body, *item, *td; IHTMLElementCollection *alls, *tds; IHTMLElement2 *tbl2; long alen, tdlen; pDoc->get_body( &body); body->get_all( &all); body->Release(); body = 0L; all->QueryInterface( IID_IHTMLElementCollection, (void**)&alls); all->Release(); all = 0L; alls->get_length( &alen); VARIANT dummy; dummy.vt = VT_I4; for( int ai = 0; ai < alen; ai++) { dummy.intVal = ai; alls->item( dummy, dummy, (IDispatch**)&disp); if (disp) { disp->QueryInterface( IID_IHTMLElement, (void**)&item); if (item){ disp->Release(); disp = 0L; item->get_tagName( &tagName); if (!lstrcmpW( tagName, L"TABLE")){//if the element is a table at the root item->QueryInterface( IID_IHTMLElement2, (void**)&tbl2); item->Release(); item = 0L; if(tbl2){ tbl2->getElementsByTagName( tdtag, &tds);//get the tr if (tds){ tds->get_length( &tdlen); for (int tri = 0; tri < tdlen; tri++){ dummy.intVal = tri; tds->item( dummy, dummy, &disp); if( disp){ disp->QueryInterface( IID_IHTMLElement, (void**)&td); if(td){ td->get_innerText( &itext); wprintf( L"%s\r\n", (LPWSTR)itext); SysFreeString( itext); td->Release(); td = 0L; } disp->Release(); disp = 0L; }else{ disp->Release(); disp = 0L; } } tds->Release(); tds = 0L; } tbl2->Release(); tbl2 = 0L; }else{ item->Release(); item = 0L; } }else{ item->Release(); item = 0L; } }else{ disp->Release(); disp = 0L; } }else{ disp->Release(); disp = 0L; } } if (alls){ alls->Release(); alls = 0L; } }__finally { pDoc->Release(); pDoc = 0L; } }__finally { SafeArrayDestroy( psa); SysFreeString( tdtag); CoUninitialize(); } return 0;
I used a console application to show the text1 ... text 2 ....
Dev s r'us- Marked As Answer byNancy ShaoMSFT, ModeratorFriday, October 23, 2009 6:48 AM
- Unmarked As Answer byNaveen HS Wednesday, November 04, 2009 4:48 AM
- Wow Thank a lot sir, I will try to implement this sir
Once again thanks a lot for helping me to solve this. - Thank you very much sir , i implemented the same its working fine.


