Using Regular Express or MSHTMl to retrieve HTML Element
-
2012년 4월 13일 금요일 오전 7:48
Hello experts,
I am wondering if I use Regular Express to get matched HTML Element is faster than using MSHTML by traverse every element to get match. Could you please give me an advice for this?
Thanks.
모든 응답
-
2012년 4월 13일 금요일 오전 8:45
All depends on multiple factors, the complexity of your HTML file, Complexity of regular expression, ...
I would suggest you measure the 2 solutions. -
2012년 4월 13일 금요일 오전 8:59
Could we prove in theory which approach could be faster in which situation that you mention?
-
2012년 4월 13일 금요일 오전 9:18I do not think that there are studies available on the topic, did a quick search without success. You could create mutiple HTML files with different complexity and do some testing yourself. It might be interesting for other parties if you publish your results...
-
2012년 4월 13일 금요일 오후 2:00
check this link this may help you
http://stackoverflow.com/questions/9336989/c-html-parser-to-get-the-content-of-a-given-tag
-
2012년 4월 16일 월요일 오전 9:16As I did some searching on Internet, using Regular Expression for parsing HTML element is not recommended. So, is it correct?
-
2012년 4월 17일 화요일 오전 2:58중재자
I think it depends on what function do you want to achieve.
Best regards,
JesseJesse Jiang [MSFT]
MSDN Community Support | Feedback to us
-
2012년 4월 18일 수요일 오전 3:58
THTML is a Type 2 grammar in Chomsky hierarchy and regex is a type 3. You cannot express all HTML in regex, regex's complexity is simplely not enough. If you want to parse HTML, use an HTML parser that has enough complexity for your input.
The following is signature, not part of post
Please mark the post answered your question as the answer, and mark other helpful posts as helpful, so they will appear differently to other users who are visiting your thread for the same problem.
Visual C++ MVP- 편집됨 Sheng Jiang 蒋晟MVP 2012년 4월 18일 수요일 오전 4:00
-
2012년 4월 18일 수요일 오전 7:38This is seems to be the correct answer. However, I see some people are currently using Regex to parse HTML document. If we could not use Regular Expression, so can XPath be a better choice since I only need to parse a specific HTML element?
-
2012년 4월 18일 수요일 오후 1:02
>However, I see some people are currently using Regex to parse HTML document.
Either they don't care about total correctness or their input have limited complexity. For example if the page structure is known and have limited depth.
>so can XPath be a better choice
No parse performance has nothing to do with XPath. XPath is a query language, if you are able to use it, then the parsing is already done and you have got XML as result.
The following is signature, not part of post
Please mark the post answered your question as the answer, and mark other helpful posts as helpful, so they will appear differently to other users who are visiting your thread for the same problem.
Visual C++ MVP- 답변으로 표시됨 cplusplusdev 2012년 4월 18일 수요일 오후 1:34
- 편집됨 Sheng Jiang 蒋晟MVP 2012년 4월 27일 금요일 오전 2:51
-
2012년 4월 18일 수요일 오후 1:35Thank you very much for your answer.

