none
C#及正则表达式截取HTML代码 RRS feed

  • 问题

  • 有如下HTML代码,请问用C#和正则表达式如何截取?
    1.以下html表示收到的3个组的信息,如果含有"unread.gif"表示未读消息,否则表示已读信息。
    2.截取未读消息和已读消息的条数和theUrl
    3.要将未读信息和已读信息分开放入两个组里。

    <div class="dxx_of" id="message1" onmouseover="msgOnmouseover(1)" onmouseout="msgOnmouseout(1)" />
    <div class="dxx1" style="padding:15px 10px;"><img src="http://www.microsoft.com/i2/unread.gif" width="14" height="10" title="" /></div>
    <div class="dxx2">
    				
    				<table class="aa" border="0" cellpadding="0" cellspacing="0" >
      					<colgroup>
    						<col width="463" />
    					</colgroup>
      					<tbody>
      						
        					<tr basestyle="oRowLine2">
          						<td valign="top" onclick="javascript:document.location='thUrl';" >
    								wa
    								<div><span class='c9'>共6条会话</span><a href="thUrl" class="sl">+展开</a></span></div>
    								<span class="c9"></span>
    							</td>
    						</tr>
    					</tbody>
    				</table>
    </div>
    <div class="c"></div>
    </div>
    
    <div class="dxx_of" id="message2" onmouseover="msgOnmouseover(2)" onmouseout="msgOnmouseout(2)" />
    <div class="dxx1" style="padding:15px 10px;"></div>
    <div class="dxx2">		
    				<table class="aa" border="0" cellpadding="0" cellspacing="0" >
      					<colgroup>
    						<col width="463" />
    					</colgroup>
      					<tbody>
      						
        					<tr basestyle="oRowLine2">
          						<td valign="top" onclick="javascript:document.location='thUrl';" >
    								wa
    								<div><span class='c9'>共3条会话</span><a href="thUrl1" class="sl">+展开</a></span></div>
    								<span class="c9"></span>
    							</td>
    						</tr>
    					</tbody>
    				</table>				
    </div>
    <div class="c"></div>
    </div>
    
    <div class="dxx_of" id="message3" onmouseover="msgOnmouseover(3)" onmouseout="msgOnmouseout(3)" />
    				<div class="dxx1" style="padding:15px 10px;"></div>
    <div class="dxx2">
      同上很多html内容
    </div>
    <div class="c"></div>
    </div>
    2009年7月24日 3:20

答案

全部回复

  • 2009年7月24日 8:36
  • 您好,请参考:http://msdn.microsoft.com/zh-cn/library/ms228595.aspx
    我使用以下代码来匹配无效。后来查了一下原来正则的"."只能匹配不含\n的任何字符,但是HTML代码中有很多\r\n
    不知有啥好的方法,有效的将其分为3个组?
     Regex regex = new Regex("<div class=\"dxx_of\" id=\".+/>.+(?<htmlCode>.+).+<div class=\"c\"></div>");
     MatchCollection matchs = regex.Matches(resultHtml);
    
     if(maths.Count>0)
        strig html = matchs[0].Groups["htmlCode"].Value;

     
    2009年7月24日 8:56
  • 你可以去正则表达式论坛 提问。

    Please mark the post answered your question as the answer, and click the chartreuse pyramid floating over "Vote as helpful" to mark other helpful posts as helpful. This posting is provided "AS IS" with no warranties, and confers no rights.
    Visual C++ MVP
    2009年7月24日 10:59
    版主
  • 你可以去正则表达式论坛 提问。

    Please mark the post answered your question as the answer, and click the chartreuse pyramid floating over "Vote as helpful" to mark other helpful posts as helpful. This posting is provided "AS IS" with no warranties, and confers no rights.
    Visual C++ MVP

    偶的个神呢。要问英文啊。。。
    2009年7月25日 17:34
  • 您好,请参考:http://msdn.microsoft.com/zh-cn/library/ms228595.aspx
    我使用以下代码来匹配无效。后来查了一下原来正则的"."只能匹配不含\n的任何字符,但是HTML代码中有很多\r\n
    不知有啥好的方法,有效的将其分为3个组?
     Regex regex = new Regex("<div class=\"dxx_of\" id=\".+/>.+(?<htmlCode>.+).+<div class=\"c\"></div>");
    
     MatchCollection matchs = regex.Matches(resultHtml);
    
    
    
     if(maths.Count>0)
    
        strig html = matchs[0].Groups["htmlCode"].Value;
    
    

     

    您好,\r\n在您的需求中不起决定作用。
    个人建议可以在正则匹配前 用string Replace方法将 \r\n替换掉。
    2009年7月26日 10:14
  • 有如下HTML代码,请问用C#和正则表达式如何截取?
    1.以下html表示收到的3个组的信息,如果含有"unread.gif"表示未读消息,否则表示已读信息。
    2.截取未读消息和已读消息的条数和theUrl
    3.要将未读信息和已读信息分开放入两个组里。

    <div class="dxx_of" id="message1" onmouseover="msgOnmouseover(1)" onmouseout="msgOnmouseout(1)" />
    
    <div class="dxx1" style="padding:15px 10px;"><img src="http://www.microsoft.com/i2/unread.gif" width="14" height="10" title="" /></div>
    
    <div class="dxx2">
    
    				
    
    				<table class="aa" border="0" cellpadding="0" cellspacing="0" >
    
      					<colgroup>
    
    						<col width="463" />
    
    					</colgroup>
    
      					<tbody>
    
      						
    
        					<tr basestyle="oRowLine2">
    
          						<td valign="top" onclick="javascript:document.location='thUrl';" >
    
    								wa
    
    								<div><span class='c9'>共6条会话</span><a href="thUrl" class="sl">+展开</a></span></div>
    
    								<span class="c9"></span>
    
    							</td>
    
    						</tr>
    
    					</tbody>
    
    				</table>
    
    </div>
    
    <div class="c"></div>
    
    </div>
    
    
    
    <div class="dxx_of" id="message2" onmouseover="msgOnmouseover(2)" onmouseout="msgOnmouseout(2)" />
    
    <div class="dxx1" style="padding:15px 10px;"></div>
    
    <div class="dxx2">		
    
    				<table class="aa" border="0" cellpadding="0" cellspacing="0" >
    
      					<colgroup>
    
    						<col width="463" />
    
    					</colgroup>
    
      					<tbody>
    
      						
    
        					<tr basestyle="oRowLine2">
    
          						<td valign="top" onclick="javascript:document.location='thUrl';" >
    
    								wa
    
    								<div><span class='c9'>共3条会话</span><a href="thUrl1" class="sl">+展开</a></span></div>
    
    								<span class="c9"></span>
    
    							</td>
    
    						</tr>
    
    					</tbody>
    
    				</table>				
    
    </div>
    
    <div class="c"></div>
    
    </div>
    
    
    
    <div class="dxx_of" id="message3" onmouseover="msgOnmouseover(3)" onmouseout="msgOnmouseout(3)" />
    
    				<div class="dxx1" style="padding:15px 10px;"></div>
    
    <div class="dxx2">
    
      同上很多html内容
    
    </div>
    
    <div class="c"></div>
    
    </div>
    
    


    A
    • 已建议为答案 chong6868 2009年9月26日 15:36
    2009年9月26日 15:36
  • 有如下HTML代码,请问用C#和正则表达式如何截取?
    1.以下html表示收到的3个组的信息,如果含有"unread.gif"表示未读消息,否则表示已读信息。
    2.截取未读消息和已读消息的条数和theUrl
    3.要将未读信息和已读信息分开放入两个组里。

    <
    div
     class
    =
    "dxx_of"
     id
    =
    "message1"
     onmouseover
    =
    "msgOnmouseover(1)"
     onmouseout
    =
    "msgOnmouseout(1)"
     />
    
    
    <
    div
     class
    =
    "dxx1"
     style
    =
    "padding:15px 10px;"
    >
    <
    img
     src
    =
    "http://www.microsoft.com/i2/unread.gif"
     width
    =
    "14"
     height
    =
    "10" title=""
     />
    </
    div
    >
    
    
    <
    div
     class
    =
    "dxx2"
    >
    
    
    				
    
    				<
    table
     class
    =
    "aa"
     border
    =
    "0"
     cellpadding
    =
    "0"
     cellspacing
    =
    "0"
     >
    
    
      					<
    colgroup
    >
    
    
    						<
    col
     width
    =
    "463"
     />
    
    
    					</
    colgroup
    >
    
    
      					<
    tbody
    >
    
    
      						
    
        					<
    tr
     basestyle
    =
    "oRowLine2"
    >
    
    
          						<
    td
     valign
    =
    "top"
     onclick
    =
    "javascript:document.location='thUrl';"
     >
    
    
    								wa
    
    								<
    div
    >
    <
    span
     class
    =
    'c9'
    >
    共6条会话</
    span
    >
    <
    a
     href
    =
    "thUrl"
     class
    =
    "sl"
    >
    +展开</
    a
    >
    </
    span
    >
    </
    div
    >
    
    
    								<
    span
     class
    =
    "c9"
    >
    </
    span
    >
    
    
    							</
    td
    >
    
    
    						</
    tr
    >
    
    
    					</
    tbody
    >
    
    
    				</
    table
    >
    
    
    </
    div
    >
    
    
    <
    div
     class
    =
    "c"
    >
    </
    div
    >
    
    
    </
    div
    >
    
    
    
    
    <
    div
     class
    =
    "dxx_of"
     id
    =
    "message2"
     onmouseover
    =
    "msgOnmouseover(2)"
     onmouseout
    =
    "msgOnmouseout(2)"
     />
    
    
    <
    div
     class
    =
    "dxx1"
     style
    =
    "padding:15px 10px;"
    >
    </
    div
    >
    
    
    <
    div
     class
    =
    "dxx2"
    >
    		
    
    				<
    table
     class
    =
    "aa"
     border
    =
    "0"
     cellpadding
    =
    "0"
     cellspacing
    =
    "0"
     >
    
    
      					<
    colgroup
    >
    
    
    						<
    col
     width
    =
    "463"
     />
    
    
    					</
    colgroup
    >
    
    
      					<
    tbody
    >
    
    
      						
    
        					<
    tr
     basestyle
    =
    "oRowLine2"
    >
    
    
          						<
    td
     valign
    =
    "top"
     onclick
    =
    "javascript:document.location='thUrl';"
     >
    
    
    								wa
    
    								<
    div
    >
    <
    span
     class
    =
    'c9'
    >
    共3条会话</
    span
    >
    <
    a
     href
    =
    "thUrl1"
     class
    =
    "sl"
    >
    +展开</
    a
    >
    </
    span
    >
    </
    div
    >
    
    
    								<
    span
     class
    =
    "c9"
    >
    </
    span
    >
    
    
    							</
    td
    >
    
    
    						</
    tr
    >
    
    
    					</
    tbody
    >
    
    
    				</
    table
    >
    				
    
    </
    div
    >
    
    
    <
    div
     class
    =
    "c"
    >
    </
    div
    >
    
    
    </
    div
    >
    
    
    
    
    <
    div
     class
    =
    "dxx_of"
     id
    =
    "message3"
     onmouseover
    =
    "msgOnmouseover(3)"
     onmouseout
    =
    "msgOnmouseout(3)"
     />
    
    
    				<
    div
     class
    =
    "dxx1"
     style
    =
    "padding:15px 10px;"
    >
    </
    div
    >
    
    
    <
    div
     class
    =
    "dxx2"
    >
    
    
      同上很多html内容
    
    </
    div
    >
    
    
    <
    div
     class
    =
    "c"
    >
    </
    div
    >
    
    
    </
    div
    >
    
    
    


    A


    A
    • 已建议为答案 chong6868 2009年10月12日 5:34
    2009年10月12日 5:34