locked
HTML parser - HTML to string RRS feed

  • Question

  • I have been looking for a converter that takes a string of HTML and returns a formatted HTML-free string.  The HTML is from an RSS description tag and so it does not just contain basic HTML.

    I know regular expressions can be used however I am yet to find an extensive implemenation using C#.

    I was more looking for a Silverlight compatible library, however can't seem to find any.

    If anyone knows of an extensive implementation it would be very helpful.

    Cheers

    Saturday, January 17, 2009 4:13 PM

Answers

  • Unless you want to use some sort of SGML parser (which can get pretty complicated) this approach will generally yeild the best results:

     

    string output = Regex.Replace(input, @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n)*?>", "");

    You can add or subtract tags delimited by the OR ( | ) sign as needed.

    Sunday, January 18, 2009 3:26 PM

All replies

  • Is the RSS feed strict XHTML? If so, couldn't you simply load that HTML into an XML document, select the appropriate node and then use the InnerText property to get just the text without the formatting HTML?

    I have not tried it myself but I think the idea should be useful.

    Sunday, January 18, 2009 1:39 PM
  •  Well it is the inside of the <description> tags so it could be many different things.

    Some examples are

     

    <p>When:

    <abbr class="dtstart" title="20090120T1600">4pm</abbr>
    -
    <abbr class="dtend" title="20090120T1700">5pm, Tue, 20 Jan '09</abbr>
    </p>
    <p>Where: MS B3.03</p>

    <p>Boundary Properties of Graphs and the Hamiltonian Cycle Problem</p>

      

    This example is one I have been having trouble with.  It is generated from a Calendar and has these abbr tags.

    A simpler example is 

     

    <p>The Computing Society is holding a few talks on the C language. These will be held on Thursdays (in CS0.01) from 5:00 to 6:30 pm on the following dates:</p>
    
    <p>22/01/2009: Lecture1: Introduction to C</p>
    
    <p>29/01/2009: Lecture2: Advanced C</p>

     

    So I don't think it has to be XHTML as it could easily not be well-formed.

    I would want the HTML to mean something though, for instance <p> tags would define a paragraph.  I don't think using InnerText would preserve this.

    Sunday, January 18, 2009 2:19 PM
  • Unless you want to use some sort of SGML parser (which can get pretty complicated) this approach will generally yeild the best results:

     

    string output = Regex.Replace(input, @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n)*?>", "");

    You can add or subtract tags delimited by the OR ( | ) sign as needed.

    Sunday, January 18, 2009 3:26 PM

  • </head>
    <body>
     <div class="container">
    <div class="header">
    <div class="logo">
    <img src="img/logo.png">
    </div>

    <div class="links">
    <img src="img/reg_btn.gif">
    <img src="img/rss.gif">
    <img src="img/fb.gif">
    <img src="img/twitter.gif">
    <span>A</span><a href="" class="signin">Sign in</a>
    </div>
    </div>

    <div class="menu">
    <ul class="unorderlist">
    <li class="menulist menulist"><a href="" class="link">HOME</a> 
    <ul class="dropdown-content"><li><a href="" class="submenu">WHO WE ARE</a></li></ul></li>
    <li class="menulist"><a href="" class="link">ABOUTUS</a></li>

    </ul>
    </div>

    <div class="center">
    <div class="slideshow">
    <img src="img.jpg">
    </div>
    <div class="puzzle">yes<br>
    <div class="puzzlebg">
    <button class="button button2" name="button" value="2" id="5">2</button>
    </div>
    </div>
    </div>
    <div class="footer">
    <div class="copyrightto">
    Copyright
    </div>
    </div>
     </div>



    • Edited by Ramesh.p Tuesday, June 26, 2018 5:01 PM
    Tuesday, June 26, 2018 4:52 PM

  • .button {
     background-color: white;
     border: none;
     color: white;
     padding: 20px;
     text-align: center;
     text-decoration: none;
     display: inline-block;
     font-size: 30px;
     margin: 0px 0px;
     cursor: pointer;
     width:80px;
     border: 2px solid #4099DE;
     float:left;
    }

    .footer{
    background-color: #4099DE;
    float:left;
    width:940px;
    text-align: center;
    }

    .copyright{
    margin:0px auto;
    color:white;
    }

    .puzzlebg{
    float:left;
    width:320px;
    Background-image:url("../img/thumb_7.jpg");
    background-repeat: no-repeat;
    height:316px;
    background-size: 418px;
    }

    .dropdown-content {
        display: none;
        ;
        background-color:#292929;
        min-width: 160px;
        box-shadow: 0px 8px 16px 0px rgba(0,0,0,0.2);
        padding: 14px 0px;
        z-index: 1;
        text-decoration: none;
    #display:inline-block;
    }

    .menulist:hover .dropdown-content {
        display: block;
    }

    .submenu{
    color:white;
    text-decoration: none;
    }

    body{
    margin:0px;
    padding:0px;
    background-color:#F2F2F2;
    font-family:calibri;
    }
    .container{
    width:940px;
    margin:0px auto;
    }
    .header{
    float:left;
    width:940px;
    margin:0px auto;
    background-color:white;
    height: 85px;
    }
    .logo{
    width:167px;
    float:left;
    margin-top: 10px;
    }
    .links{
    float:right;
    width: 265px;
    text-decoration: none;
    margin-top: 10px;
    text-align: right;
    }

    .menu{
    word-spacing: 35px;
    Background-image:url("../img/menu.gif");
    width:940px;
    float:left;
    height: 45px;
    }

    .menulist{
    display:inline-block;
    list-style: none;
    }

    .link{
    float:left;
    display:inline-block;
    color: #888888;
    text-decoration: none;
    }

    .signin{
    float:right;
    color:#4396D9;
    }

    .center{
    float:left;
    }
    .puzzle{
    margin:0px auto;
    width:320px;
    text-align: center;
    color: red;
    font-weight:bold;
    }

    .slideshow{
    float:left;
    width:940px;
    }

    • Edited by Ramesh.p Tuesday, June 26, 2018 5:31 PM
    Tuesday, June 26, 2018 4:54 PM

  • <script>
    $(document).ready(function(){

    var i=0;
    var j=0;
    var str;
    var id1=0;
    var id2=0;
    $(".button").click(function(){
    $(this).css("color","black");
    if(i==0&&id1==0){
    i=$(this).val();
    id1=$(this).attr("id");
    }else{
    j=$(this).val();
    id2=$(this).attr("id");
    }
    if((i!=0)&&(j!=0)&&(id1!=0)&&(id2!=0))
    {
    if(i==j&&id1!=id2)
    {
    str=".button"+i;
    $(str).css("background-color","#4896DE");
    i=0;j=0;id1=0;id2=0;
    }else{
    str=".button"+i;
    $(str).css("color","white");
    var k=".button"+$(this).val();

    setTimeout(function() {
    $(k).css("color","white");
    }, 500);
    i=0;
    j=0;
    id1=0;id2=0;
    }
    }

    });

    });
    </script>
    Tuesday, June 26, 2018 4:58 PM