locked
parsing strings as HTML RRS feed

  • Question

  • Hello all,

    It is my understanding that it is improper to use regular expressions on HTML strings, but how would I do it in javascript/HTML5 instead? So far, I have been using

    var x = document.createElement("html").innerHTML = htmlString;

    and then querying selectors for x, but it seems this is more of a workaround, and there should be a cleaner, more elegant solution. Does anyone know of any?

    All help is greatly appreciated and I always accept an answer!

    Saturday, September 1, 2012 9:36 PM

Answers

  • Are query selectors really that bad?

    <html>
      <head></head>
      <body>
        <div class="grocerylist">
          <div class="groceryitem" id="item1">
            <div class="price">1.00</div>
            <div class="name">Carrot</div>
          </div>
          <div class="groceryitem" id="item2">
            <div class="price">2.00</div>
            <div class="name">Celery</div>
          </div>
        </div>
      </body>
    </html>

    If you want to get a list of groceryitem, you could do:

    var items = document.querySelectorAll(".grocerylist .groceryitem");

    Then you could iterate over the items and extract out the item name and price by looking at the child nodes of the item. If you want to do more specific filtering like finding items that cost more than a certain amount, you could try adjusting the query selector to select only the price and then have js code pick out the nodes that meet the criteria and then get the parent node of the ones that meet the price criteria.

    I don't think query selectors are that bad at extracting information. Granted, you'll have to write accompanying javascript code to provide additional filtering, but that's just my opinion... I guess I've gotten used to it. Other than that, I've also seen some people use Xpath on HTML... I wonder if that is something we could do here. HTML might not be strictly valid XML, but it might be possible.

    • Marked as answer by sddhhanover Sunday, September 2, 2012 7:29 AM
    Sunday, September 2, 2012 2:57 AM

All replies

  • Are query selectors really that bad?

    <html>
      <head></head>
      <body>
        <div class="grocerylist">
          <div class="groceryitem" id="item1">
            <div class="price">1.00</div>
            <div class="name">Carrot</div>
          </div>
          <div class="groceryitem" id="item2">
            <div class="price">2.00</div>
            <div class="name">Celery</div>
          </div>
        </div>
      </body>
    </html>

    If you want to get a list of groceryitem, you could do:

    var items = document.querySelectorAll(".grocerylist .groceryitem");

    Then you could iterate over the items and extract out the item name and price by looking at the child nodes of the item. If you want to do more specific filtering like finding items that cost more than a certain amount, you could try adjusting the query selector to select only the price and then have js code pick out the nodes that meet the criteria and then get the parent node of the ones that meet the price criteria.

    I don't think query selectors are that bad at extracting information. Granted, you'll have to write accompanying javascript code to provide additional filtering, but that's just my opinion... I guess I've gotten used to it. Other than that, I've also seen some people use Xpath on HTML... I wonder if that is something we could do here. HTML might not be strictly valid XML, but it might be possible.

    • Marked as answer by sddhhanover Sunday, September 2, 2012 7:29 AM
    Sunday, September 2, 2012 2:57 AM
  • The only reason I didn't like selectors was because of the document.createElement part and the innerHTML. As it turns out, innerHTML is a very expensive operation.
    • Edited by sddhhanover Sunday, September 2, 2012 7:40 AM
    Sunday, September 2, 2012 7:29 AM
  • Hi,

    I thinks you'd better use window.toStaticHTML to filter the htmlString before you set it to innerHTML. That would filter the dynamic content which may cause exception when setting to innerHTML. Especially when you load HTML from other systems.


    woodhead is as woodhead does

    Monday, September 3, 2012 12:36 AM
  • I do, but I still get the "cannot inject dynamic content" error. I solved that issue by stripping the <head> tag, though I'm not sure why I have to do that.
    • Edited by sddhhanover Tuesday, September 4, 2012 12:53 AM
    Tuesday, September 4, 2012 12:53 AM