none
Parse HTML into SQL Columns RRS feed

  • Question

  • Dear All,

    I have an HTML that looks like this

    <!DOCTYPE html>
    <html>
        <head>
            <title>The wait operation timed out</title>
            <meta name="viewport" content="width=device-width" />
            <style>
             body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;} 
             p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
             b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
             H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
             H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
             pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
             .marker {font-weight: bold; color: black;text-decoration: none;}
             .version {color: gray;}
             .error {margin-bottom: 10px;}
             .expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
             @media screen and (max-width: 639px) {
              pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
             }
             @media screen and (max-width: 479px) {
              pre { width: 280px; }
             }
            </style>
        </head>

        <body bgcolor="white">

                <span><H1>Server Error in '/' Application.<hr width=100% size=1 color=silver></H1>

                <h2> <i>The wait operation timed out</i> </h2></span>

                <font face="Arial, Helvetica, Geneva, SunSans-Regular, sans-serif ">

                <b> Description: </b>An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

                <br><br>

                <b> Exception Details: </b>System.ComponentModel.Win32Exception: The wait operation timed out<br><br>

                <b>Source Error:</b> <br><br>

                <table width=100% bgcolor="#ffffcc">
                   <tr>
                      <td>
                          <code><pre>

    Line 417:    public void insert_user()
    Line 418:    {
    <font color=red>Line 419:        insert_user();
    </font>Line 420:    }
    Line 421:</pre></code>

                      </td>
                   </tr>
                </table>

                <br>

                <b> Source File: </b> c:\inetpub\wwwroot\projects\users.cs<b> &nbsp;&nbsp; Line: </b> 419
                <br><br>

                <b>Stack Trace:</b> <br><br>

                <table width=100% bgcolor="#ffffcc">
                   <tr>
                      <td>
                          <code><pre></pre></code>
                      </td>
                   </tr>
                </table>

                <br>

        </body>
    </html>

    I want to  create 2 columns out of this HTML . The columns will be called Source Error and Source File and there values will be like

    Source File
     c:\inetpub\wwwroot\projects\users.cs

    Source Error
    Line 417:    public void insert_user()
    Line 418:    {
    <font color=red>Line 419:        insert_user();
    </font>Line 420:    }
    Line 421

    Any help would be greatly appreciated.

    Thursday, December 7, 2017 9:19 PM

All replies

  • SQL isn't really your answer here. Best way is to push it through a HTML parser that can build a Document Object Model (DOM) structure for you, then getting the relevant fields should become easier. 

    How and what to use will be based on competencies but for the most part, it's going to require some scripting

    Thursday, December 7, 2017 9:26 PM
  • I agree with Ryan that T-SQL is probably not your answer.

    Before rejecting that idea, sometimes HTML can be treated as xml.


    Russel Loski, MCSE Data Platform/Business Intelligence Twitter: @sqlmovers; blog: www.sqlmovers.com

    Thursday, December 7, 2017 9:31 PM
  • Can i get a script or query of how it will be handled?
    Thursday, December 7, 2017 9:32 PM
  • First step, try TSQL:

    DECLARE @X xml = '
    
     Vote  
    
    
    
    
    Dear All,
    
    I have an HTML that looks like this
    
    
    <!DOCTYPE html>
     <html>
         <head>
             <title>The wait operation timed out</title>
             <meta name="viewport" content="width=device-width" />
             <style>
              body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;} 
              p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
              b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
              H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
              H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
              pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
              .marker {font-weight: bold; color: black;text-decoration: none;}
              .version {color: gray;}
              .error {margin-bottom: 10px;}
              .expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
              @media screen and (max-width: 639px) {
               pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
              }
              @media screen and (max-width: 479px) {
               pre { width: 280px; }
              }
             </style>
         </head>
    
    ';

    Put the entire html document in the string.  If single quote occurs in the string, then escape it with double single quotes ''.   By work, it means that your script reads the html as xml without error.

    Now this is unlikely to work, but if it doesn't then you can use SQL to get the values you like.

    Otherwise, you are going to need to find some other language code (Python has libraries, I have used C# in the past).


    Russel Loski, MCSE Data Platform/Business Intelligence Twitter: @sqlmovers; blog: www.sqlmovers.com

    Thursday, December 7, 2017 9:56 PM
  • Ok that didn't work. Any one else has some ideas?
    Thursday, December 7, 2017 10:18 PM
  • Any body can help me out
    Thursday, December 7, 2017 10:41 PM
  • Yes, unfortunately, there are a few HTML elements that don't fit XML formats so even if it did work once, you won't necessarily have it working every time.

    How frequently does this need to be done? Does it require automation? 

    There will be plenty of parsers on the net but if you're looking at frequently/lots of HTML/automation then you will need to go down a script path but that's dependent on your competency (or the competency of whoever is to do it). Every "web-aware" language would be able to parse a HTML document (I'd be surprised if they couldn't anyway) so pick your poison and away you go... Google will be your big friend here

    Thursday, December 7, 2017 10:47 PM
  • Problem is that it has to be done in SQL. We can't use web-aware language
    Thursday, December 7, 2017 10:52 PM
  • Why? Do you hammer nails with a screwdriver?

    If it's only the "Source Error" section you want, you could search (CHARINDEX) for the first "<pre>" after the "Source Error" and extract everything up to the next "</pre>". Is there only one per file?

    Thursday, December 7, 2017 11:11 PM
  • Hi There,

    the only way for you is read the html source in to a SQL server variable and perform string manipulation and extract the info you require.

    for this you can use sql data type of nvarchar(max) and perform string manipulation.

    Note: it is not a recommended process. with the syntax that have in the html source will not be an easy job.

    instead consider CLR, or a vb script or java script and load the extracted information.

    Good luck

    Kumar

    Thursday, December 7, 2017 11:19 PM
  • It has to be done in something else.

    See my Introduction to Web Site Scraping. That uses C# and shows the type of thing that would be appropriate.

    Many times programmers think they can use regular expressions but they are also limited and more vulnerable to failure. SQL will definitely not have the ability to parse the HTML and navigate the structure of the document. Sometimes web pages have frames and that can really make things complicated. Sometimes web pages have scripts that add to the page. So without details of what you need to do it is foolish to say you can use SQL since it is likely it would be so complicated that it is essentially impossible. No one wants to say it is possible because it is unlikely it is but even if it is the answer would be too much work.

    The only thing people here can do is to help you convince management that it can't be done using SQL.



    Sam Hobbs
    SimpleSamples.Info

    • Proposed as answer by Xi Jin Friday, December 8, 2017 6:27 AM
    Friday, December 8, 2017 2:41 AM