none
OutOfRangeException When Using Fsharp.Data HtmlProvider Tables Facility RRS feed

  • Question

  • Hi --

    This one I don't really understand.  In the code below, everything executes okay through the binding of filmResults, and when I set a breakpoint on the next line, I can see that the filmResults array contains two values.  Drilling into these, I can see that each value (of type HtmlProvider<...>) has a tables collection and each of these has a table named Table6.   Yet when the Array.map function in the binding to detailsTables iterates over the array, it finds the first of these Table6 tables fine, but the second one throws an IndexOutOfRangeException.  Has anyone encountered something like this before?

    I can't for the life of me see why anything would be wrong, unless for some reason after processing the first Table6, the library is expecting the second Table6 to have the same structure as the first one.  They do have different structures; the first has five columns and the second has three columns (each call to the details page brings back a table that is structured for the details available for the particular id whose details are being fetched).  

    Any thoughts would be greatly appreciated, as I am sort of dead in the water at the moment.  

    type LumiereFilmStartingWith = HtmlProvider<"http://lumiere.obs.coe.int/web/films/index.php?letter=A">
    type LumiereFilmReleaseDetail = HtmlProvider<"http://lumiere.obs.coe.int/web/film_info/?id=64128">
                 
     type Lumiere() =
        member public this.StartingWithA() =
            let thePage = LumiereFilmStartingWith.GetSample()
            let tables = thePage.Tables
            let html = thePage.Html
            let ids = 
                html.Descendants ["a"]
                |> Seq.choose(fun x ->
                     x.TryGetAttribute("href")
                     |> Option.map(fun a -> a.Value()))
                |> Seq.filter(fun h -> h.Contains("?id="))
                |> Seq.map(fun h -> 
                              let delimiterIndex = h.LastIndexOf("?id=")
                              h.Substring(delimiterIndex+4))
                |> Seq.map(fun s -> (int) s) 
                |> Seq.toArray

            let startingWithA = tables.``ABCDEFGHIJKLMNOPQRSTUVWXYZ[0-9]``.Rows
                                |> Array.mapi (fun i f -> {id = ids.[i]; film = f.Film; directors = f.Directors})

            let filmResults = startingWithA 
                              |> Array.map(fun fd -> 
                                              let url = "http://lumiere.obs.coe.int/web/film_info/?id=" + fd.id.ToString()
                                              LumiereFilmReleaseDetail.AsyncLoad url)
                              |> Array.take 2
                              |> Async.Parallel
                              |> Async.RunSynchronously

            let detailsTables = filmResults
                              |> Array.map(fun p -> p.Tables.Table6) 

            let filmDetails = detailsTables 
                              |> Array.map(fun t -> seq {for i in 1 .. (t.Rows.Length-1) do yield JsonConvert.SerializeObject(t.Rows.[i]) })
            startingWithA



    Just in case it is helpful, here are the two HTML snippets from each of the tables' Html property:  



    <table class="fixed_layout_100">
      <thead>
        <tr>
          <th align="CENTER">Market</th><th align="CENTER">Distributor</th><th align="CENTER">Release date</th><th align="RIGHT">2015</th><th align="CENTER">Total since 2015</th>
        </tr>
      </thead>
      <tbody>
        <tr class="odd">
          <td align="RIGHT">
            <a href="?id=64128&market=FR" target="_top" title="Admissions (Market : France)">FR</a>
          </td><td align="CENTER">Ad Vitam</td><td align="CENTER">25/02/2015</td><td align="RIGHT">10 832</td><td align="RIGHT">10 832</td>
        </tr>
        <tr class="footer">
          <td align="RIGHT">
            <a href="/web/iso_codes/">EUR EU</a>
          </td><td align="CENTER"> </td><td align="CENTER"> </td><td align="RIGHT">10 832</td><td align="RIGHT">10 832</td>
        </tr>
        <tr class="footer">
          <td align="RIGHT">
            <a href="/web/iso_codes/">EUR OBS</a>
          </td><td align="CENTER"> </td><td align="CENTER"> </td><td align="RIGHT">10 832</td><td align="RIGHT">10 832</td>
        </tr>
      </tbody>
    </table>


    <table class="fixed_layout_100">
      <thead>
        <tr>
          <th align="CENTER">Market</th><th align="RIGHT">2009</th><th align="CENTER">Total since 2006</th>
        </tr>
      </thead>
      <tbody>
        <tr class="odd">
          <td align="RIGHT">
            <a href="?id=33601&market=PT" target="_top" title="Admissions (Market : Portugal)">PT</a>
          </td><td align="RIGHT">11</td><td align="RIGHT">11</td>
        </tr>
        <tr class="footer">
          <td align="RIGHT">
            <a href="/web/iso_codes/">EUR EU</a>
          </td><td align="RIGHT">11</td><td align="RIGHT">11</td>
        </tr>
        <tr class="footer">
          <td align="RIGHT">
            <a href="/web/iso_codes/">EUR OBS</a>
          </td><td align="RIGHT">11</td><td align="RIGHT">11</td>
        </tr>
      </tbody>
    </table>

    Tuesday, June 26, 2018 4:16 PM

All replies

  • I've looked into this further and indeed it is what I suspected in the message above. I downloaded and built FSharp.Data and traced into the code, and what is happening is that the type provider is somehow holding on to the row converter that it first creates when looking at this page fetched with the first movie id. The same page is then fetched a second time with a second movie id, and even though the Table6 structure is different for this movie (i.e. it is only showing three columns rather than the five it showed for the first movie), because it is named the same in both page loads, the type provider appears to be reusing the row converter created during the first page load.

    Wondering if there is some way to force the type provider to regenerate its row converter definition for Table6 on each page load. It is rather incorrect for it to assume that the table structure will remain constant across all loads of this same page.

    Wednesday, June 27, 2018 8:21 PM