none
Extracting link from html file. RRS feed

  • Question

  • Hello guys,

    I have a huge html file containg thousands of image links, i want to extract all links in <img> tag.

    i have wrote some code.

    htmlcode = "<html><head><img originalsrc="http://imagesource.com/image.jpeg><img" src=http://imagesource.com/image.jpeg></head></html>"
    openTag = "<img"
    closingTag = ">"
    index = Text.GetIndexOf(htmlcode,openTag)
    While char <> closingTag
      char = Text.GetSubText(htmlcode,index,1)
      TextWindow.Write(char)
      index = index + 1
    EndWhile

    This code extract whole <img> tag but it only extract first tag.

    How to make it extarct all the two tags in html code?

    @LitDev, Nonki

    Please help me to sort out this problem.


    Merry Xmas!


    • Edited by 4mir '- Friday, March 29, 2013 1:10 PM
    Friday, March 29, 2013 1:08 PM

Answers

  • Since Text.GetIndexOf() will only catch the 1st occurrence of openTag from htmlcode, 

    I suggest that you cut any used parts off from htmlcode, so Text.GetIndexOf() can keep catching any possible occurrences of openTag within the string!


    Click on "Propose As Answer" if some post solves your problem or "Vote As Helpful" if some post has been useful to you! (^_^)

    Friday, March 29, 2013 1:21 PM
    Answerer
  • Use the returned index position as a mark to cut off everything from the left!

    index = Text.GetIndexOf( htmlcode, openTag )

    htmlcode = Text.GetSubTextToEnd( htmlcode, index )


    Click on "Propose As Answer" if some post solves your problem or "Vote As Helpful" if some post has been useful to you! (^_^)

    Friday, March 29, 2013 1:57 PM
    Answerer

All replies

  • Since Text.GetIndexOf() will only catch the 1st occurrence of openTag from htmlcode, 

    I suggest that you cut any used parts off from htmlcode, so Text.GetIndexOf() can keep catching any possible occurrences of openTag within the string!


    Click on "Propose As Answer" if some post solves your problem or "Vote As Helpful" if some post has been useful to you! (^_^)

    Friday, March 29, 2013 1:21 PM
    Answerer
  • Since Text.GetIndexOf() will only catch the 1st occurrence of openTag from htmlcode, 

    I suggest that you cut any used parts off from htmlcode, so Text.GetIndexOf() can keep catching any possible occurrences of openTag within the string!


    Click on "Propose As Answer" if some post solves your problem or "Vote As Helpful" if some post has been useful to you! (^_^)

    I am also thinking the same but still cant figure out how to do it.

    may be like this?

    For i = closingTag To FileLenght
         Char = Text.GetSubText(htmlcode,i,1)
         TempCode = Text.Append(TempCode.Char)
    EndFor
    htmlcode = TempCode
    TempCode = ""


    Merry Xmas!

    Friday, March 29, 2013 1:39 PM
  • Use the returned index position as a mark to cut off everything from the left!

    index = Text.GetIndexOf( htmlcode, openTag )

    htmlcode = Text.GetSubTextToEnd( htmlcode, index )


    Click on "Propose As Answer" if some post solves your problem or "Vote As Helpful" if some post has been useful to you! (^_^)

    Friday, March 29, 2013 1:57 PM
    Answerer
  • TextWindow.Title = "Link extractor from html file" htmlcode = File.ReadContents("E:\html.txt") openTag = "<img src=" opentagLen = Text.GetLength(openTag) + 1 closingTag = Text.GetCharacter(34) index = Text.GetIndexOf(htmlcode,openTag) + opentagLen count = 0 While index <> opentagLen index = Text.GetIndexOf(htmlcode,openTag) + opentagLen If index <> opentagLen Then While char <> closingTag char = Text.GetSubText(htmlcode,index,1) If char <> closingTag Then link[count] = Text.Append(link[count],char) EndIf index = index + 1 EndWhile EndIf char = "" htmlcode = Text.GetSubTextToEnd(htmlcode,index) count = count + 1 EndWhile For i = 0 To Array.GetItemCount(link) - 1 TextWindow.WriteLine(link[i]) EndFor

    I finally done it... Thanks alot GotoLoop for your help.

    here is the screenshot.


    Merry Xmas!

    Friday, March 29, 2013 3:47 PM