none
Powershell to Loop Through Text File of URLs and Extract Needed Metadata String RRS feed

  • Question

  • All,

    I'm trying to loop through a text file of URL's to then extract certain metadata from .html files and am need of some help on the best way to do this. The below isn't returning anything but it doesn't error. I need to utilize the text file of URL's since the library is pretty large and gave me out of memory errors when I ran this against the full site collection URL:   

    $data=Get-Content "C:\Powershell\urlstoparse.txt"
    $sp = 'https://kmo.us.name.com/Sites/Tickets/'

     Write-Host "Site count" is  $data.Count
     
    function Get-DocInventory([string]$data) {

        $web = Get-SPWeb -Identity $sp
     
        foreach ($d in $data) {
     if ($list.BaseType -ne "DocumentLibrary") {
      continue
      }
      foreach ($item in $list.Items) {
       if($item.Name -like '*.html*'){
        $file = $item.File
        #get binary data, and decode into text
        $dt   = $file.OpenBinary()
        $encode = New-Object System.Text.ASCIIEncoding
        $text = $encode.GetString($dt)
        $url=$text|select-string -pattern '(http[s]?)(:\/\/)(www\.kmo\.com\/uploads)([^\s,]+)(?=")' -AllMatches | % { $_.Matches } | % { $_.Value }
        $dt = @{
         "Item URL" =  $item.Url
         "Item Name" = $list.Title
         "HTML File" = $url
        }
        New-Object PSObject -Property $dt
       }  
      }
     }
    }
    Get-DocInventory $data | Export-Csv -NoTypeInformation -Path "C:\Powershell\Tickets_Detail_Parse.csv"

    Wednesday, September 11, 2019 9:11 PM

Answers

  • Hi kmoneill,

    You could try the following script:

    $data=Get-Content "C:\Powershell\urlstoparse.txt"
    $sp = 'https://kmo.us.name.com/Sites/Tickets/'
    Write-Host "Site count" is  $data.Count
    $web = Get-SPWeb $sp
    $collection=@()
    foreach ($d in $data) {
    	$file=$web.getfile($d)
    	#get binary data, and decode into text
        	$dt   = $file.OpenBinary()
        	$encode = New-Object System.Text.ASCIIEncoding
        	$text = $encode.GetString($dt)
        	$url=$text|select-string -pattern '(http[s]?)(:\/\/)(www\.kmo\.com\/uploads)([^\s,]+)(?=")' -AllMatches | % { $_.Matches } | % { $_.Value }	
    		
    	$ExportItem = New-Object PSObject
    	$ExportItem | Add-Member -MemberType NoteProperty -name "Item URL" -value $file.URL
    	$ExportItem | Add-Member -MemberType NoteProperty -name "Item Name" -value $file.item.parentlist.title
    	$ExportItem | Add-Member -MemberType NoteProperty -name "Item URL" -value $url
    	$collection += $ExportItem
    }
    $collection | Export-CSV "C:\Powershell\Tickets_Detail_Parse.csv" -NoTypeInformation

    Feel free to let me know if there are any issues.

    Best Regards,

    Michael Han


    Please remember to mark the replies as answers if they helped. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    SharePoint Server 2019 has been released, you can click here to download it.
    Click here to learn new features. Visit the dedicated forum to share, explore and talk to experts about SharePoint Server 2019.

    • Marked as answer by kmoneill Friday, September 13, 2019 1:35 PM
    Friday, September 13, 2019 8:05 AM

All replies

  • Hi kmoneill,

    In your script, "if ($list.BaseType -ne "DocumentLibrary")" how do you define the variable $list? Besides, what's the url of in your text file? library url or file url? Please share more details.

    Did you want to export all url of html files to a csv in the site https://kmo.us.name.com/Sites/Tickets/?

    Best Regards,

    Michael Han


    Please remember to mark the replies as answers if they helped. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    SharePoint Server 2019 has been released, you can click here to download it.
    Click here to learn new features. Visit the dedicated forum to share, explore and talk to experts about SharePoint Server 2019.

    Thursday, September 12, 2019 7:47 AM
  • OK - thanks - so the .txt file has a list of URL's as such in it one per line:

    https://kmo.us.name.com/Sites/Tickets/new_tickets/IN55336/628894_129875sl14pt2.html
    https://kmo.us.name.com/Sites/Tickets/new_tickets/IN055337/628895_12_9875sl8pt1.html
    https://kmo.us.name.com/Sites/Tickets/new_tickets/IN055337/628895_129875sl14pt2.html

    So what I am trying to do is read the full URL's from the .txt file - then loop through each .html and look for where the text regex in the .html matches then output it to a .csv file.

    In the .csv I have three fields in the end:

    1) The source .html file example: https://kmo.us.name.com/Sites/Tickets/new_tickets/IN55336/628894_129875sl14pt2.html

    2) The string match URL from the regex example:

    http://kmo.com/uploads/files/updated7.pdf

    3) The document library where the .html file resides - example:

    new_tickets

    Any additional questions, let me know.

    Thursday, September 12, 2019 12:45 PM
  • Hi kmoneill,

    You could try the following script:

    $data=Get-Content "C:\Powershell\urlstoparse.txt"
    $sp = 'https://kmo.us.name.com/Sites/Tickets/'
    Write-Host "Site count" is  $data.Count
    $web = Get-SPWeb $sp
    $collection=@()
    foreach ($d in $data) {
    	$file=$web.getfile($d)
    	#get binary data, and decode into text
        	$dt   = $file.OpenBinary()
        	$encode = New-Object System.Text.ASCIIEncoding
        	$text = $encode.GetString($dt)
        	$url=$text|select-string -pattern '(http[s]?)(:\/\/)(www\.kmo\.com\/uploads)([^\s,]+)(?=")' -AllMatches | % { $_.Matches } | % { $_.Value }	
    		
    	$ExportItem = New-Object PSObject
    	$ExportItem | Add-Member -MemberType NoteProperty -name "Item URL" -value $file.URL
    	$ExportItem | Add-Member -MemberType NoteProperty -name "Item Name" -value $file.item.parentlist.title
    	$ExportItem | Add-Member -MemberType NoteProperty -name "Item URL" -value $url
    	$collection += $ExportItem
    }
    $collection | Export-CSV "C:\Powershell\Tickets_Detail_Parse.csv" -NoTypeInformation

    Feel free to let me know if there are any issues.

    Best Regards,

    Michael Han


    Please remember to mark the replies as answers if they helped. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    SharePoint Server 2019 has been released, you can click here to download it.
    Click here to learn new features. Visit the dedicated forum to share, explore and talk to experts about SharePoint Server 2019.

    • Marked as answer by kmoneill Friday, September 13, 2019 1:35 PM
    Friday, September 13, 2019 8:05 AM
  • Excellent - thanks!
    Friday, September 13, 2019 1:36 PM