locked
search program and file import RRS feed

  • Question

  • i'm looking to make a program that can search and modify pages in a pdf file and then output and save the new pdf file. i'm not sure if i'm explaining correct so. i have pdf files that have about 500 pages each page has an address on it but they are not all one right after the other. the same address may be on page 2 and then again on page 300... i'm looking to be able to load the pdf i would like searched then be able to input the address or partial address have the program search through out the pdf file, then delete the pages that match to the inputed text and then output a new edited saved pdf.  could someone point me in the right direction to create this i have some working knowledge of VB. but have never created anything like this.

    thanks

    Thursday, October 12, 2017 3:10 PM

All replies

  • i'm looking to make a program that can search and modify pages in a pdf file and then output and save the new pdf file. i'm not sure if i'm explaining correct so. i have pdf files that have about 500 pages each page has an address on it but they are not all one right after the other. the same address may be on page 2 and then again on page 300... i'm looking to be able to load the pdf i would like searched then be able to input the address or partial address have the program search through out the pdf file, then delete the pages that match to the inputed text and then output a new edited saved pdf.  could someone point me in the right direction to create this i have some working knowledge of VB. but have never created anything like this.

    thanks

    I feel sure you're not likely to accept this as the answer, but to do what you want, you really should look to third party stuff like this one:

    https://www.aspose.com/products/pdf/net?gclid=EAIaIQobChMI-6S4-bPr1gIVg1uGCh0YOQPtEAAYASAAEgKGx_D_BwE

    I've never used Aspose products but I know that Karen has endorsed them. They're very much not free though.


    "A problem well stated is a problem half solved.” - Charles F. Kettering

    Thursday, October 12, 2017 3:34 PM
  • thanks Frank, i took a quick look 3 grand is a little pricey however would it be easier if it were .doc or .txt since i can convert these pdf's to either one.

    thanks

    Thursday, October 12, 2017 3:39 PM
  • thanks Frank, i took a quick look 3 grand is a little pricey however would it be easier if it were .doc or .txt since i can convert these pdf's to either one.

    thanks

    Gosh yes - a plain text file is easy to parse using a variety of things.

    Here's an easy one:

    Using rdr As New System.IO.StreamReader("filePathHere")
        Do While rdr.Peek() >= 0
            Dim itm As String = rdr.ReadLine.Trim
                'Now process the line “itm”
            Loop
    End Using
    

    You might want to put a breakpoint in and step into that. That's set to read each line and the result of that line is stored in the variable "itm".

    Try that and let's see how you make out.


    "A problem well stated is a problem half solved.” - Charles F. Kettering

    Thursday, October 12, 2017 3:45 PM
  • Yes Aspose is pricy yet worth every penny if you consider what it would take to go another route such as a free library or code it yourself.

    I created an extraction program in vb.net that parses extremely large PDF documents in tangent with my code to extract pages based on business logic into separate smaller PDF documents. Created this in 2005, still running today with zero modifications. Also used Aspose for obtaining records out of PDF documents to store in database tables.

    Yet another program for extracting data I found a bug in the Aspose library, reported the issue and had a fix in a new build two days later.

    Also used their Excel library in the past and it rocks also.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites

    Thursday, October 12, 2017 4:27 PM
  • thank you so much Karen, i'm really trying to find a way to sort through these files and remove the pages i dont need based on addresses. the easiest way possible, so basically i can just input an address or list of addresses to remove. i can change formats of the original document to a variety of formats for what ever way would be the easiest
    Thursday, October 12, 2017 6:54 PM
  • Not sure what to tell you here.

    The high level processing via Aspose is to open the file which provides a Stream where I loop through the stream via a HasNextPage to determine if there is a page then GetNextPageText which I convert to a string to parse. Each GetNextPageText represents a page in the PDF so if I parsed the page and didn't want it I'd simple skip the page (this is for your benefit).

    My process was to associate say a company with x amount of pages which I would store in an Integer property then after all processing was done would do another process to split the pages where I would rely on a List that had page numbers and other details to create the new PDF documents.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites

    Thursday, October 12, 2017 7:19 PM
  • thank you so much Karen, i'm really trying to find a way to sort through these files and remove the pages i dont need based on addresses. the easiest way possible, so basically i can just input an address or list of addresses to remove. i can change formats of the original document to a variety of formats for what ever way would be the easiest

    So you don't plan on doing any of this yourself then?

    *****

    Change it to a text file, parse through the lines and put a breakpoint in so you can figure out the next move.

    I'd use a collection class and set it up to recognize and ignore duplicates.

    That's just a thought though - only you see your document, not us.


    "A problem well stated is a problem half solved.” - Charles F. Kettering

    Thursday, October 12, 2017 7:22 PM
  • i'd like the program to do it all to where i import the file and have the program take the addresses i input to search and remove those pages that contain that address. once again thanks for all your help
    Thursday, October 12, 2017 11:37 PM
  • i'd like the program to do it all to where i import the file and have the program take the addresses i input to search and remove those pages that contain that address. once again thanks for all your help

    JH,

    I wasn't walking away, but if you'll convert your file to a plain text file, you'll be miles ahead already.

    Using a StreamReader, you can parse that text and then from there, it's hard to say what to do because only you know what's in those files.

    Still though, have a look and see how far you get.

        Private Sub ReadTheTextFile(ByVal filePath As String)
    
            If Not String.IsNullOrWhiteSpace(filePath) Then
                Dim fi As New IO.FileInfo(filePath)
    
                If fi.Exists Then
                    Using rdr As New System.IO.StreamReader(fi.FullName)
                        Do While rdr.Peek() >= 0
                            Dim itm As String = rdr.ReadLine.Trim
    
                            If Not String.IsNullOrWhiteSpace(itm) Then
                                Stop
                            End If
                        Loop
                    End Using
                End If
            End If
    
        End Sub

    "Stop" will work much like a breakpoint. When it does, hover your mouse over "itm" and you'll see the contents.

    Hopefully that will be an epiphany to the next logical step?

    Try it out and we'll talk more tomorrow. :)


    "A problem well stated is a problem half solved.” - Charles F. Kettering

    Thursday, October 12, 2017 11:54 PM
  • Okay, here is thought that I'm giving you as a thought, conceptual on how one might start the process of splitting data up in chunks. In this case our marker is any line containing "Page"

    Our file

    Page 1
    Yada Yada Ydata
    
    Page 2
    I'm on page two
    
    
    Page 3
    Last page
    
    Bye

    Okay, I split this out in about thirty minutes, no regards for more than very basic assertion and no optimization, we are in conceptual mode here (and it does work).

    Helper concrete class

    Public Class LineInfo
        Public Property Index As Integer
        Public Property Line As String
    End Class

    Code

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
        ' file to read
        Dim fileName As String = IO.Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "TextFile1.txt")
    
        ' file contents with a indexer
        Dim contents As List(Of LineInfo) = IO.File.ReadAllLines(fileName) _
            .Select(Function(line, index) New LineInfo With {.Line = line, .Index = index}) _
            .ToList
    
        ' spit out the file contents, bla
        For Each item In contents
            Console.WriteLine($"{item.Index,-4} {item.Line}")
        Next
    
        ' lines containing Page
        Dim PageIndices = contents.Where(Function(item) item.Line.Contains("Page")).ToList
    
        ' use below to change logic on the fly
        Dim firstRun As Boolean = True
    
        For x As Integer = 0 To PageIndices.Count - 1
            If x < PageIndices.Count - 1 Then
                Dim endMarker As Integer = 0
                If firstRun Then
                    endMarker = PageIndices(x + 1).Index
                    firstRun = False
                Else
                    endMarker = (PageIndices(x + 1).Index - PageIndices(x).Index)
                End If
    
                Dim aList As List(Of Integer) = New List(Of Integer) From {PageIndices(x).Index, PageIndices(x + 1).Index}
                Dim bList As List(Of Integer) = Enumerable.Range(PageIndices(x).Index, endMarker).ToList
                Console.WriteLine("Page line numbers: " & String.Join(","c, bList.Except(aList).ToArray))
            End If
    
    
        Next
        Console.WriteLine($"Last page, from : {PageIndices(PageIndices.Count - 1).Index} to {contents.Count - 1}")
    
    
    
    End Sub

    Results from above code, note I spit out the file contents then spit out the gaps for each page excluding page except for a goof on the last page (note I start at 7 but should be 8)

    Of course this all should be in it's own class. Once you have the contents for pages using the indices and the file contents you can then apply logic to what your business needs call for.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites


    Friday, October 13, 2017 12:43 AM