none
number of words in pdf document RRS feed

  • Question

  • i'm using AcroAXPDF to view my pdf documents. how do i find the number of words in a pdf document?
    Monday, June 26, 2017 11:28 AM

All replies

  • The following uses iTextSharp available on NuGet

    https://www.nuget.org/packages/iTextSharp/

    I placed all code in a form but if this works for you consider making a class and calling code from the class. Note, if the PDF large, has a great deal of text then consider wrapping the call with Async/await.

    Imports System.IO
    Imports System.Text
    Imports iTextSharp.text.pdf.parser
    
    Public Class Form1
        Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
            ' here the pdf is in the same folder as the executable and if very large
            Dim fileName = IO.Path.Combine(
                AppDomain.CurrentDomain.BaseDirectory, "MicrosoftTheSecurityDevelopmentLifecycle.pdf")
            Dim pdfContents As String = ExtractAllTextFromPdf(fileName)
            Dim wordCount As Integer = GetWordCountFromString(pdfContents)
            MessageBox.Show($"Word count {wordCount}")
        End Sub
        Public Function ExtractAllTextFromPdf(ByVal inputFile As String) As String
            'Sanity checks
            If String.IsNullOrEmpty(inputFile) Then
                Throw New ArgumentNullException("inputFile")
            End If
            If Not File.Exists(inputFile) Then
                Throw New FileNotFoundException("Cannot find inputFile", inputFile)
            End If
    
            'Create a stream reader (not necessary but I like to control locks and permissions)
            Using SR As New FileStream(inputFile, FileMode.Open, FileAccess.Read, FileShare.Read)
                'Create a reader to read the PDF
                Dim reader As New iTextSharp.text.pdf.PdfReader(SR)
    
                'Create a buffer to store text
                Dim sb As New StringBuilder()
    
                'Use the PdfTextExtractor to get all of the text on a page-by-page basis
                For i As Integer = 1 To reader.NumberOfPages
                    sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i))
                Next i
    
                Return sb.ToString()
            End Using
        End Function
        Public Function GetWordCountFromString(ByVal text As String) As Integer
            'Sanity check
            If String.IsNullOrEmpty(text) Then
                Return 0
            End If
    
            'Count the words
            Return RegularExpressions.Regex.Matches(text, "\S+").Count
        End Function
    
    End Class
    


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites

    Monday, June 26, 2017 12:52 PM
    Moderator
  • Hello audeamus,

    You can fist extract the text from pdf document and then perform counting.

    Below example explains how to extract text from pdf and it uses a free pdf dll, you could have a try. Hope it's helpful.

    //Load the PDF file
    PdfDocument doc = new PdfDocument();
    doc.LoadFromFile(@"E:\Program Files\Sample.pdf");
    //Extract text to a TXT file
    StringBuilder s = new StringBuilder();
    foreach (PdfPageBase page in doc.Pages)
    {
        s.AppendLine(page.ExtractText());
    }
    File.WriteAllText("Extract text.txt", s.ToString());

    Tuesday, June 27, 2017 8:17 AM