none
Extract more than 10 lines of text from each word / PDF documents in a document library. RRS feed

  • Question

  • Is there a way to get more than 10 lines of text from each word / pdf documents stored in a document library in a site under site collection.  I need these 10 lines of text to pass into the NLP tool (Natural Language Processing software)

    My thought:

    Option1) Using server side code just loop through each document in the library and read the text unto certain size (20kb or 40kb...etc).  I know this work for word docs for sure not sure if it works for .pdf files.

    Optoion2) Use the OOTB search, in the search box enter * and it will give all the docs and pdf's but in this approach i'm getting only 3 lines of text...researching on this to see if i can change the display templates to get more text.

    any help/thoughts much appreciated.


    Vijay Ji

    Thursday, February 2, 2017 9:18 PM

Answers

  • Hi,

    We can select the Option1 to achieve it, using the server side code to read the document from document library, then read the text using C# code.

    /// <summary>  
    /// Reading Text from PDF document  
    /// </summary>  
    /// <returns></returns>
    private string GetTextFromPDF()  
    {  
    	StringBuilder text = new StringBuilder();  
    	using (PdfReader reader = new PdfReader("D:\\RentReceiptFormat.pdf"))  
    	{  
    		for (int i = 1; i <= reader.NumberOfPages; i++)  
    		{  
    			text.Append(PdfTextExtractor.GetTextFromPage(reader, i));  
    		}  
    	}  
    	return text.ToString();  
    }  
    
    /// <summary>  
    /// Reading Text from Word document  
    /// </summary>  
    /// <returns></returns>  
    private string GetTextFromWord()  
    {  
    	StringBuilder text = new StringBuilder();  
    	Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();  
    	object miss = System.Reflection.Missing.Value;  
    	object path = @"D:\Articles2.docx";  
    	object readOnly = true;  
    	Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);  
    
    	for (int i = 0; i < docs.Paragraphs.Count; i++)  
    	{  
    		text.Append(" \r\n " + docs.Paragraphs[i + 1].Range.Text.ToString());  
    	}   
    
    	return text.ToString();  
    }  

    More information:

    http://www.c-sharpcorner.com/blogs/reading-contents-from-pdf-word-text-files-in-c-sharp1

    https://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

    Best Regards,

    Dennis


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com


    Friday, February 3, 2017 2:35 AM
    Moderator

All replies

  • Hi,

    We can select the Option1 to achieve it, using the server side code to read the document from document library, then read the text using C# code.

    /// <summary>  
    /// Reading Text from PDF document  
    /// </summary>  
    /// <returns></returns>
    private string GetTextFromPDF()  
    {  
    	StringBuilder text = new StringBuilder();  
    	using (PdfReader reader = new PdfReader("D:\\RentReceiptFormat.pdf"))  
    	{  
    		for (int i = 1; i <= reader.NumberOfPages; i++)  
    		{  
    			text.Append(PdfTextExtractor.GetTextFromPage(reader, i));  
    		}  
    	}  
    	return text.ToString();  
    }  
    
    /// <summary>  
    /// Reading Text from Word document  
    /// </summary>  
    /// <returns></returns>  
    private string GetTextFromWord()  
    {  
    	StringBuilder text = new StringBuilder();  
    	Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();  
    	object miss = System.Reflection.Missing.Value;  
    	object path = @"D:\Articles2.docx";  
    	object readOnly = true;  
    	Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);  
    
    	for (int i = 0; i < docs.Paragraphs.Count; i++)  
    	{  
    		text.Append(" \r\n " + docs.Paragraphs[i + 1].Range.Text.ToString());  
    	}   
    
    	return text.ToString();  
    }  

    More information:

    http://www.c-sharpcorner.com/blogs/reading-contents-from-pdf-word-text-files-in-c-sharp1

    https://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

    Best Regards,

    Dennis


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com


    Friday, February 3, 2017 2:35 AM
    Moderator
  • Thans Dennis for the reply! sorry for the delay.

    Vijay Ji

    Wednesday, May 15, 2019 1:27 AM