none
Remove strange characters when reading doc file in C#?

    Question

  • I am using C# and Microsoft Word 12.0 object library to read data from .doc file and then save these content to a text file (This is required by my Project). My .doc file have some tables and I need to read each row and column in such tables. The reading operations were executed successfully, but the data contains some strange characters (like square ones) as in the attached image

    Here is the code I used:

    private void btnRead_Click(object sender, EventArgs e)
    {
        try
        {
            Microsoft.Office.Interop.Word.ApplicationClass wordObject = new ApplicationClass();
            object file = textBox1.Text; //this is the path
            object nullobject = System.Reflection.Missing.Value;
            Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open
                (ref file, ref nullobject, ref nullobject, ref nullobject,
                ref nullobject, ref nullobject, ref nullobject, ref nullobject,
                ref nullobject, ref nullobject, ref nullobject, ref nullobject,
                ref nullobject, ref nullobject, ref nullobject, ref nullobject);
    
            docs.ActiveWindow.Selection.WholeStory();
            docs.ActiveWindow.Selection.Copy();
            IDataObject data = Clipboard.GetDataObject();
            String allData = "";
            for (int t = 1; t < docs.Tables.Count; t++ )
            {
                Table tbl = docs.Tables[t];
                for (int r = 1; r < tbl.Rows.Count; r++)
                {
                    for (int c = 1; c < 3; c++)
                    {
                        allData += tbl.Cell(r, c).Range.FormattedText.Text.Trim() + Environment.NewLine;
                    }
                }
            }
            txtData.Text = allData;
            saveTextFile(allData);
    
            docs.Close(ref nullobject, ref nullobject, ref nullobject);
        }
        catch (Exception j)
        {
            MessageBox.Show(j.Message);
        }
    }
    
    private void saveTextFile(String data)
    { 
        try
        {
            StreamWriter sw = new StreamWriter(txtOutput.Text.Trim());
            sw.WriteLine(data);
            sw.Flush();
            sw.Close();
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.StackTrace);
        }
    }
    


    it does not always appear as the last 2 characters, in some line it were three, and in others, those characters may appear at the middle.

    Does anyone know how to remove such characters?

    Friday, September 02, 2011 1:11 AM

Answers

  • That's just 0x0Dh (i.e.: \n in C escaped character notation) and nothing special? You may also see a lot of 0x07h (i.e.: Bell) character in other parts.

    While you can just use String.Replace() to remove/replace them, you have to deduce their usage yourself.

    For specification of Word format, you can visit here.

    • Marked as answer by gonnabe Friday, September 02, 2011 2:21 AM
    Friday, September 02, 2011 1:38 AM