locked
splitting arabic text file RRS feed

  • Question

  • i would like to split an Arabic text file into words the display this words into a textbox 
    Saturday, June 29, 2013 9:15 AM

Answers

  • Arabic character are unicode character.  You have to read the file with unicode encoding (the dfault is usually ascii) and the use RegEx to seperate the words. See webpage below

    http://msdn.microsoft.com/en-us/library/az24scfc.aspx

    If you look at the group constructs below webpage under Anchors \b you will find the expression below that should work

    \b\w+\s\w+\b

    http://msdn.microsoft.com/en-us/library/az24scfc.aspx#grouping_constructs


    jdweng

    Saturday, June 29, 2013 1:10 PM
  • About encoding (especially Unicode):

    http://www.joelonsoftware.com/articles/Unicode.html


    Let's talk about MVVM: http://social.msdn.microsoft.com/Forums/en-US/wpf/thread/b1a8bf14-4acd-4d77-9df8-bdb95b02dbe2

    Saturday, June 29, 2013 10:06 PM
  • See if this code works

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.IO;
    namespace ConsoleApplication1
    {
        class Program
        {
            const string filename = @"c:\temp\arabicText.xls";
            static void Main(string[] args)
            {
                FileStream input = new FileStream(filename, FileMode.Open, FileAccess.Read);
                StreamReader reader = new StreamReader(filename,Encoding.Unicode);
                
                string document = reader.ReadToEnd();
                string pattern = @"\b\w+\s\w+\b";
               
                string output = "";
                foreach (Match match in Regex.Matches(document, pattern, RegexOptions.Singleline))
                {
                    if(output.Length == 0)
                    {
                        output = match.Value;
                    }
                    else
                    {
                        output += " " + match.Value;
                    }
                }
                Console.Write(output);
                Console.ReadLine();
     
            }
        }
        
    }


    jdweng

    Saturday, June 29, 2013 10:36 PM

All replies

  • Arabic character are unicode character.  You have to read the file with unicode encoding (the dfault is usually ascii) and the use RegEx to seperate the words. See webpage below

    http://msdn.microsoft.com/en-us/library/az24scfc.aspx

    If you look at the group constructs below webpage under Anchors \b you will find the expression below that should work

    \b\w+\s\w+\b

    http://msdn.microsoft.com/en-us/library/az24scfc.aspx#grouping_constructs


    jdweng

    Saturday, June 29, 2013 1:10 PM
  • About encoding (especially Unicode):

    http://www.joelonsoftware.com/articles/Unicode.html


    Let's talk about MVVM: http://social.msdn.microsoft.com/Forums/en-US/wpf/thread/b1a8bf14-4acd-4d77-9df8-bdb95b02dbe2

    Saturday, June 29, 2013 10:06 PM
  • See if this code works

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.IO;
    namespace ConsoleApplication1
    {
        class Program
        {
            const string filename = @"c:\temp\arabicText.xls";
            static void Main(string[] args)
            {
                FileStream input = new FileStream(filename, FileMode.Open, FileAccess.Read);
                StreamReader reader = new StreamReader(filename,Encoding.Unicode);
                
                string document = reader.ReadToEnd();
                string pattern = @"\b\w+\s\w+\b";
               
                string output = "";
                foreach (Match match in Regex.Matches(document, pattern, RegexOptions.Singleline))
                {
                    if(output.Length == 0)
                    {
                        output = match.Value;
                    }
                    else
                    {
                        output += " " + match.Value;
                    }
                }
                Console.Write(output);
                Console.ReadLine();
     
            }
        }
        
    }


    jdweng

    Saturday, June 29, 2013 10:36 PM