none
Remove letters (All letters in alphabet) from String

    Question

  • I have a string of "text" that i extract from a specific file. Usually this string just contains a mobile telephone number, but in some instances, it also contains  a name   or some other text

     

    ie it should containt 0404 123 456

     

    but sometimes it has 0404 123 456 bob

     

    can someone give me any pointers as to the simplest and easiest method of replacing any character that is not a number of 0-9 inclusive with spaces? It needs to be a fairly quick method as the process is reading and writing anything from 50 to 50,000 records in a run

     

    I could use the string.replace method, but i don't want to have to write it for each letter in the alphabet

     

    Cheers


    Mc

    Monday, May 21, 2007 5:40 AM

Answers

  • Hi,

     

    Your wanting to use regular expressions. They are a way to perform quite fast processing of strings based on a pattern, you can also improve the speed by compiling the expression into an assembly, there is a hit at the startup though.

     

    The class your needing is the RegEx class in the System.Text.RegularExpression namespace. With your case I would recommend capturing all the telephone numbers rather than removing the names. Regular expressions are a whole topic in themselves but I'll give you an example that works with the two numbers you posted, this might not work for your whole file though...

     

    Code Snippet

     

    Sub Main()

       Dim sample As String = "0404 123 456 0404 123 456 bob"

       Dim pattern As String = "[0-9]{4}\s[0-9]{3}\s[0-9]{3}"

       Dim regex As New Regex(pattern, RegexOptions.Compiled)

       Dim matches As MatchCollection

       matches = regex.Matches(sample)

       For Each match As Match In matches

              Console.Write(match.Value)

              Console.Write(Environment.NewLine)

       Next

       Console.ReadLine()

    End Sub

     

    The pattern above defines what will pattern will be matched. [0-9] indicates any number between 0 and 9, {4} indicates that four numbers are expected, \s indicates a space (tab also). So the whole pattern is look for a sequence of text containing 4 numbers a space 3 numbers a space and another 3 numbers.  If you have a more complex or variation of telephone number then regular expressions can handle that too, you just have to modify the pattern.

     

    As I said before regular expression is a whole topic in itself.

    Monday, May 21, 2007 8:31 AM
  • ok , the easiest way to get only the numbers is to extract the numbers from the string rather than taking away all other characters

    dim st as string
    dim c as string
    dim t as string
    dim d as string
    dim r as string

    st = 'you fill here
    t = st.lenght
    c = 1
    r = ""

    do until c = t+1*1
    d = getchar(st, c)
    if isnumeric(d) = true then
    r = r & d
    end if
    c = c+1
    loop


    and your result telephone number is the string r


    Monday, May 21, 2007 2:49 PM
  • There is no reason to use IsNumeric.

     

    Paticularly when there is a shared method on the Character class which allows you to identify letters and digits:

     

    Code Snippet

    Function ParseIt(ByVal thisString As String) As String

        ' Check for an 'invalid' string

        If thisString Is Nothing Then Return String.Empty

        ' Create a new empty stringbuilder

        Dim sb As New System.Text.StringBuilder

        ' Convert to a character array

        Dim charArray() As Char = thisString.ToCharArray

        ' Loop through each character in the array

        For Each c As Char In charArray

            ' If it's a digit, then append to the 'builder

            If Char.IsDigit(c) = True Then sb.Append(c)

        Next

        ' Return the Stringbuilder String

        Return sb.ToString

    End Function

     

    Note: this will only append DIGITS, and will ignore whitespace, control chars, etc. - is is straightforward to get it to append whitespace (and any other characters) as necessary.

    Monday, May 21, 2007 4:40 PM
    Moderator
  • >> Processing upto 50,000 telephone numbers in a reasonable time by calling boolean checks for numbers is just not going to happen. 50,000 numbers @ 11 required characters + n amount of unrequired characters  = an extremely large amount of individual characters all of which are being checked individually, and telephone numbers don't necessary need to be numeric. You've just got to use regular expressions here.

     

     

    I'm not sure what equipment you're running (trs-80??), but my pc will process 100,000 records of 38 characters each in under a second.  Here's some code to create your test file... let me know how RegEx's performance is:

    Code Snippet

     

    'Code to create test file

    Dim SR As New System.IO.StreamWriter("C:\TEST.TXT")

    For I As Integer = 0 To 100000

     SR.WriteLine("123 456 7890 asdfkasdf ;asdfjk;askfdj ")

    Next

    SR.Close()

     

    Edit - Just out of curiousity, do you believe a RegEx NOT to be looking at each character individually?  What magicians the folks at MS must be indeed!

    Monday, May 21, 2007 6:59 PM
  • Hi, the reason why your results are so different and why your regular expression is taking longer is because your compiling the RegEx method for each line of the file. If you change the method RegEx to this you will notice an improvement in the time taken.

     

    Code Snippet

    Dim pattern As String = "[0-9]{3}\s[0-9]{3}\s[0-9]{3}"

    Dim regex As New Regex(pattern, RegexOptions.Compiled)

     

    Function RegExs(ByVal Sample As String) As MatchCollection

    Dim matches As MatchCollection

    matches = regex.Matches(Sample)

    Return matches

    End Function

     

     

    My test code made one compile for the whole file so when it ran again using the same file it didn't need to compile again.

     

    Monday, May 21, 2007 8:47 PM

All replies

  • It would be possible. But I think you'd be better off writing a regular expression to just extract what you want rather than trying to worry out about "sanitizing" your text so that you can extract what you want.

     

    Let me know if you're not familiar with regular expressions.

    Monday, May 21, 2007 8:14 AM
  • Hi,

     

    Your wanting to use regular expressions. They are a way to perform quite fast processing of strings based on a pattern, you can also improve the speed by compiling the expression into an assembly, there is a hit at the startup though.

     

    The class your needing is the RegEx class in the System.Text.RegularExpression namespace. With your case I would recommend capturing all the telephone numbers rather than removing the names. Regular expressions are a whole topic in themselves but I'll give you an example that works with the two numbers you posted, this might not work for your whole file though...

     

    Code Snippet

     

    Sub Main()

       Dim sample As String = "0404 123 456 0404 123 456 bob"

       Dim pattern As String = "[0-9]{4}\s[0-9]{3}\s[0-9]{3}"

       Dim regex As New Regex(pattern, RegexOptions.Compiled)

       Dim matches As MatchCollection

       matches = regex.Matches(sample)

       For Each match As Match In matches

              Console.Write(match.Value)

              Console.Write(Environment.NewLine)

       Next

       Console.ReadLine()

    End Sub

     

    The pattern above defines what will pattern will be matched. [0-9] indicates any number between 0 and 9, {4} indicates that four numbers are expected, \s indicates a space (tab also). So the whole pattern is look for a sequence of text containing 4 numbers a space 3 numbers a space and another 3 numbers.  If you have a more complex or variation of telephone number then regular expressions can handle that too, you just have to modify the pattern.

     

    As I said before regular expression is a whole topic in itself.

    Monday, May 21, 2007 8:31 AM
  • ok , the easiest way to get only the numbers is to extract the numbers from the string rather than taking away all other characters

    dim st as string
    dim c as string
    dim t as string
    dim d as string
    dim r as string

    st = 'you fill here
    t = st.lenght
    c = 1
    r = ""

    do until c = t+1*1
    d = getchar(st, c)
    if isnumeric(d) = true then
    r = r & d
    end if
    c = c+1
    loop


    and your result telephone number is the string r


    Monday, May 21, 2007 2:49 PM
  • Oh. My.

    Please don't use this solution Smile
    Monday, May 21, 2007 3:39 PM
  • Just wanted to throw one more idea at you. 

     

    Dim oldstr As String = "123 456 7890 bob 999 888 7777 adam 555 555 5555 gg"

    Dim newstr As String = ""

    For Each c As Char In oldstr

    If IsNumeric(c) = True Then

    newstr &= c

    Else

    newstr &= " "

    End If

    Next

     

    Monday, May 21, 2007 4:07 PM
  • If you're considering using one of the examples using the IsNumeric() function, do yourself a favor and at least use a StringBuilder to append your output together.
    Monday, May 21, 2007 4:12 PM
  • There is no reason to use IsNumeric.

     

    Paticularly when there is a shared method on the Character class which allows you to identify letters and digits:

     

    Code Snippet

    Function ParseIt(ByVal thisString As String) As String

        ' Check for an 'invalid' string

        If thisString Is Nothing Then Return String.Empty

        ' Create a new empty stringbuilder

        Dim sb As New System.Text.StringBuilder

        ' Convert to a character array

        Dim charArray() As Char = thisString.ToCharArray

        ' Loop through each character in the array

        For Each c As Char In charArray

            ' If it's a digit, then append to the 'builder

            If Char.IsDigit(c) = True Then sb.Append(c)

        Next

        ' Return the Stringbuilder String

        Return sb.ToString

    End Function

     

    Note: this will only append DIGITS, and will ignore whitespace, control chars, etc. - is is straightforward to get it to append whitespace (and any other characters) as necessary.

    Monday, May 21, 2007 4:40 PM
    Moderator
  • woah... hang on folks !!

     

    Processing upto 50,000 telephone numbers in a reasonable time by calling boolean checks for numbers is just not going to happen. 50,000 numbers @ 11 required characters + n amount of unrequired characters  = an extremely large amount of individual characters all of which are being checked individually, and telephone numbers don't necessary need to be numeric. You've just got to use regular expressions here.

    Monday, May 21, 2007 6:36 PM
  • >> Processing upto 50,000 telephone numbers in a reasonable time by calling boolean checks for numbers is just not going to happen. 50,000 numbers @ 11 required characters + n amount of unrequired characters  = an extremely large amount of individual characters all of which are being checked individually, and telephone numbers don't necessary need to be numeric. You've just got to use regular expressions here.

     

     

    I'm not sure what equipment you're running (trs-80??), but my pc will process 100,000 records of 38 characters each in under a second.  Here's some code to create your test file... let me know how RegEx's performance is:

    Code Snippet

     

    'Code to create test file

    Dim SR As New System.IO.StreamWriter("C:\TEST.TXT")

    For I As Integer = 0 To 100000

     SR.WriteLine("123 456 7890 asdfkasdf ;asdfjk;askfdj ")

    Next

    SR.Close()

     

    Edit - Just out of curiousity, do you believe a RegEx NOT to be looking at each character individually?  What magicians the folks at MS must be indeed!

    Monday, May 21, 2007 6:59 PM
  • What, who me?!  I have a ZX81, three hours to load Notepad.

     

    I did some benchmarks and on a relatively old machine here's what I found.

     

    Using regular expressions it found 10001 matches and took 78 milliseconds

    Using  SJWhiteley's ParseIt method (sorry mate needed another benchmark) it took 222 milliseconds.

     

    Ok so not exactly going to ruin the users day like I thought but regular expressions are 2.846 times faster.

    I thought it might have taken a bit more time to loop over all characters to be honest, it does create a noticable lag.

     

    RegEx is based on mathematic theories of equivalence with finite automata.

    Yeah I don't know what that means either but just think how impressed McWhirters boss will be when he hears that.

     

    .

     

    Monday, May 21, 2007 7:59 PM
  • That's interesting I upped it to 1,000,000 records and SJ's code runs in about 3 seconds, the RegEx in about 10 seconds.

     

    Code Snippet

    Private Sub Button5_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button5.Click

    'Code to create test file

    'Dim SR As New System.IO.StreamWriter("C:\BP\TEST.TXT")

    'For I As Integer = 0 To 1000000

    ' Dim Sb As New System.Text.StringBuilder(22)

    ' SR.WriteLine("123 456 7890 asdfkasdf ;asdfjk;askfdj ")

    'Next

    'SR.Close()

    Debug.WriteLine(Now())

    Dim Sw As New System.IO.StreamReader("c:\bp\test.txt")

    Dim Rec As String = Sw.ReadLine

    Do While Sw.Peek > -1

    ParseIt(Rec)

    'RegExs(Rec)

    Rec = Sw.ReadLine

    Loop

    Sw.Close()

    'SR.Close()

    Debug.WriteLine(Now())

    End Sub

    Function ParseIt(ByVal thisString As String) As String

    ' Check for an 'invalid' string

    If thisString Is Nothing Then Return String.Empty

    ' Create a new empty stringbuilder

    Dim sb As New System.Text.StringBuilder

    ' Convert to a character array

    Dim charArray() As Char = thisString.ToCharArray

    ' Loop through each character in the array

    For Each c As Char In charArray

    ' If it's a digit, then append to the 'builder

    If Char.IsDigit(c) = True Then sb.Append(c)

    Next

    ' Return the Stringbuilder String

    Return sb.ToString

    End Function

    Function RegExs(ByVal Sample As String) As String

    Dim pattern As String = "[0-9]{3}\s[0-9]{3}\s[0-9]{3}"

    Dim regex As New Regex(pattern, RegexOptions.Compiled)

    Dim matches As MatchCollection

    matches = regex.Matches(Sample)

    'For Each match As Match In matches

    ' Console.Write(match.Value)

    ' Console.Write(Environment.NewLine)

    'Next

    Return matches(0).ToString

    End Function

     

    Monday, May 21, 2007 8:08 PM
  • no chance mate....

     

    I've just ran on 1000000 records, 5 times each method (parse it and regex), and here are the results.

     

    ParseIt = 14052 milliseconds

    RegEx = 235 millisecond

     

    Your using way more memory now so the garbage collector may have kicked in during the regex.

     

    I don't believe this, I reran the code and I swear to you. The RegEx has went down to 8 milliseconds for 1000000 records, why, because the regex has been compiled into an assembly.

     

    Imports System.Text.RegularExpressions

    Imports System.Diagnostics

    Module Module1

    Sub Main()

    'Code to create test file

    'Dim SR As New System.IO.StreamWriter("C:\TEST.TXT")

    'For I As Integer = 0 To 1000000

    ' SR.WriteLine("0123 456 789 asdfkasdf ;asdfjk;askfdj ")

    'Next

    'SR.Close()

    Dim watch As New Stopwatch

    Dim sample As String = System.IO.File.ReadAllText("C:\TEST.TXT")

    Dim pattern As String = "[0-9]{4}\s[0-9]{3}\s[0-9]{3}"

     

    watch.Start()

    ParseIt(sample)

    ParseIt(sample)

    ParseIt(sample)

    ParseIt(sample)

    ParseIt(sample)

    watch.Stop()

    Console.WriteLine("Parse It: {0}", watch.ElapsedMilliseconds)

     

    watch.Reset()

    watch.Start()

    ParseItRegEx(sample)

    ParseItRegEx(sample)

    ParseItRegEx(sample)

    ParseItRegEx(sample)

    ParseItRegEx(sample)

    watch.Stop()

     

    Console.WriteLine("Parse It Reg Ex: {0}", watch.ElapsedMilliseconds)

    Console.ReadLine()

    End Sub

     

    Function ParseItRegEx(ByVal thisString As String) As MatchCollection

    Dim pattern As String = "[0-9]{4}\s[0-9]{3}\s[0-9]{3}"

    Dim regex As New Regex(pattern, RegexOptions.Compiled)

    Return regex.Matches(thisString)

    End Function

     

    Function ParseIt(ByVal thisString As String) As String

    ' Check for an 'invalid' string

    If thisString Is Nothing Then Return String.Empty

    ' Create a new empty stringbuilder

    Dim sb As New System.Text.StringBuilder

    ' Convert to a character array

    Dim charArray() As Char = thisString.ToCharArray

    ' Loop through each character in the array

    For Each c As Char In charArray

    ' If it's a digit, then append to the 'builder

    If Char.IsDigit(c) = True Then sb.Append(c)

    Next

    ' Return the Stringbuilder String

    Return sb.ToString

    End Function

     

     

    End Module

    Monday, May 21, 2007 8:36 PM
  • Hi, the reason why your results are so different and why your regular expression is taking longer is because your compiling the RegEx method for each line of the file. If you change the method RegEx to this you will notice an improvement in the time taken.

     

    Code Snippet

    Dim pattern As String = "[0-9]{3}\s[0-9]{3}\s[0-9]{3}"

    Dim regex As New Regex(pattern, RegexOptions.Compiled)

     

    Function RegExs(ByVal Sample As String) As MatchCollection

    Dim matches As MatchCollection

    matches = regex.Matches(Sample)

    Return matches

    End Function

     

     

    My test code made one compile for the whole file so when it ran again using the same file it didn't need to compile again.

     

    Monday, May 21, 2007 8:47 PM
  • Gawd

     

    Didn't think it would be quite this complicated (although i did figure it would be regex which i know nothing about)

     

    I think the various regex solutions will help me acheive the desired result. I will have have to take into account the fact that the string of numbers could be 4 3 3 or it could be 10 or it could be 9 or it could be 3 3 3... All depends on how the data was entered (the program it was created it doesn't format any of it)

     

    I think for the sake of ease, i will format the string first and remove all spaces, which means i should be working with a string of numbers 9 or 10 digits long that may have some letters on one end of the other

     

    I'll let you know what i come up with Smile (and give someone a tick)

     

    Thanks for all the responses!

    Tuesday, May 22, 2007 7:42 AM
  • It looks like the RegEx will be the fastest. But you also have to think about the specific application: if it's only going to be done once then a brute force method which takes 10 times as long is better than spend hours/days agonizing over the 'best' way.

     

    If these sorts of things you are going to be doing in other scenarios, then learning how to use regular expressions may be time well spent Smile

    Tuesday, May 22, 2007 12:36 PM
    Moderator
  • >>ParseIt = 14052 milliseconds

    >>RegEx = 235 millisecond

     

    The reason for that it ParseIt is actually returning the appended string, RegEx isn't, it's built the collection of matches but hasn't append them back into a single string.  If you change it to return the same data, it's slows down considerably.

     

    Anyhow, I'd like to have a look at the actual data file before saying either of these methods is the way to go.  If the phone number is always the first field on a line for exmple.  The use of SubString could very well be in order, in combination with one of these methods, for example.

    Tuesday, May 22, 2007 3:20 PM
  • why always speed

    if he had 5 million tellephone number he wouldnt be on this forum , but he would hire a professional
    because if you have 5 million telephone numbers, your company is big and has lots of money (well most of the time)


    if he wants to learn VB, its good to give lots of differents ways to build the program
    and thats what we all did : )

    Friday, May 25, 2007 4:55 PM
  • I ended up using the ParseIt function that SJwhitely provided for the simple reason that it worked, and enabled me to continue past this point in the application easily.

     

    The test data i was working with  only has a couple thousand records, not all of which have mobile numbers

     

    I also have to look further into the regex as it will be i think i better method in the end, but since i dont know enough about it at the moment (and dont have the time to study it in depth in the timeframe available) i will have to put it aside for now.

     

    The main problem i could see with regex at this time would have been determining the correct pattern to use, as the source data has no formatting function for the original input thus the number could be 0404 123 456 or it could be 04 041 23 456 and so on.

     

    The parseit function seemed to have no visibile effect on the process time and just rips the numbers out of the string which is the primary aim. i can them format that string to a uniform standard

     

    Thanks guys! Much appreciated

     

    Mc

    Saturday, May 26, 2007 2:46 AM