locked
Split text lines into words and select the correct ones RRS feed

  • Question

  • The following code splits each lines into words and store the first words in each line into array list and the second words into another array list and so on. Then it selects the most frequent word from each list as correct word. 

    Module Module1
    
        Sub Main()
            Dim correctLine As String = ""
            Dim line1 As String = "Canda has more than ones official language"
            Dim line2 As String = "Canada has more than one oficial languages"
            Dim line3 As String = "Canada has nore than one official lnguage"
            Dim line4 As String = "Canada has nore than one offical language"
    
            Dim wordsOfLine1() As String = line1.Split(" ")
            Dim wordsOfLine2() As String = line2.Split(" ")
            Dim wordsOfLine3() As String = line3.Split(" ")
            Dim wordsOfLine4() As String = line4.Split(" ")
     
    
            For i As Integer = 0 To wordsOfLine1.Length - 1
                Dim wordAllLinesTemp As New List(Of String)(New String() {wordsOfLine1(i), wordsOfLine2(i), wordsOfLine3(i), wordsOfLine4(i)})
                Dim counts = From n In wordAllLinesTemp
                Group n By n Into Group
                Order By Group.Count() Descending
                Select Group.First
                correctLine = correctLine & counts.First & " "
            Next
            correctLine = correctLine.Remove(correctLine.Length - 1)
            Console.WriteLine(correctLine)
            Console.ReadKey()
    
        End Sub
    
    End Module

    So this is my code. How can I make it works with lines of different number of words. I mean that the length of each lines here is 7 words and the for loopworks with this length (length-1). Suppose that line 3 contains 5 words. 


    • Edited by myahia72 Wednesday, February 28, 2018 4:20 PM
    Tuesday, February 27, 2018 8:21 PM

Answers

  • Instead of accessing wordsOfLineX(i) directly, create a lambda or helper method to safely get a string from the array, returning an empty string if the index is invalid.  For example:

    Module Module1
    
        Sub Main()
            Dim correctLine As String = ""
            Dim line1 As String = "Canda has more than ones official language"
            Dim line2 As String = "Canada has more than one oficial languages"
            Dim line3 As String = "Canada has nore than one official lnguage"
            Dim line4 As String = "Canada has nore than one offical language"
    
            Dim wordsOfLine1() As String = line1.Split(" ")
            Dim wordsOfLine2() As String = line2.Split(" ")
            Dim wordsOfLine3() As String = line3.Split(" ")
            Dim wordsOfLine4() As String = line4.Split(" ")
    
            Dim getWordSafely = Function(array As String(), index As Integer)
                                    If index > -1 AndAlso index < array.Length Then Return array(index)
                                    Return String.Empty
                                End Function
    
            For i As Integer = 0 To wordsOfLine1.Length - 1
                Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i),
                                                            getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)})
                Dim counts = From n In wordAllLinesTemp
                             Group n By n Into Group
                             Order By Group.Count() Descending
                             Select Group.First
                correctLine = correctLine & counts.First & " "
            Next
            correctLine = correctLine.Remove(correctLine.Length - 1)
            Console.WriteLine(correctLine)
            Console.ReadKey()
    
        End Sub
    
    End Module
    Just keep in mind that now an empty string could be a predominant result depending on how many short strings there are.


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    • Marked as answer by myahia72 Wednesday, February 28, 2018 8:49 PM
    Wednesday, February 28, 2018 8:42 PM
  • I mentioned it in a reply to Acamar above but it is worth reiterating - the first line is deciding the maximum number of words to test.  It might be better to get the longest string and use that length:

            Dim maxLength = (Aggregate a In {wordsOfLine1, wordsOfLine2, wordsOfLine3, wordsOfLine4} Select a.Length Into Max)
    
            For i As Integer = 0 To maxLength - 1
                Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i),
                                                            getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)})


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    • Marked as answer by myahia72 Thursday, March 1, 2018 11:23 AM
    Wednesday, February 28, 2018 8:59 PM
  • The following is a possibility. It will adapt to the number of words in each line. This does not do everything but it does most of it and the rest should be easy.

    Class classWordsOfLine
        Public line As String
        Public Words() As String
        Public Sub New(line As String)
            Me.line = line
            Words = line.Split(" ")
        End Sub
    End Class
    
    Module Module1
    
        Sub Main()
            Dim correctLine As String = ""
            Dim WordsOfLine(4) As classWordsOfLine
            Dim maxwords As Integer = 0
            '
            WordsOfLine(0) = New classWordsOfLine("Canda has more than ones official language")
            maxwords = Math.Max(maxwords, WordsOfLine(0).Words.Length)
            WordsOfLine(1) = New classWordsOfLine("Canada has more than one oficial languages")
            maxwords = Math.Max(maxwords, WordsOfLine(1).Words.Length)
            WordsOfLine(2) = New classWordsOfLine("Canada has nore than one official lnguage")
            maxwords = Math.Max(maxwords, WordsOfLine(2).Words.Length)
            WordsOfLine(3) = New classWordsOfLine("Canada has nore than one offical language")
            maxwords = Math.Max(maxwords, WordsOfLine(3).Words.Length)
            WordsOfLine(4) = New classWordsOfLine("Canada has nore than one language")
            maxwords = Math.Max(maxwords, WordsOfLine(4).Words.Length)
            '
            For fromx As Integer = 0 To maxwords - 1
                Dim words(4) As String
                Dim tox As Integer = 0
                For linex As Integer = 0 To WordsOfLine.Length - 1
                    ' if the number of words are less than the current index then don't try it
                    If WordsOfLine(linex).Words.Length - 1 >= fromx Then
                        words(tox) = WordsOfLine(linex).Words(fromx)
                        tox = tox + 1
                    End If
                Next
                ReDim Preserve words(tox - 1)
                ' words now has the words and just the right number of them
                Console.WriteLine(String.Join(" | ", words))
            Next
        End Sub
    
    End Module
    



    Sam Hobbs
    SimpleSamples.Info

    • Marked as answer by myahia72 Thursday, March 1, 2018 11:36 AM
    Wednesday, February 28, 2018 9:35 PM

All replies

  • So I want to split all text line words into arrays and then apply voting method on these words.

    You haven't indicted what the problem is.   How far into the process have you got and what is the difficulty you have run into.  Post the code you have so far.

    Tuesday, February 27, 2018 8:53 PM
  • Yeah, what is the question?

    Well maybe you are asking how to "Split text lines into words" but then the "select the correct ones" part is very vague. Note that you should out the entire question in the body of the post, don't expect to try to ask the entire question in the subject (title). There is absolutely no question in the body.

    As for splitting lines you need to decide how complex you want to make it. For example if you get "it.For" then is that the end of a sentence and the beginning of another and the space has been mistakenly omitted? What if you get ".Net" then is that another mistake? Some people (marketing types of people especially) exist to violate rules and like to do things however they want to so you might have a period in the middle of a name. How complicated do you need (want) to be? You need to decide that first.



    Sam Hobbs
    SimpleSamples.Info

    Tuesday, February 27, 2018 11:00 PM
  • What's the question? How to split lines of text? How to get a percentage of possible correct words for an index of the 5 string arrays? Provide you all the code for a graduate project idea you came up with? What do you want?

    La vida loca

    Wednesday, February 28, 2018 1:56 AM
  • Hi myahia72,

    If you want to split text lines into words, you can use string.split method to do this:

     Dim str As String="Canda has more than ones official language"
            Dim words() As String = str.Split(" ")

    You said that you use array1 contain the first word from each line and select the correct one, can you provide your existing code here, it is helpful to us to know what you want to do.

    Best regards,

    Cherry


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Wednesday, February 28, 2018 5:20 AM
  • I have posted the code and also I have updated the question
    Wednesday, February 28, 2018 6:01 PM
  • Instead of accessing wordsOfLineX(i) directly, create a lambda or helper method to safely get a string from the array, returning an empty string if the index is invalid.  For example:

    Module Module1
    
        Sub Main()
            Dim correctLine As String = ""
            Dim line1 As String = "Canda has more than ones official language"
            Dim line2 As String = "Canada has more than one oficial languages"
            Dim line3 As String = "Canada has nore than one official lnguage"
            Dim line4 As String = "Canada has nore than one offical language"
    
            Dim wordsOfLine1() As String = line1.Split(" ")
            Dim wordsOfLine2() As String = line2.Split(" ")
            Dim wordsOfLine3() As String = line3.Split(" ")
            Dim wordsOfLine4() As String = line4.Split(" ")
    
            Dim getWordSafely = Function(array As String(), index As Integer)
                                    If index > -1 AndAlso index < array.Length Then Return array(index)
                                    Return String.Empty
                                End Function
    
            For i As Integer = 0 To wordsOfLine1.Length - 1
                Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i),
                                                            getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)})
                Dim counts = From n In wordAllLinesTemp
                             Group n By n Into Group
                             Order By Group.Count() Descending
                             Select Group.First
                correctLine = correctLine & counts.First & " "
            Next
            correctLine = correctLine.Remove(correctLine.Length - 1)
            Console.WriteLine(correctLine)
            Console.ReadKey()
    
        End Sub
    
    End Module
    Just keep in mind that now an empty string could be a predominant result depending on how many short strings there are.


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    • Marked as answer by myahia72 Wednesday, February 28, 2018 8:49 PM
    Wednesday, February 28, 2018 8:42 PM
  • I have posted the code and also I have updated the question

    You have also made all the previous responses look like nonsense.  If you are provided with code for the project it should be posted as an additional post, not by rewriting your question. 

    The code works for lines of any length because the For loop runs from 0 to wordsOfLine?.Length - 1, not from 0 to 6.   You should work through that code line by line to ensure that you understand exactly what each statement does, because it is likely you will need to make changes to do what you describe.

    Wednesday, February 28, 2018 8:46 PM
  • Thanks very much
    Wednesday, February 28, 2018 8:53 PM
  • The code works for lines of any length because the For loop runs from 0 to wordsOfLine?.Length - 1, not from 0 to 6.   You should work through that code line by line to ensure that you understand exactly what each statement does, because it is likely you will need to make changes to do what you describe.

    But there's only one loop over the first line, so it becomes the maximum length line.  The following lines are all accessed by that same iteration variable so if they parsed shorter, there would be an index out of range exception when building the List(Of String).

    -EDIT-

    Though I agree that the post appeared to begin with a question about how to organize the words and now is more about dealing with one of the problems (varying length strings) that one might encounter with this kind of thing.


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"


    Wednesday, February 28, 2018 8:55 PM
  • I mentioned it in a reply to Acamar above but it is worth reiterating - the first line is deciding the maximum number of words to test.  It might be better to get the longest string and use that length:

            Dim maxLength = (Aggregate a In {wordsOfLine1, wordsOfLine2, wordsOfLine3, wordsOfLine4} Select a.Length Into Max)
    
            For i As Integer = 0 To maxLength - 1
                Dim wordAllLinesTemp As New List(Of String)(New String() {getWordSafely(wordsOfLine1, i), getWordSafely(wordsOfLine2, i),
                                                            getWordSafely(wordsOfLine3, i), getWordSafely(wordsOfLine4, i)})


    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    • Marked as answer by myahia72 Thursday, March 1, 2018 11:23 AM
    Wednesday, February 28, 2018 8:59 PM
  • But there's only one loop over the first line, so it becomes the maximum length line.  The following lines are all accessed by that same iteration variable so if they parsed shorter, there would be an index out of range exception when building the List(Of String).

    Then don't do it like that.   If the lines do not have an equal number of words then OP has much bigger problems than the range exception - the whole voting concept becomes itrrelevant.   If lines with unequal number of words are allowed then there are several options.  OP could ignore lines that don't match in number of words, or do some sort of similarity ranking to work out which column each word goes into (that is, where to insert a blank dummy word).  Whatever the choice, just extending the lines so they match is going to corrupt the voting.

    Wednesday, February 28, 2018 9:11 PM
  • If the lines do not have an equal number of words then OP has much bigger problems than the range exception - the whole voting concept becomes irrelevant.  

    I completely agree that short lines may skew the results, but we don't really know what the expected results are supposed to be or what the input will actually look like.

    Reed Kimble - "When you do things right, people won't be sure you've done anything at all"

    Wednesday, February 28, 2018 9:18 PM
  • The following is a possibility. It will adapt to the number of words in each line. This does not do everything but it does most of it and the rest should be easy.

    Class classWordsOfLine
        Public line As String
        Public Words() As String
        Public Sub New(line As String)
            Me.line = line
            Words = line.Split(" ")
        End Sub
    End Class
    
    Module Module1
    
        Sub Main()
            Dim correctLine As String = ""
            Dim WordsOfLine(4) As classWordsOfLine
            Dim maxwords As Integer = 0
            '
            WordsOfLine(0) = New classWordsOfLine("Canda has more than ones official language")
            maxwords = Math.Max(maxwords, WordsOfLine(0).Words.Length)
            WordsOfLine(1) = New classWordsOfLine("Canada has more than one oficial languages")
            maxwords = Math.Max(maxwords, WordsOfLine(1).Words.Length)
            WordsOfLine(2) = New classWordsOfLine("Canada has nore than one official lnguage")
            maxwords = Math.Max(maxwords, WordsOfLine(2).Words.Length)
            WordsOfLine(3) = New classWordsOfLine("Canada has nore than one offical language")
            maxwords = Math.Max(maxwords, WordsOfLine(3).Words.Length)
            WordsOfLine(4) = New classWordsOfLine("Canada has nore than one language")
            maxwords = Math.Max(maxwords, WordsOfLine(4).Words.Length)
            '
            For fromx As Integer = 0 To maxwords - 1
                Dim words(4) As String
                Dim tox As Integer = 0
                For linex As Integer = 0 To WordsOfLine.Length - 1
                    ' if the number of words are less than the current index then don't try it
                    If WordsOfLine(linex).Words.Length - 1 >= fromx Then
                        words(tox) = WordsOfLine(linex).Words(fromx)
                        tox = tox + 1
                    End If
                Next
                ReDim Preserve words(tox - 1)
                ' words now has the words and just the right number of them
                Console.WriteLine(String.Join(" | ", words))
            Next
        End Sub
    
    End Module
    



    Sam Hobbs
    SimpleSamples.Info

    • Marked as answer by myahia72 Thursday, March 1, 2018 11:36 AM
    Wednesday, February 28, 2018 9:35 PM
  • I think the suggested solution about ignoring the lines with missing some words may be a good suggestions since I have about 70 lines resulted from one run and I have 5 runs. So there will be five 70 lines. The possibilities of having lines with missing words is low and ignoring these lines will not affect the results. 
    Thursday, March 1, 2018 11:35 AM
  • Actually the program here will not ignore the lines with missing words. Instead it will add a word from the next line to the words array as following

    Canda | Canada | Canada | Canada | Canada
    has | has | has | has | has
    more | more | nore | nore | nore
    than | than | than | than | than
    ones | one | one | one | one
    official | oficial | official | offical | language
    language | languages | lnguage | language

    I think this line 

    If WordsOfLine(linex).Words.Length - 1 >= fromx Then


     should update to

    If WordsOfLine(linex).Words.Length >= maxwords Then
     
    • Edited by myahia72 Thursday, March 1, 2018 1:06 PM
    Thursday, March 1, 2018 12:04 PM
  • Yes the problem of what to do when there is a mismatch in the number of words is a design problem. The solution needs to be defined in the requirements.

    This is obviously a theoretical exercise intended to show a specific methodology not disclosed here. I agree that if the requirements were clarified then the implementation can be improved correspondingly.

    A more realistic implementation would likely include some kind of spell check. A dictionary would help for recognition of words in it. A sophisticated solution could use a Natural Language form of recognition of words that could help match words to columns when there are fewer words. This application could be much more complex so I certainly understand there are fundamental imperfections.



    Sam Hobbs
    SimpleSamples.Info

    Thursday, March 1, 2018 3:40 PM