locked
Find the longest common substring

    Question

  • Hi, I have many strings which have the same length. The format likes

    TTTATGTCTATAATCCTTACCAAAAGTTACCTTGGAATAAGAAGAAGTCA
    GTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATT
    TGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCT
    GCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGA
    CAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACT
    CAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCA
    GCGGTTTCCAAGAGGATACAGTAGGTGTCTCTATCTCTTCCTATAGTGCA
    GGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGG
    AATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCC
    CTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGA
    CCCTCTCTCCATGGTATTGAGCCTACACATGGTTGTGAAACAAGAATTGC
    TTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACAT
    TCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAAT
    CCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGC
    CATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAG
    AGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTG
    CAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAA
    ACCTTCTCTACTCTCTTCAAACCACTTTTCTTTCACTCTCAGCATATATT
    TTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCT
    GTTTTTTCCCTCCAGCCCATCTAAAATTTACCTGCAACTTAAAAGTGATC
    

    I want to find the longest common substring among them. Thanks for help.

     

     

    Sunday, March 27, 2011 6:32 PM

Answers

  • This is tweaked a bit.  Doesn't run terribly fast, sucks CPU cycles like there's no tomorrow, but is okay on memory, and so far as I can tell (which is different than guaranteed) it seems to work correctly.  I would like for some of the other people posting here in this thread to have a look over this code to point out any flaws, and/or make recommendations for performance improvement.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace ConsoleApplication1
    {
    	class Program
    	{
    		static void Main(string[] args)
    		{
    			string str1 = @"TTTATGTCTATAATCCTTACCAAAAGTTACCTT"; //GGAATAAGAAGAAGTCAGTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATTTGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCTGCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGACAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACTCAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCAGCGGTTTCCAAGAGGATACAGTAGGTGTCTCTATCTCTTCCTATAGTGCAGGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCCCTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGACCCTCTCTCCATGGTATTGAGCCTACACATGGTTGTGAAACAAGAATTGCTTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACATTCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAATCCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGCCATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAGAGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTGCAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAAACCTTCTCTACTCTCTTCAAACCACTTTTCTTTCACTCTCAGCATATATTTTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCTGTTTTTTCCCTCCAGCCCATCTAAAATTTACCTGCAACTTAAAAGTGATC";
    			string str2 = @"TGTCTATCCAAAAGTTACCTTGAATCCTTAGAA"; //TAAGAAGAAGTCATTTAGTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATTTGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCTGCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGACAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACTCAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCAGCGGTTTCCAAGAGGGTGTCTCTATCTCTTCCTATAGTGCAAAGAATTGCAATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCCCTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGACCCTCTCTCCAATACAGTAGTGGTATTGAGCCTACACATGGTTGTGAAACTTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACATTCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAATCCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGCCATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAGAGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTGCAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAAACCTTCTCTACTCTCTTCAAACCACACTCTCAGCATATATTTTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCTGTTTTTTCCCTCCAGCCCATCTAACTGCAACTTAAAAGTGATCCTTTTCTTTGGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATTTAC";
    			string test = @"AGTAAAAAGAAGGCTGTTGT-------------"; //-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------";
    			
    			string longestCommonSubstring = FindLongestCommonSubstring(test, str1, str2);
    
    			if (longestCommonSubstring != null)
    				Console.WriteLine("Found: {0}", longestCommonSubstring);
    			else
    				Console.WriteLine("Did not find ANY common substrings (should never happen)");
    
    			Console.ReadKey();
    		}
    
    		static string FindLongestCommonSubstring(params string[] stringsToCheck)
    		{
    			string longestPattern = String.Empty; ;
    
    			// Since we're looking for the longest possible common substring, 
    			// start with the entire string as a pattern, then decrement the
    			// pattern by one until we find a match in all the strings
    
    			for (int patternLen = stringsToCheck[0].Length; patternLen > 0; patternLen--)
    			{
    				for (int patternStartPos = 0; patternStartPos + patternLen <= stringsToCheck[0].Length; patternStartPos++)
    				{
    					string pattern = stringsToCheck[0].Substring(patternStartPos, patternLen);
    					
    					// If this pattern occurs in all the strings, it is
    					// the longest common substring/sub-pattern
    					if (stringsToCheck.All(x => x.ContainsSimilar(pattern, .8))) // If all the strings contain this pattern where at least 80% of the characters match
    						return pattern;
    				}
    			}
    
    			return null;
    		}
    	}
    
    	public static class Extensions
    	{
    		public static bool ContainsSimilar(this string source, string pattern, double similarity)
    		{
    			for (int matchStartPos = 0; matchStartPos <= source.Length - pattern.Length; matchStartPos++)
    			{
    				// try comparing the pattern to the source with both their starts lined up initially
    				// each iteration of the loop effectively "moves" the pattern one character further
    				// down source for another comparison
    				int matchingChars = 0;
    				int charsRequiredForMatch = (int)Math.Ceiling((double)pattern.Length * similarity);
    				for (int i = 0; i < pattern.Length; i++)
    				{
    					if (source[matchStartPos + i] == pattern[i])
    					{
    						matchingChars++;
    						if ((double)matchingChars / (double)pattern.Length >= similarity)
    							return true;
    					}
    					
    					int charsRemaining = source.Length - (i + 1);
    					if (charsRemaining + matchingChars <= charsRequiredForMatch) // Give up if it's not possible to reach the required number of characters for a match
    						return false;
    				}
    			}
    			return false;
    		}
    	}
    }
    
    
    • Marked as answer by ardmore Wednesday, March 30, 2011 12:32 PM
    Tuesday, March 29, 2011 9:09 PM

All replies

  • What would "Common" mean? The longest string of the same letter?
    Sunday, March 27, 2011 7:24 PM
  • Common means same letters. For example, for the strings string1,string2,string3 and string4 etc. They all have many substrings like AGCT,AGCCGT,CCTGTA,..... They have different length. The shortest one is just one letter such as "A",or "G", or "C", or "T". I want to find the longest one which occuies in all of the strings.
    • Edited by ardmore Sunday, March 27, 2011 7:33 PM add
    Sunday, March 27, 2011 7:32 PM
  • Bad description. Easy answer: the longerst is the whole text!

    I cant see any connection how to look for the longest stirng.

    Sunday, March 27, 2011 7:34 PM
  • No, I mean in different strings. You misunderstood my question. For example

    str1 =

    TTTATGTCTATAATCCTTACCAAAAGTTACCTTGGAATAAGAAGAAGTCA
    GTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATT
    TGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCT
    GCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGA
    CAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACT
    CAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCA
    GCGGTTTCCAAGAGGATACAGTAGGTGTCTCTATCTCTTCCTATAGTGCA
    GGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGG
    AATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCC
    CTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGA
    CCCTCTCTCCATGGTATTGAGCCTACACATGGTTGTGAAACAAGAATTGC
    TTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACAT
    TCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAAT
    CCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGC
    CATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAG
    AGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTG
    CAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAA
    ACCTTCTCTACTCTCTTCAAACCACTTTTCTTTCACTCTCAGCATATATT
    TTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCT
    GTTTTTTCCCTCCAGCCCATCTAAAATTTACCTGCAACTTAAAAGTGATC
    

     

    str2 =TGTCTATCCAAAAGTTACCTTGAATCCTTAGAATAAGAAGAAGTCATTTA
    GTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATT
    TGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCT
    GCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGA
    CAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACT
    CAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCA
    GCGGTTTCCAAGAGGGTGTCTCTATCTCTTCCTATAGTGCAAAGAATTGC
    AATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCC
    CTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGA
    CCCTCTCTCCAATACAGTAGTGGTATTGAGCCTACACATGGTTGTGAAAC
    TTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACAT
    TCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAAT
    CCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGC
    CATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAG
    AGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTG
    CAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAA
    ACCTTCTCTACTCTCTTCAAACCACACTCTCAGCATATATT
    TTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCT
    GTTTTTTCCCTCCAGCCCATCTAACTGCAACTTAAAAGTGATCCTTTTCTTT
    GGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATTTAC

     

    str1 is different str2 but they have same length. I want to find the longest substring between str1 and str2.

     

     

    Sunday, March 27, 2011 8:02 PM
  • You\d need to write a simple algorithm that creates every possible substring combination in string A and try to finf the longest one that matches in string B.

    You'd probably want to optimisz a bit, as you probably have an idea of the amount of similarity between A and B. That way you could start near your expected common length and work your way up or down from there/

    The following isn't optimal, but it works:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace ConsoleApplication3
    {
      public class UniqueSubstringEnumerator : IEnumerator<String>, IEnumerable<string>
      {
        private int lpos;
        private int rpos;
        private string source;
    
        public UniqueSubstringEnumerator(string source)
        {
          this.source = source;
          lpos = 0;
          rpos = lpos + 1;
        }
    
        #region IEnumerator<string> Members
    
        public string Current
        {
          get { return source.Substring(lpos, rpos - lpos); }
        }
    
        #endregion
    
        #region IDisposable Members
    
        public void Dispose()
        {
        }
    
        #endregion
    
        #region IEnumerator Members
    
        object System.Collections.IEnumerator.Current
        {
          get { return this.Current; }
        }
    
        public bool MoveNext()
        {
          if (rpos >= source.Length)
          {
            if (lpos < source.Length - 1)
            {
              lpos = lpos + 1;
              rpos = lpos + 1;
              return true;
            }
            else
            {
              return false;
            }
          }
          else
          {
            rpos = rpos + 1;
            return true;
          }
        }
    
        public void Reset()
        {
          rpos = 1;
          lpos = 0;
        }
    
        #endregion
    
        #region IEnumerable<string> Members
    
        public IEnumerator<string> GetEnumerator()
        {
          return this;
        }
    
        #endregion
    
        #region IEnumerable Members
    
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
          return this;
        }
    
        #endregion
      }
    }
    
    

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace ConsoleApplication3
    {
      class Program
      {
        static void Main(string[] args)
        {
          string sourceA = "abcde";
          string sourceB = "acdeb";
          
          HashSet<string> partsB = new HashSet<string>(new UniqueSubstringEnumerator(sourceB).ToList().Distinct());
          
          var longest = from part in new UniqueSubstringEnumerator(sourceA)
                 where partsB.Contains(part)
                 orderby part.Length descending
                 select part;
    
          string longestMatch = longest.FirstOrDefault();
        }
      }
    }
    
    


    Recipient of the Microsoft Community Contributor Award 2011.
    Sunday, March 27, 2011 11:24 PM
  • a faster way would probably be to write a statemachine that does the matching directly on the char[]'s of both strings (or by using a MemoryStream), but the algorithm would remain mostly the same.
    Recipient of the Microsoft Community Contributor Award 2011.
    Sunday, March 27, 2011 11:26 PM
  • Hi

    check the code below.
    Bet this is about DNA-matching or homologue sequences.
    The max substring for the two samples strings is 326 characters long
    and starts at pos 549 (str1) and 499 (str2) resp.

    Christoph
    namespace ConsoleSubString
    {
      using System;
    
      /// <summary>
      /// Class that find the substring of maximum length shared by two strings
      /// and keeps the result
      /// </summary>
      public class MaxCommonSubstringFinder
      {
        /// <summary>
        /// the first string to compare
        /// </summary>
        public readonly string String1;
    
        /// <summary>
        /// the second string to compare
        /// </summary>
        public readonly string String2;
    
        /// <summary>
        /// the maximum length substring shared by <see cref="String1" /> and <see cref="String2" />
        /// </summary>
        public readonly string Substring;
    
        /// <summary>
        /// the length of <see cref="Substring" />
        /// </summary>
        public int Length { get { return Substring.Length; } }
    
        /// <summary>
        /// the start position of <see cref="Substring" /> in <see cref="String1" />
        /// </summary>
        public readonly int StartPos1;
    
        /// <summary>
        /// the start position of <see cref="Substring" /> in <see cref="String2" />
        /// </summary>
        public readonly int StartPos2;
    
        public MaxCommonSubstringFinder(string str1, string str2)
        {
          String1 = str1;
          String2 = str2;
    
          if (string.IsNullOrEmpty(str1) || string.IsNullOrEmpty(str2))
          {
            Substring = string.Empty;
            StartPos1 = -1;
            StartPos2 = -1;
          }
          else
          {
            Substring = FindMaxCommonSubstring(str1, str2, out StartPos1, out StartPos2);
          }
        }
        private static string FindMaxCommonSubstring(string str1, string str2, out int index1, out int index2)
        {
          string lastMatch = null;
          int lastLength = -1;
    
          index1 = -1;
          index2 = -1;
    
          for (int start = 0; start < str1.Length; start++)
          {
            for (int length = 0; length < str1.Length - start; length++)
            {
              string sub = str1.Substring(start, length);
              int pos = str2.IndexOf(sub);
              if (pos < 0)
                break;
    
              if (length > lastLength)
              {
    
                index1 = start;
                index2 = pos;
                lastMatch = sub;
                lastLength = length;
              }
            }
          }
          return lastMatch;
        }
    
      }
      class Program
      {
    
        static void Main(string[] args)
        {
    
           string str1 = "TTTATGTCTATAATCCTTACCAAAAGTTACCTTGGAATAAGAAGAAGTCA" +
                   "GTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATT" +
                   "TGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCT" +
                   "GCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGA" +
                   "CAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACT" +
                   "CAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCA" +
                   "GCGGTTTCCAAGAGGATACAGTAGGTGTCTCTATCTCTTCCTATAGTGCA" +
                   "GGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGG" +
                   "AATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCC" +
                   "CTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGA" +
                   "CCCTCTCTCCATGGTATTGAGCCTACACATGGTTGTGAAACAAGAATTGC" +
                   "TTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACAT" +
                   "TCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAAT" +
                   "CCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGC" +
                   "CATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAG" +
                   "AGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTG" +
                   "CAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAA" +
                   "ACCTTCTCTACTCTCTTCAAACCACTTTTCTTTCACTCTCAGCATATATT" +
                   "TTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCT" +
                   "GTTTTTTCCCTCCAGCCCATCTAAAATTTACCTGCAACTTAAAAGTGATC";
    
          string str2 = "TGTCTATCCAAAAGTTACCTTGAATCCTTAGAATAAGAAGAAGTCATTTA" +
                   "GTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATT" +
                   "TGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCT" +
                   "GCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGA" +
                   "CAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACT" +
                   "CAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCA" +
                   "GCGGTTTCCAAGAGGGTGTCTCTATCTCTTCCTATAGTGCAAAGAATTGC" +
                   "AATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCC" +
                   "CTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGA" +
                   "CCCTCTCTCCAATACAGTAGTGGTATTGAGCCTACACATGGTTGTGAAAC" +
                   "TTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACAT" +
                   "TCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAAT" +
                   "CCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGC" +
                   "CATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAG" +
                   "AGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTG" +
                   "CAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAA" +
                   "ACCTTCTCTACTCTCTTCAAACCACACTCTCAGCATATATT" +
                   "TTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCT" +
                   "GTTTTTTCCCTCCAGCCCATCTAACTGCAACTTAAAAGTGATCCTTTTCTTT" +
                   "GGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATTTAC";
    
          var mcss = new MaxCommonSubstringFinder(str1, str2);
          Console.WriteLine(mcss.Substring);
          Console.WriteLine(mcss.Length);
          Console.WriteLine(mcss.StartPos1);
          Console.WriteLine(mcss.StartPos2);
          Console.ReadKey(true);
        }
      }
    }
    
    

    Monday, March 28, 2011 12:07 AM
  • Thanks for everybody. This example is just for comparation between two strings. Actually I have 100 DNA sample strings, what is the idea for the solution? 
    Monday, March 28, 2011 12:11 AM
  • You need the maximum overlap between 100 DNA-samples?
    This is likely not too performant but returns the searched result

      public class MaxCommonSubstringFinder
      {
        public readonly List<string> InputStrings;
        public readonly List<int> MatchStartPositions;
    
        public readonly string MaxSubstring;
        public int Length { get { return MaxSubstring == null ? 0 : MaxSubstring.Length; } }
    
        public MaxCommonSubstringFinder(IEnumerable<string> strings)
        {
          this.InputStrings = new List<string>(strings);
    
          if (strings == null || strings.Count() == 0 || strings.Any(s => string.IsNullOrEmpty(s)))
          {
            this.MatchStartPositions = null;
            MaxSubstring = null;
          }
          else
          {
            MaxSubstring = FindMaxCommonSubstring(this.InputStrings, out this.MatchStartPositions);
          }
        }
        private static string FindMaxCommonSubstring(IList<string> strings, out List<int> startPositions)
        {
          string lastMatch = null;
          int lastLength = -1;
    
          string str1 = strings[0];
          var otherStrings = strings.Skip(1);
    
          for (int start = 0; start < str1.Length; start++)
          {
            for (int length = 0; length < str1.Length - (start -1); length++)
            {
              string sub = str1.Substring(start, length);
              if (otherStrings.All(s => s.IndexOf(sub) > -1) && length > lastLength)
              {
                lastMatch = sub;
                lastLength = length;
              }
            }
          }
          if (lastMatch != null)
          {
            startPositions = new List<int>(strings.Count);
            startPositions.AddRange(strings.Select(t => t.IndexOf(lastMatch)));
          }
          else
          {
            startPositions = null;
          }
          return lastMatch;
        }
    }
    
        static void Main(string[] args)
        {
    
          var strings = new string[] { "xxxxxBCDEyyy", "aaaBCDEzzz", "gggggggBCDE", "1.2.-3.-IV--BCDE#~~"};
    
          var mcss = new MaxCommonSubstringFinder(strings);
          Console.WriteLine("MaxSubstring: {0}", mcss.MaxSubstring);
          Console.WriteLine("Length: {0}", mcss.Length);
          Console.WriteLine("Startpositions: {0}", string.Join(", ", mcss.MatchStartPositions.Select(i => i.ToString())));
    
          Console.ReadKey(true);
        }
    }
    

    Monday, March 28, 2011 1:22 AM
  • Any update?

    Best Regards,


    Larcolais Gong[MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Monday, March 28, 2011 5:25 AM
  • You can break your inner loop as soon as you don't find a match, and start that loop at lastLength+1. You can break the outer loop if there are not enough characters to find a longer match.

    for (int start = 0; start < str1.Length; start++)
    {
        if (str1.Length - (start - 1) < lastLength) break;
    
        for (int length = lastLength + 1; length < str1.Length - (start -1); length++)
        {
            string sub = str1.Substring(start, length);
            if (otherStrings.Any(s => s.IndexOf(sub) == -1)
                break;
    
            if (length > lastLength)
            {
                lastMatch = sub;
                lastLength = length;
            }
        }
    }
    

     It should add a little speed.

     

    Monday, March 28, 2011 9:06 AM
  • Hi Christoph Basedau,

    Could you please explain your algorithm logic a little bit so I can understand the idea better? Copy and paste code is easy but hmm.. 

     


    • Edited by ardmore Monday, March 28, 2011 12:44 PM typo
    Monday, March 28, 2011 12:44 PM
  • Hmm, it doesnt look so easy at all :)
    Monday, March 28, 2011 1:48 PM
  • Because I have 100 samples, so it is impossible to find exactly match at all except only shorter substrings. If I change the condition: find at least 80% match the longest substring. How to modify the code?

    For example.

     9 TGGGTGGC   16
      | | | | | |.|
     4 TGGGTGCC   11 
    
    We found in string 1 position 9 to 16, in string 2 from position 4 to 11 the substrings are very similar. They have 7 letters match in the total 8 characters.

    Monday, March 28, 2011 1:52 PM
  • Hi ardmore,

    Of course I can explain it to you. The algorithm is pretty easy,
    it simply examines all possible/relevant substrings.

    First of all I use a class, to do the examination and I
    "return" the result as a property "MaxSubstring". The class also returns
    the "StartPositions" of the common substring for each input string.
    And it also stores the original "InputStrings"

    The algorithm can be described as follows:
    I take the first string of all input strings to create and examine any of its possible
    substrings.
    This is accomplished by two loops, an outer one that increments
    the startposition argument and an inner one that increments the length argument
    of the Substring method.

    Based on these two variables a new substring 'sub' is retrieved inside each iteration
    of the inner loop. Any such substring 'sub' is checked to be contained by all of the other input strings -
    This is done by the linq expression "All(s =>s.IndexOf(sub) > -1)", which
    can also expressed negatively: if (otherStrings.Any(s => s.IndexOf(sub) == -1 ) break;"

    If a subtring is "common", i.e. contained by all, I store it and its length in temporary variables named
    "lastMatch" and "lastLength". The lastLength variable will identify the longest common substring.
    lastMatch is the match/substring itself.
    Basically this simple test is performed for all following substrings.
    If there is another "common" substring it must exceed the current value of "lastLenght"
    to be stored as the current result (in lastMatch/lastLength).

    And that's it.

    After having created and checked all substrings, we have found the winner
    in the variable lastMatch.
    If there is no common substring lastMatch will be null.


    Concerning performance one can do some optimizations
    by not executing those loops that logically have no chance to yield a
    longer common substring:
    I.e. if the current value of "length" is smaller then 'lastLength',
    the substring cannot be the longer then the last match. So the inner loop starts with
    for (int length = lastLength + 1;...) . in Louis optimized version.

    Also at the end of the outer loop, there will be a point when the length argument
    cannot exceed lastLength, then the outer loop is exited.

    So the code you could use is the one Louis posted (Thanks @Loius!),
    it contains the speed-optimizations.

    Note that one only has do build and test all substrings of one input string only,
    not of all.


    Chris


    Monday, March 28, 2011 5:28 PM
  • Okay. You used

    if

    (otherStrings.All(s => s.IndexOf(sub) > -1) && length > lastLength)

    to find the exact match but how can I apply the priciple to at least 80% match?

    Monday, March 28, 2011 9:06 PM
  • Are you looking for strings that share only 80% of homology / identity
    such as:

    TTAAGCGCCG
    CCAAGCGCCG

    So you are looking for similarity not equality?
    This is much more expensive to implement.

    Also you need to precisely specify the meaning of similarity.
    It not to difficult if you compare only two strings.

    But if you have multiple strings, and let's say
    str1 and str2 share a 50-character-sequence of 90%-similarity (45 chars equal)
    also str1 and str3 have such a sequence,
    but the similarity between str2 and str3 is only 65%.
    How ill you rate the whole series?
    And what if all 3 pairs have a 90% sequence, but it's a different sequence
    for each pair, do these strings pass your test?

     

    Monday, March 28, 2011 11:44 PM
  • I use .net 3.5 and I have to find at least 80% (>=0.8) of homology/identity. It seems we must use loop rather than linq something?
    Tuesday, March 29, 2011 12:09 AM
  • You can use Linq but you need to write your own "evaluate similarity" method that you pass to Any/All, instead of
    IndexOf / Equals / Contains or such like.

    if
    (otherStrings.All(s => IsSimilarTo(s, sub, 0.8)) && length > lastLength)

    where IsSimilarTo is the TODO



     


    Tuesday, March 29, 2011 12:16 AM
  • Similar means not only chars equal but also position aligement

    TAC
    TGC

    T and T are similar, A nad G are not. C and C are similar.

    Thus the similarity rato=2/3

    TAC
    CAT

    In this format although chars are equal but positions are not lined up. Thus similarity rato=0

    Tuesday, March 29, 2011 12:51 AM
  • Hi
    thats clear for two strings.
    But if you have these samples:

    TAC
    TGC
    TGG

    you have a 66%  ratio bewteen 1. and 2.
    and 2. and 3.
    but 1. and 3. have only 33%
    So assuming the limit is at least 66% similarity
    will the test series in total be a match or not?





    Tuesday, March 29, 2011 1:01 AM
  • Hi
    thats clear for two strings.
    But if you have these samples:

    TAC
    TGC
    TGG

    you have a 66%  ratio bewteen 1. and 2.
    and 2. and 3.
    but 1. and 3. have only 33%
    So assuming the limit is at least 66% similarity
    will the test series pass the test or not?





    No, every ratio must >=0.8 here 66% on your assumption.
    Tuesday, March 29, 2011 1:10 AM
  • Yes I know, but 80% don't make too much sense in a 3-letter example,
    where you can have only 0%, 33% 66% and 100% matches (and thus 80% is as much as 100%)

    Still you have to specify what you actually want to compare:
    99 samples against one reference sample
    or all 100 samples with each other.

    What I wanted to point out by the prevoius post is: You can have the a 80% sequence
    between A and B, also B and C, also C and D, but that doesn't mean that you have one sequence
    that is homolgue by 80% and that is shared by all test samples




    Tuesday, March 29, 2011 8:21 AM
  • It should compare samples with each other rather than against one reference. My string is very long( 6000 chars but only four letters A,G,C,T).

    I just worry the CPU time.

    Tuesday, March 29, 2011 1:06 PM
  • The score is not based entire string.

    ACTGGTA 
    ACTTGCT

    In the above example, ACTGG is the result.

    ACTGG 
     | | | . | 
    ACTTG

    There are 4 letters match based on 5.

    The result is 4/5=0.8 not 4/7




    • Edited by ardmore Tuesday, March 29, 2011 1:17 PM add
    Tuesday, March 29, 2011 1:16 PM
  • We can try against one reference if cost cpu too much. 
    Tuesday, March 29, 2011 6:23 PM
  • Can you look at my code?
    List<string> common = new List<string>();
            int count = 0;
            foreach (string item1 in strings)
            {
              count++;
              Console.WriteLine("Processing " + count);
              for (int i = 0; i < item1.Length - 1; i++)
              {
                int substringlength = item1.Length - i;
                List<string> substrings1 = new List<string>();
                substrings1 = SubstringsFinder(item1, substringlength);
                for (int m = 0; m < substrings1.Count; m++)
                {
                  string source = substrings1[m];
                  foreach (string item2 in strings)
                  {
                    if (item1 == item2)
                      break;
                    List<string> substrings2 = new List<string>();
                    substrings2 = SubstringsFinder(item2, substringlength);
                    for (int n = 0; n < substrings2.Count; n++)
                    {
                      string target = substrings2[n];
                      string temp = SubstringMatch(source, target);
                      if (temp != null)
                        common.Add(temp);
                    }
                  }
                }
              }
    
    static List<string> SubstringsFinder(string line, int m)
        {
          List<string> lineList = new List<string>();
    
          for (int i = 0; i < line.Length - m + 1; i++)
          {
            lineList.Add(line.Substring(i, m));
          }
          return lineList;
        }
        static string SubstringMatch(string source, string target)
        {
          int count = 0;
          for (int i = 0; i < source.Length; i++)
          {
            if (source[i] == target[i])
              count++;
          }
          if (count / source.Length >= 0.8)
            return source;
          else
            return null;
        }
    It looks something wrong. Much cpu time no result.
    Tuesday, March 29, 2011 6:27 PM
  • The following methods can be used to calculate/evaluate string similarity:

    The first calculates the similarity between two strings.

    The other is more efficient, it takes an extra parameter which is the amount of similarity that is wanted and aborts early if such similarity cannot be achieved. This method only works if two substrings are of the same length. 

    You can use this similarity method instead of a String.Equals or Substring.Contains from the previous methods.

     

    call them like this:

    bool areSimilar = Similarity("TTTTT", "TTTCT") >= 0.80M

    or 

    bool areSimilar = AreSimilar("TTTTT", "TTTCT", 0.80M);

     

    public static decimal Similarity(string a, string b)
            {
                if (String.IsNullOrEmpty(a) || string.IsNullOrEmpty(b))
                {
                    throw new ArgumentNullException();
                }
     
                if (a.Length != b.Length)
                {
                    throw new InvalidOperationException();
                }
     
                a = a.ToUpperInvariant();
                b = b.ToUpperInvariant();
     
                decimal similarPairs = 0;
                for (int i = 0; i < a.Length; i++)
                {
                    if (a[i] == b[i])
                    {
                        similarPairs++;
                    }
                }
     
                return similarPairs / a.Length * 100;
            }
     
            public static bool AreSimilar(string a, string b, decimal minSimilarity)
            {
                if (String.IsNullOrEmpty(a) || string.IsNullOrEmpty(b))
                {
                    throw new ArgumentNullException();
                }
     
                if (a.Length != b.Length)
                {
                    throw new InvalidOperationException();
                }
     
                int minimumMatches = (int)Math.Ceiling(minSimilarity * a.Length);
                a = a.ToUpperInvariant();
                b = b.ToUpperInvariant();
     
                int similarPairs = 0;
                for (int i = a.Length - 1; i >= 0; i--)
                {
                    if (a[i] == b[i])
                    {
                        similarPairs++;
                    }
     
                    if (similarPairs >= minimumMatches)
                    {
                        return true;
                    }
     
                    if ((minimumMatches - similarPairs) > i)
                    {
                        return false;
                    }
                }
     
                return false;
            }

     


    Recipient of the Microsoft Community Contributor Award 2011.
    Tuesday, March 29, 2011 6:56 PM
  • Some of the other posts here pushed me in this direction, so I can't take full credit for it if it works (and won't take any credit if it doesn't).

    Since you're looking for the longest common substring, I figured you could start with checking if str1.Substring(0, str1.Length) occurs in all the other strings, then decrement the length each iteration of the loop, such that next you'd check if str1.Substring(0, str1.Lengt - 1) is in all strings, then str1.Substring(1, str.Length - 1), then str1.Substring(0, str.Length - 2), str1.Substring(1, str.Length - 2), str1.Substring(2, str.Length - 2), etc.  Then the first substring common to all the strings will necessarily be the longest common substring.

     

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace ConsoleApplication1
    {
    	class Program
    	{
    		static void Main(string[] args)
    		{
    			string str1 = @"TTTATGTCTATAATCCTTACCAAAAGTTACCTTGGAATAAGAAGAAGTCAGTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATTTGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCTGCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGACAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACTCAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCAGCGGTTTCCAAGAGGATACAGTAGGTGTCTCTATCTCTTCCTATAGTGCAGGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCCCTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGACCCTCTCTCCATGGTATTGAGCCTACACATGGTTGTGAAACAAGAATTGCTTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACATTCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAATCCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGCCATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAGAGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTGCAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAAACCTTCTCTACTCTCTTCAAACCACTTTTCTTTCACTCTCAGCATATATTTTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCTGTTTTTTCCCTCCAGCCCATCTAAAATTTACCTGCAACTTAAAAGTGATC";
    			string str2 = @"TGTCTATCCAAAAGTTACCTTGAATCCTTAGAATAAGAAGAAGTCATTTAGTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATTTGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCTGCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGACAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACTCAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCAGCGGTTTCCAAGAGGGTGTCTCTATCTCTTCCTATAGTGCAAAGAATTGCAATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCCCTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGACCCTCTCTCCAATACAGTAGTGGTATTGAGCCTACACATGGTTGTGAAACTTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACATTCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAATCCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGCCATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAGAGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTGCAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAAACCTTCTCTACTCTCTTCAAACCACACTCTCAGCATATATTTTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCTGTTTTTTCCCTCCAGCCCATCTAACTGCAACTTAAAAGTGATCCTTTTCTTTGGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATTTAC";
    			string test = @"-------------------------------------------------AGTAAAAAGAAGGCTGTTGTTCCGTGAAATA--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------";
    
    			string longestCommonSubstring = FindLongestCommonSubstring(str1, str2, test);
    
    			if (longestCommonSubstring != null)
    				Console.WriteLine("Found: {0}", longestCommonSubstring);
    			else
    				Console.WriteLine("Did not find ANY common substrings (should never happen)");
    
    			Console.ReadKey();
    		}
    
    		static string FindLongestCommonSubstring(params string[] stringsToCheck)
    		{
    			string longestPattern = String.Empty; ;
    
    			// Since we're looking for the longest possible common substring, 
    			// start with the entire string as a pattern, then decrement the
    			// pattern by one until we find a match in all the strings
    
    			for (int patternLen = stringsToCheck[0].Length; patternLen > 0; patternLen--)
    			{
    				for (int patternStartPos = 0; patternStartPos + patternLen < stringsToCheck[0].Length; patternStartPos++)
    				{
    					string pattern = stringsToCheck[0].Substring(patternStartPos, patternLen);
    					// If this pattern occurs in all the strings, it is
    					// the longest common substring/sub-pattern
    					if (stringsToCheck.All(x => x.Contains(pattern)))
    						return pattern;
    				}
    			}
    
    			return null;
    		}
    	}
    }
    
    
    Tuesday, March 29, 2011 7:22 PM
  • In 'SubstringMatch' you should add a cast to the line:

      if (count / source.Length >= 0.8)

    should change to:

      if (count / (double)source.Length >= 0.8)

    to get it to work. Otherwise the fraction is always 0 (because int / int returns int)



    I think you can condense your code to:

    public static void Main()
        {
    
          List<string> strings = /* get the DNA samples here */;
    
          List<string> common = new List<string>();
          foreach (string item1 in strings)
          {
            foreach (string item2 in strings)
            {
              if (item1 == item2)
                continue;
    
              for (int i = 0; i < item1.Length - 1; i++)
              {
                int substringLength = item1.Length - i;
                var substrings1 = SubstringsFinder(item1, substringLength);
                var substrings2 = SubstringsFinder(item2, substringLength);
                
                substrings1.ForEach(source => substrings2.ForEach(target =>
                                         {
                                           if (IsSubstringMatch(source, target))
                                             common.Add(source);
                                         }));
              }
            }
          }
        }
        
    
        private static List<string> SubstringsFinder(string line, int length)
        {
          int iterations = line.Length - length + 1;
          List<string> lineList = new List<string>(iterations);
          for (int i = 0; i < iterations; i++)
          {
            lineList.Add(line.Substring(i, length));
          }
          return lineList;
        }
        private static bool IsSubstringMatch(string source, string target)
        {
          int count = 0;
          for (int i = 0; i < source.Length; i++)
          {
            if (source[i] == target[i])
              count++;
          }
          return (count / (double) source.Length >= 0.8);
        }

    The performance will be probably still be poor since there are
    myriards of substrings to be created.
    Maybe you don't need to store the substrings but create them on the fly.





    Tuesday, March 29, 2011 7:42 PM
  • Ahh, I should have read all the posts. ;)
    Tuesday, March 29, 2011 8:01 PM
  • Okay, I want to steal your code and make some cheating tricks. First I want to find longest common substring since it look easily. Then since the score goal is 0.8, then we can expand the pattern(to add chars on the left or right) a little. I am still thinking...
    Tuesday, March 29, 2011 8:01 PM
  • This is tweaked a bit.  Doesn't run terribly fast, sucks CPU cycles like there's no tomorrow, but is okay on memory, and so far as I can tell (which is different than guaranteed) it seems to work correctly.  I would like for some of the other people posting here in this thread to have a look over this code to point out any flaws, and/or make recommendations for performance improvement.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace ConsoleApplication1
    {
    	class Program
    	{
    		static void Main(string[] args)
    		{
    			string str1 = @"TTTATGTCTATAATCCTTACCAAAAGTTACCTT"; //GGAATAAGAAGAAGTCAGTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATTTGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCTGCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGACAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACTCAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCAGCGGTTTCCAAGAGGATACAGTAGGTGTCTCTATCTCTTCCTATAGTGCAGGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCCCTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGACCCTCTCTCCATGGTATTGAGCCTACACATGGTTGTGAAACAAGAATTGCTTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACATTCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAATCCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGCCATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAGAGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTGCAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAAACCTTCTCTACTCTCTTCAAACCACTTTTCTTTCACTCTCAGCATATATTTTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCTGTTTTTTCCCTCCAGCCCATCTAAAATTTACCTGCAACTTAAAAGTGATC";
    			string str2 = @"TGTCTATCCAAAAGTTACCTTGAATCCTTAGAA"; //TAAGAAGAAGTCATTTAGTAAAAAGAAGGCTGTTGTTCCGTGAAATACTGTCTTTATGCCTCAGATTTGGAGTGCTCAGAGCCTCTGCAGCAAAGATTTGGCATGTGTCCTAGGCCTGCTCAGAGCAGCAAATCCCACCCTCTTGGAGAATGAGACTCATAGAGGGACAGCTCCCTCCTCAGAGGCTTCTCTAATGGGACTCCAAAGAGCAAACACTCAGCCCCATGAGGACTGGCCAGGCCAAGTGGTGTGTGGGAACAGGGAGCAGCGGTTTCCAAGAGGGTGTCTCTATCTCTTCCTATAGTGCAAAGAATTGCAATAGGGTTCAGGGCAGTGATAAATTCCAATGCTAATGTGGGATGAACCCCTGGGGGCTTCACCTCCTTCTTTTAGGATGTCTGTGGAAATTGAGAAAGACCCTCTCTCCAATACAGTAGTGGTATTGAGCCTACACATGGTTGTGAAACTTTGACTGAATCTCACTTTTTTCTAGGTGGACTCACAGAGTCCCTTACATTCATTATTAGTTCATTCGTTCATCCACTCATTATCTTACTCTGTATCAATCCCAGGGCTGAGAACTTACTTTCCCCATTGTCCTCATTTATCTAACAAGCCATTCTCCTATTGACCTCATTTTCTTAGTCTGATGAACACTGGCTGCTAGAGAAAATCTTACCAATAAGCAGATTGACTTCAGTGAATATTCATAGTCTGCAAGTTCAACTGGCTCTTTGAATGCCTGGTCTTTCCTTTGACTACCTTAAACCTTCTCTACTCTCTTCAAACCACACTCTCAGCATATATTTTCACTCTCCTATTTCCAAATAAAATAGAATTCATCAGACCGTTTCCCCTGTTTTTTCCCTCCAGCCCATCTAACTGCAACTTAAAAGTGATCCTTTTCTTTGGCTCCAAAGCTTATCAGTGCTCCATGGCCTGCATTGAGCAAGGTCCTGGAATTTAC";
    			string test = @"AGTAAAAAGAAGGCTGTTGT-------------"; //-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------";
    			
    			string longestCommonSubstring = FindLongestCommonSubstring(test, str1, str2);
    
    			if (longestCommonSubstring != null)
    				Console.WriteLine("Found: {0}", longestCommonSubstring);
    			else
    				Console.WriteLine("Did not find ANY common substrings (should never happen)");
    
    			Console.ReadKey();
    		}
    
    		static string FindLongestCommonSubstring(params string[] stringsToCheck)
    		{
    			string longestPattern = String.Empty; ;
    
    			// Since we're looking for the longest possible common substring, 
    			// start with the entire string as a pattern, then decrement the
    			// pattern by one until we find a match in all the strings
    
    			for (int patternLen = stringsToCheck[0].Length; patternLen > 0; patternLen--)
    			{
    				for (int patternStartPos = 0; patternStartPos + patternLen <= stringsToCheck[0].Length; patternStartPos++)
    				{
    					string pattern = stringsToCheck[0].Substring(patternStartPos, patternLen);
    					
    					// If this pattern occurs in all the strings, it is
    					// the longest common substring/sub-pattern
    					if (stringsToCheck.All(x => x.ContainsSimilar(pattern, .8))) // If all the strings contain this pattern where at least 80% of the characters match
    						return pattern;
    				}
    			}
    
    			return null;
    		}
    	}
    
    	public static class Extensions
    	{
    		public static bool ContainsSimilar(this string source, string pattern, double similarity)
    		{
    			for (int matchStartPos = 0; matchStartPos <= source.Length - pattern.Length; matchStartPos++)
    			{
    				// try comparing the pattern to the source with both their starts lined up initially
    				// each iteration of the loop effectively "moves" the pattern one character further
    				// down source for another comparison
    				int matchingChars = 0;
    				int charsRequiredForMatch = (int)Math.Ceiling((double)pattern.Length * similarity);
    				for (int i = 0; i < pattern.Length; i++)
    				{
    					if (source[matchStartPos + i] == pattern[i])
    					{
    						matchingChars++;
    						if ((double)matchingChars / (double)pattern.Length >= similarity)
    							return true;
    					}
    					
    					int charsRemaining = source.Length - (i + 1);
    					if (charsRemaining + matchingChars <= charsRequiredForMatch) // Give up if it's not possible to reach the required number of characters for a match
    						return false;
    				}
    			}
    			return false;
    		}
    	}
    }
    
    
    • Marked as answer by ardmore Wednesday, March 30, 2011 12:32 PM
    Tuesday, March 29, 2011 9:09 PM
  • Using this one get a wrong result.

     

    var strings = new string[] { "456ABCDEyyy", "891ABCDEljk", "1.3-SBCDE#" };

    The pattern is "ABCDE".

    Wednesday, March 30, 2011 2:45 PM
  • Cool!  Glad to hear it's solved. 

    Have a nice day, all!


    Michael Sun [MSFT]
    MSDN Community Support | Feedback to us
    Get or Request Code Sample from Microsoft
    Monday, April 04, 2011 1:18 AM
    Moderator
  • Hi,

       Have you looked at BLAST? It has been heavily optimized and runs well for me. It will find the best overall match, not the longest substring, but I suspect you could get the longest substring by setting the open gap penalty large enough. BLAST is freely available from NCBI at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.

    I have worked out running BLAST from C#.net if you want to discuss further.

     

     

       I have done similar stuff just by brute force and it is slow but works. I think that some of the other answers are essentially the same as this.

     

    Something like:

    string sequence1 = "CAGCATAC..."

    string sequenc2= "CGACATCAGACT..."

    for(int currentLength = sequence1.length;currentLength  >0; currentLength--)

    {

    foreach (int position = 1; position + currentLength  <= sequence1.length; position++)

    {
    if(sequence2.contains(sequence1.substring(position,currentLength)))

    {

    console.write("Foundit!" + sequence1.substring(position,currentLength))

    break;
    }

    }

    }

    Ethan

    ethan.strauss@promega.com

     


    Ethan Strauss
    Wednesday, April 06, 2011 7:13 PM
  • Under the help of friends, I got it. But I really don't know about BLAST. Could you please tell something of it?
    Wednesday, April 06, 2011 7:54 PM
  • HI,

       BLAST is a program used for matching DNA sequences. There is description of it here
    http://www.ncbi.nlm.nih.gov/blast/blast_help.shtml.

    It is not really optimized for longest subsequences, but it will work and is available. If you want to discuss in detail, why don't you contact me offline as I think this will get us into bioinformatics rather than C#.

    Ethan

    ethan.strauss@promega.com

     


    Ethan Strauss
    Thursday, April 07, 2011 3:42 PM