none
C# RegEx Pattren RRS feed

  • Question

  • Hi,



    I need to create Regex Pattren that filter the files. 

    I have Files in Directory with names like this:

    SIT.SERTGHportSD.SIT - CV VAN RETAIL...ADJMENT_REPORT.20190331.20190227_072039.csv.txt
    SIT.SERTGHportSD.SIT - CV VAN RETAIL CARD...ADHAR__RISK_BY_RATING.20180630.20181018_134810.csv.txt  
    SIT.SERTGHportSD.SIT - CV VAN RETAIL...ADHAR_RISK_BY_PD.20180630.20181016_065751.csv.txt
    SIT.SERTGHportSD.SIT - CV VAN RETAIL...ADHAR_RISK_BY_PD.20180630.20181018_105254.csv.txt
    SIT.SERTGHportSD.SIT - CV VAN RETAIL...ADHAR_RISK_BY_PD.20183456.20181018_105254.csv.txt   
    SIT.SERTGHportSD.SIT - CV VAN RETAIL...ADHAR_RISK_BY_RATING..csv - Copy.txt
    SIT.SERTGHportSD.SIT - CV VAN RETAIL...ADHAR_RISK_BY_PD.20180630.20181022_040721.csv.txt
    SIT.SERTGHportSD.SIT - CV VAN RETAIL...ADHAR__RISK_BY_RATING.20180630.20181016_074829.csv.txt


    I need to get all the files which name as (VAN RETAIL) and First Date is valid date (20190331) so total 8 files but pattran should return all files other than (row 2 - name not match, row 5 not valid date in first position, row 6 no date) and remaining 5 files list return.

    I try like this but its get row 5 and 6 .

    Regex reg = new Regex(@"\bVAN RETAIL\.", RegexOptions.IgnoreCase); 

    any help appreciated.

    Thanks
    Thursday, June 13, 2019 5:16 PM

Answers

  • Is there any particular reason you're stuck on using RE? Can it be done with RE? Yes. Will it be readable, maintainable and be usable if the rules change later? No.

    //Matches all strings with VAN RETAIL in them followed by any non-digits followed by an 8 digit number
    "VAN RETAIL[^0-9]*[0-9]{8}"
    

    This would match all but 6 because it doesn't have the date. 2 is still valid because it contains VAN RETAIL. 5 is valid because it is an 8 digit number.

    It is unclear whether your ... are because you're leaving stuff out or actually part of the filename. If you want to filter out row 2 AND the ... is part of the filename then a tweak to the RE allows that.

    //Matches all strings with VAN RETAIL followed by a dot followed by any non-digits followed by an 8 digit number
    "VAN RETAIL\.[^0-9]*[0-9]{8}"
    Rows 2 and 6 are excluded.


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Inayat72 Thursday, June 13, 2019 7:36 PM
    Thursday, June 13, 2019 5:55 PM
    Moderator

All replies

  • You're not going to be able to accomplish all that with RE. RE is just about matching patterns. There is no RE (at least no reasonable one) that can tell you whether a numeric value like YYYYMMDD is actually a valid date. 20190229 is not valid but 20200229 is. If you really want to use RE for this then you can start with using it to filter down the list of files but you're then going to have to provide additional heuristics for the rest of it. Personally I would just use Directory.GetFiles to get the files that contain the starting values you expect and then use some code to filter the rest. It isn't clear to me how you can tell where the date starts though. Your file has a large # of dots in it so unless you can narrow down the filenames then you're really not going to be able to filter out this list very much. At a minimum you might agree that the 4th part of the file is the date and the third part has to have this VAN RETAIL thing you mentioned.

    Once you've filtered down the files to the list of potential candidates then use Split to break them up into parts and then look for the portion that is a date, then try to convert it.

    //List of files that meet the basic filename requirements (e.g. has VAN RETAIL in name).
    var files = from f in Directory.GetFiles(...)
                where MatchesName(f)
                select f;
    
    bool MatchesName ( string filename )
    {
       //Break into parts
       var parts = filename.Split('.');
    
       //The 4th part must be the date
       if (parts.Length < 4)
          return false;
    
       var potentialDate = parts[3];
    
       //Try and convert it
       return DateTime.TryParseExact(potentialDate, "YYYYMMDD", null, DateTimeStyles.None, out var dt);
    }
    
    
    
    


    Michael Taylor http://www.michaeltaylorp3.net

    Thursday, June 13, 2019 5:35 PM
    Moderator
  • Is it not possible to get the all files where name like 'VAN RETAIL' and first (8) number 99999999 so then I will verified the date  format as I have some files where we don't have numbers and its not always I will have date on 3 position.

    I want to combine Regex with 2 conditions One Name (like 'VAN RETAIL' and second first 8 digit number 99999999) if both pass then good otherwise skip this file.

    Thanks


    • Edited by Inayat72 Thursday, June 13, 2019 5:46 PM
    Thursday, June 13, 2019 5:44 PM
  • Is there any particular reason you're stuck on using RE? Can it be done with RE? Yes. Will it be readable, maintainable and be usable if the rules change later? No.

    //Matches all strings with VAN RETAIL in them followed by any non-digits followed by an 8 digit number
    "VAN RETAIL[^0-9]*[0-9]{8}"
    

    This would match all but 6 because it doesn't have the date. 2 is still valid because it contains VAN RETAIL. 5 is valid because it is an 8 digit number.

    It is unclear whether your ... are because you're leaving stuff out or actually part of the filename. If you want to filter out row 2 AND the ... is part of the filename then a tweak to the RE allows that.

    //Matches all strings with VAN RETAIL followed by a dot followed by any non-digits followed by an 8 digit number
    "VAN RETAIL\.[^0-9]*[0-9]{8}"
    Rows 2 and 6 are excluded.


    Michael Taylor http://www.michaeltaylorp3.net

    • Marked as answer by Inayat72 Thursday, June 13, 2019 7:36 PM
    Thursday, June 13, 2019 5:55 PM
    Moderator
  • So what is the other best way if we not use RE? reason for using RE is there are so many Pattern some the file start with and some time its contain particular word or some other option also so try to develop process that cover all types of cases. Its hard to cover all these cases without RE.

    Any how Thanks for all your help.

    Thursday, June 13, 2019 7:36 PM
  • RE is great for consistent patterns (all files starting with this or end with that or have this in the middle). As you start adding more conditions then RE becomes more hassle than it is worth in my opinion. RE is fast in most cases but rarely is this critical for an app.

    For simple stuff like "starts with" or "ends with" then use the corresponding String methods. Contains could also be used for finding things in the middle. If you need to break a string part (such as getting that date) then RE does become more beneficial because you can use grouping. Often a combination of techniques is needed. For example you might find that starting with RE (like discussed in the earlier post) to eliminate the bulk of stuff and then adding some code to filter out the rest is sufficient. Other times you might go the other way. For example if you are dealing with a directory that has 1000s of files then running RE on each one is inefficient. Using a wildcard search on the file system to first filter out the obviously wrong filenames would improve performance. You could then use RE (with the basic filter removed now) to narrow things down even further.

    Not sure what you mean by a process to cover all types but I'm going to assume that your filenames are all over the place and you're trying to determine which ones to handle. How you'd do that would depend on how the files are treated. If all files are treated the same but have wildly different names then the first question would be do you even gain anything by looking at the filename vs opening the file and parsing out the basic info instead? If the files are completely different (e.g. a.txt vs b.txt) then filename is useless. 

    However if you can discern the file type from the filename (e.g. A-*.txt are A files while B-*.txt are B files) then a combination of RE/String parsing would be sufficient. If you have multiple conditions then setting up filtering rules would be good. If your rules are pretty flexible then you might want to introduce a type/function to encapsulate those rules. Then put the filter rules into a list that you can enumerate through. Here's a simple example.

    class Program
    {
        static void Main ( string[] args )
        {
            var filters = LoadFilters();
            var files = GetFiles();
    
            Select all files where all the filter rules are met
            foreach (var filter in filters)
                files = files.Where(filter);
    
            foreach (var file in files)
                Console.WriteLine(file);
        }
    
        Using a simple func type here - use an interface if the setup is more complex
        static IEnumerable<Func<string, bool>> LoadFilters ()
        {
            Files must start with "A"
            yield return ( f ) => f.StartsWith("A", StringComparison.OrdinalIgnoreCase);
    
            Files must end with ".txt"
            yield return ( f ) => String.Compare(Path.GetExtension(f), ".txt", true) == 0;
    
            More complex rules can use a separate type
            Files must have a second part that is a number
            yield return new SecondPartIsNumber().Matches;
        }
    
        static IEnumerable<string> GetFiles ()
        {
            yield return "A.file.txt";
            yield return "B.file.txt";
            yield return "A.1.file.txt";
            yield return "A.1.B.txt";
        }
    }
    
    Demo moving filtering rules into standalone type
    class SecondPartIsNumber
    {
        public bool Matches ( string value )
        {
            var parts = value.Split('.');
    
            if (parts.Length < 2)
                return false;
    
            if (!Int32.TryParse(parts[1], out var result))
                return false;
    
            return true;
        }
    }


    Michael Taylor http://www.michaeltaylorp3.net

    Thursday, June 13, 2019 8:15 PM
    Moderator