none
Help Scrubbing Filenames in a Directory RRS feed

  • Question

  • Hello all!

    I am trying to iterate through a directory and scrub file names of illegal characters, but I also want to make a few specific substitutions.

    I want to switch dashes '-' with underscores '_', I also want to switch double quotes ' " ' with 'in'.  All other illegal characters I want to kill.

    I've started by trying to kill all illegal characters, but I can't get that to work.

    Public Class Cleaner
    
        Public fPath As String = ""
        Private ReadOnly fName As String = ""
    
        Public Sub New(ByVal fPath As String)
    
            'MsgBox(fPath)
            fPath = fPath & "\"
            'MsgBox(fPath)
    
            fName = Dir(fPath, vbDirectory)
            Do While fName <> ""
                If (GetAttr(fPath & fName) And vbDirectory) = vbDirectory Then
    
                    fName = System.Text.RegularExpressions.Regex.Replace(Input, "[\\/:*?""<>|\r\n]", "", System.Text.RegularExpressions.RegexOptions.Singleline)
                    MsgBox(fName)
    
                End If
                fName = Dir()
            Loop
        End Sub
    
    End Class

    Any help would be greatly appreciated.

    Thanks!

    Tuesday, June 25, 2019 4:14 PM

Answers

  • The file names come out like this:

    00318-COVER GIRL 108” x 78”(06-10-2019).psa

    The first dash I need to replace with an underscore because the ftp program I am using is janky as all hell, and the double quotes are causing a problem with the server that is receiving the file.  This same program outputs PDFs that sometimes have random \ in the file names.

    Hi

    Dont know if this will help. This version takes a test string

    00<>31//\\z8-COV||ER GIRL 1<>08” x 78”(06-10-20>|>"19).psa

    where the inch marks are actually curly quotes (Chr(148)) which I think your original string contained.

    The resultant cleaned sting is

    0031z8_COVER GIRL 108in x 78in(06_10_2019).psa

    Option Strict On
    Option Explicit On
    Public Class Form1
    	Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    		' test string
    		Dim teststring As String = "00<>31//\\z8-COV||ER GIRL 1<>08" & Chr(148) & " x 78" & Chr(148) & "(06-10-20>|>""19).psa"
    
    		Dim CleanedString As String = CleanString(teststring)
    
    		' 0031z8_COVER GIRL 108in x 78in(06_10_2019).psa
    
    	End Sub
    	Function CleanString(s As String) As String
    		While s.Contains(Chr(148))
    			s = RemChr148(s, "in")
    		End While
    		Dim inv As String = Nothing
    		For Each c As Char In IO.Path.GetInvalidFileNameChars
    			inv &= c
    			While s.Contains(c)
    				s = RemIllegal(s, c)
    			End While
    		Next
    		Return s.Replace("-", "_")
    	End Function
    	Function RemIllegal(s As String, rep As String) As String
    		Dim ind As Integer = s.IndexOf(rep)
    		Dim s2 As String = s.Substring(0, ind)
    		Dim s3 As String = s.Substring(ind + 1, s.Length - ind - 1)
    		Return s2 & s3
    	End Function
    	Function RemChr148(s As String, Optional rep As String = Nothing) As String
    		Dim ind As Integer = s.IndexOf(Chr(148))
    		Dim s2 As String = s.Substring(0, ind)
    		If Not rep = Nothing Then
    			s2 &= rep
    		End If
    		Dim s3 As String = s.Substring(ind + 1, s.Length - ind - 1)
    		Return s2 & s3
    	End Function
    End Class


    Regards Les, Livingston, Scotland

    • Marked as answer by stopiamwarren Friday, November 22, 2019 7:25 PM
    Wednesday, June 26, 2019 6:52 PM

All replies

  • Once you have a file name try the following

    ' this represents a bad file name
    Dim fName As String = """M""\a/ry/ h**ad:>> a\/:*?""| li*tt|le|| la""mb.?accdb"
    
    Dim regexSearch = Path.GetInvalidFileNameChars() & New String(Path.GetInvalidPathChars())
    Dim r As New Regex($"[{Regex.Escape(regexSearch)}]")
    fName = r.Replace(fName, "")
    Console.WriteLine(fName)
    Which would be implemented within your Do While


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Tuesday, June 25, 2019 4:27 PM
    Moderator
  • If the filenames are got from disk, then they are not illegal.

    Tuesday, June 25, 2019 4:51 PM
  • This is what I have, but it's like it's not even going inside of my if now.

    Imports System.Text.RegularExpressions
    Imports System.IO
    Public Class Cleaner
    
        Public fPath As String = ""
        Private ReadOnly fName As String = ""
    
        Public Sub New(ByVal fPath As String)
    
            'MsgBox(fPath)
            fPath = fPath & "\"
            'MsgBox(fPath)
    
            fName = Dir(fPath, vbDirectory)
    
    
            Do While fName <> ""
                If (GetAttr(fPath & fName) And vbDirectory) = vbDirectory Then
    
                    MsgBox(fName)
    
                    Dim fileName As String = """M""\a/ry/ h**ad:>> a\/:*?""| li*tt|le|| la""mb.?accdb"
    
                    Dim regexSearch = Path.GetInvalidFileNameChars() & New String(Path.GetInvalidPathChars())
                    Dim r As New Regex($"[{Regex.Escape(regexSearch)}]")
                    fileName = r.Replace(fName, "")
    
                    MsgBox(fName)
                    MsgBox(fileName)
    
                End If
                fName = Dir()
            Loop
        End Sub
    
    End Class


    Tuesday, June 25, 2019 4:54 PM
  • Did you check the chars in your file name against GetInvalidPathChars?

    """<>|" & vbNullChar & ChrW(1) & ChrW(2) & ChrW(3) & ChrW(4) & ChrW(5) & ChrW(6) & ChrW(7) & vbBack & vbTab & vbLf & vbVerticalTab & vbFormFeed & vbCr & ChrW(14) & ChrW(15) & ChrW(16) & ChrW(17) & ChrW(18) & ChrW(19) & ChrW(20) & ChrW(21) & ChrW(22) & ChrW(23) & ChrW(24) & ChrW(25) & ChrW(26) & ChrW(27) & ChrW(28) & ChrW(29) & ChrW(30) & ChrW(31)


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Tuesday, June 25, 2019 5:01 PM
    Moderator
  • Hello all!

    I am trying to iterate through a directory and scrub file names of illegal characters, but I also want to make a few specific substitutions.

    Hi

    If any of the filenames in a directory contain illegal characters, then they would never have been created successfully in the first place?


    Regards Les, Livingston, Scotland

    Tuesday, June 25, 2019 5:14 PM
  • The files are coming out of a software program that is allowed to write garbage in the filenames.  It's causing me issues downstream.
    Wednesday, June 26, 2019 2:53 PM
  • The path itself is being fed in from a dialog box that the user has to manually pick, I send the path into the class by value, then append the extra \ to complete the path.
    Wednesday, June 26, 2019 2:54 PM
  • The path itself is being fed in from a dialog box that the user has to manually pick, I send the path into the class by value, then append the extra \ to complete the path.
    Can you please provide several examples of file names that you believe need sanitizing?

    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Wednesday, June 26, 2019 2:56 PM
    Moderator
  • The file names come out like this:

    00318-COVER GIRL 108” x 78”(06-10-2019).psa

    The first dash I need to replace with an underscore because the ftp program I am using is janky as all hell, and the double quotes are causing a problem with the server that is receiving the file.  This same program outputs PDFs that sometimes have random \ in the file names.
    Wednesday, June 26, 2019 3:20 PM
  • Try the following

    Requires System.Web reference to the project.

    Dim fileName = "00318-COVER GIRL 108” x 78”(06-10-2019).psa"
    Dim result = HttpUtility.UrlEncode(fileName)
    Dim regex = New Regex(Regex.Escape("-"))
    Dim newfileName = regex.Replace(result, "_", 1)

    Result

    00318_COVER+GIRL+108%e2%80%9d+x+78%e2%80%9d(06-10-2019).psa


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    Wednesday, June 26, 2019 4:03 PM
    Moderator
  • The file names come out like this:

    00318-COVER GIRL 108” x 78”(06-10-2019).psa

    The first dash I need to replace with an underscore because the ftp program I am using is janky as all hell, and the double quotes are causing a problem with the server that is receiving the file.  This same program outputs PDFs that sometimes have random \ in the file names.

    Hi

    Dont know if this will help. This version takes a test string

    00<>31//\\z8-COV||ER GIRL 1<>08” x 78”(06-10-20>|>"19).psa

    where the inch marks are actually curly quotes (Chr(148)) which I think your original string contained.

    The resultant cleaned sting is

    0031z8_COVER GIRL 108in x 78in(06_10_2019).psa

    Option Strict On
    Option Explicit On
    Public Class Form1
    	Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    		' test string
    		Dim teststring As String = "00<>31//\\z8-COV||ER GIRL 1<>08" & Chr(148) & " x 78" & Chr(148) & "(06-10-20>|>""19).psa"
    
    		Dim CleanedString As String = CleanString(teststring)
    
    		' 0031z8_COVER GIRL 108in x 78in(06_10_2019).psa
    
    	End Sub
    	Function CleanString(s As String) As String
    		While s.Contains(Chr(148))
    			s = RemChr148(s, "in")
    		End While
    		Dim inv As String = Nothing
    		For Each c As Char In IO.Path.GetInvalidFileNameChars
    			inv &= c
    			While s.Contains(c)
    				s = RemIllegal(s, c)
    			End While
    		Next
    		Return s.Replace("-", "_")
    	End Function
    	Function RemIllegal(s As String, rep As String) As String
    		Dim ind As Integer = s.IndexOf(rep)
    		Dim s2 As String = s.Substring(0, ind)
    		Dim s3 As String = s.Substring(ind + 1, s.Length - ind - 1)
    		Return s2 & s3
    	End Function
    	Function RemChr148(s As String, Optional rep As String = Nothing) As String
    		Dim ind As Integer = s.IndexOf(Chr(148))
    		Dim s2 As String = s.Substring(0, ind)
    		If Not rep = Nothing Then
    			s2 &= rep
    		End If
    		Dim s3 As String = s.Substring(ind + 1, s.Length - ind - 1)
    		Return s2 & s3
    	End Function
    End Class


    Regards Les, Livingston, Scotland

    • Marked as answer by stopiamwarren Friday, November 22, 2019 7:25 PM
    Wednesday, June 26, 2019 6:52 PM