Parse CSV, ignoring commas inside string literals in VBA?

@Gimp said...

The current answers do not contain enough detail.

I'm running into the same problem. Looking for more detail in this answer.

To elaborate on @MRAB's answer:

Function ParseCSV(FileName)
    Dim Regex       'As VBScript_RegExp_55.RegExp
    Dim MatchColl   'As VBScript_RegExp_55.MatchCollection
    Dim Match       'As VBScript_RegExp_55.Match
    Dim FS          'As Scripting.FileSystemObject
    Dim Txt         'As Scripting.TextStream
    Dim CSVLine
    ReDim ToInsert(0)

    Set FS = CreateObject("Scripting.FileSystemObject")
    Set Txt = FS.OpenTextFile(FileName, 1, False, -2)
    Set Regex = CreateObject("VBScript.RegExp")

    Regex.Pattern = """[^""]*""|[^,]*"    '<- MRAB's answer
    Regex.Global = True

    Do While Not Txt.AtEndOfStream
        ReDim ToInsert(0)
        CSVLine = Txt.ReadLine
        For Each Match In Regex.Execute(CSVLine)
            If Match.Length > 0 Then
                ReDim Preserve ToInsert(UBound(ToInsert) + 1)
                ToInsert(UBound(ToInsert) - 1) = Match.Value
            End If
        Next
        InsertArrayIntoDatabase ToInsert
    Loop
    Txt.Close
End Function

You need to customize the InsertArrayIntoDatabase Sub for your own table. Mine has several text fields named f00, f01, etc...

Sub InsertArrayIntoDatabase(a())
    Dim rs As DAO.Recordset
    Dim i, n
    Set rs = CurrentDb().TableDefs("tbl").OpenRecordset()
    rs.AddNew
    For i = LBound(a) To UBound(a)
        n = "f" & Format(i, "00") 'fields in table are f00, f01, f02, etc..
        rs.Fields(n) = a(i)
    Next
    rs.Update
End Sub

Note that instead of using CurrentDb() in InsertArrayIntoDatabase(), you should really use a global variable that gets set to the value of CurrentDb() before ParseCSV() runs, because running CurrentDb() in a loop is very slow, especially on a very large file.


The first way to solve this problem is to look at the structure of the line from the csv file (int,int,"String literal, will have at most one comma", etc). A naive solution would be (Assuming that the line don't have any semicolons)

Function splitLine1(line As String) As String()

   Dim temp() As String
   'Splits the line in three. The string delimited by " will be at temp(1)
   temp = Split(line, Chr(34)) 'chr(34) = "

   'Replaces the commas in the numeric fields by semicolons
   temp(0) = Replace(temp(0), ",", ";")
   temp(2) = Replace(temp(2), ",", ";")

   'Joins the temp array with quotes and then splits the result using the semicolons
   splitLine1 = Split(Join(temp, Chr(34)), ";")

End Function

This function only solves this particular problem. Another way to do the job is using the regular expression object from VBScript.

Function splitLine2(line As String) As String()

    Dim regex As Object
    Set regex = CreateObject("vbscript.regexp")
    regex.IgnoreCase = True
    regex.Global = True

    'This pattern matches only commas outside quotes
    'Pattern = ",(?=([^"]*"[^"]*")*(?![^"]*"))"
    regex.Pattern = ",(?=([^" & Chr(34) & "]*" & Chr(34) & "[^" & Chr(34) & "]*" & Chr(34) & ")*(?![^" & Chr(34) & "]*" & Chr(34) & "))"

    'regex.replaces will replace the commas outside quotes with semicolons and then the
    'Split function will split the result based on the semicollons
    splitLine2 = Split(regex.Replace(line, ";"), ";")

End Function

This method seems much more cryptic, but does not deppends on the structure of the line

You can read more about regular expressions patterns in VBScript Here


A simple regex for parsing a CSV line, assuming no quotes inside quoted fields, is:

"[^"]*"|[^,]*

Each match will return a field.