KnowDotNet Visual Organizer

Regex for Comma Delimited Text

True Comma Delimited Text May or May Not Have Quotes

by Les Smith and Brian Davis
Print this Article Discuss in Forums

Need to parse a truly comma delimited text line?  It is not easy without the use of regular expressions.  This article demonstrates the use of Regular Expression to parse a truly Comma Delimited line that has optional double quotes with the possibility of nested quotes such as Excel can handle.

I was faced with parsing a truly comma delimited text file in a project recently and started writing a regular expression to handle it.  After several iterations, that were not giving me what I needed, I asked my partner Brian Davis to give me a Regex that would handle the problem.

Brian is the guru on Regular Expressions, and he sent me one that did not look anything like what I had been using.  He said he would explain it to me if I needed him to.  At the time, I was in the heat of the battle with the project and the Regex worked, so I am using the Regex shown below without being able to explain it.  It works great and if it will help you without having to understand it, then great!  If you need an explanation, drop Brian an email at BDavis@KnowDotNet.com.  

The code is illustrated in an abbreviated method, shown below.

The file that I am working with, optionally uses double quotes to enclose fields that could potentially have embedded commas, but does not use them for fields that can't have commas.  Right away, there is no easy way to parse this without a Regex.

   ''' <summary>
   ''' Use an enum to access the fields of the comma delimited
   ''' line so that if the fields change, you only change the enum
   ''' and not the rest of the code.
   ''' </summary>
   ''' <remarks></remarks>
   Public Enum FD
      SSN
      FirstName
      LastName
  
End Enum


I use the Enum shown above to access the fields of the line so that if the format of the line changes, I do not have to go looking through the code for places to change.  I simply change the Enum and recompile.  The line that I have to parse actually has about 40 fields in it, but I have only shown three for brevity in the article.

This is an updated expression since my first article on this subject.  I found that the customer was sending non-standard CSV format.  For example, the field "   "L" " is non standard, but the customer was sending it in and my application was failing in the Regex.  The Regex shown below correctly handles that invalid CSV format.  The valid format is " ""L"" ".  Embedded quotes should be doubled, but the customer may not always be right, but the customer is always the customer and sales-driven organizations must be able to handle what the customer sends in many times, like it or not.

   ''' <summary>
   ''' This method accepts a comma delimited line and parses it
   ''' </summary>
   ''' <param name="line" ></param>
   ''' <remarks></remarks>
   Public Sub ProcessLine(ByVal line As String)
      
Dim patt As String = _
          "(?m)
^((?("")""(?<quote>)|(?<=(^|,)))(?<field>(?(quote)(""""|[^""]|""(?!,))|[^,])*?)(?(quote)""(?<-quote>))(?($)|,))+"


      
Dim m As Match = Regex.Match(line, patt)

      
With m.Groups("field")
        
Dim ssn As String = .Captures(FD.SSN).Value
        
Dim FirstName As String = .Captures(FD.FirstName).Value
        
Dim lastName As String = .Captures(FD.LastName).Value
      
End With
   End Sub



If you have any comments or suggestions on this article or any other programming subject you would like to discuss, comment on my blog at Click Here.

Writing Add-Ins for Visual Studio .NET
Writing Add-ins for Visual Studio .NET
by Les Smith
Apress Publishing