KnowDotNet

Regex Basics - Named Groups, Backreferences, and Regex.Replace

by Brian Davis

A U.S. Social Security Number is a 9-digit number, but it can be formatted in several different ways.  Sometimes it appears as a number with no separation, like 123456789.  Other times, however, a space or a dash may by used to make it 123 45 6789 or 123-45-6789.  We may write an expression to cover these different separators that looks like this:

\d{3}[ -]?\d{2}[ -]?\d{4}

This would allow us to match the preceding examples, but it would also match things like 12345-6789 or 123-45 6789, which do not really look like well-formatted SSNs.  To insure that we have separators in both places and that the separators are the same, we can use named groups and a backreference.  

\d{3}(?<separator>[ -]?)\d{2}\k<separator>\d{4}

The (?<separator>[ -]?) captures the separator in a named group, enabling us to reference this later using \k<separator>.  Named groups provide a powerful way to reference earlier matched portions.  Rather than just match well-formatted SSNs, we could use other named groups to convert those SSNs into the format we desire.  If we are storing these SSNs in a database, for example, we may want to put them into a numeric field without separators, regardless of what separators were in the original.  This is accomplished by grouping the numeric portions of the SSN as well as the separators.

(?<first>\d{3})(?<separator>[ -]?)
(?<second>\d{2})\k<separator>(?<third>\d{4})

Using the Replace method, we can get rid of any separators that may be in the SSN

Regex.Replace("123-45-6789","(?<first>\d{3})(?<separator>[ -]?)(?<second>\d{2})\k<separator>(?<third>\d{4})","${first}${second}${third}")

Named groups, backreferences, and replacement expressions are also useful for formatting dates, reading comma delimited files from Microsoft Excel or databases, and rearranging items in a list.