|
|
Regex Basics - Simple Pattern Matching with IsMatch | | Writing code to search for a substring within another string is quite simple. The IndexOf() String function will tell us the exact starting position of the substring in the string or return -1 if it does not appear. Often, this is enough to get us where we want to go. One problem with this approach is that it may require several different calls to cover alternate cases. For instance, what if we need to find not only if the string "File" appears within an input string, but we need to find the word "File" only when not followed by " System" or " Folder". Additionally, we don't care about capitalization - "File", "file", and "FiLe" are all valid. Doing this with normal string operations could take a significant amount of code:
Public Function IsWhatWeAreLookingFor(ByVal sInput As String) As Boolean
Dim iPosition As Integer = sInput.ToUpper.IndexOf("FILE")
If iPosition > -1 Then
If sInput.ToUpper.IndexOf("FILE SYSTEM") = iPosition Then
Return IsWhatWeAreLookingFor(sInput.Substring(iPosition + 1))
End If
If sInput.ToUpper.IndexOf("FILE FOLDER") = iPosition Then
Return IsWhatWeAreLookingFor(sInput.Substring(iPosition + 1))
End If
Return True
Else
Return False
End If
End Function
...
If IsWhatWeAreLookingFor(sInput) Then
...
End If |
A regular expression is capable of defining what we want to look for in a general way. It allows for many different variations of text to be matched in a single expression:
If Regex.IsMatch(sInput, "\bFILE\b(?!\s(SYSTEM|FOLDER))", RegexOptions.IgnoreCase) Then
...
End If
|
This one line of code replaces what would have been several lines of code using IndexOf(). The expression used looks like this:
| \bFILE\b(?!\s(SYSTEM|FOLDER)) |
Breaking it down, the "\b" insures a word boundary, so the expression finds the word "FILE" only when it appears as an entire word. This means that it will not match "files" or "metafile", but will match "File this away" or "I deleted the file yesterday" or "What file?". Next comes the grouping construct "(?!...)". This is called a zero-width negative look-ahead assertion (everything in regular expressions has a cool name). This means that the expression will now look ahead and then stop if the expression matches, but it won't actually capture any of the text that it might match. So, continuing with the expression, "\s" matches a white-space character (space, tab, vertical tab, line feed, carriage return, new line, or form feed). Next comes an alternation construct "|". This means that the expression now will match either "SYSTEM" or "FOLDER". If it does match either of these, then the negative assertion is true and the expression fails. If, however, somthing other that " SYSTEM" or " FOLDER" appears after "FILE" (including nothing at all), then the negative assertion fails and the expression evaluates to true.
The string operation solution involes several lines of code in a recursively-called function, but the Regex solution contains just one line of code. Our example is a simple one; a more complex problem could turn hundreds of lines of parsing code into one expression. |
|