Friday 31 July 2015

R: Regular Expressions

Regular expressions are powerful tool for text processing. A regular expression is a sequence of characters that forms a search pattern, mainly used for pattern matching in strings. Regular Expressions are used to search, edit, or manipulate text and data. Once you comfortable in regular expressions, you will find the advantage of regular expressions.

Regular expressions composed of meta characters like *, ., ?, ^, $ etc., Meta characters have special meaning.

Meta characters
Meta character
Meaning
?
Match one or no character
*
Match zero or more times
+
Match one or more times
.
Match single character

? : Match one or no character
For example, colou?r matches both "color" and "colour".
> gregexpr("colou?r", "color is wrongly typed as colour")
[[1]]
[1]  1 27
attr(,"match.length")
[1] 5 6
attr(,"useBytes")
[1] TRUE

* : Match zero (or) more characters
For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.

> gregexpr("ab*c", "abc bc abc abbbc ac babbc")
[[1]]
[1]  1  8 12 18 22
attr(,"match.length")
[1] 3 3 5 2 4
attr(,"useBytes")
[1] TRUE


+ : Match one (or) more charters
For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac"

> gregexpr("ab+c", "abc bc abc abbbc ac babbc")
[[1]]
[1]  1  8 12 22
attr(,"match.length")
[1] 3 3 5 4
attr(,"useBytes")
[1] TRUE


. : Match single character
For example a.c matches abc, acc, azc, axc, a1c, a7c etc., ‘.’ Can be any character.

> gregexpr("a.c", "abc bc abc abbbc ac babbc axc azc")
[[1]]
[1]  1  8 27 31
attr(,"match.length")
[1] 3 3 3 3
attr(,"useBytes")
[1] TRUE


Quantifiers
Quantifiers specify the number of occurrences to match against.

Quantifier
Meaning
{n}
The preceding item is matched exactly n times.
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.

> gregexpr("a{2}", "aaa, abcd, aabc, bcaad, aaaa")
[[1]]
[1]  1 12 20 25 27
attr(,"match.length")
[1] 2 2 2 2 2
attr(,"useBytes")
[1] TRUE
>
>
> gregexpr("a{2,4}", "aaa, abcd, aabc, bcaad, aaaa")
[[1]]
[1]  1 12 20 25
attr(,"match.length")
[1] 3 2 2 4
attr(,"useBytes")
[1] TRUE
>
> 
> gregexpr("a{2,}", "aaa, abcd, aabc, bcaad, aaaa")
[[1]]
[1]  1 12 20 25
attr(,"match.length")
[1] 3 2 2 4
attr(,"useBytes")
[1] TRUE



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment