Tuesday, May 27, 2008

What are Regular Expressions?

Regular expressions are a way of describing a pattern, using this pattern with PHP you can match, examine, replace, and edit strings with extreme versatility and flexibility. This guide covers the basics of Perl Compatible Regular Expressions, or PCRE, and how to use preg_match(), preg_replace(), and preg_split().
Let's dive right into some basic examples, and how to use them.

Pattern Matching
Using preg_match(), we can perform Perl pattern matching on a string. The preg_match() function returns a 1 if a match is found, and 0 if there was no match. Optionally, you can also store the matches in an array, by setting a variable as the third parameter. This can be very helpful for validating data.

$string = "football";
if (preg_match('/foo/', $string)) {
// matched correctly
}


This would correctly match, because the word football has foo in it. Now let's try a more complicated idea, like validating an email address.

$string = "first.last@domain.uno.dos";
if (preg_match(
'/^[^0-9][a-zA-Z0-9_]+([.][a-zA-Z0-9_]+)*[@][a-zA-Z0-9_]+([.][a-zA-Z0-9_]+)*[.][a-zA-Z]{2,4}$/',
$string)) {
// valid email address
}


This example will validate that an email address is using the correct form. Now lets go into what the various characters to define our pattern do.

Perl compatible regular expressions run the same as Perl for pattern syntax, so we must have a pair of delimiters. We're going to use / as our delimiter.

The ^ at the beginning and $ at the end tells it to look at the start and the end of the string. Without the $ for example, it would still match with more data at the end of the email.

[ and ] are used to define acceptable input, for instance, a-z would allow all lowercase letters, A-Z all uppercase, 0-9 numbers 0 through 9, an underscore, etc.

The { and } define how many characters you are expecting, in this example, {2,4} means each section can be 2-4 characters long, like .co.uk or .info.

( and ) are used to group sections together, and defines what the string must contain. (a|b|c) would match a or b or c.

A single period (.) will match any characters, in [.] it will match a literal period.

Certain symbols, when you want to use them literally instead of to control your regular expression, will have to be escaped with a (\) These characters are ( ) [ ] . * ? + ^ | $

Pattern Replacing
– preg_replace will allow you to replace anything that your regular expression matches with what you define. A simple example of replacing text is a simple comment remover.

preg_replace('[(/*)+.+(*/)]', '', $val);


This will remove multi-line comments in the form of /* comment */ from CSS and PHP files. The parameters you pass are the regular expression, what you want to replace it with, and the string to use. If you want to use the matches sub patterns you've defined in your regular expressions, $0 is set to the entire match, and $1, $2, and so forth are set to the individual matches for each sub pattern.

Pattern Splitting
preg_split can split a string into pieces by something more complicated than just one or two characters, for example, a way to grab all tags regardless of spacing (though explode() or split() would work better in this situation,) could be...

$tags = preg_split('/[,]/', 'my,tags,unevenly,spaced');
print_r($tags);

No comments: