Using Perl Compatible Regular Expressions with PHP
Intended Audience
Overview
Learning Objectives
Definitions
Background Information
Prerequisites
PCRE Syntax
How the Scripts Work
Script Overview
E-mail Validation
Scripts
About The Author
Intended Audience
This tutorial is intended for the PHP programmer interested in
using Perl Compatible Regular Expressions (or PCRE for short) to match or
replace values within its target.
A basic understanding of PHP and an overview of Perl will be
given in this tutorial. Knowledge of the intended use of regular expressions
will be helpful, although this tutorial should make that clear.
Readers interested in learning more about Regular Expressions
and PHP before reading this tutorial are encouraged to do so by referring
to:
Zend.com’s PHP manual entry on PCRE
http://www.zend.com/manual/ref.pcre.php
Zend.com’s PHP manual entry on preg_* functions
http://www.zend.com/manual/function.preg-replace.php
Overview
This tutorial will show you how to replace, match and
otherwise manipulate strings within a target by using regular expressions. A
regular expression is used for complex string manipulation in PHP and can be
very handy when one needs to validate a value, grab information from an outside
source, locate information within a page, or replace all of one specified string
or value with another. It can be a great timesaver by doing all of the “find and
replace” work that one would otherwise have to do by hand. Regular Expressions
can also take care of much of the double checking that one often does when, for
example, making sure that e-mail address submitted to a listserv are correct, or
that URLs submitted for links are in the proper format.
Learning Objectives
In this tutorial, you will learn how to use the following PHP
functions:
preg_replace()
preg_match()
preg_match_all()
preg_replace_all()
Definitions
Regular Expression: A function consisting of an expression
used for complex string manipulation in PHP.
PCRE: Perl Compatible Regular Expressions – use syntax widely
used in Perl. Alternate Regular expressions can employ POSIX-extended
syntax.
Target: The text file, web page, or other file that will be
“searched” by the regular expression.
Object: The entity within the target that will be searched
for.
Background Information
When using regular expressions, there is one thing more
important than all the rest. That item is syntax. Without the proper syntax, and
by which the exact definition of what you are searching for is set, the regular
expression can function improperly without limitation or, it can simply not
function at all. In order to alleviate constant syntax problems, a plan of
attack and a bit of patience is usually the key. So, it is usually prudent with
a regular expression, especially depending on the restrictions on has in place,
to first over specify the target and then simplify down. This amount of scrutiny
insures that the object you are searching for is the only one grabbed, instead
of outputting large amounts of unnecessary and unwanted matches.
Prerequisites
As a prerequisite to understanding the Perl Compatible Regular
Expressions, it is recommended that you review:
The links outlined above in the Intended Audience
section
Become familiar and constantly refer to the Pattern Syntax
section of the Zend manual when constructing future regular
expressions:
http://www.zend.com/manual/reference.pcre.pattern.syntax.php
PCRE Syntax
The types of regular expressions that will be covered in this
tutorial are called Perl Compatible Regular Expressions, or PCRE for short. What
this means is that the regular expressions associated with these PHP functions
closely follow the syntax used in Perl regular expressions. This syntax is
primarily used with the PREG functions in PHP, i.e. preg_replace, preg_match,
preg_quote, preg_split, preg_grep, etc. These functions tend to be slightly
faster than their POSIX compatible relatives (eregi, ereg, ereg_replace, etc).
The syntax used can be confusing, but insures a very specific set of search
criteria. The best place to find a large portion of this syntax is in the PHP
manual:
http://www.zend.com/manual/reference.pcre.pattern.syntax.php
The first type of syntax to cover involves meta-characters.
These characters are stand in values that cause the expression to behave in a
certain way. They come in most handy because regular expressions are used to
find certain patterns represented by the object you are searching for within a
target. For example, if you were searching for an e-mail address in a Web page,
you would look for the combination of an @ symbol followed by a period and then
a predictable array of endings, i.e. com, org, net, etc. A complete list of
these meta-characters can be found at the aforementioned link, a few of the most
important ones will be illustrated here.
/ indicates a delimiter (used in pattern modifiers or to
begin/end an
expression)
^ indicates the start of the target string to match
$ indicates the end of the target string to match
\ is used as a general escape character
{ } encloses a minimum, maximum value – used to indicate
number of characters in a matching string
( ) encloses a subpattern
| separates alternative patterns
So, for example:
/car/ (note the beginning and closing / for delimiters,
these enclose your expression)
indicates that the regular expression is looking for the
letters “car”. So, if a sentence looked like such:
I own a car now.
A match would be found, and the match would return “car”.
Alternatively, if we had put the word carriage into the sentence instead of car,
a match would still be made, but it would only return the word “car” since that
is what was requested in the regular expression.
Try it yourself:
<?php
preg_match ('/car/', 'I own a car now.', $output);
echo $output[0];
?>
So, for another example:
/car.*/
indicates that the regular expression is looking for the
letters “car”, but looking for them with following characters. So, if a sentence
looked like such:
I own a carriage now.
A match would be found, and the match would return
“carriage”.
Alternatively, if we had used the first sentence, a match
would have been made and the words “car now” would have been returned. This is
because the regular expression requested a match starting with “car” and all
characters after it on that line.
Try it yourself:
<?php
preg_match('/car.*/', 'I own a carriage now.', $output);
echo $output[0];
?>
So, one more example:
/ca(r|nyon|)/
indicates that the regular expression is looking for the
letters “car” AND also looking for the word “canyon”. So, if a sentence looked
like this:
I own a car now.
A match would be found, and the match would return “car”.
Alternatively, if the word car had been replaced with canyon, a match would have
been made and the words “canyon” would have been returned. This is because the
regular expression requested a match starting with “ca” and either “r” or “nyon”
ending the match.
Try it yourself:
<?php
preg_match('/ca(r|nyon|)/', 'I own a car now.', $output);
echo $output[0];
?>
How the Scripts Work
In this tutorial, the following example is given:
E-mail Validation: uses a regular expression to make sure that
an e-mail address is in fact a valid e-mail address. This can be used to
validate e- mail address submitted via a form or to search a target page for
e-mail addresses and to display them.
Script Overview
Having read the introduction, you should have an understanding
of what regular expressions are. In the following example, the versatility of
regular expressions is illustrated Remember, regular expressions are tools that
can be used in a variety of ways, not just those illustrated here. They can
make otherwise tedious and lengthy jobs a breeze. At the end of the tutorial,
the regular expressions are matched with PHP in context to show how they would
be used.
E-mail Validation
The regular expression this tutorial covers is also one of the
most popular uses of regular expressions. By validating an e- mail address, one
can insure that at least the format is correct (although it cannot validate if
the address is authentic). This can prevent accidental submissions of partial
e-mail addresses to your database or form, it can insure that an e-mail address
submitted to a listserv is valid, or it can be used to search and replace all
the e-mail addresses for contact on your website with a different or updated
e-mail address. All of these uses can employ a regular expression that would
make the job of checking e-mail addresses by hand, obsolete.
Code Flow
Assign the regular expression output to a variable
($okay).
Invoke the preg_match function in order to match the objects
in the target file with the desired validation parameters.
<?php
$okay = preg_match('/^[A-z0-9_\-]+\@(A-z0-9_-]+\.)+[A-z]{2,4}$/', $emailfield);
?>
Items to match:
begin with a delimiter /
then indicate the beginning of the line with ^
then, [A-Za-z0-9_\-] is any character A-Z, a-z, 0-9 and _ or -
.
Then, indicate that this pattern is one or more with the +
symbol.
Then, just add a [@] after the plus to look for the @ symbol
in the e-mail address (a dead giveaway for validation scripts).
Now all you need is to repeat your previous criteria for
matching (text between A and Z or the numbers 0 – 9)
Adding a () around the next subset at the additional [.] tells
it to look for more text following the . (the .com,.net,etc)
Then adding a minimum/maximum bracket {2,4} tells it to look
for an ending that falls within those values (i.e. .de, .au, etc)
Finally, the $ indicates the end of the target
string
And the expression is closed with our ending delimiter
/
Scripts
E-mail Validations (email.php)
<?php
if ($submit) {
$okay = preg_match(
'/^[A-z0-9_\-]+[@][A-z0-9_\-]+([.][A-z0-9_\-]+)+[A-z]{2,4}$/',
$emailfield
);
if ($okay) {
echo "E-mail is validated";
} else {
echo "E-mail is incorrect";
}
}else {
?>
<form method="POST" action="email.php">
E-mail address: <input type="text" name="emailfield">
<br><input type="submit" name="submit" value="Validate">
</form>
<?php
}
?>
You can see immediately, if an e-mail address is provided that
does not fit our criteria, it will be returned as incorrect. However, if it does
fit our specified criteria, it is validated:


Here’s what happens with a non-valid e-mail address is
inserted:


Conclusion
That’s all there is to it. With this example, you can readily
see how useful regular expressions can be. They are tools that programmers, when
effectively implemented, wield with an unmatched amount of power. While PHP does
over the POSIX compatible regular expressions, which generally has slightly
easier syntax, the PCRE regular expressions are generally reputed to have
widespread acceptance, powerful implementations and tend to be faster when used
with large files or large strings.
About The Author
Patrick Delin is a System/Database/Web Administrator with over
8 years of development experience and 3 years of PHP experience. During the past
two years, he has enjoyed his ongoing relationship with the PHP community and
Zend Technologies and has written a number of tutorials..
He can be contacted at:
pdelin@unl.edu.


4 comments to “Using Perl Compatible Regular Expressions with PHP”
January 2nd, 2007 at 4:17 pm
Hi,
for e-mail address:
aaa@bbb.ccc.pl
aaa@bbb.pl
your code:
/^[A-z0-9_\-]+[@][A-z0-9_\-]+([.][A-z0-9_\-]+)+[A-z]{2,4}$/
first you should change some + to *, like following:
^[A-z0-9_\-]+[@][A-z0-9_\-]+([.][A-z0-9_\-]*)*[A-z]{2,4}$
It allows us to enter "aaa@bbb.pl" e-mail address.
Then should add [.] to your expression so it looks as follows:
^[A-z0-9_\-]+[@][A-z0-9_\-]+([.][A-z0-9_\-]*)*[.][A-z]{2,4}$
It enables us to enter "aaa@bbb.pl" e-mail address.
This needs additional changes to avoid entering "aaa@bbbb…..pl"
Then perhaps we have to make changes to meet e-mail address standards, like (1st sign of e-mail address cannot be number, etc).
Best regards,
Slawomir Naborczyk
February 15th, 2007 at 5:01 am
This regex should be good enough I guess-
/^[^0-9][A-z0-9_-]+[@][A-z0-9_-]+([.][A-z0-9_-]+)*[.][A-z]{2,4}$/
* checks for number at the begining
* takes care of emails like abcd@corp.abcd.co.in
* also checks for multiple periods in the mail ID
Regards,
Sunil Jagadish
http://suniljagadish.wordpress.com
October 17th, 2008 at 5:05 am
dot museum (6 characters) is a valid TLD.
April 8th, 2009 at 8:43 pm
All of these examples won’t accept an email adress with a dot (.) before the @ sign, like my.email@example.com. A quick improvement, based on the above examples, would be the pattern
/^[A-z0-9_.-]+@[A-z0-9_-]+(\.[A-z0-9_-]+)*\.[A-z]{2,4}|museum$/