PHP Regular Expressions Introduction - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript PHP Regular Expressions Introduction - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript

Breaking

Post Top Ad

Post Top Ad

Sunday, July 7, 2019

PHP Regular Expressions Introduction

PHP  Regular Expressions



Introduction

Regular expressions are an intricate and powerful tool for matching patterns and manipulating text. Though not as fast as plain-vanilla string matching, regular expressions are extremely flexible. They allow you to construct patterns to match almost any conceivable combination of characters with a simple—albeit terse and punctuation-studded—grammar. If your website relies on data feeds that come in text files—data feeds like sports scores, news articles, or frequently updated headlines—regular expressions can help you make sense of them.

This chapter gives a brief overview of basic regular expression syntax and then focuses on the functions that PHP provides for working with regular expressions. For a bit more detailed information about the ins and outs of regular expressions, check out the PCRE section of the PHP online manual and Appendix B of Learning PHP 5 by David Sklar (O’Reilly). To start on the path to regular expression wizardry, read the comprehensive Mastering Regular Expressions by Jeffrey E.F. Friedl (O’Reilly).

Regular expressions are handy when transforming plain text into HTML, and vice versa. Luckily, because these are such helpful subjects, PHP has many built-in functions to handle these tasks, explained by recipes in other chapters. Tells how to escape HTML entities; Covers stripping HTML tags; Show how to convert plain text to HTML and HTML to plain text, respectively. For information on matching and validating email addresses.

Over the years, the functionality of regular expressions has grown from its basic roots to incorporate increasingly useful features. As a result, PHP offers two different sets of regular expression functions. The first set includes the traditional (or POSIX) functions, whose names each begin with ereg (for extended regular expressions; the ereg functions themselves are already an extension of the original feature set). The other set includes the Perl-compatible family of functions, prefaced with preg (for Perl-compatible regular expressions).

The preg functions use a library that mimics the regular expression functionality of the Perl programming language. This is a good thing because Perl allows you to do a variety of handy things with regular expressions, including nongreedy matching, forward and backward assertions, and even recursive patterns.

There’s no longer any reason to use the ereg functions and they are officially deprecated as of PHP 5.3.0. They offer fewer features, and they’re slower than preg functions. However, the ereg functions existed in PHP for many years prior to the introduction of the preg functions, so many programmers still use them because of legacy code or out of habit. Thankfully, the prototypes for the two sets of functions are identical, so it’s easy to switch back and forth from one to another without too much confusion. (We list how to do this while avoiding the major gotchas.)

Think of a regular expression as a program in a very restrictive programming language. The only task of a regular expression program is to match a pattern in text. In regular expression patterns, most characters just match themselves. That is, the regular expression rhino matches strings that contain the five-character sequence rhino. The fancy business in regular expressions is due to a handful of punctuation and symbols called metacharacters. These symbols don’t literally match themselves, but instead give commands to the regular expression matcher.

The most frequently used metacharacters include the period (.), asterisk (*), plus sign (+), and question mark (?). (To match a literal metacharacter in a pattern, precede the character with a backslash.)

  • The period means “match any character,” so the pattern .at matches bat, cat, and even rat.

  • The asterisk means “match 0 or more of the preceding object.” (So far, the only objects we know about are characters.)

  • The plus is similar to asterisk, but means “match one or more of the preceding object.” So .+at matches brat, sprat, and even the cat inside of catastrophe, but not plain at. To match at, replace the + with an *.

  • The question mark means “the preceding object is optional.” That is, it matches 0 or 1 of the object that precedes it. colou?r matches both color and colour.


To apply * and + to objects greater than one character, place the sequence of characters that make up the object inside parentheses. Parentheses allow you to group characters for more complicated matching and also capture the part of the pattern that falls inside them. A captured sequence can be referenced by preg_replace() to alter a string, and all captured matches can be stored in an array that’s passed as a third parameter to preg_match() and preg_match_all(). The preg_match_all() function is similar to preg_match(), but it finds all possible matches inside a string, instead of stopping at the first match. 

Example Using preg functions

       if (preg_match('{<title>.+</title>}', $html)) {
              print "The page has a title!\n";
       }

       if (preg_match_all('/<li>/', $html, $matches)) {
              print 'Page has ' . count($matches[0]) . " list items\n";
       }

       // turn bold into italic

       $italics = preg_replace('/(<\/?)b(>)/', '$1i$2', $bold);

Normally, the pattern delimiter character, which starts and ends the pattern string, is /. Because the pattern delimiter character needs to be backslash-escaped if it appears as a literal inside the pattern, this is a clumsy delimiter pattern when matching HTML or XML. The preceding code uses open and close curly braces as delimiters in the first pattern string to avoid this problem. Any nonalphanumeric, nonwhitespace character (except backslash) can be a pattern delimiter character. If you use an open-bracket character as the opening delimiter, you can use a corresponding close bracket as the closing delimiter.

If you want to match strings with a specific set of characters, create a character class by putting the characters you want inside square brackets. The character class [aeiou] matches any one of the characters a, e, i, o, and u. You can also put ranges inside of square brackets to form a character class. The class [a-z] matches all lowercase English letters. The class [a-zA-Z0-9] matches digits and English letters. The class [a-zA-Z0-9_] matches digits, English letters, and the underscore.

So far, all the patterns we’ve seen match anything that contains text that corresponds to the pattern. That is, [a-z0-9]+ matches grapefruit and c3p0, but it also matchesgrr!!! and *****\*\*p. All four of those strings meet the condition that [a-z0-9]+ sets out: “one or more of a digit or lowercase English letter.”

Anchoring your pattern enables matching against strings that only contain characters that the pattern describes. The caret (^) and the dollar sign ($) anchor the pattern at the beginning and the end of the string, respectively. Without them, a match can occur anywhere in the string. So whereas [a-z0-9]+ means “one or more of a digit or lowercase English letter,” ^[a-z0-9]+ means “begins with one or more of a digit or lowercase English letter,” [a-z0-9]+$ means “ends with one or more of a digit or lowercase English letter,” and ^[a-z0-9]+$ means “contains only one or more of a digit or lowercase English letter.”

Example  Matching with character classes and anchors

       $thisFileContents = file_get_contents(__FILE__);
       // http://php.net/language.variables gives a regular expression for
       // valid variable names in php. Beginning the pattern with \$ matches
       // a literal $
       $matchCount = preg_match_all('/\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*/',
                                                                   $thisFileContents, $matches);
       print "Matches: $matchCount\n";
       foreach ($matches[0] as $variableName) {
              print "$variableName\n";
       }

Example  prints each variable name it uses:

       Matches: 8
       $thisFileContents
       $matchCount
       $thisFileContents
       $matches
       $matchCount
       $matches
       $variableName
       $variableName

If it’s easier to define what you’re looking for by its complement, use that. To make a character class match the complement of what’s inside it, begin the class with a caret. A caret outside a character class anchors a pattern at the beginning of a string; a caret inside a character class means “match everything except what’s listed in the square brackets.” For example, the character class [^aeiou] matches everything but lowercase English vowels.

Note that the opposite of [aeiou] isn’t [bcdfghjklmnpqrstvwxyz]. The character class [^aeiou] also matches uppercase vowels such as AEIOU, numbers such as 123, URLs such as http://www.cnpq.br/, and even emoticons such as :).

The vertical bar (|), also known as the pipe, specifies alternatives. Uses the pipe to find various possibilities for image filenames in a block of text.

Example  Matching with |

       $text = "The files are cuddly.gif, report.pdf, and cute.jpg.";
       if (preg_match_all('/[a-zA-Z0-9]+\.(gif|jpe?g)/',$text,$matches)) {
              print "The image files are: " . implode(',',$matches[0]);
       }

Example  prints:

       The image files are: cuddly.gif,cute.jpg

We’ve covered just a small subset of the world of regular expressions. We provide some additional details in later recipes, but the PHP website also has some very useful information on Perl-compatible regular expressions. The links from this last page to Pattern Modifiers and Pattern Syntax are especially detailed and informative.

No comments:

Post a Comment

Post Top Ad