PHP Regular Expressions Matching Words - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript PHP Regular Expressions Matching Words - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript

Breaking

Post Top Ad

Post Top Ad

Monday, July 8, 2019

PHP Regular Expressions Matching Words

PHP Regular Expressions 


Matching Words

Problem

You want to pull out all words from a string.

Solution

The simplest way to do this is to use the PCRE “word character” character type escape sequence, \w:

       $text = "Knock, knock. Who's there? r2d2!";
       $words = preg_match_all('/\w+/', $text, $matches);
       var_dump($matches[0]);

Discussion

The \w escape sequence matches letters, digits, and underscores. It does not include other punctuation. So the output from the preceding code is:

       array(6) {
            [0]=>
            string(5) "Knock"
            [1]=>
            string(5) "knock"
            [2]=>
            string(3) "Who"
            [3]=>
            string(1) "s"
            [4]=>
            string(5) "there"
            [5]=>
            string(4) "r2d2"
       }

This is mostly correct except that Who’s is broken up into Who and s. To extend this pattern to handle English contractions properly, we can match against either a word character or an apostrophe sandwiched by word characters:

       $text = "Knock, knock. Who's there? r2d2!";
       $pattern = "/(?:\w'\w|\w)+/";
       $words = preg_match_all($pattern, $text, $matches);
       var_dump($matches[0]);

(The ?: syntax in this pattern prevents the text that matches the parenthesized subpattern from being “captured.”)

With the addition of the u modifier, a pattern becomes Unicode-aware and will handle words properly in non-ASCII character sets. For example:

       $fr = 'Toc, toc. Qui est là? R2D2!';
       $fr_words = preg_match_all('/\w+/u', $fr, $matches);
       print "The French words are:\n\t";
       print implode(', ', $matches[0]) . "\n";

       $kr = '노크, 노크. 거기 누구입니까? R2D2!';
       $kr_words = preg_match_all('/\w+/u', $kr, $matches);
       print "The Korean words are:\n\t";
       print implode(', ', $matches[0]) . "\n";

This prints:

       The French words are:
                       Toc, toc, Qui, est, là, R2D2
       The Korean words are:
                       노크, 노크, 거기, 누구입니까, R2D2

Without that u at the end of each pattern, the non-ASCII characters would be stripped out of the matches, producing incorrect results.

No comments:

Post a Comment

Post Top Ad