PHP Files Processing Every Word in a File - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript PHP Files Processing Every Word in a File - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript

Breaking

Post Top Ad

Post Top Ad

Wednesday, July 10, 2019

PHP Files Processing Every Word in a File

PHP Files 


Processing Every Word in a File

Problem

You want to do something with every word in a file. For example, you want to build a concordance of how many times each word is used to compute similarities between documents.

Solution

Read in each line with fgets(), separate the line into words, and process each word:

       $fh = fopen('great-american-novel.txt','r') or die($php_errormsg);
       while (! feof($fh)) {
              if ($s = fgets($fh)) {
                     $words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY);
                     // process words
              }
       }
       fclose($fh) or die($php_errormsg);

Discussion

This example calculates the average word length in a file:

       $word_count = $word_length = 0;

       if ($fh = fopen('great-american-novel.txt','r')) {
              while (! feof($fh)) {
                     if ($s = fgets($fh)) {
                            $words = preg_split('/\s+/',$s,-1,PREG_SPLIT_NO_EMPTY);
                            foreach ($words as $word) {
                                   $word_count++;
                                   $word_length += strlen($word);
                            }
                     }
              }
       }

       print sprintf("The average word length over %d words is %.02f characters.",
                               $word_count,
                               $word_length/$word_count);

Processing every word proceeds differently depending on how “word” is defined. The code in this recipe uses the Perl-compatible regular expression engine’s \s whitespace metacharacter, which includes space, tab, newline, carriage return, and formfeed.

Breaks apart a line into words by splitting on a space, which is useful because the words have to be rejoined with spaces. The Perl-compatible engine also has a word-boundary assertion (\b) that matches between a word character (alphanumeric) and a nonword character (anything else). Using \b instead of \s to delimit words most noticeably treats words with embedded punctuation differently. The term 6 o’clock is two words when split by whitespace (6 and o’clock); it’s four words when split by word boundaries (6, o, ', and clock).

No comments:

Post a Comment

Post Top Ad