PHP Regular Expressions Capturing Text Inside HTML Tags - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript PHP Regular Expressions Capturing Text Inside HTML Tags - Web Development and Design | Tutorial for Java, PHP, HTML, Javascript

Breaking

Post Top Ad

Post Top Ad

Monday, July 8, 2019

PHP Regular Expressions Capturing Text Inside HTML Tags

PHP Regular Expressions 



Capturing Text Inside HTML Tags

Problem

You want to capture text inside HTML tags. For example, you want to find all the heading tags in an HTML document.

Solution

Example  Capturing HTML headings

       $html = file_get_contents(__DIR__ . '/example.html');
       preg_match_all('@<h([1-6])>(.+?)</h\1>@is', $html, $matches);
       foreach ($matches[2] as $text) {
              print "Heading: $text\n";
       }

Discussion

Robust parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it’s significantly easier to validate and parse.

For instance, the pattern can’t deal with attributes inside the heading tags and is only smart enough to find matching headings, so <h1>Dr.

Strangelove</h1> is OK, because it’s wrapped inside <h1></h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is <h2>, whereas the closing tag is not.

Example  Extracting text from HTML tags

       $html = file_get_contents(__DIR__.'/example.html');
       preg_match_all('@<(strong|em)>(.+?)</\1>@is', $html, $matches);
       foreach ($matches[2] as $text) {
              print "Text: $text\n";
       }

However, breaks on nested headings. If example.html contains <strong>Dr. Strangelove or: <em>How I Learned to Stop Worrying and Love the Bomb</em></strong>, Example doesn’t capture the text inside the <em></em> tags as a separate item.

This isn’t a problem because headings are block-level elements, it’s illegal to nest them. However, as inline elements, nested <strong> and <em> tags are valid.

Regular expressions can be moderately useful for parsing small amounts of HTML, especially if the structure of that HTML is reasonably constrained (or you’re generating it yourself). For more generalized and robust HTML parsing, use the Tidy extension.

It provides an interface to the popular libtidy HTML cleanup library. After Tidy has cleaned up your HTML, you can use its methods for getting at parts of the document. Or if you’ve told Tidy to convert your HTML to XHTML, you can use all of the XML manipulation power of SimpleXML or the DOM extension to slice and dice your HTML document.

No comments:

Post a Comment

Post Top Ad