iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🔎

Can You Read This Regex for Extracting HTML Tags?

に公開

The following is often cited as a regular expression for extracting HTML tags.
Now, can you read it? (It's okay if it takes some time.)

<("[^"]*"|\'[^\']*\'|[^\'">])*>

Explanation

This pattern consists of zero or more repetitions of the following pattern enclosed in <>.

(“[^”]*|\’[^\’]*\’|[^\’”>])
  • Since it is "zero or more times," <> also matches.

image 01 match <>

This part uses the following three patterns:

  1. Zero or more repetitions of any character other than a double quote ("), enclosed in double quotes.
“[^”]*
  • Conversely, patterns like <"""> do not match.
    • (The quotation mark is not closed.)

image 02 unmatch

image 03 unmatch

  1. Zero or more repetitions of any character other than a single quote ('), enclosed in single quotes.

\ is an expression that escapes the meta-character immediately following it.

\’[^\’]*\’

image 04 unmatch

  1. Any single character except ', ", or >.
[^\’”>]

Summary

From the above, we can see that this regular expression matches strings enclosed in <>. However, we've learned that it is structured not to match if there are unclosed quotation marks (i.e., a situation where there is no pair of ") within the tag. Thus, it can be seen that it is effective for extracting HTML tags.

<("[^"]*"|\'[^\']*\'|[^\'">])*>

A Bit of Practical Talk

Having read this far, some of you might have noticed that the following also matches.

<あいうえお>

image 05 match

The reason is this part.

[^\'">]

This could potentially affect actual work.

For example, suppose you have a task to remove tags from the HTML data of an online supermarket's advertisement page. In this case, if there is a sentence like <Store Manager's Recommended Item> Eggs from XX Prefecture (8 pcs) 299 yen [Limit one per person], <Store Manager's Recommended Item> will be judged as an HTML tag.

image 06 match

A possible solution would be to specify a Unicode range. It is important to read regular expressions carefully and verify them against actual data in practice.

References

Discussion