iTranslated by AI
Can You Read This Regex for Extracting HTML Tags?
The following is often cited as a regular expression for extracting HTML tags.
Now, can you read it? (It's okay if it takes some time.)
<("[^"]*"|\'[^\']*\'|[^\'">])*>
Explanation
This pattern consists of zero or more repetitions of the following pattern enclosed in <>.
(“[^”]*”|\’[^\’]*\’|[^\’”>])
- Since it is "zero or more times,"
<>also matches.

This part uses the following three patterns:
- Zero or more repetitions of any character other than a double quote (
"), enclosed in double quotes.
“[^”]*”
- Conversely, patterns like
<""">do not match.- (The quotation mark is not closed.)


- Zero or more repetitions of any character other than a single quote (
'), enclosed in single quotes.
\ is an expression that escapes the meta-character immediately following it.
\’[^\’]*\’

- Any single character except
',", or>.
[^\’”>]
Summary
From the above, we can see that this regular expression matches strings enclosed in <>. However, we've learned that it is structured not to match if there are unclosed quotation marks (i.e., a situation where there is no pair of ") within the tag. Thus, it can be seen that it is effective for extracting HTML tags.
<("[^"]*"|\'[^\']*\'|[^\'">])*>
A Bit of Practical Talk
Having read this far, some of you might have noticed that the following also matches.
<あいうえお>

The reason is this part.
[^\'">]
This could potentially affect actual work.
For example, suppose you have a task to remove tags from the HTML data of an online supermarket's advertisement page. In this case, if there is a sentence like <Store Manager's Recommended Item> Eggs from XX Prefecture (8 pcs) 299 yen [Limit one per person], <Store Manager's Recommended Item> will be judged as an HTML tag.

A possible solution would be to specify a Unicode range. It is important to read regular expressions carefully and verify them against actual data in practice.
Discussion