iTranslated by AI
Handling Exceptions in NG List-Based Keyword Filtering
In this article, I will explore how to handle "keywords" that require some form of processing, specifically focusing on how to prevent false positives (accidental replacements).
As examples, let's consider the words "Baku," "Moe," and "Mani."
Suppose we simply want to replace the word "Baku" with "##" as an NG (No Good) word.
"The luck is sky-high (Baku-agari). Dreaming is Baku."
→
"The luck is ##-high. Dreaming is ##."
As you can see, it gets processed like this.
I only wanted to replace the simple word "Baku," but...
Definition of Terms
Exception List (pre_list): A list used for initial replacement to protect specific terms.
NG List (ng_list): A list of words to be marked and displayed as "##."
Aside
By the way, I'm using this setup to simplify the example.
I personally question the method of restricting display via an NG list altogether.
This is because false positives are common, and it is easy to bypass these filters by adjusting how you write, making it impossible to protect general users from malicious ones.
However, I do think it has some effect in suppressing casual violations or deterring careless behavior.
Note that I am simply using an application of this feature for displaying text correction tools, and content regulation is not my goal.
Please excuse me for this.
Also, if you extend this tool, you can create your own ruby display tools.
Of course, building the dictionary is a lot of work...
Exception Handling Part 1
This involves regular expressions. If there are very few exceptions to the NG words, or if you want to do it in a single line, you can write exceptions using regular expressions.
Baku(?!agari)
Like this. If you have up to about 10 exceptions, this method works fine.
Baku(?=agari|sagari|[birth/turn/space/sale/buy/eat])
Since it's a regular expression, using a [character class] is convenient for single characters.
"The luck is Baku-agari. Dreaming is ##."
As you can see, they can coexist.
Exception Handling Part 2
However, as the scale grows or the processing for "Baku" branches into two or more logic paths, this one-liner reaches its limits.
Therefore, I adopt a method of "running matching in advance, replacing with the exception list of the NG list, and then reverting them later."
pre_list.txt
Baku(?=agari|sagari|[birth/turn/space/sale/buy/eat])
Mani(?=car|relic|jewel|sutra)
By doing this, I replace them in the text during pre-processing.
"The luck is %ID0%. Dreaming is Baku."
By prioritizing the replacement of exceptions like this, I store the pre-replacement data in an Array or similar structure.
I replace "Baku."
"The luck is %ID0%. Dreaming is ##."
Then, I revert the exception list.
"The luck is Baku. Dreaming is ##."
First, let's assume we use % as a special escape character.
let text = `The luck is Baku-agari. Dreaming is Baku`;
let array = [];
// Pre-escape since % is used as a special character
text = text.replaceAll('%', '%A');
const pre_list = [
/Baku(?=agari|sagari|[birth/turn/space/sale/buy/eat])/g,
/Mani(?=car|relic|jewel|sutra)/g,
];
for(re of pre_list){
text = text.replaceAll(re, (match) => {
array.push(match);
return `%ID${array.length}%`;
});
}
const ng_list = ['Baku', 'Moe', 'Nima'];
for(item of ng_list){
text = text.replaceAll(item, '##');
}
for(var i = 0; i < array.length; i++){
text = text.replace(new RegExp(`%ID${i + 1}%`, `g`), array[i]);
}
text = text.replaceAll('%A', '%');
console.log(text);
What do you think?
It was easier than expected, right?
Well, in this age of AI, perhaps this isn't the era to maintain these things by hand anymore.
Since AI can have usage limits, it's good to keep this method in mind for times when you need a casual implementation.
The rest depends on the management cost of the list.
Word Boundary Problem
(?<![ァ-ヴ])Baku(?![ァ-ヴ])
or
(?<=^|[ァ-ヴ])Baku(?=[ァ-ヴ]|$)
Since \b does not work in many environments for Japanese, there is a method to avoid false matches by indicating "word boundaries" in this way.
However, if you want to replace "American Tapir" (America-Baku) and "Malayan Tapir" (Malay-Baku) with "America-##" and "Malay-##," this doesn't fit your needs.
On the other hand, you want to leave something like "Sabaku-neko" (Desert cat) as it is.
If you know that this "Sabaku-neko" is an exception, it's easy.
Plan A
- Add "Sabaku-neko" to the pre-match list and handle as an exception.
- Set "America-##" as a regular version of the replacement to catch it.
Plan B
The reverse is also possible.
- Split the NG list into two as follows:
(?<=^|[^ァ-ヴ])Baku(?=[^ァ-ヴ]|$)
(?<=America|Malay)Baku
AB Evaluation
If you want to support unknown terms like "White Tapir," Plan A is superior.
However, because terms like "Kyabba-club" or "Sabaku-ran" (Server Clan) might get caught, you need to tune it according to your needs.
If you also need to consider "Kyabakura," you must pick up the entire list of existing, known terms containing "Baku."
Furthermore, you sometimes see Japanese Katakana representations like "Bakuhatsu" (Explosion), "Bakusui" (Deep sleep), "Bakuen" (Explosion flames), and "Enbaku" (Oats).
If those don't appear frequently, it won't be a problem, so it's a case-by-case situation, but if you want to support them as much as possible, you'll need to consider many things.
Performance Improvements and More
Since this pre_list and ng_list processing stores the target for replacement one by one, it is possible to package all replacement patterns into a single regular expression and perform everything in one giant replaceAll call.
It is known that multiple calls to replace and replaceAll take significant execution time, so this method is likely to run faster.
However, whether you actually need that much performance is something you should consider as well.
Also, if you don't test them one by one, it becomes difficult to identify where an error is occurring.
During list registration and testing, it might be a good idea to run an error check for each item.
List Maintenance
Regarding the order of the list:
- Keep them sorted by Unicode or similar code order.
- Manage them in a DB, and use a sub-program to output them to a file whenever you make changes.
- Pre-sort them by length (longest first).
- Perform duplicate checks.
These kinds of things become necessary.
Basically, if you want to replace both "Al" and "Alcohol," you must apply "Alcohol" first, otherwise, it will become "##cohol" and not work as intended.
You need to either perform the sorting in the program or maintain the dictionary in advance.
function sort_length(array){
array.sort((a, b) => { return b.length - a.length;});
}
If you just need simple sorting, this is sufficient. However, if you want to sort by Unicode order in addition to that, you need further code.
Note that the JavaScript Array.sort method has been a "stable sort" since ECMAScript 2019, so if you perform a regular sort first and then call sort_length above, you can sort by length and then by Unicode (UTF-16) order for each length.
As an aside, Array.sort defaults to UTF-16 order, so characters such as compatible CJK ideographs, certain full-width symbols, and half-width Katakana come after surrogate pairs.
Pre-caching
In some cases, this is not just a single function; it might be called tens of thousands of times. In such situations, saving the list in an external cache area (like a global variable), performing the checks and sorting there, and then using it can improve runtime performance.
Text processing has data volumes 1,000 times smaller than videos or images, but when you run loops like this, some parts may execute exponentially. Therefore, it is realistic to execute it beforehand and skip the heavy work. Since the cache data size is not significant compared to video, you can just create it and it should generally be fine.
Aside: Regular Expressions
Even though regular expressions are easy to use, if you put a + inside a +, it will become slow in no time.
You should have knowledge about regular expression backtracking, processing time, and processing volume.
It is easy to write + or *, but if you specify a limit like {1,20}, you can define constraints such as "between 1 and 20 characters."
This will finish significantly faster than scanning the entire text.
Furthermore, since there are many cases where shortest match (more accurately, non-greedy match) is sufficient for * or +, using .*?, .+?, or {1,20}? is also recommended.
Discussion