iTranslated by AI
Common mistakes when searching with grep -f and how to fix them
Dear grep. Thank you for everything as always.
We love grep, don't we? grep.
I rely on it heavily in my daily operations.
Client: "I've placed a file with 3,000 records, including these 5 IDs, into S3. Have they been imported into the DB?"
Me: "They haven't been imported into the DB... Since there are no errors, let me check the original file."
Suppose you want to check if IDs output to a text file (compare.txt) are present in the original file (org.csv) like this.
In such cases, running grep -f compare.txt org.csv will instantly extract the lines matching the list written in compare.txt.
However, sometimes it just doesn't work as expected...
I'll record the common mistakes I make and their solutions here.
Please assume we are using the following text files.
List file (org.csv)
# org.csv
id,name
1,test1
2,test2
3,test3
4,test4
5,test5
6,test6
7,test7
8,test8
9,test9
10,test10
11,test11
12,test12
13,test13
14,test14
15,test15
16,test16
17,test17
18,test18
19,test19
20,test20
21,test21
22,test22
23,test23
24,test24
25,test25
26,test26
27,test27
28,test28
29,test29
30,test30
File to compare (compare.txt)
# compare.txt
test10
test21
test25
test28
# Note that there is a newline at the final line
Common Mistakes
Character Encoding or Line Endings are Different
This is the first one.
In particular, non-engineer clients often provide lists in Excel format.
(Like org.csv.xlsm...)
In such cases, don't forget to check the character encoding and line endings and make them consistent.
(A common occurrence is forgetting to convert the line endings...)
Since I use a Mac, I have to be especially careful with data received from Windows machines.
(Even on a Mac, while it's still an Excel file.)
# Check character encoding
nkf -g org.csv
# Convert to UTF-8 and LF line endings
cp org.csv bk_org.csv && nkf -Luw --overwrite org.csv
The Comparison File Contains Empty Lines
Having fixed the character encoding, I try to instantly check how many lines from compare.txt are included in org.csv.
grep -f compare.txt org.csv | wc -l
> 32
Huh...
Why is it matching every single line?!
I thought so, but when I check the grep manual using the man command:
-f file, --file=file
Read one or more newline separated patterns from file. Empty pattern lines match every input line. Newlines are not considered part of a pattern. If file is empty, nothing is
matched.
Empty pattern lines match every input line.
It clearly states that empty lines match every line.
This happened because I've been told to the point of exhaustion: "Make the final line of a text file a newline due to POSIX constraints"...
So, I will perform the comparison after deleting the unnecessary lines that only contain newlines.
cp compare.txt bk_compare.txt && sed '/^$/d' bk_compare.txt > compare.txt && grep -f compare.txt org.csv | wc -l
> 4
Discussion