iTranslated by AI
Command Names as Seen by the Linux Kernel
Introduction
Those of you who use Linux probably run various commands on Linux every day. You likely use the term "command name" to identify them, but what this term means varies depending on the context. In this article, I will write about what a "command name" is from the perspective of the Linux kernel.
First, I'll provide a short conclusion, followed by a specific explanation, and finally, I'll describe what prompted me to investigate this and the subsequent investigation process.
Conclusion
- From the Linux kernel's perspective, a command name is the first 15 bytes of the
basenameof the executable file name (the file name with the directory part removed). - It is stored as a null-terminated string in a 16-byte field named
commwithin a structure calledtask_struct, which exists for each process (more precisely, kernel-level thread) in the kernel's memory. - This allows the kernel to identify processes at a low cost and with better readability than a PID.
- This command name is used in kernel logs, by
ps,pgrep, and theprocpspackage. The reason long command names are cut off in the middle is due to the 15-byte limit mentioned above.
Investigation Process
Software versions used for investigation
- linux kernel: v5.15
- procps: 3.3.17
Background
What prompted me to investigate the points mentioned in the "Conclusion" section was that the pgrep command I was using in my own program was not working as expected. The pgrep command takes a string specified as an argument as a regular expression and retrieves a list of PIDs for running processes that match it. For example, the following shows the PID of a script named "foo.sh" that sleeps infinitely, displayed via pgrep after execution.
$ cat foo.sh
#!/bin/bash
sleep infinity
$ ./foo.sh &
[2] 1086408
$ pgrep "foo\.sh"
1086408
However, when I did the same thing for a script named "foo-bar-baz-hoge-huga.sh" that does exactly the same thing as "foo.sh", pgrep displayed nothing.
$ cat foo-bar-baz-hoge-huga.sh
#!/bin/bash
sleep infinity
$ ./foo-bar-baz-hoge-huga.sh &
[2] 1086868
$ pgrep "foo-bar-baz-hoge-huga\.sh"
$
Thinking this was strange, I looked at man pgrep and found the following description:
NOTES
The process name used for matching is limited to the 15 characters present in the output of /proc/pid/stat.
When I actually checked the /proc/pid/stat file for foo-bar-baz-hoge-huga.sh, I got the following string:
$ cat /proc/601235/stat
601235 (foo-bar-baz-hog) S 593786 601235 593786 34817 601419 4194304 224 0 0 0 0 0 0 0 20 0 1 0 5735606 8617984 900 18446744073709551615 94266299658240 94266300571405 140732967030208 0 0 0 65536 4 65538 1 0 0 17 1 0 0 0 0 0 94266300816048 94266300864080 94266304847872 140732967036675 140732967036712 140732967036712 140732967038941 0
The string displayed inside the parentheses in the second field, which shows the command name, was not the entire script name, but indeed matched only the first 15 characters.
I understood the specification and realized that my use of pgrep was incorrect, but I decided to find out exactly where this 15-character limit comes from.
Reading the procfs manual
Files under the /proc/ directory are provided by a filesystem called procfs. Unlike filesystems like ext4 or XFS that manage data on a disk, procfs exists to allow users to obtain information from the kernel or change the kernel's state through files. I won't go into the details of procfs here, but if you're interested, please refer to this video.
First, let's check the specifications for the /proc/pid/stat file. Specifications for files under procfs are documented in man procfs. Here is an excerpt from the relevant section:
/proc/[pid]/stat
Status information about the process. This is used by ps(1). It is defined in the kernel source file fs/proc/array.c.
...
(2) comm %s
The filename of the executable, in parentheses. Strings longer than TASK_COMM_LEN (16) characters (including the terminating null byte) are silently truncated. This is visible
whether or not the executable is swapped out.
I found that the second field of the /proc/pid/stat file contains the name of the executable file within parentheses, and that as a null-terminated string, any portion exceeding 16 bytes (including the terminating null byte) is ignored. Subtracting the 1 byte for the null character from 16 bytes leaves 15 bytes, which matches the information found in the pgrep manual.
Identifying the handler for the /proc/pid/stat file
Next, I examined the kernel source to identify where this string is actually output and where the data is stored. Since the procfs manual stated that the /proc/pid/stat file is defined in the fs/proc/array.c file in the kernel source, I decided to look at that file first.
The relevant code seemed to be the following part within the do_task_stat() function:
Calling the seq_puts() function outputs the string specified in the argument to the file. In the code above, lines 562 and 564 output "(" and ")", and it is apparent that the proc_task_name() function on line 563 likely outputs the command name to the file.
Before looking into the contents of proc_task_name(), I decided to trace whether the do_task_stat() function is truly called when the /proc/pid/stat file is read. Tracing the callers of the do_task_stat() function revealed a calling sequence from two functions: proc_tid_stat() and proc_tgid_stat().
In the kernel, TID refers to the Thread ID and TGID refers to the process, so I can infer that the proc_tgid_stat() function is likely the caller. Since functions to display thread status exist under the /proc/pid/task directory in procfs, the proc_tid_stat() function is likely the handler for the /proc/pid/task/tid file.
Tracing the callers of these functions further, I confirmed that in the fs/proc/base.c file—where handlers called when users read or write various files within procfs are registered—the proc_tgid_stat() function is registered to be called when accessing the /proc/tgid/stat file, or in other words, the /proc/pid/stat file.
In summary, I found the following:
- The user reads the
/proc/pid/statfile. - The
proc_tgid_stat()function is called. - The
do_task_stat()function is called. - The
proc_task_name()function is called, and the command name is sent to the file output.
Identifying the source of command name information
Looking at the implementation of the proc_task_name() function, it looks like this:
I will omit the details, but if the process identified by the PID is a normal program, the evaluation result of the if statement on line 103 will be false. This result is true only for special processes created within the kernel.
Furthermore, since the escape argument of the proc_task_name() function is true when called via the proc_tgid_stat() function, the evaluation result of the if statement on line 108 becomes true. Therefore, within the proc_task_name() function, the data obtained by the __get_task_comm() function (presumably a null-terminated string) is used as the output for the /proc/pid/stat file on line 109. The seq_escape_str() function on line 109 escapes special characters and spaces, but since that's not important here, I won't provide a detailed explanation.
Now, let's look at the contents of the __get_task_comm() function.
You can see that the value of tsk->comm, or more precisely, the value of the comm field in the structure named task_struct, is the source of the command name. A task_struct structure exists for each thread. Let's look at the definition of the task_struct structure.
We can see that the comm field is a char array of length 16. The procfs manual also mentioned that the length of TASK_COMM_LEN is 16 bytes.
Checking where the task_struct->comm value is set
The following __set_task_comm() function is what sets the task_struct->comm value.
The caller of the __set_task_comm() function is the begin_new_exec() function.
This function is called when the execve() system call is invoked to create a new process, and bprm->filename contains the name of the executable file corresponding to the process as a null-terminated string. Here, you can see that the executable file name is processed by the kbasename() function and then saved in task->comm. Similar to the basename() function in the standard C library, the kbasename() function returns a string with the directory portion of the filename removed. In other words, if the executable file name is "./foo.sh", "foo.sh" will be stored in task_struct->comm, and if it is "./foo-bar-baz-hoge-huga.sh", it will be stored as "foo-bar-baz-hog". Finally, we have the definition of the "command name" from the perspective of the /proc/pid/stat file, or in other words, the Linux kernel.
Looking at the procps source
Finally, I read the procps source and found that the string output by pgrep is, as written in the man page, the first 15 characters of the second field of the /proc/pid/stat file with the "(" and ")" removed.
Since it's not doing anything particularly unusual, I'll omit the explanation of the procps source code.
Column: Thinking about the definition of command names
We have learned that what the Linux kernel refers to as a command name is the first 15 bytes of the executable file's basename, but why is it processed with basename, and why is it truncated to a maximum of 15 bytes? I believe the reasons are likely as follows.
In order to identify a process through kernel logs and other means, it is useful to have information that can be easily viewed as a string, separate from the PID. The executable filename can be used for this. However, storing the filename exactly as it is within the task_struct structure would consume a large amount of kernel memory and could potentially lead to security vulnerabilities if a malicious user runs a program with an extraordinarily long filename. Therefore, it is impractical to store the entire filename.
One might think, "The executable filename is in the process's memory, so why not just look at that value?"... but it's not that simple. When the kernel accesses a process's memory, if that memory has been swapped out, it must be swapped back in before it can be read, which is a cumbersome task. In addition to being troublesome, this approach cannot be used for purposes such as outputting to the kernel log when the system is low on memory; you cannot increase memory usage when memory is already scarce.
I suspect the reason for using a basename like "foo.sh" instead of the filename specified at execution time or a full path like "./foo.sh" is based on the judgment that the basename provides sufficient visibility.
Closing Thoughts
In this article, I explained the reasoning behind the Linux kernel's command name specification. I also shared the process of exploring the source code to find answers to minor questions that arise during computer use, providing a hands-on look at the source code reading experience. While these details may not be immediately applicable, I hope they serve as helpful trivia.
I've posted the video of this source code reading on YouTube, so please take a look if you're interested.
As a side note, I was reminded of the value of open-source software, which allows us to quickly inspect the source code in situations like this. That's all.
Discussion