💻

The importance of using loggers

に公開

Introduction

No matter what your role is, if your job involves coding you've most certainly have to show some kind of output, either to the user or to yourself. For commodity, most of us use the default print statement of the language we are working with (System.out.print on Java, print on Python, etc.).
Everything is nice and beautiful until we actually need to modify our code: Some print statements are not needed, some should be shown only under certain conditions, and some other should be actually saved somehwere to be checked later. In some cases, a simple "Find and Replace" will be enough, but in some others, we actually have to spend time thinking and reorganzing them, which most people, including myself, don't find fun at all.

TL;DR. Just give me the code (in Python)

In a hurry? Got it! Here you are:

  • Clone this repository: https://github.com/DXC-Lab-Linkage/quick_python_logging_setup
  • Copy and paste logging_setup.py and logging_config.json into the root directory of your project
  • Take a look at log_test.py to know how to call the logger. For the lazy, here it is: Just copy-paste this code to the top of your code (after the imports), and then just use logger.info, logger.debug, logger.warning or logger.error as you need. Note that the loggers are configured to output debug messages only to the file and not to stdout.
import logging
from logging_setup import LoggingSetup

LoggingSetup.setup_logging(LoggingSetup.MAIN_LOGGER_NAME)
logger = logging.getLogger(LoggingSetup.MAIN_LOGGER_NAME)
  • Enjoy! Make changes as you see fit.

Print-Man

In case you're not aware, it's advised to provide some kind of output to the users (even if the users are ourselves) in order to let them know that something is actually happening and the program is not frozen, stuck or has crashed. This is particularly important when your program is doing some operation or processing that will take long, and while "long" is subjective, the fact that the program goes silent for more than a couple of seconds is worrisome for some users, especially for those who don't know what is happening inside.

Say your program has to process a CSV file and produce a new CSV file with the results of that processing. You need to apply 5 different operations to each of the rows in the file, and each operation receives as input the output of the previous one. Let's call them operation_1 ... operation_5. Assuming you actually paid attention to the "Introduction to Programming" class, you would nicely have these operations separated in functions within your code.

# Let's assume that you will process all rows in one operation before applying the next one

import pandas as pd

def operation_1(input_rows: pd.DataFrame) -> pd.DataFrame:
    # Do something with the rows

def operation_2(input_rows: pd.DataFrame) -> pd.DataFrame:
    # Do something with the rows

def operation_3(input_rows: pd.DataFrame) -> pd.DataFrame:
    # Do something with the rows

def operation_4(input_rows: pd.DataFrame) -> pd.DataFrame:
    # Do something with the rows

def operation_5(input_rows: pd.DataFrame) -> pd.DataFrame:
    # Do something with the rows

def main() -> None:
    input_rows = pd.read_csv("my_file.csv")

    # This is for the example. Not necessarily the best implementation
    for operation in (operation_1, operation_2, operation_3, operation_4, operation_5):
        input_rows = operation(input_rows)

    input_rows.to_csv("my_processed_file.csv", index=False)        

As a proof-of-concept, your file has 10 rows. You run the script. There are 2 possible scenarios:

  1. Your code works flawlessly and your output is correct (rare, but possible)
  2. Your code fails at some operation and an error/exception is raised (much more common)

I don't think someone would think that, in the case 1 above, the program is ready and you're done, but just to be sure: Nope. Your program is not ready, even if you yourself are the end user. "Why?" you ask? Well, let's consider the following:

  • Your input now has hundreds of thousands of rows

"Not a problem" you say. You run your program... but nothing seems to be happening. Does this worry you? Your program may very well be running normally, but even so, how can you tell? You would have to wait until the very end to see if the output file was created correctly. And even if that happens, if you happen to find some rows incorrectly processed, how can you tell where the issue occurred?

I may be old (well, I am now...), but I'm still amazed when a younger fellow does not know about debuggers, especially in Python. Enter "print-man": Add print statements to wherever they are needed in order to know the progress and/or the state of the program at that line.

def main() -> None:
    input_rows = pd.read_csv("my_file.csv")

    # This is for the example. Not necessarily the best implementation
    for operation in (operation_1, operation_2, operation_3, operation_4, operation_5):
        print(f"Now applying {operation}. Please wait")
        input_rows = operation(input_rows)
        print(f"{operation} successfully applied"}

    input_rows.to_csv("my_processed_file.csv", index=False)   

Well, that's a bit better. That would at least show something between the operations, and it would be helpful if an error occurs, because at least you would know which operation caused the issue. Imagine that operation_2 is the problem. "Print-man" appears:

def operation_2(input_rows: pd.DataFrame) -> pd.DataFrame:
    print(f"Now entering operation_2")
    # Do something with the rows
    print("Apply inner function to the rows")
    result_rows = input_rows.apply(some_inner_function, axis=1)
    # Do something else before returning
    print("Final touches before ending operation_2")
    result_rows = final_touches_to_the data(result_rows)

    return result_rows

Even with this simple approach, at least we have something. It's not ideal, but it's better than nothing.

What's the problem?

First of all: The "print-man" approach is not bad per se. It allows to have at least some hints at progress or errors in a quick way. The problem is when we rely too much on it and we end up with a lot of statements that are not necessarily useful, or when they are useful there is no way to check them all because we're printing them only. We could definitely redirect the output to a file when we run the program, but we would need to remember to do so, and if we forget, we may end up having to run it again against all the data.

$> python my_script.py > my_script_output.txt & 

(& added to send the process to background)

Secondly, not all messages should be the same, nor all of them have to be shown in all cases. Why would the user need to see that an inner function is being called on operation_2? It would be useful in case we're checking an error, but not necessarily when we just want to know the progress.

We could use some kind of flag to help us know when certain messages are needed, either as a constant, or as a parameter to the command

  1. As a constant
# Set this to True to print messages regarding to debugging
DEBUG_MODE = False

...

# Somewhere in the code
if DEBUG_MODE:
    print("This is a debug message")
# Do something
print("This is a normal message)
  1. As parameter
from argparse import ArgumentParser, Namespace
from functools import partial


def handle_arguments() -> Namespace:
    parser = ArgumentParser(description="Some nice description here")
    parser.add_argument("--debug_mode", "-d", action="store_true", help="If set, debug messages will be printed")
    ...

    return parser.parse_args()


def operation_1(debug_mode: bool, input_rows: pd.DataFrame) -> pd.DataFrame:
    # Do something

def operation_2(debug_mode: bool, input_rows: pd.DataFrame) -> pd.DataFrame:
    # Do something

...

def main() -> None:
    args = handle_arguments()

    debug_mode = args.debug_mode

    for operation in (
        partial(operation_1, debug_mode), partial(operation_2, debug_mode),...
    ):
        input_rows = operation(input_rows)

...

While this approach looks better, it introduces more code, and more code means more things to maintain. We could define a global variable with the argument value, but that introduces a whole new problem (please don't use global variables...)

Thirdly, if we need to show only some part of the output to someone else, there will be a lot of non-relevant information in those statements. At least we would have some output, but we would need to refine it, get rid of text that we don't need, etc.

Loggers

Enter loggers. Most of the programming languages provide some way to log messages (I say most because languages like Brainfuck exist).

I am sure that most, if not all, of the people reading this know about the existence of loggers. Then, why we don't use them?

For starters, we want to output something quickly. Print statements are there for a reason. There is no need to do something too complicated. Even C's printf is not esoteric once you know how it's used. Python has f-strings, str.format and some other ways to build strings with variable values in them.

value = 10
print(f"This is my value: {value}")

print("This is my value: {value}".format(value=value))

Then, just out of habit we tend to use print statements as our default output method. As I mentioned before, it's not necessarily bad. The issue raises when those print statements are going to be more than a reference of what is happening, and some of the times we can't foresee that.

Loggers exist to have a reliable and customizable output, which can be stdout, but it's not limited to it. They also have levels associated, which are the way to differentiate the use of whatever it's being output. The "problem" (note the quotes) with loggers is that we need to spend time configuring them if we want them to be fully functional.

Sure, in Python we can just get the root Logger and start using it.

import logging

logger = logging.getLogger()

logger.info("Hello world!")

Or in Java, providing the class name:

import java.util.logging.*; // Sue me for importing everything :P

Logger logger = Logger.getLogger(MyClass.class.getName());
logger.log(Level.INFO, "Hello world!")

In the simplest way, yes, this will work, but using a logger like this is pretty much "overkill". Better stick with print statements in these cases.
In order to take advantage of loggers, it's necessary to configure them, and while in most cases the configuration is very intuitive, it's not necessarily straightforward. You need to know and setup at least:

  • Loggers: The actual objects that will be used to log messages
  • Formatters: Objects that handle the format of the log messages
  • Handlers: Interfaces that control where the log messages will be output
  • Levels: Labels to show what type of log messages are being written

Instead of having to worry only about the actual output, now you have to worry about 4 other components as well, which means that you'll have to spend time first learning how to setup them, and then actually setting them up.

Plan ahead

While it's true that you need to spend time learning about loggers in the language you're using (although the concepts are similar in many of them), most of the times this will happen AFTER you already have a good chunk of code already done and working, maybe with a bunch of print statements. Editing your working code is not pleasant at all, because you may cause further issues, or even if you're careful and are using version control software, changing print statements to logger methods needs a pull request (git) / a new branch for it before merging with the trunk (subversion) / etc. And let's be real: Configuring loggers and changing statements is time you would like to spend anywhere else in your code, especially if you have to meet a deadline.

Although nowadays you can use LLMs to do some of the tedious work for you, there may be occasions in which you can't use them (sounds unreal, but there are companies that don't want to use them for privacy reasons, even if the code you intend to send to the LLM does not contain any kind of private information). And yet, even if you're able to use them, the best approach when it comes to logging is to do it before starting your work.

The time you'll spend setting things up won't disappear, but there are still advantages to plan using loggers in advance:

  • You just have to learn once how to configure loggers, formatters, handlers and levels. The knowledge acquired can be reused in other pieces of code.
  • You will be using logger from the beginning. There's no need to worry about having to replace print statements later.
  • It will be easier to separate debug, info, error messages
  • If you ever need to check something quickly with print statements, it will be easier to get rid of them becuase you'll know all of them are temporary; your intended log messages will be using loggers, so find/replace will be enough to remove all print statements.

Write logs to files too

It may sound unnecessary, but if you have experienced situations in which you wish you had saved log messages in a file only to find out that you need to rerun your code (sometimes for hours!), you'll definitely agree with the idea of writing logs to file.

Log files are extremely useful. Using them allows you to:

  • Check afterwards how your code behaved
  • Have a way to keep track of any potential issues or rare cases with the use of debug messages
  • Be able to have something to show even after your code ended and you've got the output. This is especially true when some clients ask you for certain output

In my experience, there have been times when I do the first executions of a code and either everything goes smoothly or some very unusual issue arises, only to not being able to pinpointing what exactly happened, in which case I have had to apply "Print-man" and run everything again. Sometimes it's very frustrating. So, to avoid that, I try to use loggers from the very beginning and write messages to files. It's been quite helpful, and in more times than I'm willing to accept, having log files has saved me of situations on which I have to prove that certain things happened or did not happen in the code. Recently, something similar happened to a young colleague: The colleague needed to show some results that were obtained after several hours of processing. Said colleague used print statements and everything was thought to be fine... except when (s)he found out that the output had been truncated and the most important part was not shown. The colleague was using Jupyter Notebooks, and while this may fall in the category of setting up cells correctly and using display instead of print, when I heard his/her problem I immediately suggested to spend time learning and setting up loggers, and to add loggers to file, because I have in similar situations in the past. The colleague was forced to rerun the code again and wait for several hours to get the results, I could not change that, but I'd like to think that that was the best moment to teach him/her about the importance of loggers.

OK, you have a point, now show me how to do it

There are a ton of excellent tutorials about loggers out there, for most of the languages at least, so writing here how to do it explicitly in Python (which is the language I currently use) is unnecessary. But, what I can do is to provide you with some Python code that, at least, will allow you to be using loggers and to log to a file and to stdout at the same time.

You can find the code and instructions at the top of this article, in the section TL;DR: Just give me the code (in Python)

In conclusion

Logging is a part of the coding process that should be taken into account since the very beginning to avoid potential headaches. While it's very intuitive in most of the cases, it does take time to learn how to use them properly, but it's time very well spent, and you will have to learn it only once. Afterwards, it pays off.

Also, I strongly recommend to log messages to files as well. They can be very helpful whenever you need to check what your code output.

Last, the "print-man" approach is not bad, but you should not abuse it. It's quite useful when you need to check something quickly, but if you need to debug something meaningful, consider using a debugger instead.

DXC Lab

Discussion