iTranslated by AI
Text Annotation Tool with Streamlit
I have created a text annotation tool using Streamlit and summarized the steps here. In this article, I will create a tool to add 5W1H annotations to text.
Recently, I encountered a situation at work where I wanted to perform simple text annotation to train a machine learning model. Therefore, I created an annotation tool with minimal functionality in Python.
What is Streamlit
Streamlit is a framework for creating web applications in Python. Both the code and the implementation results are very simple and easy to understand.
Operating Environment
- Ubuntu 20.04
- Python 3.8.10
Environment Setup
pip3 install streamlit
In this implementation, we assume that we will be annotating 5W1H for a text file toy.txt that I created myself.
toy.txt
今月末の学会で、自分が口頭発表の際に使うためのデモ動画を、最優先で作る。
自分の机の上にある、講義で使う参考書をなるはやで会場まで持ってきてほしい。
自社の代表が出席する取締役会が明日正午から開始される。
Implementation
We will implement the TextAnnotator class.
Full implementation toy.py 43 lines
import streamlit as st
class TextAnnotator:
def __init__(self):
self.path_dataset = f"./dataset/toy.txt"
self.path_out = f"./dataset/annotated_toy.tsv"
self.lines = self.load_text()
return
def load_text(self):
with open(self.path_dataset, mode="r", encoding="utf-8") as f:
return f.readlines()
def run(self):
st.title("5W1Hアノテーションツール", anchor=None)
annots = []
for idx, line in enumerate(self.lines):
line = line.replace("\n", "")
st.header(f"{idx+1}番目\n")
st.write(line)
id = idx+1
when = "when_" + st.text_input('when(いつ)', '-' , key=id)
where = "where_" + st.text_input('where(どこで)', '-' ,key=id*300+1)
who = "who_" + st.text_input('who(だれが)', '-' , key=id*300+2)
why = "why_" + st.text_input('why(なんで)', '-' , key=id*300+3)
what = "what_" + st.text_input('what(なにを)', '-' , key=id*300+4)
how = "how_" + st.text_input('how (どうした)', '-', key=id*300+5)
annot = f"{line}\t{when},{where},{who},{why},{what},{how}\n"
annots.append(annot)
if st.button('完了', key=id*300+7):
with open(self.path_out, mode="w", encoding="utf-8") as o:
o.writelines(annots)
return
if __name__ == "__main__":
ta = TextAnnotator()
ta.run()
- import
Import streamlit.
import streamlit
- __ init __
Specify the input file containing the text to be annotated and the output destination file. Then, load the input text.
class TextAnnotator:
def __init__(self):
self.path_dataset = f"./dataset/toy.txt"
self.path_out = f"./dataset/annotated_toy.tsv"
self.lines = self.load_text()
return
- load_text
Assuming that each line contains one sentence to be annotated, it reads the file line by line.
def load_text(self):
with open(self.path_dataset, mode="r", encoding="utf-8") as f:
return f.readlines()
- run
We will create the GUI using the Streamlit library.
def run(self):
st.title("5W1Hアノテーションツール", anchor=None)
annots = []
for idx, line in enumerate(self.lines):
line = line.replace("\n", "")
st.header(f"{idx+1}番目\n")
st.write(line)
id = idx+1
when = "when_" + st.text_input('when(いつ)', '-' , key=id)
where = "where_" + st.text_input('where(どこで)', '-' ,key=id*300+1)
who = "who_" + st.text_input('who(だれが)', '-' , key=id*300+2)
why = "why_" + st.text_input('why(なんで)', '-' , key=id*300+3)
what = "what_" + st.text_input('what(なにを)', '-' , key=id*300+4)
how = "how_" + st.text_input('how (どうした)', '-', key=id*300+5)
annot = f"{line}\t{when},{where},{who},{why},{what},{how}\n"
annots.append(annot)
if st.button('完了', key=id*300+7):
with open(self.path_out, mode="w", encoding="utf-8") as o:
o.writelines(annots)
return
I will mainly explain the Streamlit functions.
st.title("5W1Hアノテーションツール", anchor=None)
st.write(line)
Creates a title (heading) and the descriptive text below it.
when = "when_" + st.text_input('when(いつ)', '-' , key=id)
Creates a text box and stores the entered data in the variable when. '-' is the default string entered in the text box. If the annotation result is empty, it can be troublesome in post-processing, so we empirically enter a string that is unlikely to appear in the input text. The key must be a unique value for each text box, so we assign an integer that does not overlap with other text boxes.
I will omit the explanation for the other 5W1H items as they are implemented as shown in the full implementation above.
annot = f"{line}\t{when},{where},{who},{why},{what},{how}\n"
annots.append(annot)
The result of annotating the input text is formatted, stored in the variable annot, and added to the variable annots.
if st.button('完了', key=id*300+7):
with open(self.path_out, mode="w", encoding="utf-8") as o:
o.writelines(annots)
return
st.button() creates a button. This also requires a unique key, so we assign an appropriate integer. When this button is pressed, the variable annots, which recorded the annotation results, is output to a file.
After creating the code, run the following command in the terminal to launch the annotation tool in your browser.
streamlit run toy.py
Results
Here are the results of running the tool. As you scroll, the second and third texts will be displayed.

Launch screen of the 5W1H text annotation tool
After that, you manually copy and paste into the input fields. In an actual project, I annotated about 300 sentences over 2.5 hours. Perhaps there was a smarter way to do it, but perseverance won over ingenuity. Although it is a steady and simple task, annotating the text while looking at each sentence increased the "resolution" of my understanding of the data, which was a good experience. Since I went through all the data, the fact that unexpected data was eliminated was a significant advantage.
This is what it looks like after annotating.

Annotation of the 1st sentence

Annotation of the 2nd sentence

Annotation of the 3rd sentence
After filling in everything and pressing the "Done" (完了) button shown at the bottom left of the third image, the annotated dataset will be output.

Dataset annotated with 5W1H
Then, you can extract specific 5W1H items to use as training data, generate 5W1H sentences, or use them for any other desired purpose.
Conclusion
I have created a very simple annotation tool using Streamlit, which allows for the creation of web applications in Python. Because it was a simple task, I built it without deeply researching existing text annotation tools. While there might be software that automatically extracts 5W1H, I am satisfied with the fact that I found I could build one myself from scratch. I have gained one more skill.
Discussion