iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🌪️

Compressing 300-Dimensional Word Vectors into 1 Dimension

に公開2

I conducted a follow-up experiment using Japanese data on a word embedding method aimed at representing word embeddings, as represented by Word2Vec, in one dimension. This article summarizes the procedure and results.

For this experiment, I referred to "WordTour" presented by Kyoto University. Explanatory materials by the author themselves are available below:
https://www.slideshare.net/joisino/word-tour-onedimensional-word-embeddings-via-the-traveling-salesman-problem-naacl-2022?from_action=save

Given the recent development of large-scale language models, it is clear that word vectors are extremely effective. On the other hand, as mentioned in the materials, when considering system operation in environments without abundant computational resources, there are challenges such as being "memory-intensive," "time-consuming," and "difficult to interpret."

To solve these challenges, this research attempted to see if text could be made one-dimensionally continuous, inspired by the fact that pixels are one-dimensionally continuous.

Operating Environment

  • Ubuntu20.04
  • Python3.8.10

Environment Setup

The author has made the implementation public.
https://github.com/joisino/wordtour

How to Build WordTour by Yourself

I will proceed with the environment setup by referring to the section above. Since I will also prepare Japanese word embeddings this time, I will perform additional setup steps.

setup.sh

sudo apt install wget unzip build-essential
git clone https://github.com/joisino/wordtour.git
./download.sh
make
pip3 install sudachipy sudachidict_core

Preparing Japanese Word Embeddings

This time, I will use "chiVe," a Japanese word embedding published by the team at WorksApplications. I downloaded the v1.2 mc90 gensim version file (0.6GB) and placed it in a directory of my choice.
https://github.com/WorksApplications/chiVe

What is WordTour?

WordTour is research that explored making text one-dimensionally continuous. Since text is discrete data, it cannot be treated like continuous data, such as "adjacent words." While converting text into word vectors allows for the representation of vectors adjacent to a word vector or perturbations of a specific word vector, it cannot represent exactly which specific words correspond to them.

Therefore, the study considers the problem of outputting one-dimensional word embeddings from a high-dimensional word vector space. In doing so, it defines and formalizes a condition called "soundness," which ensures that words embedded close to each other have similar meanings. Additionally, by adding a condition that the beginning and end of the word embeddings coincide, the word embeddings are formalized as a loop rather than a path.

All that remains is to solve this problem. This formulation is a Traveling Salesman Problem (TSP), which is an NP-hard problem. However, it is known that modern solvers can strictly solve TSPs with approximately 100,000 vertices. Therefore, by inputting the problem into a solver, the one-dimensional word embeddings that are the solution to this problem can be output.

Implementation

In this verification, I will focus on word vectors for nouns. Considering that the Traveling Salesman Problem solver used in WordTour can strictly solve up to approximately 100,000 vertices, I will limit the vocabulary to 10,000 words. First, I will implement a class called VectorDumper to extract only nouns from chiVe's word vectors. The data output by the following implementation will be in the same format as the text version of chiVe; however, since the text version is large, I will output only noun word vectors from the gensim version.

The implementation uses sudachipy to output the part-of-speech (POS) information of chiVe's word vectors and writes the word vectors corresponding to 10,000 nouns into tour.txt. Please see the implementation for details.

Complete implementation: VectorDumper (60 lines)
import yaml
import gensim

from sudachipy import tokenizer
from sudachipy import dictionary


class VectorDumper:

    def __init__(self, opt):
        self.opt = opt
        self.word2vec = gensim.models.KeyedVectors.load(opt["path"]["chive"])
        self.vocabs = self.word2vec.index_to_key

        self.tokenizer = dictionary.Dictionary().create()
        self.mode = tokenizer.Tokenizer.SplitMode.C
        self.target_pos = ["名詞"]
        return


    def tokenize(self, token):
        return self.tokenizer.tokenize(token, self.mode)


    def dump_controlled_vocab(self):

        with open(self.opt["path"]["out"], mode="w", encoding="utf-8") as o:

            counter = 0
            for idx, vocab in enumerate(self.vocabs):

                if counter > 10000:
                    break

                pos = self.tokenize(vocab)[0].part_of_speech()[0]

                if pos in self.target_pos:
                    counter += 1

                    vector = self.word2vec[vocab]
                    target = list(map(str, vector.tolist()))
                    line = " ".join([vocab] + target) + "\n"

                    o.write(line)
        return


    def run(self):
        # self.dump_raw_vocab()
        self.dump_controlled_vocab()
        return


if __name__ == "__main__":
    # Read config.yaml
    with open('config.yaml', 'r') as yml:
        opt = yaml.safe_load(yml)
    
    vd = VectorDumper(opt)
    vd.run()

Running the above class will output a file with 10,000 lines, where words and their word vectors are combined into one line in the following format. In my environment, I output it to ./dev/tour.txt.

映画 0.02665202133357525 0.026810171082615852 ...
先生 -0.07289369404315948 0.08926983177661896 ...
メール 0.07681874185800552 0.12336508184671402 ...
場所 0.019091689959168434 -0.08042081445455551 ...

Next, input the above file into WordTour. Run the following commands in a terminal opened in the directory where you cloned the WordTour repository.

./make_LKH_file ./dev/tour.txt 10000 > ./LKH-3.0.6/wordtour.tsp
cp wordtour.par ./LKH-3.0.6/wordtour.par
cd LKH-3.0.6
make
./LKH wordtour.par
cd ..
python3 generate_order_file.py
cat wordtour.txt

The output wordtour.txt contains the word embeddings embedded in one dimension.

Implementation Results

Here is an excerpt from the output of wordtour.txt.

こと (thing/matter)
物 (object)
...
はじめ (beginning)
中心 (center)
メイン (main)
テーマ (theme)
タイトル (title)
...
同時 (simultaneous)
瞬間 (moment)
一気 (at once)
急 (sudden)
途中 (midway)
何度 (how many times)
一度 (once)
度 (degree/times)
熱 (heat)
空気 (air)
匂い (smell)
香り (scent)
味 (taste)
酒 (sake/alcohol)
ビール (beer)
コーヒー (coffee)
茶 (tea)
黒 (black)
白 (white)
赤 (red)
色 (color)
カラー (color)
...
週 (week)
週間 (weeks)
箇月 (months)
年間 (years)
年 (year)
以来 (since)
振り (after a while)
久々 (after a long time)
久し振り (long time no see)
先日 (the other day)
昨日 (yesterday)
今日 (today)
本日 (today - formal)
明日 (tomorrow)
楽 (easy/comfortable)
対応 (support/response)
サポート (support)
プログラム (program)
実行 (execution)
設定 (setting)
モード (mode)
クリア (clear)
プレー (play)
ゲーム (game)
アニメ (anime)
漫画 (manga)
小説 (novel)
物語 (story)
ストーリー (story)
人物 (person)
キャラ (character)
キャラクター (character)
...
00
0
1
2
3
4
5
6
7
8
9
11
12
14
13
16
18
17
19
21
22
23
26
24
25
15
10
20
30
40
60
50
100
1000
円 (yen)
ドル (dollar)
米 (US/rice)
金 (gold/money)
お金 (money)
価値 (value)
...
事 (matter)

Impressions of the Implementation

Let's examine how the words transition. Here is a further excerpt focusing on specific connections:

同時 (simultaneous)
瞬間 (moment)
一気 (at once)
急 (sudden)
途中 (midway)
何度 (how many times)
一度 (once)
------ Words above relate to duration and timing
度 (degree/times)  ← Connects from "once" (一度) to "degree" (温度の「度」)
------ Words below relate to gases and atmosphere
熱 (heat)
空気 (air)
匂い (smell)
香り (scent)
------
味 (taste)  ← Connects from "scent" (sensed via five senses) to "taste," then moves toward drinks
------
酒 (sake/alcohol)
ビール (beer)
コーヒー (coffee)
------
茶 (tea)  ← Transitions from drinks to the "color" of the drink
黒 (black)
白 (white)
赤 (red)
色 (color)
カラー (color)

Intuitive Understanding of 1D Word Embeddings

From a human perspective, the connections feel quite intuitive, though it is clear that not every single word is perfectly covered. As mentioned in the author's explanatory materials, WordTour distinguishes the properties desired for word embeddings into soundness and completeness. Since WordTour aims to create word embeddings that satisfy only soundness (only correct connections are retrieved, but some may be left out), this output tendency of 1D embeddings based on chiVe seems correct.

Word Learning Status as Seen from 1D Word Embeddings

In addition, looking at the excerpts of the output results, we can confirm the presence of numeral embeddings. Simply observing chiVe's word vectors this way seems to provide various insights.

For example, the fact that continuous words are obtained suggests that chiVe's word embeddings have successfully learned words with similar meanings. If it were a Word2Vec model where word vector learning hadn't gone well or a model still in the middle of training, such semantic connections would not be seen.

1D Word Embeddings from the Perspective of Dimensionality Reduction

Conventional dimensionality reduction typically brings to mind Principal Component Analysis (PCA), which aims to represent original data in as few dimensions as possible. While it is a powerful technique, applying it to word vectors is tricky; after learning vectors that can have high representational power (e.g., 300 dimensions), reducing those dimensions without being able to interpret what each dimension represents leaves the author feeling unsure of what is being shown. In typical PCA, there are many cases where decomposition into principal components is used to determine necessary dimensions when there are many interpretable but redundant inputs.

In this case, since no clear interpretation is given to each dimension of the word embeddings, applying PCA directly feels difficult. On the other hand, 1D word embeddings, which are derived using all dimensions of the word vectors to satisfy soundness, do not lose the information related to the dimensions of the learned word vectors (though they do lose completeness) and output dimensionality-reduced word embeddings that are interpretable, as described in the implementation results. Furthermore, the challenges mentioned at the beginning—"memory-intensive" and "time-consuming"—are addressed, as the compression to one dimension makes them sufficiently more memory-efficient and cost-effective than using word vectors directly.

Elements Needed to Satisfy Completeness

While we have found that 1D word embeddings are interpretable and seem useful, there are also challenges. For example, as noted in the author's own explanatory materials, the number of words that can be placed next to the word "animal" is limited to just two. Since there are more than two types of "animals," it is fundamentally impossible to represent all similar words without omission in a single dimension. To address this issue, methods for acquiring structural relationships between words, for instance, are needed.

While there is the Japanese WordNet, which preserves hypernymy/hyponymy between words, it would be ideal to be able to learn hierarchical structures—including trending words—from large-scale monolingual corpora in an unsupervised manner if possible.

Summary

I applied WordTour, a method for acquiring memory-efficient, cost-effective, and interpretable 1D word embeddings, to the Japanese word embedding vector chiVe. I confirmed from the output that the word embeddings actually satisfy soundness, and I shared my thoughts on the elements necessary for dimensionality reduction and satisfying completeness.

In the field of natural language processing, the development of large-scale language models is progressing with tremendous momentum, but there are still many risks and hurdles to using generative models in a corporate setting. While such development is important, I personally want to continue striving to develop algorithms and models that are a bit easier to interpret, user-friendly, compact, and more powerful than conventional ones, gradually integrating them into everyday engineering tasks.

Discussion

ggpggp

一つ質問があるのですが、kaeru39さんが作成した日本語版のプログラムはgitなどに記載されていますか?
日本語版の同じものを再現したいと考えていて、どこから手をつければいいのか分からない状況です。

kaeru39kaeru39

@ggpさん、ご質問ありがとうございます。
大変恐縮ですが、今回はwordtourのリポジトリの入力ファイルのみを入れ替えただけですので、日本語版のプログラムの公開は行っておりません。

私が行った手順は以下です。
1)wordtourのリポジトリをreadme.mdに従って実行する
2)実行ファイルが英語データだったため、chiVeの日本語分散表現をダウンロードしてきて記事に記載のVectorDumperクラスを実装する、英語データと同じフォーマットであることを確認する
3)入力ファイルを日本語データに差し替えてwordtourを再度実行する

ただ、今回コメントいただきましたので、時間を作ってgithubでの公開も検討したいと思います。ありがとうございます!