🐸

【LLM Method】TF-IDF explained

2024/09/21に公開

1. What is TF-IDF?

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (often called a corpus). It combines two key concepts: Term Frequency (TF) and Inverse Document Frequency (IDF).

・TF(term frequency) means how often the word is shown in the sentence.
・IDF(inverse document frequency) means how rare the word whole the dataset.

TF-IDF formula:
\text{TF-IDF} = TF(\text{term frequency of sentence}) * IDF(\text{rarity of the word in whole data})

2. Details More

Here's a breakdown:

1. Term Frequency (TF)

Definition:
It measures how frequently a term (word) appears in a document.
Formula:
TF(t,d) = \dfrac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}

  • The idea is that if a word appears more frequently in a document, it's likely important. However, TF alone doesn't account for how common a word is across all documents(for example, "the" is apperd in a sentence many times, but it wouldn't be a elements that characterize the sentence).

2. Inverse Document Frequency (IDF)

Definition:
It measures how unique or rare a term is across all documents in the corpus. Common words like "the" or "is" appear frequently in many documents and are less informative, while rare words (e.g., specific domain terms) are more informative.
Formula:
IDF(t, D) = \log\left(\dfrac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)

  • Words that appear in many documents will have a low IDF score, while words that appear in only a few documents will have a high IDF score.

3. TF-IDF Score

  • Definition:
    The TF-IDF score is the product of TF and IDF for each term in a document. It reflects both how frequently a word appears in a document and how rare that word is across the entire corpus.
  • Formula:
    \text{TF-IDF}(t, d, D) = TF(t, d) \times IDF(t, D)
    • This score increases when a term appears frequently in a document (high TF) but is rare in the overall corpus (high IDF).

4. Example

Let's say you're analyzing three documents:

  1. "Machine learning is great."
  2. "Learning is fun with machine learning."
  3. "Learning deep learning models."
  • The term "learning" appears in all documents, so its IDF will be low.
  • The term "deep" appears only in the third document, so its IDF will be high.
  • The TF for "learning" might be high in some documents, but due to its commonality across all documents, the final TF-IDF score for "learning" will be lower than for rarer terms like "deep."

5. Use Cases

  • Search Engines: TF-IDF helps rank documents based on relevance to search queries by identifying important keywords.
  • Text Classification: TF-IDF is commonly used as a feature representation for machine learning models in tasks like sentiment analysis or topic modeling.

Summary

The TF-IDF is the famous and useful technique for NLP tasks, it is used at the data analysys competition like kaggle, like for an auxiliary loss to indicate the rank of disease symptoms.

Discussion