Skip to content

TF-IDF

TF-IDF, or Term Frequency-Inverse Document Frequency, is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is widely used in information retrieval and text mining to identify significant words that can help in classifying and clustering documents.

  • Term Frequency (TF): This measures how frequently a term appears in a document. It is calculated by dividing the number of times a term appears in a document by the total number of terms in that document.

  • Inverse Document Frequency (IDF): This measures how important a term is across the entire corpus. It is calculated by taking the logarithm of the total number of documents divided by the number of documents containing the term. The more documents a term appears in, the less significant it is.

\[ \frac{Term Frequency}{DocumentFrequency} \]