Basics of Language Model

Language modeling is used in speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition, information retrieval and other applications.

· · ·

Introduction

Given the finite set of all the words in a language, $\nu$, a sentence in the language is the sequence of words

$x_1\,x_2\,\cdots\,x_n$

Where
- $n \ge 1$
- $x_1 \cdots x_{n-1} \in \nu$
- $x_n$ is a special symbol STOP $\require{cancel}\cancel{\in} \nu $

Set of all the words in a language are assumed to be finite.

Let $\nu^{\dagger}$ be the infinite set of all sentences with the vocabulary $\nu$.

A language model consists of a finite set $\nu$ as a function $p(x_1, x_2, \cdots, x_n)$, such that:

For any $⟨ x_1 \cdots x_n ⟩ \in \nu^\dagger, \, p(x_1, x_2, \cdots, x_n) \ge 0$
$\sum_{⟨ x_1 \cdots x_n ⟩ \in \nu^\dagger} p(x_1, x_2, \cdots, x_n) = 1$

So, $p(x_1, x_2, \cdots, x_n)$ is basically a probability distribution over the sentences $\nu^\dagger$.

Applications of Language Modeling

A distribution of $p(x_1 \cdots x_n)$ signifies how probable a sentence is in a language. Such a distribution can prove useful in speech recognition or machine translation. Candidates generated by these algorithms can be run against the language model to check how probable the sentences are.

Frequency Based Modeling

Frequency based modeling is given by,

$p(x_1 \cdots x_n) = \frac {c(x_1 \cdots x_n)} {N}$

Where
- $c(x_1 \cdots x_n)$ is the number of times $x_1 \cdots x_n$ occurs in the training corpus
- N is the total number of sentences in the corpus.

One major drawback of such a model is that it would assign probability 0 to any sentence not seen in the training corpus.

Markov Models for Fixed Length Sequences

Consider a sequence of random variables, $X_1, X_2, \cdots, X_n$, where each random variable can take any value in a finite set $\nu$. n is assumed to be a fixed number.

Language model aims to find the probability of $x_1 \cdots x_n$, where $n\ge1$ and $x_i \in \nu$ for $i = 1 \cdots n$, i.e. to model the join probability,

$P(X_1 = x_1, X_2 = x_2, \cdots , X_n = x_n)$

If n is a fixed number there are $|\nu|^n$ possible sequences of the form $x_1 \cdots x_n$, which makes it impossible to list all the possible sequences for a large value of n and $|\nu|$. This is where markov models help to build a more compact model.

First-Order Markov Process make the assumption that identity of an element in a sequence depends only on the identity of previous element in the sequence, i.e. $X_i$ is conditionally independent of $ X_1 \cdots X_{i-2} $, given the value of $X_{i-1}$.

$\begin{align} P(X_1 = x_1, \cdots , X_n = x_n) &= P(X_1 = x_1) \prod_{i=2}^n P(X_i = x_i | X_1=x_1 \cdots X_{i-1}=x_{i-1}) \\ &= P(X_1 = x_1) \prod_{i=2}^n P(X_i=x_i|X_{i-1} = x_{i-1}) \end{align}$

The first step of the equation above is exact using the chain rule of probability. The second step is a result of first-order markov model assumption.

Similarly, second-order Markov models assume that identity of an element in a sequence depends only on the identity of previous two elements in the sequence, i.e. $X_i$ is conditionally independent of $X_1 \cdots X_{i-3}$, given the value of $X_{i-1}$ and $X_{i-2}$.

$\begin{align} P(X_1 = x_1, \cdots , X_n = x_n) &= \prod_{i=1}^n P(X_i = x_i | X_1=x_1 \cdots X_{i-1}=x_{i-1}) \\ &= \prod_{i=1}^n P(X_i=x_i|X_{i-2} = x_{i-2}, X_{i-1} = x_{i-1}) \end{align}$

It is assumed that $x_0,\, x_{-1} = *$, where * is the special start symbol.

Markov Sequences for Variable-length Sentences

The value n, assumed to be fixed number in previous section, is considered to be a random variable itself and the nth word is always equal to the special symbol STOP.

Using the second-order markov assumption,

$P(X_1 = x_1, \cdots , X_n = x_n) = \prod_{i=1}^n P(X_i=x_i|X_{i-2} = x_{i-2}, X_{i-1} = x_{i-1})$

Where
- $n \ge 1$
- $x_n =$ STOP
- $x_i \in \nu$ for $i=1 \cdots (n-1)$

Process of generating a sequence using the above distribution would be as follows:

Initialize $i=1$, and $x_0, x_{-1} = *$
Generate $x_i$ from the distribution,

$P(X_i=x_i|X_{i-2} = x_{i-2}, X_{i-1} = x_{i-1})$

If $x_i = $ STOP then return the sequence $x_1 \cdots x_i$, else set i = i+1 and repeat previous step.

Trigram Language Models

Trigram language models are direct application of second-order markov models to the language modeling problem. Each sentence is modeled as a sequence of n random variables, $X_1, \cdots, X_n$ where n is itself a random variable.

A trigram model consists of finite set $\nu$, and a parameter,

$q(w|u,v)$

Where
- u, v, w is a trigram
- $w \in \nu \cup \{STOP\}$
- $u, v \in \nu \cup \{*\}$

The value of $q(w| u, v)$ can be interpreted as the probability of seeing the word w immediately after bigram u, v.

So, for any sequence $x_1 \cdots x_n$ where $x_i \in \nu$ for $i = 1 \cdots (n-1)$ and $x_n = $ STOP, the probability of the sentence under trigram language model is

$p(x_1 \cdots x_n) = \prod_{i=1}^n q(x_i|x_{i-2}, x_{i-1}) \tag{1}$

Where $x_0 = x_{-1} = *$

Trigram assumption: Each word depends on the previous two. This is essentially the second-order markov assumption used over sentences.

The only step remaining in the trigram language model is the estimation of language parameters, i.e., $q(w|u,v)$. Since the total number of words are $| \nu |$ the total number of possible parameters would be $| \nu |^3$. It is a very big number and hence needs some kind of indirect estimation process.

Maximum Likelyhood Estimates

This is the most generic solution to the estimation problem shown above.

For any w, u, v,

$q(w|u,v) = \frac {c(u,v,w)} {c(u, v)}$

Where
- c(u,v,w) is the number of times the trigram is seen in corpus
- c(u, v) is the number of times the bigram is seen in the corpus

Many of the frequencies c(u,v,w) and c(u,v) would be 0. This would effect the estimation and present the following flaws:

$q(w|u,v) = 0$ because c(u,v,w) is 0 which would underestimate many trigram probabilities which is unreasonable
If the denominator is 0, then $q(w|u,v)$ would be undefined.

Perplexity

It is one of the evaluation metrics for the language models and is calculated on a held-out data after the model is trained on some corpus. The held-out data is not used for parameter estimation of the language model.

Consider a test dataset consisting of sentences, $s_1, \cdots, s_m$, then $p(s_i)$ gives probability for sentence $s_i$ in the language model.

A basic measure of quality of language model would be the probability it assigns to the entire test set, give by

$\prod_{i=1}^m p(s_i)$

So, higher the quantity is, the better the language model is at modeling unseen sentences.

Perplexity is a direct transformation of this basic defination. Let M be the total number of words in the corpus, then average log probability under the model is

${1 \over M} log \prod_{i=1}^m p(s_i) = {1 \over M} \sum_{i=1}^m log \,p(s_i)$

which is the log probability of the entire corpus, divided by the total number of words in the corpus. Again the higher the value of this, the better the language model.

Then, Perplexity is defined as

$2^{-l} \tag{2}$

Where

$l = {1 \over M} \sum_{i=1}^m log_2\, p(s_i) \tag{3}$

The perplexity is a positive number. The smaller the value of perplexity, the better the language model is at modeling unseen data. Perplexity becomes a minimization parameter because of the negative power that is applied to the defination.

Intuition for Perplexity

Let the vocabulary, $\nu$ have N words, and the model predicts uniform distribution over the vocabulary, i.e.,

$q(w|u,v) = {1 \over N}$

Then evaluating (3) using (1),

$\begin{align} l &= {1 \over M} \sum_{i=1}^m log_2\, p(s_i) \\ &= {1 \over M} \left[log_2\, \left({1 \over N}\right)^{n_1} + \cdots + log_2\, \left({1 \over N}\right)^{n_m} \right] \\ &= - {1 \over M} * log_2\,N * (n_1 + n_2 + \cdots + n_m) \\ &= - log_2\, N \tag{4} \end{align}$

Where $n_1, n_2, \cdots, n_m$ are the number of words in each sentence in the test sample and,

$n_1, n_2, \cdots, n_m = M$

Using (4) in (2),

$Perplexity = 2^{-l} = 2^{-(-log_2\,N)} = N \tag{5}$

So, under a uniform distribution model, the perplexity is equal to the vocabulary size. Perplexity can be considered the effective vocabulary size under the model.

Properties of perplexity:

If for any trigram u, v, w, the estimated probability $q(w|u, v) = 0$ then the perplexity will be $\infty$ which is consistent with the rule stating that a good model should not predict probability zero for an unseen dataset and perplexity is low for good models.
If perplexity is the measure of language model, then 0 estimates should be avoided at all costs.

Linear Interpolation

The following trigram, bigram and unigram maximum-likelihood estimates are defined,

$\begin{align} q_{ML}(w|u, v) &= {c(u,v,w) \over c(u,v)} \\ q_{ML}(w|v) &= {c(v,w) \over c(v)} \\ q_{ML}(w) &= {c(w) \over c()} \\ \end{align}$

Where
- $c(u,v,w)$ is the number of times trigram u, v, w occurs
- $c(v,w)$ is the number of times bigram v, w occurs
- $c(w)$ is the number of times unigram w occurs
- $c()$ is the total number of words seen in the training

Properties of these models:

The unigram model will never face the issue of number or denominator being 0, so the estimate is always well defined and greater than 0. But it completely ignores the context and hence discards valuable information.
The trigram model use the context better than the unigram model, but has the problem of many of its counts being 0 rendering estimate value 0 or undefined.
The bigram model falls between the two extremes.

Linear Interpolation uses all three of these estimates to define the trigram estimate as follows,

$q(w|u, v) = \lambda_1 * q_{ML}(w|u, v) + \lambda_2 * q_{ML}(w|v) + \lambda_3 * q_{ML}(w) \tag{6}$

Where

$\lambda_1 \ge 0,\,\lambda_2 \ge 0,\,\lambda_3 \ge 0$ $\lambda_1 + \lambda_2 + \lambda_3 = 1$

i.e. (6) is a weighted average of the three estimates.

Discounting Methods and Katz Back-off

An alternative estimation method commonly used in practice.

Consider a bigram language model where the following parameter is to be found,

$q(w|v) \tag{7}$

Where
- $w \in \nu \cup $ {STOP}
- $v \in \nu \cup$ {*}

Discounted Counts are used to reflect the intuition that if the counts are taken from the training corpus, there would be a systematic over-estimation of probability of bigrams seen in the corpus and hence underestimate the bigrams not seen in the corpus. So, discounted count is given by,

$c^*(v, w) = c(v,w) - 0.5$

Where
- $c^*(v, w)$ is the discounted count
- $c(v,w)$ is the count of bigrams, such that, $c(v,w) \gt 0$
- 0.5 is the discount value

Using the discounted count, (7) can be written as,

$q(w|v) = {c^*(v, w) \over c(v)}$

i.e., use the discounted count in the numberator and regular count in the denominator.

For any context $v$, there is a missing mass, defined as,

$\alpha(v) = 1 - \sum_w {c^*(v, w) \over c(v)}$

The intuition behind discounted methods is to divide the missing mass among words $w$, such that $c(v,w) = 0$.

Formally, for any $v$, there exist sets

$\begin{align} A(v) &= {w: c(v,w) > 0} \\ B(v) &= {w: c(v,w) = 0} \end{align}$

Then the estimate is given by,

$q_{BO}(w|v) = \begin{cases} {c^*(v, w) \over c(v)} & If w \in A(v)\\ \alpha(v) * {q_{ML}(w) \over \sum_{w \in B(v)} q_{ML}(w)} & If w \in B(v) \end{cases} \tag{8}$

i.e if $c(v,w) \gt 0$ return ${c^*(v, w) \over c(v)}$ else divide the remaining probability mass $\alpha(v)$ in proportion to the unigram estimates $q_{ML}(w)$.

This method can be generalized to trigram language model in a recursive way, i.e., for any bigram (u, v) define,

$\begin{align} A(u,v) &= {w: c(u,v,w) > 0} \\ B(u,v) &= {w: c(u,v,w) = 0} \end{align}$

Where 0.5 is the discount value and hence discounted count is given by,

$c^*(u, v, w) = c(u,v,w) - 0.5$

Then the trigram model is given by,

$q_{BO}(w|v) = \begin{cases} {c^*(u, v, w) \over c(u, v)} & If w \in A(u, v)\\ \alpha(u, v) * {q_{BO}(w|v) \over \sum_{w \in B(u,v)} q_{BO}(w|v)} & If w \in B(u,v) \end{cases}$

where

$\alpha(u, v) = 1 - \sum_{w \in A(u,v)} {c^*(u, v, w) \over c(u, v)}$

$\alpha(u, v)$ is the missing probability mass of the bigram. It can be noted that the missing probability is distributed in proportion to the bigram estimaes $q_{BO}(w|v)$ given in (8).

REFERENCES:

Language Modeling - Michael Collins