Next: 11.8 References Up: 11 Mathematical Methods Previous: 11.6 Finite State Technology

11.7 Optimization and Search in Speech and Language Processing

John Bridle
Dragon Systems UK Ltd., Cheltenham, UK

An optimal search method is one that always finds the best solution (or a best solution, if there is more than one). For our purposes best is in terms of the value of a criterion function, which defines a score for any possible solution. Optimization can be defined as the process of finding a best solution, but it is also used, more loosely, meaning to find a sequence of better and better solutions.

Optimization and search are vital to modern speech and natural language processing systems. Although there are optimization techniques which are not normally thought of as search, and search methods which are not optimal or not defined as optimizing anything, most often we are dealing with search methods which seek to optimize a well-defined criterion, and usually we understand the search method in terms of the way it approximates to an optimal search.

Three well-known problems for which optimal search methods are important are: training a set of models for speech recognition, decoding an acoustic pattern in terms of a sequence of word models, and parsing an errorful symbol string with the least number of assumptions of errors. Speech recognition and parsing are combinatorial optimization problems: a solution is a sequence (or more complicated data structure) of symbols, which in a speech recognition system might represent words. Training the parameters of a set of models is a problem in (non-linear) continuous optimization: a solution is a vector of real numbers (e.g., probabilities and parameters of probability density functions). It is common, in speech and language processing, to reserve the term search for combinatorial problems, particularly speech recognition, and to use optimization for continuous problems such as training.

We must never forget that an optimal solution is not necessarily correct: it is only optimal given the way the problem is posed. Speech recognition based on optimal search has the important and useful property that we can usually categorize recognition errors as model errors (the recognition criterion was inappropriate) or search errors (we failed to find the mathematically optimal solution), by checking the value of the criterion for the correct answer as well as the incorrect answer. In the case of model errors we must improve the structure of the models, or the method of training or the training data itself.

A search method for ASR is usually expected to be an optimal (finds the best-scoring answer) to good (and preferably controllable) approximation. The main properties of search methods, apart from optimality, are speed and use of computing resources. Other attributes include delay (not the same as speed) and providing extra information, such as alternative answers or details like scores and positions.

For some problems the optimal solution can be computed directly, using standard numerical algorithms. A feed-forward neural network (see section ) can be used as a classifier without performing a search, and matrix inversion can be used as the basis for some training methods. However, the most successful methods for dealing with timescales and sequences in speech recognition are based on searches.

The most desirable search methods are those that are provably optimal. This is the case with the standard speech recognition algorithms, which are based on dynamic programming and the Markov property. The standard training methods use an efficient re-estimation approach, which usually converges to at least a local optimum in a few iterations through the training data. The standard model structures and training criteria are chosen partly because they are compatible with such efficient search methods. Less efficient methods, such as gradient descent or optimization by simulated annealing can be used for more difficult (but possibly more appropriate) models and training criteria.

11.7.1 Dynamic Programming-based Search for Speech Recognition

The most important search techniques for combinatorial problems in speech recognition are based on dynamic programming (DP) principles [Bel57]. DP is often the key to efficient searches for the optimum path through a graph structure, when paths are evaluated in terms of an accumulation of scores along a path. The Viterbi algorithm [For73] is an application of DP, and the forward-backward algorithm [Jel76] is very closely related. Three speech recognition problems usually solved by DP are those of unknown timescales, unknown word sequence and unknown word boundary position. Our first example is a simple whole-word speech recognizer.

Consider a simple isolated word discrimination system with one word-model per vocabulary word. Each word model consists of a sequence of states, each state corresponding to a position along the word. Associated with each state is information (the output distribution) about the range of acoustic data (spectrum shapes, etc.) to be expected at that point in the word. Most modern speech recognition systems use word models composed of shared sub-word models (see section 1.5), [Por88].

When given an unknown sound pattern (a sequence of frames of spectrum measurements) we shall decide which word of the vocabulary has been said by picking the word model which fits best.

Different utterances of the same word can have very different timescales: both the overall duration and the details of timing can vary greatly. The degree of fit is usually in two parts: the fit to the spectrum and the fit to the timescale. To keep our example simple, we assume that all we know (or want to use) about the timescales of words is that states are used in strict sequence, each one being used one or more times before moving on to the next one. There are many more elaborate schemes (section 1.5), most of which penalize time-scale distortion. We shall also assume that best means minimum sum of individual degree-of-fit numbers, which might be distances, or strain energy, or negative log probabilities.

If we knew how to align the states of the model with the frames of the unknown pattern we could score the model (and hence the word hypothesis) by combining the spectrum fit scores. We define the model score as the result of choosing the best alignment. We find that score (and the alignment if we want it) using a DP algorithm, as follows: we introduce an optimal partial score, F, where is the score for optimally aligning the first i states of the model to the first t frames of the input pattern. We are interested in , the score for optimally aligning all N states of the model to all T frames of the input pattern.

For our simple example F can be computed from the DP step:

where is the degree of fit of the state of the word model to the measured spectrum at time t. (If we are going to align state i with frame t we must align frame t-1 with state i or with state i-1.) can be computed by starting with and working through to frame T.

Simple connected word recognition is done by processing all the word models together, and allowing the scores to propagate from the end of one word to the start of another. In practice we choose between words that could end at the current input frame, and propagate just the best one. A pair of arrays, indexed by frame number, can keep track of the identity of the best word ending at each input frame, and the best time for it to start [Vin71]. The start-time for the current word is propagated with F within the words. There are several alternatives and elaborations, e.g., [Ney84].

It is possible to operate such a connected word recognizer continuously using partial trace-back: when all active optimal partial paths agree about the interpretation of some past stretch of input data then nothing in the future data can change our interpretation of that stretch: we can output that result and recover the associated storage.

In a large-vocabulary system we can save a lot of storage and computation by keeping only the relatively good-scoring hypotheses. This pruning or beam search [Low76] can also reduce the delay in the partial trace-back.

The connected-word search method outlined above processes each input frame as it arrives, and considers all word endings at that time. There is another important class of methods, called stack-decoders, in which words which start at the same time are processed together [Pau92].

We have considered tasks of finding optimal time alignments and sequence of words. It can also be very useful to find close-scoring alternative sequences [SA91], which can be analyzed subsequently using sources of knowledge which it is inconvenient to incorporate into the main search: for instance alternative acoustic scoring methods or application specific systems.

Most real-time large-vocabulary systems rely for their speed on a multi-resolution search: an initial fast match, using a relatively crude set of models, eliminates many possibilities, and the fine match can then be done with much less work. Sometimes the searches are performed in alternating forward and backward directions, with a combination of scores used for pruning [ASP91]. Additional discussion of search in HMM-based systems can be found in section 1.5.

11.7.2 Training/Learning as Optimization

Training is the process by which useful information in training data is incorporated into models or other forms used in recognition. The term learning is used of the complete system which performs the training. We normally talk about training hidden Markov models, but many artificial neural networks are defined so they include the training algorithms, so we can refer to the neural network as learning from the data.

Most early artificial intelligence research on automatic learning focussed on problems of learning structure. A classic problem is learning a grammar from example sentences. Learning structure directly is a combinatorial problem, and is very difficult. Among the techniques available are genetic algorithms [Gol89] and optimization by simulated annealing [KGV83]. Most established learning methods used in speech recognition avoid the difficulties by replacing discrete optimization problems with continuous problems. As an example, the rules of a regular grammar can be replaced by a set of probabilities, and the probabilities define a continuous (but bounded) space within which continuous optimization methods can be used. Many types of neural networks can be seen as the result of generalizing discrete, logical systems of rules so that continuous optimization methods can be applied.

In training, we are usually trying to optimize the value of a function E of the training data and of unknown parameters , and E is made up of a sum over the training instances : where . Evaluation of E for a given needs a pass over all the training data. An example from HMMs is the probability of the model generating the training data (usually a very small number!)

Available methods depend on the type of response surface (form of E as a function of ) and amount of extra information available, such as derivatives.

One of the simplest cases is when E is quadratic in . There are standard methods for finding the optimum in a number of evaluations not much more than the dimensionality of [PFTV88].

When is smooth and we can also compute the partial derivative of E with respect to the components of , , then there are many gradient based methods. In the simplest case (gradient descent or ascent) we adjust in proportion to . Variations of gradient descent, such as momentum smoothing, adaptive step size, conjugate gradients and quasi-Newton methods, are available in the literature, e.g., [PFTV88], and many have been applied for training neural networks of the multi-layer logistic perceptron type [RHW86]. The motivation and application of these methods in speech recognition is discussed in section .

However, the most important methods in use in speech recognition are of the expectation-maximization (E-M) type, which are particularly applicable to maximum likelihood training of stochastic models such as HMMs. The basic idea is not difficult. As an example, consider the problem of constructing the type of word model described in section . Assume we have a crude version of the word model, and wish to improve it. We first align the states of the model to each of several examples of the word using the DP method outlined above, then re-estimate the output distributions associated with each state, using statistics (such as the mean) of the data frames aligned to that state. This cycle can be repeated, but usually converges well in a few iterations. In the Baum-Welch procedure the alignment is a more sophisticated, probabilistic, assignment of frames to states, computed using the forward-backward algorithm, and convergence is provable for the most common forms of model. For details see section 1.5, or e.g., [Por88].

11.7.3 Future Directions

It is difficult to separate progress in search and optimization algorithms from the way that the problem is posed (usually this means the form of the models). Dynamic programming style algorithms rely on a key property of the structure of the models: the pattern of immediate dependencies between random variables must form a singly connected graph. More general (but efficient) search and optimization methods would allow the exploration of models in which, for instance, multiple parallel factors explain the data more naturally and more parsimoniously.

Next: 11.8 References Up: 11 Mathematical Methods Previous: 11.6 Finite State Technology