The smallest unit of speech
The field of automated lip-reading draws for a large part on work that has been done in audio-based ASR. The latter is largely based on phonemes. Phonemes are the smallest units of speech such as the \m\ sound in mind and the \th\ sound in think. ASR systems recognize these phonetic units and "then the sequence is fed into a language model which generates hypotheses for words and sentences" Bear writes in an earlier paper.

Visual speech recognition often works with visemes, the visual counterpart of phonemes. For instance, the closed position of the lips to form the \m\ sound is visually different from the open-lipped \o\ sound.

Phonemes and their accompanying language models have been studied for decades. It is therefore understandable that the younger field of visual ASR, draws on this body of knowledge by replacing phonemes by visemes, says Bear in her earlier work. "Many lip-reading systems recognize the visual units, visemes, and then feed the sequence into an acoustic language model."

Warring factions
The problem with visemes, however, is that there are fewer of them than phonemes. The phonemes \p\,\b\ and \m\ all sound different, but they all look the same on the lips. Compared to audio systems, the visual systems have far fewer distinguishable units to work with. This is one of the reasons visual ASR still under-performs compared to audio-based systems.

This has led some ASR scientists, including Bear, to ask whether the viseme is the best visual signal to work with. There appear to be two warring factions in the visual ASR community, one side championing visemes as the basic unit and the other elevating phonemes as best equipped for the task.