• Document: Lip reading using CNN and LSTM
  • Size: 2.42 MB
  • Uploaded: 2018-10-13 06:34:09
  • Status: Successfully converted

Some snippets from your converted document:

Lip reading using CNN and LSTM Amit Garg Jonathan Noyola Sameep Bagadia amit93@stanford.edu jnoyola@stanford.edu sameepb@stanford.edu Abstract 1), which used a set of weights pre-trained on faces [1]. This packed each sequence towards the front, leaving Here we present various methods to predict words and blank spaces at the ends of shorter sequences. The second phrases from only video without any audio signal. We em- method was similar, except we used nearest-neighbor ploy a VGGNet pre-trained on human faces of celebrities interpolation to stretch and normalize the number of images from IMDB and Google Images [1], and explore different per sequence. ways of using it to handle these image sequences. The VGGNet is trained on images concatenated from multiple The third method first passed each individual image frames in each sequence, as well as used in conjunction through the VGGNet to extract a set of features, and then with LSTMs for extracting temporal information. While the passed each sequence of features through several LSTM LSTM models fail to outperform other methods for a va- layers, retrieving the classification label from the final riety of reasons, the concatenated image model that uses output. This method was attempted both with freezing the nearest-neighbor interpolation performed well, achieving a VGGNet, speeding up training time, and end-to-end, taking validation accuracy of 76%. longer but allowing the VGGNet to be trained further on this particular dataset. 1. Introduction In order to make the problem tractable, we formulate it as a classification problem of detecting what words Visual lip-reading plays an important role in human- or phrases are being spoken out of a fixed set of known computer interaction in noisy environments where audio words and phrases. Each method received a single image speech recognition may be difficult. It can also be ex- sequence as input, and produced a single word or phrase tremely useful as a hearing aid for the hearing-impaired. classification label as output. However, similar to speech recognition, lip-reading systems also face several challenges due to variances in the inputs, such as with facial features, skin colors, speaking speeds, 2. Related Work and intensities. To simplify the problem, many systems are restricted to limited numbers of phrases and speakers. In this section, we describe the related work done in this To further aid in lip-reading, more visual input data can field. Most of the work done in lip reading uses non neural be gathered in addition to color image sequences, such as network approaches. They extract various features out of depth image sequences. the image and then use machine learning approaches like SVMs to classify what was spoken. The dataset used here is a set of image sequences (i.e. low-rate videos) that each show a person speaking a Rekik et al in [2] propose HMMs as a way to perform word or phrase. The goal is to classify these sequences. lip reading using only image and depth information. The One of the main issues that prevents older methods is system consists of two main blocks – feature extraction and that sequence lengths, and hence number of features per speech recognition. The first step estimates the speakers sequence, vary widely. Therefore, various methods for face pose using a 3D face model, including a 3D mouth capturing temporal information were used to take all input patch to detect the mouth. This is followed by motion and features into account. appearance descriptors to generate features for the model – for instance HOG (histogram of gradients). The second step The first method consisted of concatenating a fixed segments the speech video to look for frames corresponding number of images from a sequence into one larger image to the utterance. The features on these frames are fed to the which can then be passed through a VGGNet (see Figure HMM for classification. The dataset is the same as what we 1

Recently converted files (publicly available):