Abstract:
We have opted a paper “LIPNET: END-TO-END SENTENCE-LEVEL
LIPREADING” as a base paper of our Final Year Project.
Lip-reading is the task of decryption text from the movement of a speaker’s mouth.
Ancient approaches separated the issue into 2 stages: planning or learning visual
options, and prediction. Newer deep lip-reading approaches are end-to-end trainable
(Wand et al„ 2016; Chung & Zisserman, 20j6a). However, existing work on models
trained end-to-end perform solely word classification, instead of sentence-level
sequence prediction. Studies have shown that human lip-reading performance will
for extended words (Easton & Basala, 1982), indicating the importance of
options capturing temporal context in an ambiguous communication. Intended by this
observation, our project presents, a model that maps a video frames to text, creating
of spatial-temporal convolutions, a neural network, and therefore the connection
temporal classification loss, trained entirely end-to-end. End-to-end sentence-level lip reading model that at the same time learns spatial-temporal visual options and a
sequence model.