Abstract:
Deepfake videos are created using Artificial Intelligence to generate realistic content that falsely depicts individuals saying or doing things they haven’t. By employing algorithms like convolutional neural networks (CNNs) and generative adversarial networks (GANs), these videos are crafted by training on extensive datasets of a target’s images and videos. This allows the model to capture and replicate facial features and expressions accurately, making the manipulated videos seem very realistic. Despite potential legitimate uses such as in entertainment, deepfake technology poses serious risks, including misinformation and defamation, making it crucial to develop effective detection methods. These methods typically analyze inconsistencies in facial features, movements, and audio, along with detecting anomalies absent in genuine media. However, evolving deepfake technologies continually challenge detection capabilities, necessitating ongoing advancements in detection techniques that focus on facial analysis. Our research comprises two phases. In the first phase, we used a time-distributed convolutional network to extract features from videos. The time-distributed layer enables us to create a single representation for a set of video frames, rather than treating each frame independently. These embeddings were optimized by using metric learning. For optimization, we used Contrastive loss along with the cross-entropy loss to create a network capable of producing embeddings that were less reliant on the identification of image anomalies and more tuned to identify the abnormal patterns in facial structure. Our experiments on the FaceForensics dataset, which includes various deepfake generation methods, demonstrated that our network exhibits generalization concerning different generation techniques. In the second phase of our thesis, we aimed to enhance efficiency by focusing on key facial areas for deepfake detection. By extracting and analyzing facial landmarks, which represent the face’s structure, we observed changes in expressions and emotions during speech over the course of a video. Through experiments, we identified an effective set of landmarks and created a dataset from three deepfake datasets with varying complexity. Local feature descriptors were used to generate a feature vector for each coordinate, selected based on their size, time complexity, and performance. For classification, a graph convolutional neural network was employed, ideal for sparse data and for creating a lightweight and robust deepfake detector. The graph construction involved segmenting facial regions and establishing semantic relationships based on their impact on natural and manipulated speech, which were tested experimentally for authenticity. The graph network was trained to detect relationships between facial landmarks and their temporal changes, enabling the algorithm to identify inconsistencies or unnatural movements indicative of manipulation. Our work has assessed the efficacy of employing facial features, as opposed to generic image features, for deepfake detection. Our temporal network demonstrated superior generalization compared to competing approaches achieving over 90% accuracy over different datasets. The results are comparable to SOTA works Additionally, our implementation of the Graph Convolutional Neural network offered a lightweight yet highly effective deepfake detection solution. Our research contributions include deepfake detection by developing a time-distributed CNN-LSTM network that leverages metric learning for optimized video embeddings. We enhanced our approach by analyzing physiological and local feature descriptors of facial structures, creating a comprehensive graphical facial feature vector. This analysis facilitated the development of a specialized graph kernel based on correlations between facial landmarks, improving speech pattern analysis. Building on these foundations, we introduced a lightweight deepfake detection framework using a graph convolutional network with neighborhood normalization. This framework utilizes spatial data and a correlationbased graph kernel for more effective deepfake classification, enhancing the generalization capabilities of our detection system.