Abstract:
Speech synthesizers may produce extremely accurate sounds that mimics a person’s voice. This capability can be exploited to make false audio recordings, making it impossible to tell between real and fake communications. Modern speech synthesis technologies, such as deep learning-based models, can produce synthetic speech that is substantially identical to human voice. These techniques can reproduce not only the tone and rhythm of a voice, but also sophisticated speech patterns, making detection difficult for both human listeners and automated systems. The continuous growth of speech synthesis technologies raise another challenge which is out-dated and limited datasets available. Older datasets might not accurately reflect the state of synthesized speech today, which would reduce its diversity and applicability. To address that we extended the dataset by utilizing ASVspoof 2019 dataset as well as we generated 3,000 synthesized audio samples using ElevenLabs API. We also developed a classifier to classify the real and synthesized audio. We used Melspectrogram and an input to our model and It shows promising results by achieving 99.5% accuracy on our dataset. We also compared the results with previously available model for synthesized speech classification and our model achieve lowest ERR 1.02%. These findings highlight the potential of our technique to advance the field of synthetic speech classification. Our model enhances detection accuracy while also providing a framework for future study into constructing more resilient defenses against increasingly advanced synthetic speech generating technologies. The study’s findings contribute to a larger effort to protect communication authenticity and improve security in the digital age.