Abstract:
The advancement of machine learning and artificial intelligence has revolutionized multimedia production by automating traditionally manual processes. AutoFoley: Enhancing Videos with Automatic Audio Generation introduces an innovative system that transforms silent videos into rich, immersive experiences by generating synchronized audio tracks. This system employs a Long-term Recurrent Convolutional Network (LRCN) model, combining convolutional layers for spatial feature extraction and recurrent layers for temporal pattern analysis. AutoFoley processes video data by detecting objects, segmenting scenes, and generating contextually relevant audio using pre-trained models and curated datasets. By integrating advanced neural networks and utilizing open-source frameworks like PyTorch, the system ensures accurate audio synchronization and high-quality sound effects. The backend, powered by Flask, facilitates seamless interaction between users and the processing pipeline. AutoFoley simplifies the traditionally complex process of Foley artistry, offering content creators, filmmakers, and multimedia professionals an automated, efficient, and scalable solution for enhancing video content with realistic and synchronized audio effects.