This research explores a novel system designed to empower visually impaired individuals by narrating their surroundings through spoken language, leveraging the capabilities of a mobile camera. Our study involves a comparative analysis of various pre-trained models in generating descriptive captions. Globally, approximately 2.2 billion people are affected by some form of visual impairment or blindness. Addressing this significant challenge, our research proposes an integrated solution aimed at assisting visually impaired individuals in comprehending their environment. This is achieved through the description of video streams, utilizing advanced Generative AI techniques.
The cornerstone of our proposed methodology is the use of a pre-trained GPT-4 Vision multimodal model, which has been trained on an extensive dataset comprising 13 million tokens. Additionally, we have engineered a robust Client-Server socket connection framework. This design ensures that intensive computational tasks, particularly video stream preprocessing, are primarily conducted server-side.
A key aspect of our research involves the evaluation of generated captions. These are meticulously compared with standard captions using established metrics such as BLEU and ROUGE scores. Recognizing the semantic limitations inherent in these metrics, we also employ a Semantic Similarity metric for a more nuanced comparison. This comprehensive approach allows for a thorough assessment of the effectiveness of our system in providing accurate and contextually relevant descriptions for the visually impaired.
Referenes:
- Image Captioning Paper and Codes
- Semi-Autoregressive Image Captioning
- DeeCap: Dynamic Early Exiting for Efficient Image Captioning
- SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
- Show, Translate and Tell
- Guided Open Vocabulary Image Captioning with Constrained Beam Search
- A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation
- AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
- NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
- Enhancing image captioning with depth information using a Transformer-based framework
- Current challenges and limitations of image captioning
- How to Develop a Deep Learning Photo Caption Generator from Scratch
- Generating image captions from the camera feed
- A Real-time Image Caption Generator based on Jetson nano
- Exploring Deep Learning Image Captioning
- Image captioning using CRNN encoding in seq2seq model
- Step by Step Guide to Build Image Caption Generator using Deep Learning
- Image/Video Summarization in Text/Speech for Visually Impaired People
- Audio Description of Videos for People with Visual Disabilities
- Artificial intelligence for visually impaired
- What Makes Videos Accessible to Blind and Visually Impaired People?