Narrating the Unseen Real-Time Video Descriptions for Visually Impaired Individuals

This research explores a novel system designed to empower visually impaired individuals by narrating their surroundings through spoken language, leveraging the capabilities of a mobile camera. Our study involves a comparative analysis of various pre-trained models in generating descriptive captions. Globally, approximately 2.2 billion people are affected by some form of visual impairment or blindness. Addressing this significant challenge, our research proposes an integrated solution aimed at assisting visually impaired individuals in comprehending their environment. This is achieved through the description of video streams, utilizing advanced Generative AI techniques.

The cornerstone of our proposed methodology is the use of a pre-trained GPT-4 Vision multimodal model, which has been trained on an extensive dataset comprising 13 million tokens. Additionally, we have engineered a robust Client-Server socket connection framework. This design ensures that intensive computational tasks, particularly video stream preprocessing, are primarily conducted server-side.

A key aspect of our research involves the evaluation of generated captions. These are meticulously compared with standard captions using established metrics such as BLEU and ROUGE scores. Recognizing the semantic limitations inherent in these metrics, we also employ a Semantic Similarity metric for a more nuanced comparison. This comprehensive approach allows for a thorough assessment of the effectiveness of our system in providing accurate and contextually relevant descriptions for the visually impaired.

Referenes:

Image Captioning Paper and Codes
Semi-Autoregressive Image Captioning
DeeCap: Dynamic Early Exiting for Efficient Image Captioning
SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model
Show, Translate and Tell
Guided Open Vocabulary Image Captioning with Constrained Beam Search
A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
Enhancing image captioning with depth information using a Transformer-based framework
Current challenges and limitations of image captioning
How to Develop a Deep Learning Photo Caption Generator from Scratch
Generating image captions from the camera feed
A Real-time Image Caption Generator based on Jetson nano
Exploring Deep Learning Image Captioning
Image captioning using CRNN encoding in seq2seq model
Step by Step Guide to Build Image Caption Generator using Deep Learning
Image/Video Summarization in Text/Speech for Visually Impaired People
Audio Description of Videos for People with Visual Disabilities
Artificial intelligence for visually impaired
What Makes Videos Accessible to Blind and Visually Impaired People?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Narrating the Unseen Real-Time Video Descriptions for Visually Impaired Individuals

Files

README.md

Latest commit

History

README.md

File metadata and controls

Narrating the Unseen Real-Time Video Descriptions for Visually Impaired Individuals