Skip to content

Latest commit

 

History

History
66 lines (46 loc) · 4.84 KB

README.md

File metadata and controls

66 lines (46 loc) · 4.84 KB

Amazon ML Challenge - Team DBkaScam

This repository contains the code and resources used by Team DBkaScam for the Amazon ML Challenge 2024. The team consisted of the following members:

  1. Arnav Goel
  2. Medha Hira
  3. Mihir Aggarwal
  4. AS Poornash

We were ranked among the Top 10 teams on the leaderboard and had the honor of presenting our solution to Amazon Scientists in the Grand Finale, where we secured an impressive 6th Rank. Here is a viewing link to our final presenatation: Link

Overview

In this challenge, our goal was to create an efficent machine learning model to extract entity values from images. Our solution presents an ensemble over open-source vision-language models such as MiniCPM-2.6 and Qwen2-VL-7b, where each component's OCR capabilities are enhanced through a distinct prompting framework.

Dataset

  • We were supposed to utilize the dataset provided by Amazon which included around ~230,000 training images and ~130,000 test images. The dataset is available at the following link: Link.
  • Notably, the size of the training dataset mandated using downsampling and EDA methods to get smaller subsets for efficiently performing supervised fine-tuning and curating few-shot exemplars.

Solution Architecture

Screenshot 2024-11-05 at 2 04 33 AM

Our approach was designed to optimize accuracy and ensure generalization. Thus, we picked a voting-ensemble based approach which comprised of the two models shown above. We detail the strategies for each leg of the ensemble below:

Zero-Shot Prompting (ZSP)

Screenshot 2024-11-05 at 1 48 05 AM

The image above shows the zero-shot curated prompt we used to prompt MiniCPM-2.6 with a product image and extract the necessary entity value. The format is made specific to reduce model hallucinations and make post-processing easier.

Dynamic Few-Shot Learning (FSL)

Screenshot 2024-11-05 at 1 48 51 AM

We employed few-shot learning to improve upon frequently-observed model mistakes in our ZSP approach. Thus we curated a pool of few-shot exemplars consisting of frequent model errors segregated by category. At inference time, we sampled based on the input image and provided three exemplars to better augment the model output. The prompt for the same is shared above.

Supervised Fine-Tuning (SFT)

  • We utilised LLaMa-Factory (Zheng et al., 2024) for performing parameter-efficient SFT on Qwen2-VL-7B using 8-bit QLoRA.
  • We fine-tuned Qwen2-VL-7b on 150000 samples with a batch size of 16 for 1 epoch for this experiment.

Post-Processing

Post-processing was an important part of our solution to ensure handling of edge cases and adherance to guidelines.

  1. Handling Edge Cases in Data:

    • Fractions and mixed fractions in images are processed using regex expressions to convert them into decimals.
    • Symbols like single (') and double (") quotes, typically representing feet and inches, are standardized to match the training set mapping.
  2. Managing Ranges and Unknown Symbols:

    • For data ranges (e.g., a-b), higher values are selected based on predefined guidelines.
    • Symbols not listed in the reference appendix are removed using instruction-tuned few-shot learning and rule-based algorithms.

Results

The results table shows the various model combinations tried by us and the final results achieved by our best-performing model.

Strategy Model F1 Score
ZSP MiniCPM-2.6 66.2
InternVL2-8B 65.9
ZSP + PostProc MiniCPM-2.6 69.3
InternVL2-8B 68.2
SFT Qwen2-7B-SFT 64.8
FSL Qwen2-7B 70.9
Ensemble Methods ZSP-1 + ZSP-2 68.5
SFT + ZSP-1 70.7
SFT + FSL 71.4
SFT + FSL + ZSP-1 71.8