Amazon ML Challenge - Team DBkaScam

This repository contains the code and resources used by Team DBkaScam for the Amazon ML Challenge 2024. The team consisted of the following members:

Arnav Goel
Medha Hira
Mihir Aggarwal
AS Poornash

We were ranked among the Top 10 teams on the leaderboard and had the honor of presenting our solution to Amazon Scientists in the Grand Finale, where we secured an impressive 6th Rank. Here is a viewing link to our final presenatation: Link

Overview

In this challenge, our goal was to create an efficent machine learning model to extract entity values from images. Our solution presents an ensemble over open-source vision-language models such as MiniCPM-2.6 and Qwen2-VL-7b, where each component's OCR capabilities are enhanced through a distinct prompting framework.

Dataset

We were supposed to utilize the dataset provided by Amazon which included around ~230,000 training images and ~130,000 test images. The dataset is available at the following link: Link.
Notably, the size of the training dataset mandated using downsampling and EDA methods to get smaller subsets for efficiently performing supervised fine-tuning and curating few-shot exemplars.

Solution Architecture

Our approach was designed to optimize accuracy and ensure generalization. Thus, we picked a voting-ensemble based approach which comprised of the two models shown above. We detail the strategies for each leg of the ensemble below:

Zero-Shot Prompting (ZSP)

The image above shows the zero-shot curated prompt we used to prompt MiniCPM-2.6 with a product image and extract the necessary entity value. The format is made specific to reduce model hallucinations and make post-processing easier.

Dynamic Few-Shot Learning (FSL)

We employed few-shot learning to improve upon frequently-observed model mistakes in our ZSP approach. Thus we curated a pool of few-shot exemplars consisting of frequent model errors segregated by category. At inference time, we sampled based on the input image and provided three exemplars to better augment the model output. The prompt for the same is shared above.

Supervised Fine-Tuning (SFT)

We utilised LLaMa-Factory (Zheng et al., 2024) for performing parameter-efficient SFT on Qwen2-VL-7B using 8-bit QLoRA.
We fine-tuned Qwen2-VL-7b on 150000 samples with a batch size of 16 for 1 epoch for this experiment.

Post-Processing

Post-processing was an important part of our solution to ensure handling of edge cases and adherance to guidelines.

Handling Edge Cases in Data:
- Fractions and mixed fractions in images are processed using regex expressions to convert them into decimals.
- Symbols like single (') and double (") quotes, typically representing feet and inches, are standardized to match the training set mapping.
Managing Ranges and Unknown Symbols:
- For data ranges (e.g., a-b), higher values are selected based on predefined guidelines.
- Symbols not listed in the reference appendix are removed using instruction-tuned few-shot learning and rule-based algorithms.

Results

The results table shows the various model combinations tried by us and the final results achieved by our best-performing model.

Strategy	Model	F1 Score
ZSP	MiniCPM-2.6	66.2
	InternVL2-8B	65.9
ZSP + PostProc	MiniCPM-2.6	69.3
	InternVL2-8B	68.2
SFT	Qwen2-7B-SFT	64.8
FSL	Qwen2-7B	70.9
Ensemble Methods	ZSP-1 + ZSP-2	68.5
	SFT + ZSP-1	70.7
	SFT + FSL	71.4
	SFT + FSL + ZSP-1	71.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Amazon ML Challenge - Team DBkaScam

Overview

Dataset

Solution Architecture

Zero-Shot Prompting (ZSP)

Dynamic Few-Shot Learning (FSL)

Supervised Fine-Tuning (SFT)

Post-Processing

Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

Amazon ML Challenge - Team DBkaScam

Overview

Dataset

Solution Architecture

Zero-Shot Prompting (ZSP)

Dynamic Few-Shot Learning (FSL)

Supervised Fine-Tuning (SFT)

Post-Processing

Results