This repository contains the code and resources used by Team DBkaScam for the Amazon ML Challenge 2024. The team consisted of the following members:
We were ranked among the Top 10 teams on the leaderboard and had the honor of presenting our solution to Amazon Scientists in the Grand Finale, where we secured an impressive 6th Rank. Here is a viewing link to our final presenatation: Link
In this challenge, our goal was to create an efficent machine learning model to extract entity values from images. Our solution presents an ensemble over open-source vision-language models such as MiniCPM-2.6
and Qwen2-VL-7b
, where each component's OCR capabilities are enhanced through a distinct prompting framework.
- We were supposed to utilize the dataset provided by Amazon which included around
~230,000
training images and~130,000
test images. The dataset is available at the following link: Link. - Notably, the size of the training dataset mandated using downsampling and EDA methods to get smaller subsets for efficiently performing supervised fine-tuning and curating few-shot exemplars.
Our approach was designed to optimize accuracy and ensure generalization. Thus, we picked a voting-ensemble based approach which comprised of the two models shown above. We detail the strategies for each leg of the ensemble below:
The image above shows the zero-shot curated prompt we used to prompt MiniCPM-2.6
with a product image and extract the necessary entity value. The format is made specific to reduce model hallucinations and make post-processing easier.
We employed few-shot learning to improve upon frequently-observed model mistakes in our ZSP approach. Thus we curated a pool of few-shot exemplars consisting of frequent model errors segregated by category. At inference time, we sampled based on the input image and provided three exemplars to better augment the model output. The prompt for the same is shared above.
- We utilised LLaMa-Factory (Zheng et al., 2024) for performing parameter-efficient SFT on Qwen2-VL-7B using 8-bit QLoRA.
- We fine-tuned
Qwen2-VL-7b
on 150000 samples with a batch size of 16 for 1 epoch for this experiment.
Post-processing was an important part of our solution to ensure handling of edge cases and adherance to guidelines.
-
Handling Edge Cases in Data:
- Fractions and mixed fractions in images are processed using regex expressions to convert them into decimals.
- Symbols like single (') and double (") quotes, typically representing feet and inches, are standardized to match the training set mapping.
-
Managing Ranges and Unknown Symbols:
- For data ranges (e.g.,
a-b
), higher values are selected based on predefined guidelines. - Symbols not listed in the reference appendix are removed using instruction-tuned few-shot learning and rule-based algorithms.
- For data ranges (e.g.,
The results table shows the various model combinations tried by us and the final results achieved by our best-performing model.
Strategy | Model | F1 Score |
---|---|---|
ZSP | MiniCPM-2.6 | 66.2 |
InternVL2-8B | 65.9 | |
ZSP + PostProc | MiniCPM-2.6 | 69.3 |
InternVL2-8B | 68.2 | |
SFT | Qwen2-7B-SFT | 64.8 |
FSL | Qwen2-7B | 70.9 |
Ensemble Methods | ZSP-1 + ZSP-2 | 68.5 |
SFT + ZSP-1 | 70.7 | |
SFT + FSL | 71.4 | |
SFT + FSL + ZSP-1 | 71.8 |