Implement Vision transformers and cyclic re-fine/retrain endpoint #67
Labels
esmero-nlp
Natural Language Processing as API
ML Xperiments
Distrust through research and (in)validation
Milestone
What?
Since we began our ML explorations, some models and approaches have become more mature. Our existing Image models (MobileNet + YOLO) do a decent job on finding "similarities" (and yolo not bad on image segmentation), but honestly the results are not good enough for actual Field specific matching.
That said, Vision Transformers(ViT) seem to have a better zero-shot and semantic based Similarity Embedding generation and I want to give them a try.
The Google ViT, which also can be refined by re-training on a few extra images (thus the "cyclic" idea) generates an embedding dimension 768, which also matches our Archipelago Strawberryfield Code for SBFlavors.
I will open tomorrow also one for CliP which uses (an idea I had when we started but these people might be smarter than me) same Vector Space for Text and Image, which allows a phrase like "Has a red Car and a blue one" and an image of a "red Car" to be encoded using compatible vectors and thus allows for dot products between "textual representations" and "image" but also "image to image" to be executed. The Apple one, trained on 5Billion images! Might be a good experiment.
All this is to be evaluated and follows the same rules as before. No data is shared to the outside, All vectors are indexed internally
The text was updated successfully, but these errors were encountered: