From f2294274cdad0c5b401a0c82d429a8ce1bd9e5e5 Mon Sep 17 00:00:00 2001 From: Anssi Kostiainen Date: Thu, 11 Jan 2024 11:14:02 +0200 Subject: [PATCH 1/2] Revise use cases with transformers Add new use cases: - Text-to-image - Speech Recognition - Text Generation - (Image Segmentation now refers to [SegAny] also) Add new reference models: - Text-to-image: stable-diffusion-v1-5 - Image segmentation: segment-anything - Speech-to-text: whisper-tiny.en - Text generation: t5-small, m2m100_418M, gpt2, llama-2-7b Remove redundant local reference: [POWERFUL-FEATURES] --- index.bs | 216 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 209 insertions(+), 7 deletions(-) diff --git a/index.bs b/index.bs index 178a924e..0198225d 100644 --- a/index.bs +++ b/index.bs @@ -437,8 +437,8 @@ A user joins a teleconference via a web-based video conferencing application at her desk since no meeting room in her office is available. During the teleconference, she does not wish that her room and people in the background are visible. To protect the privacy of the other people and the surroundings, the -application runs a machine learning model such as [[DeepLabv3+]] or -[[MaskR-CNN]] to semantically split an image into segments and replaces +application runs a machine learning model such as [[DeepLabv3+]], [[MaskR-CNN]] +or [[SegAny]] to semantically split an image into segments and replaces segments that represent other people and background with another picture. ### Skeleton Detection ### {#usecase-skeleton-detection} @@ -490,6 +490,20 @@ For better accessibility, a web-based presentation application provides automatic image captioning by running a machine learning model such as [[im2txt]] which predicts explanatory words of the presentation slides. +### Text-to-image ### {#usecase-text-to-image} + +Images are a core part of modern web experiences. An ability to in a +privacy-preserving manner generate images based on text input enables visual +personalization and adaptation of web applications and content. For example, a web +application can use as an input a natural language description on the web page +or a description provided by the user within a text prompt to produce an +image matching the text description. This text-to-image use case enabled by +latent diffusion model architecture [[LDM]] forms the basis for additional +text-to-image use cases. For example, inpainting where a portion of an existing +image on the web page is selectively modified using the newly generated content, +or the converse, outpainting, where an original image is extended beyond its +original dimensions filling the empty space with generated content. + ### Machine Translation ### {#usecase-translation} Multiple people from various countries are talking via a web-based real-time @@ -520,6 +534,29 @@ noise suppression using Recurrent Neural Network such as [[RNNoise]] for suppressing background dynamic noise like baby cry or dog barking to improve audio experiences in video conferences. +### Speech Recognition ### {#usecase-speech-recognition} + +Speech recognition, also known as speech to text, enables recognition and +translation of spoken language into text. Example applications of speech +recognition include transcription, automatic translation, multimodal interaction, +real-time captioning and virtual assistants. Speech recognition improves +accessibility of auditory content and makes it possible to interact with such +content in a privacy-preserving manner in a textual form. Examples of common +use cases include watching videos or participating online meetings using +real-time captioning. Models such as [[Whisper]] approach humans in their accuracy +and robustness and are well positioned to improve accessibility of such use cases. + +### Text Generation ### {#usecase-text-generation} + +Various text generation use cases are enabled by large language models (LLM) that +are able to perform tasks where a general ability to predict the next item +in a text sequence is required. This class of models can translate texts, answer +questions based on a text input, summarize a larger body of text, or generate +text output based on a textual input. LLMs enable better performance compared to +older models based on RNN, CNN, or LSTM architectures and further improve the +performance of many other use cases discussed in this section. +Examples of LLMs include [[t5-small]], [[m2m100_418M]], [[gpt2]], and [[llama-2-7b]]. + ### Detecting fake video ### {#usecase-detecting-fake-video} A user is exposed to realistic fake videos generated by ‘deepfake’ on the web. @@ -6530,6 +6567,25 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio ], "date": "January 2018" }, + "SegAny": { + "href": "https://arxiv.org/abs/2304.02643", + "title": "Segment Anything", + "authors": [ + "Alexander Kirillov", + "Alex Berg", + "Chloe Rolland", + "Eric Mintun", + "Hanzi Mao", + "Laura Gustafson", + "Nikhila Ravi", + "Piotr Dollar", + "Ross Girshick", + "Spencer Whitehead", + "Wan-Yen Lo", + "Tete Xiao" + ], + "date": "April 2023" + }, "PoseNet": { "href": "https://medium.com/tensorflow/real-time-human-pose-estimation-in-the-browser-with-tensorflow-js-7dd0bc881cd5", "title": "Real-time Human Pose Estimation in the Browser with TensorFlow.js", @@ -6607,6 +6663,18 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio ], "date": "September 2016" }, + "LDM": { + "href": "https://arxiv.org/abs/2112.10752", + "title": "High-Resolution Image Synthesis with Latent Diffusion Models", + "authors": [ + "Robin Rombach", + "Andreas Blattmann", + "Dominik Lorenz", + "Patrick Esser", + "Björn Ommer" + ], + "date": "April 2022" + }, "GNMT": { "href": "https://github.com/tensorflow/nmt", "title": "Neural Machine Translation (seq2seq) Tutorial", @@ -6680,6 +6748,19 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio ], "date": "September 2017" }, + "Whisper": { + "href": "https://arxiv.org/abs/2212.04356", + "title": "Robust Speech Recognition via Large-Scale Weak Supervision", + "authors": [ + "Alec Radford", + "Jong Wook Kim", + "Tao Xu", + "Greg Brockman", + "Christine McLeavey", + "Ilya Sutskever" + ], + "date": "December 2022" + }, "GRU": { "href": "https://arxiv.org/pdf/1406.1078.pdf", "title": "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation", @@ -6772,12 +6853,133 @@ Thanks to Dwayne Robinson for his work investigating and providing recommendatio ], "date": "November 2019" }, - "POWERFUL-FEATURES": { - "href": "https://w3c.github.io/webappsec-secure-contexts/", - "title": "Secure Contexts", + "t5-small": { + "href": "https://jmlr.org/papers/volume21/20-074/20-074.pdf", + "title": "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", + "authors": [ + "Colin Raffel", + "Noam Shazeer", + "Adam Roberts", + "Katherine Lee", + "Sharan Narang", + "Michael Matena", + "Yanqi Zhou", + "Wei Li", + "Peter J. Liu" + ], + "date": "June 2020" + }, + "m2m100_418M": { + "href": "https://arxiv.org/abs/2010.11125", + "title": "Beyond English-Centric Multilingual Machine Translation", "authors": [ - "Mike West" - ] + "Angela Fan", + "Shruti Bhosale", + "Holger Schwenk", + "Zhiyi Ma", + "Ahmed El-Kishky", + "Siddharth Goyal", + "Mandeep Baines", + "Onur Celebi", + "Guillaume Wenzek", + "Vishrav Chaudhary", + "Naman Goyal", + "Tom Birch", + "Vitaliy Liptchinsky", + "Sergey Edunov", + "Edouard Grave", + "Michael Auli", + "Armand Joulin" + ], + "date": "October 2020" + }, + "gpt2": { + "href": "https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf", + "title": "Language Models are Unsupervised Multitask Learners", + "authors": [ + "Alec Radford", + "Jeffrey Wu", + "Rewon Child", + "David Luan", + "Dario Amodei", + "Ilya Sutskever" + ], + "date": "February 2019" + }, + "llama-2-7b": { + "href": "https://arxiv.org/abs/2307.09288", + "title": "Llama 2: Open Foundation and Fine-Tuned Chat Models", + "authors": [ + "Hugo Touvron", + "Louis Martin", + "Kevin Stone", + "Peter Albert", + "Amjad Almahairi", + "Yasmine Babaei", + "Nikolay Bashlykov", + "Soumya Batra", + "Prajjwal Bhargava", + "Shruti Bhosale", + "Dan Bikel", + "Lukas Blecher", + "Cristian Canton Ferrer", + "Moya Chen", + "Guillem Cucurull", + "David Esiobu", + "Jude Fernandes", + "Jeremy Fu", + "Wenyin Fu", + "Brian Fuller", + "Cynthia Gao", + "Vedanuj Goswami", + "Naman Goyal", + "Anthony Hartshorn", + "Saghar Hosseini", + "Rui Hou", + "Hakan Inan", + "Marcin Kardas", + "Viktor Kerkez", + "Madian Khabsa", + "Isabel Kloumann", + "Artem Korenev", + "Punit Singh Koura", + "Marie-Anne Lachaux", + "Thibaut Lavril", + "Jenya Lee", + "Diana Liskovich", + "Yinghai Lu", + "Yuning Mao", + "Xavier Martinet", + "Todor Mihaylov", + "Pushkar Mishra", + "Igor Molybog", + "Yixin Nie", + "Andrew Poulton", + "Jeremy Reizenstein", + "Rashi Rungta", + "Kalyan Saladi", + "Alan Schelten", + "Ruan Silva", + "Eric Michael Smith", + "Ranjan Subramanian", + "Xiaoqing Ellen Tan", + "Binh Tang", + "Ross Taylor", + "Adina Williams", + "Jian Xiang Kuan", + "Puxin Xu", + "Zheng Yan", + "Iliyan Zarov", + "Yuchen Zhang", + "Angela Fan", + "Melanie Kambadur", + "Sharan Narang", + "Aurelien Rodriguez", + "Robert Stojnic", + "Sergey Edunov", + "Thomas Scialom" + ], + "date": "July 2023" } } From 1d0f5c04967fdd4cec6c3da7f942d8eeb1265e1f Mon Sep 17 00:00:00 2001 From: Anssi Kostiainen Date: Thu, 18 Jan 2024 21:40:21 +0200 Subject: [PATCH 2/2] Revise use cases with transformers, grammar fixes Improve grammar based on review comments --- index.bs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/index.bs b/index.bs index 0198225d..490b9a7c 100644 --- a/index.bs +++ b/index.bs @@ -492,8 +492,8 @@ automatic image captioning by running a machine learning model such as ### Text-to-image ### {#usecase-text-to-image} -Images are a core part of modern web experiences. An ability to in a -privacy-preserving manner generate images based on text input enables visual +Images are a core part of modern web experiences. An ability to generate images +based on text input in a privacy-preserving manner enables visual personalization and adaptation of web applications and content. For example, a web application can use as an input a natural language description on the web page or a description provided by the user within a text prompt to produce an @@ -542,7 +542,7 @@ recognition include transcription, automatic translation, multimodal interaction real-time captioning and virtual assistants. Speech recognition improves accessibility of auditory content and makes it possible to interact with such content in a privacy-preserving manner in a textual form. Examples of common -use cases include watching videos or participating online meetings using +use cases include watching videos or participating in online meetings using real-time captioning. Models such as [[Whisper]] approach humans in their accuracy and robustness and are well positioned to improve accessibility of such use cases.