layout

title

tags

comments

post

Artificial Intelligence Paper Review

ai

ml

paper review

Segmentation

Classification

Inpainting

Image Editing

Face Swap

Video Generation

Diffusion Model

Volume Rendering

Virtual Try On

Voice Conversion

tts

Large Language Model

speech recognition

Object Detection

Fundamental

RAG

false

Retrieval-Augmented Generation

Text Embeddings by Weakly-Supervised Contrastive Pre-training

(DRAGIN) Dynamic Retrieval Augmented Generation based on the Information Needs of Large Language Models

(DeepRAG) Thinking to Retrieval Step by Step for Large Language Models

Large Language Model

(ChipNeMo) Domain-Adapted LLMs for Chip DesignBefore Projection

CONTINUAL PRE-TRAINING OF LANGUAGE MODELS

(Reuse, Don’t Retrain) A Recipe for Continued Pretraining of Language Models

(SteerLM) Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

Nemotron-4 340B Technical Report

(LogParser-LLM) Advancing Efficient Log Parsing with Large Language Models

Harnessing LLMs for High-level Reasoning Over Spatiotemporal Sensor Traces

(Penetrative AI) Making LLMs Comprehend the Physical World

Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies

DeepSeek-V3 Technical Report

(DeepSeek-R1) Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Visual Language Model

(Video-LLaVA) Learning United Visual Representation by Alignment Before Projection

(VILA) On Pre-training for Visual Language Models

Sigmoid Loss for Language Image Pre-Training

(NVILA) Efficient Frontier Visual Language Models

(Template Matters) Understanding the Role of Instruction Templatesin Multimodal Language Model Evaluation and Training

(DEPLOT) One-shot visual language reasoning by plot-to-table translation

(MATCHA) Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

(Pix2Struct) Screenshot Parsing as Pretraining for Visual Language Understanding

Large Model Optimization

(AWQ) ACTIVATION-AWARE WEIGHT QUANTIZATION FOR ON-DEVICE LLM COMPRESSION AND ACCELERATION

Speculative Decoding

Computer Vision

Segmentation

Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

PortraitNet: Real-time Portrait Segmentation Network for Mobile Device

Real-time Hair Segmentation and Recoloring on Mobile GPUs

TTVOS: Lightweight Video Object Segmentation with Adaptive Template Attention Module and Temporal Consistency Loss

SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

(PP-LiteSeg) A Superior Real-Time Semantic Segmentation Model

(SemPLeS) Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation

Object Detection

Scaled-YOLOv4: Scaling Cross Stage Partial Network

Pose Estimation

(OpenPose) Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Image Classification

(Background Splitting) Finding Rare Classes in a Sea of Background

Image Inpainting

(PiiGAN) Generative Adversarial Networks for Pluralistic Image Inpainting

Recurrent Feature Reasoning for Image Inpainting

Image Editing

Spatially-invariant Style-codes Controlled Makeup Transfer

Adaptive semantic attribute decoupling for precise face image editing

(Arbitrary Facial Attribute Editing) Only Change What You Want

Face Swap

(SimSwap) An Efficient Framework For High Fidelity Face Swapping

(MobileFaceSwap) A Lightweight Framework for Video Face Swapping

(MobileFSGAN) MIGRATING FACE SWAP TO MOBILE DEVICES: A LIGHTWEIGHT FRAMEWORK AND A SUPERVISED TRAINING SOLUTION

(A new face swap method for image and video domains) a technical report

(Smooth-Swap) A Simple Enhancement for Face-Swapping with Smoothness

Region-Aware Face Swapping

GHOST — A New Face Swap Approach for Image and Video Domains

Video Generation

PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

(MakeItTalk) Speaker-Aware Talking-Head Animation

First Order Motion Model for Image Animation

(DaGAN) Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Thin-Plate Spline Motion Model for Image Animation

(SadTalker) Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Diffusion Model

(InstructPix2Pix) Learning to Follow Image Editing Instructions

High-Resolution Image Synthesis with Latent Diffusion Models

Null-text Inversion for Editing Real Images using Guided Diffusion Models

Volume Rendering

(NeRF) Representing Scenes as Neural Radiance Fields for View Synthesis

(R2L) Distilling Neural Radiance Field to Neural Light Field for Efficient Novel View Synthesis

Real-Time Neural Light Field on Mobile Devices

(Instant-NGP) Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

(MobileNeRF) Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures

(Re-ReND) Real-time Rendering of NeRFs across Devices

(BakedSDF) Meshing Neural SDFs for Real-Time View Synthesis

Virtual Try On

(ARShoe) Real-Time Augmented Reality Shoe Try-on System on Smartphones

CONTINUAL PRE-TRAINING OF LANGUAGE MODELS

(Reuse, Don’t Retrain) A Recipe for Continued Pretraining of Language Models

Natural Language

Text-to-Speech

(YourTTS) Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

(VITS) Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

(NaturalSpeech2) Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

(NaturalSpeech) End-to-End Text to Speech Synthesis with Human-Level Quality

Voice Conversion

Voice Conversion With Just Nearest Neighbors

LOW-LATENCY REAL-TIME VOICE CONVERSION ON CPU

(QuickVC) Any-To-Many Voice Conversion Using Inverse Short-Time Fourier Transform for Faster Conversion

Speech Recognition

(Whisper) Robust Speech Recognition via Large-Scale Weak Supervision

(WhisperX) Time-Accurate Speech Transcription of Long-Form Audio

Music Fingerprinting

(SpectroMap) Peak detection algorithm for audio fingerprinting

MUSIC AUGMENTATION AND DENOISING FOR PEAK-BASED AUDIO FINGERPRINTING

Fundamental

Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Searching for MobileNetV3

Supervised Contrastive Learning

(Wavelet Knowledge Distillation) Towards Efficient Image-to-Image Translation

(Teachers Do More Than Teach) Compressing Image-to-Image Models

Coordinate Attention for Efficient Mobile Network Design

Image Augmentations for GAN Training

Improved Consistency Regularization for GANs

(GraN-GAN) Piecewise Gradient Normalization for Generative Adversarial Networks

TOWARDS FASTER AND STABILIZED GAN TRAINING FOR HIGH-FIDELITY FEW-SHOT IMAGE SYNTHESIS

(GAN Compression) Efficient Architectures for Interactive Conditional GANs

Improving GANs with A Dynamic Discriminator

Systematic Analysis and Removal of Circular Artifacts for StyleGAN