Groundbreaking Papers in Generative AI

Ground-breaking Papers in GenAI

Author: Lovelyn | Last Updated: February 12, 2025

Text-Only Transformers (NLP)

LLaMA 2: Open Foundation and Fine-Tuned Chat Models (July 2023)
Improved training efficiency, longer context window, RLHF fine-tuned chat versions.
LLaMA: Open and Efficient Foundation Language Models (February 2023)
Pretrained models (7B–65B) on public datasets, optimized tokenization, smaller vocabulary size.
GPT-3: Language Models are Few-Shot Learners (May 2020)
Large-scale autoregressive model, few-shot learning, 175B parameters.
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (March 2020)
Replaces masked words instead of predicting them, uses a discriminator-based loss.
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (October 2019)
Converts all NLP tasks into a text-to-text format, uses encoder-decoder.
ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations (September 2019)
Factorized embeddings, cross-layer parameter sharing, sentence order prediction.
RoBERTa: A Robustly Optimized BERT Pretraining Approach (July 2019)
Removes NSP, longer training, dynamic masking, larger batch sizes.
XLNet: Generalized Autoregressive Pretraining for Language Understanding (June 2019)
Permutation-based training, bidirectional context, avoids MLM limitations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (October 2018)
Bidirectional masked language modeling (MLM), next sentence prediction (NSP), transformer encoder-only architecture.
Transformer: Attention is All You Need (June 2017)
Self-attention, positional encoding, encoder-decoder architecture, layer normalization.

Image-Only Transformers (Computer Vision)

Segformer: Simple and Efficient Design for Semantic Segmentation with Transformers (May 2022)
Uses hierarchical transformers with lightweight decoders for efficient segmentation.
Swin Transformer V2: Scaling Up Capacity and Resolution (December 2021)
Improves window-based attention, enables training at extreme resolutions.
BEiT: BERT Pretraining of Image Transformers (March 2021)
Adopts masked image modeling (MIM) similar to BERT for image pretraining.
Vision Transformer (ViT): An Image is Worth 16x16 Words (March 2021)
Applies transformers directly to images by splitting them into patches.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (March 2021)
Introduces local self-attention via shifted windows, improving computational efficiency.
DETR: End-to-End Object Detection with Transformers (October 2020)
Applies transformers for object detection without needing anchor boxes.
Image GPT: Generative Pretraining from Pixels (October 2020)
Uses autoregressive transformers for image generation.
iGPT: Generative Pretraining from Pixels (March 2020)
Explores training transformers for images using a similar approach to GPT.

Multimodal Transformers (Text + Image + Other Modalities)

MM1: Meta's Open Multimodal Model (May 2023)
Trained on both images and text, setting new benchmarks for vision-language models.
GIT: A Generative Image-to-Text Transformer for Vision and Language (January 2023)
Unified image-to-text transformer that generates text from images using a single-stage model.
PaLI: A Jointly-Scaled Multimodal Model for Vision and Language (October 2022)
Combines language and vision encoders, achieving state-of-the-art multimodal performance.
Flamingo: Few-Shot Learning for Vision and Language (September 2022)
Designed for vision-language tasks, enabling few-shot multimodal learning.
Flamingo: A Visual Language Model for Few-Shot Learning (April 2022)
Integrates vision and language models for few-shot learning tasks.
BEiT-3: Image as a Foreign Language (April 2022)
Extends BEiT to unify vision and language learning under a masked prediction framework.
BLIP: Bootstrapped Language-Image Pretraining (November 2021)
Uses caption generation and image-text matching to improve multimodal learning.
ALIGN: Learning from Noisy Web Data for Image-Text Matching (April 2021)
Uses contrastive learning with large-scale web image-text pairs for alignment.
CLIP: Learning Transferable Visual Models from Natural Language (March 2021)
Trains vision and language models together for zero-shot learning.
DALL·E: Creating Images from Text Descriptions (January 2021)
Uses GPT-style transformers to generate images from text prompts.
VisualGPT: Transformer for Image Captioning (December 2020)
Combines visual transformers with GPT for caption generation.
VirTex: Learning Visual Representations from Textual Annotations (October 2020)
Trains image representations using textual supervision.
Gemini: A Family of Highly Capable Multimodal Models (December 2023)
Introduces the Gemini family of multimodal models, demonstrating capabilities across image, audio, video, and text understanding.

Transformer architecture variants

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (Feb 2025)
NSA (Natively trainable Sparse Attention) enhances long-context modeling by integrating hardware-aligned optimizations and dynamic hierarchical sparsity, achieving faster, efficient training without compromising model performance.
Mamba: State Space Models (December 2023)
State-space models as an alternative to transformers for sequence processing, offering efficiency benefits.
RetNet: Retentive Networks (July 2023)
Replaces self-attention with retention mechanisms for improved efficiency and long-context handling.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (May 2022)
Introduces FlashAttention, an algorithm that computes exact attention with reduced memory usage and improved speed, enhancing transformer efficiency.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (April 2022)
Scaled MoE to trillion-parameter models, activating only a fraction of the network at a time.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (August 2021)
Presents ALiBi, a method that allows transformers to handle longer sequences than they were trained on by incorporating linear biases in attention.
Perceiver IO: General Perception with Iterative Attention (July 2021)
Extends Perceiver by allowing outputs of variable size, making it applicable to diverse tasks.
RoPE: Rotary Position Embeddings (April 2021)
Improved positional encoding method for long-context transformers.
Perceiver: General Perception with Iterative Attention (March 2021)
Processes arbitrary input modalities (text, images, audio) using a cross-attention mechanism.
Linformer: Self-Attention with Linear Complexity (June 2020)
Reduces self-attention complexity from O(n²) to O(n), making Transformers more efficient.
BigBird: Transformers for Longer Sequences (May 2020)
Uses sparse attention to handle sequences up to 8× longer than standard Transformers.
Reformer: The Efficient Transformer (January 2020)
Uses locality-sensitive hashing (LSH) for self-attention, reducing memory requirements.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (July 2019)
Introduces recurrence mechanism to process longer dependencies efficiently.
Universal Transformers: Dynamic Depth for Sequence Modeling (April 2019)
Replaces fixed layers with adaptive computation, allowing dynamic depth per token.
Sparse Transformer: Scaling Transformers to Large Contexts (October 2018)
Introduces sparse attention for efficient long-sequence processing.
Attention is All You Need (June 2017)
Introduces the Transformer architecture with self-attention and no recurrence.
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (January 2017)
Introduced Mixture-of-Experts (MoE) to improve efficiency by routing inputs to specialized subnetworks.

LLaMA Papers

LLaMA Beyond English: An Empirical Study on Language Capability Transfer (January 2025)
Multilingual fine-tuning with extended vocabulary, language-specific adapters for non-English support.
Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers (December 2024)
Analyzed scaling laws for transformer depth vs width, explored layer-wise importance for reasoning tasks.
The LLaMA 3 Herd of Models (July 2024)
Introduced Mixture of Experts (MoE), improved efficiency, retrieval-augmented generation (RAG) integration.
LLaMA 2: Open Foundation and Fine-Tuned Chat Models (July 2023)
Pretrained (7B, 13B, 70B) models, RLHF for fine-tuned chat, improved safety alignment.
LLaMA: Open and Efficient Foundation Language Models (February 2023)
Pretrained models (7B–65B) on public datasets, optimized tokenization, smaller vocabulary size.

Sparse and Efficient Models

QLoRA: Efficient Finetuning of Quantized LLMs (April 2023)
Uses 4-bit quantization for efficient training of large models, reducing GPU memory usage.
LoRA: Low-Rank Adaptation of Large Language Models (April 2021)
Efficient fine-tuning for transformers, reducing computational costs while maintaining performance.
GPTQ: Fast and Accurate Post-Training Quantization (March 2021)
Post-training quantization (PTQ) for reducing model size without significant accuracy loss.

Diffusion Models

Diffusion Transformers for Image Super-Resolution (September 2023)
Introduces Diffusion Transformers (DiT) for image super-resolution, achieving state-of-the-art results using a transformer-based approach.
Consistent Diffusion Models (CDMs) (June 2023)
Requires fewer steps than DDPMs, achieving high-quality generation more efficiently.
Consistency Models (February 2023)
Presents consistency models, a novel approach enabling single-step or few-step sampling in diffusion models while maintaining high-quality generation.
Imagen: Scaling Up Image Generation with Diffusion Models (July 2022)
Introduces Imagen, a text-to-image diffusion model outperforming previous generative models by leveraging large-scale language models.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (June 2022)
Presents a photorealistic text-to-image generation method using a combination of language models and diffusion models.
Latent Diffusion Models (LDMs) (May 2022)
Compress input data into latent space before diffusion, reducing computational cost.
High-Resolution Image Synthesis with Latent Diffusion Models (December 2021)
Introduces Latent Diffusion Models (LDMs), significantly improving computational efficiency for high-resolution image generation.
Denoising Diffusion Probabilistic Models (June 2020)
Proposes a probabilistic model that generates high-quality images through iterative denoising steps, forming the foundation for modern diffusion models.
Denoising Score Matching with Annealed Langevin Dynamics (March 2015)
Develops the foundation of diffusion models by introducing score matching and Langevin dynamics for high-dimensional generative modeling.

Fine-Tuning & Adaptation Techniques

QLoRA: Efficient Finetuning of Quantized LLMs (December 2023)
Introduces QLoRA, a method that fine-tunes large models using low-rank adaptation while keeping base models quantized, significantly reducing memory requirements.
Parameter-Efficient Fine-Tuning of Large Language Models (March 2023)
Compares various parameter-efficient fine-tuning (PEFT) strategies, such as LoRA, adapters, and prefix tuning, highlighting trade-offs in efficiency and performance.
LoRA: Low-Rank Adaptation of LLMs (November 2022)
Introduces LoRA, a method that freezes most model weights while injecting trainable low-rank matrices, enabling efficient fine-tuning with minimal additional parameters.
RLHF: Training Language Models with Human Feedback (March 2022)
Uses reinforcement learning to align models with human preferences.
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-Tuning (October 2021)
Shows that prompt tuning with continuous embeddings achieves results competitive with full fine-tuning in large language models.
Prefix-Tuning: Optimizing Continuous Prompts for Generation (April 2021)
Proposes prefix-tuning, a method that optimizes soft prompts instead of model parameters, enabling lightweight task adaptation.
AdapterFusion: Non-Destructive Task Composition for Adapter-Based Transfer Learning (February 2020)
Develops AdapterFusion, allowing multiple adapters trained on different tasks to be effectively combined without full fine-tuning.
Adapter Layers for Efficient Transfer Learning (February 2019)
Introduces adapter layers that allow efficient task adaptation by adding small, trainable modules to frozen pretrained models.

Memory/Compute Optimization

AWQ: Activation-aware Weight Quantization for LLMs (June 2023)
Introduces AWQ, a quantization technique that considers activation outliers to improve the efficiency of large language models.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (October 2022)
Presents GPTQ, a method for post-training quantization of GPT models, maintaining accuracy while reducing model size.
DeepSpeed ZeRO: Towards Training Trillion Parameter Models (July 2022)
Describes DeepSpeed ZeRO, a system that enables efficient training of large-scale models by partitioning model states across devices.

Ground-breaking Papers in GenAI

Text-Only Transformers (NLP)

Image-Only Transformers (Computer Vision)

Multimodal Transformers (Text + Image + Other Modalities)

Transformer architecture variants

LLaMA Papers

Sparse and Efficient Models

Diffusion Models

Fine-Tuning & Adaptation Techniques

Memory/Compute Optimization

Training Programmes