Improved training efficiency, longer context window, RLHF fine-tuned chat versions.
Pretrained models (7B–65B) on public datasets, optimized tokenization, smaller vocabulary size.
Large-scale autoregressive model, few-shot learning, 175B parameters.
Replaces masked words instead of predicting them, uses a discriminator-based loss.
Converts all NLP tasks into a text-to-text format, uses encoder-decoder.
Factorized embeddings, cross-layer parameter sharing, sentence order prediction.
Removes NSP, longer training, dynamic masking, larger batch sizes.
Permutation-based training, bidirectional context, avoids MLM limitations.
Bidirectional masked language modeling (MLM), next sentence prediction (NSP), transformer encoder-only architecture.
Self-attention, positional encoding, encoder-decoder architecture, layer normalization.
Uses hierarchical transformers with lightweight decoders for efficient segmentation.
Improves window-based attention, enables training at extreme resolutions.
Adopts masked image modeling (MIM) similar to BERT for image pretraining.
Applies transformers directly to images by splitting them into patches.
Introduces local self-attention via shifted windows, improving computational efficiency.
Applies transformers for object detection without needing anchor boxes.
Uses autoregressive transformers for image generation.
Explores training transformers for images using a similar approach to GPT.
Trained on both images and text, setting new benchmarks for vision-language models.
Unified image-to-text transformer that generates text from images using a single-stage model.
Combines language and vision encoders, achieving state-of-the-art multimodal performance.
Designed for vision-language tasks, enabling few-shot multimodal learning.
Integrates vision and language models for few-shot learning tasks.
Extends BEiT to unify vision and language learning under a masked prediction framework.
Uses caption generation and image-text matching to improve multimodal learning.
Uses contrastive learning with large-scale web image-text pairs for alignment.
Trains vision and language models together for zero-shot learning.
Uses GPT-style transformers to generate images from text prompts.
Combines visual transformers with GPT for caption generation.
Trains image representations using textual supervision.
Introduces the Gemini family of multimodal models, demonstrating capabilities across image, audio, video, and text understanding.
NSA (Natively trainable Sparse Attention) enhances long-context modeling by integrating hardware-aligned optimizations and dynamic hierarchical sparsity, achieving faster, efficient training without compromising model performance.
State-space models as an alternative to transformers for sequence processing, offering efficiency benefits.
RetNet: Retentive Networks (July 2023)Replaces self-attention with retention mechanisms for improved efficiency and long-context handling.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (May 2022)Introduces FlashAttention, an algorithm that computes exact attention with reduced memory usage and improved speed, enhancing transformer efficiency.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (April 2022)Scaled MoE to trillion-parameter models, activating only a fraction of the network at a time.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (August 2021)Presents ALiBi, a method that allows transformers to handle longer sequences than they were trained on by incorporating linear biases in attention.
Perceiver IO: General Perception with Iterative Attention (July 2021)Extends Perceiver by allowing outputs of variable size, making it applicable to diverse tasks.
Improved positional encoding method for long-context transformers.
Processes arbitrary input modalities (text, images, audio) using a cross-attention mechanism.
Reduces self-attention complexity from O(n²) to O(n), making Transformers more efficient.
Uses sparse attention to handle sequences up to 8× longer than standard Transformers.
Uses locality-sensitive hashing (LSH) for self-attention, reducing memory requirements.
Introduces recurrence mechanism to process longer dependencies efficiently.
Replaces fixed layers with adaptive computation, allowing dynamic depth per token.
Introduces sparse attention for efficient long-sequence processing.
Introduces the Transformer architecture with self-attention and no recurrence.
Introduced Mixture-of-Experts (MoE) to improve efficiency by routing inputs to specialized subnetworks.
Multilingual fine-tuning with extended vocabulary, language-specific adapters for non-English support.
Analyzed scaling laws for transformer depth vs width, explored layer-wise importance for reasoning tasks.
Introduced Mixture of Experts (MoE), improved efficiency, retrieval-augmented generation (RAG) integration.
Pretrained (7B, 13B, 70B) models, RLHF for fine-tuned chat, improved safety alignment.
Pretrained models (7B–65B) on public datasets, optimized tokenization, smaller vocabulary size.
Uses 4-bit quantization for efficient training of large models, reducing GPU memory usage.
Efficient fine-tuning for transformers, reducing computational costs while maintaining performance.
Post-training quantization (PTQ) for reducing model size without significant accuracy loss.
Introduces Diffusion Transformers (DiT) for image super-resolution, achieving state-of-the-art results using a transformer-based approach.
Requires fewer steps than DDPMs, achieving high-quality generation more efficiently.
Consistency Models (February 2023)Presents consistency models, a novel approach enabling single-step or few-step sampling in diffusion models while maintaining high-quality generation.
Introduces Imagen, a text-to-image diffusion model outperforming previous generative models by leveraging large-scale language models.
Presents a photorealistic text-to-image generation method using a combination of language models and diffusion models.
Compress input data into latent space before diffusion, reducing computational cost.
Introduces Latent Diffusion Models (LDMs), significantly improving computational efficiency for high-resolution image generation.
Proposes a probabilistic model that generates high-quality images through iterative denoising steps, forming the foundation for modern diffusion models.
Develops the foundation of diffusion models by introducing score matching and Langevin dynamics for high-dimensional generative modeling.
Introduces QLoRA, a method that fine-tunes large models using low-rank adaptation while keeping base models quantized, significantly reducing memory requirements.
Compares various parameter-efficient fine-tuning (PEFT) strategies, such as LoRA, adapters, and prefix tuning, highlighting trade-offs in efficiency and performance.
Introduces LoRA, a method that freezes most model weights while injecting trainable low-rank matrices, enabling efficient fine-tuning with minimal additional parameters.
Uses reinforcement learning to align models with human preferences.
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-Tuning (October 2021)Shows that prompt tuning with continuous embeddings achieves results competitive with full fine-tuning in large language models.
Proposes prefix-tuning, a method that optimizes soft prompts instead of model parameters, enabling lightweight task adaptation.
Develops AdapterFusion, allowing multiple adapters trained on different tasks to be effectively combined without full fine-tuning.
Introduces adapter layers that allow efficient task adaptation by adding small, trainable modules to frozen pretrained models.
Introduces AWQ, a quantization technique that considers activation outliers to improve the efficiency of large language models.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (October 2022)Presents GPTQ, a method for post-training quantization of GPT models, maintaining accuracy while reducing model size.
DeepSpeed ZeRO: Towards Training Trillion Parameter Models (July 2022)Describes DeepSpeed ZeRO, a system that enables efficient training of large-scale models by partitioning model states across devices.
only need a heading with image. Heading : Ground-breaking papers in GenAI