Research News
Papers and technical reports from the Apertus project. See the News area for general announcements.
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
Shows that respecting robots.txt opt-outs causes minimal performance degradation.
Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks
Research on memorization patterns and copyright risks in LLMs.
Quantifying Training Data Retention in Large Language Models: An Analysis of Pretraining Factors and Mitigation Strategies
Analysis of memorization and mitigation strategies applied in Apertus.
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Multilingual evaluation benchmark across 44 languages.
Deriving Activation Functions Using Integration
xIELU activation function used in Apertus architecture.
Global MMLU: Multilingual Evaluation
Understanding addressing cultural, linguistic biases.
Towards Fully FP8 GEMM LLM Training at Scale
A new LLM architecture, enables unprecedented throughput gains.
Understanding and Minimising Outlier Features
Methods to reduce OFs and improve quantisation without slowing down convergence.
Scaling Laws and Compute-Optimal Training
How scaling experiments can be performed with reduced compute and GPU hours.
Training Dynamics of the Cooldown Stage
Performance impacts of the Warmup-Stable-Decay of learning rate schedulers.
The AdEMAMix Optimizer: Better, Faster, Older
A mixture of EMAs to improve performance in language and image tasks.
Benchmarking optimizers for large language model pretraining
A comprehensive evaluation of recent optimization techniques in LLM pretraining.
Quantile reward policy optimization
Alignment with pointwise regression and exact partition functions.
ConLID
Supervised Contrastive Learning for Low-Resource Language Identification.
Parity-aware byte-pair encoding
Improving cross-lingual fairness in tokenization.
Going over Fine Web with a Fine-Tooth Comb
Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval.
Low-Perplexity LLM-Generated Sequences and Where To Find Them
Tracing texts back to training sources, revealing how data impacts model behavior.
Mixtera: A data plane for foundation model training
Define and dynamically adjust data mixtures during training without performance bottlenecks.
This list is continuously expanded: please visit our 📖 Zotero group for further literature.
