Research News

Papers and technical reports from the Apertus project. See the News area for general announcements.

Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs

Fan, Sabolčec, Ansaripour, Tarun, Jaggi, Bosselut, Schlag

Shows that respecting robots.txt opt-outs causes minimal performance degradation.

Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks

Xu, Bosselut, Schlag

Research on memorization patterns and copyright risks in LLMs.

Quantifying Training Data Retention in Large Language Models: An Analysis of Pretraining Factors and Mitigation Strategies

Yixuan Xu (Master Thesis)

Analysis of memorization and mitigation strategies applied in Apertus.

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Romanou et al.

Multilingual evaluation benchmark across 44 languages.

Deriving Activation Functions Using Integration

Huang, Schlag

xIELU activation function used in Apertus architecture.

Global MMLU: Multilingual Evaluation

Singh et al.

Understanding addressing cultural, linguistic biases.

Towards Fully FP8 GEMM LLM Training at Scale

Hernández-Cano et al.

A new LLM architecture, enables unprecedented throughput gains.

Understanding and Minimising Outlier Features

He, Noci, Paliotta, Schlag, Hofmann

Methods to reduce OFs and improve quantisation without slowing down convergence.

Scaling Laws and Compute-Optimal Training

Hägele, Bakouch, Kosson, Ben Allal, Von Werra, Jaggi

How scaling experiments can be performed with reduced compute and GPU hours.

Training Dynamics of the Cooldown Stage

Dremov, Hägele, Kosson, Jaggi

Performance impacts of the Warmup-Stable-Decay of learning rate schedulers.

The AdEMAMix Optimizer: Better, Faster, Older

Pagliardini, Ablin, Grangier

A mixture of EMAs to improve performance in language and image tasks.

Benchmarking optimizers for large language model pretraining

Semenov, Pagliardini, Jaggi

A comprehensive evaluation of recent optimization techniques in LLM pretraining.

Quantile reward policy optimization

Matrenok, Moalla, Gulcehre

Alignment with pointwise regression and exact partition functions.

ConLID

Foroutan, Saydaliev, Kim, Bosselut

Supervised Contrastive Learning for Low-Resource Language Identification.

Parity-aware byte-pair encoding

Foroutan et al.

Improving cross-lingual fairness in tokenization.

Going over Fine Web with a Fine-Tooth Comb

Marinas, Kucherenko, Kucharavy

Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval.

Low-Perplexity LLM-Generated Sequences and Where To Find Them

Wuhrmann, Kucharavy, Kucherenko

Tracing texts back to training sources, revealing how data impacts model behavior.

Mixtera: A data plane for foundation model training

Böther et al.

Define and dynamically adjust data mixtures during training without performance bottlenecks.

This list is continuously expanded: please visit our 📖 Zotero group for further literature.