Spaced Repetition Learning meets LLMs

My old supervisor, Prof. Rainer Nagel, would sometimes ask me whether I am still doing mathematics. The answer. Agentic coding tools are strong enough to essentially set up an entire machine learning experiment and submit the job on Kubernetes. I review code and logic, give specifications, etc. This leaves time for reading papers.

I miss giving talks, and I thought it would be nice to write up some things I have been thinking about as material for a seminar talk I can give when I visit old acquaintances who are still in academia and always grateful for a good seminar talk.

Due to the advent of NatureLM. It is clear that…

SM-2 algorithm

Improvements of SM algorithm and issues they address.

Markov Decision Processed.

Stochastic Shortest Path Minimize Memorization Cost
FSRS algorithm (Free SRS)

Half-Life Regression Model deployed by Duolingo is easy to make nonlinear.

Neural architectures transform interval prediction paradigms

Recent advances in neural networks and reinforcement learning have revolutionized spaced repetition by enabling personalized, adaptive algorithms that learn from massive datasets. LSTM/GRU models capture temporal dependencies in review sequences through recurrent architectures, ScienceDirect with the LSTM cell state updates following f_t = σ(W_f·[h_{t-1}, x_t] + b_f) for the forget gate and similar equations for input and output gates. These models achieve superior performance by learning complex patterns in review histories that hand-crafted algorithms miss, with attention mechanisms identifying which past reviews most influence current recall probability.

The Half-Life Regression (HLR) framework, deployed at scale by Duolingo, formulates the learning objective as ℓ(⟨p, Δ, x⟩; Θ) = (p - p̂_Θ)² + α(h - ĥ_Θ)² + λ||Θ||₂², where p̂_Θ = 2^(-Δ/ĥ_Θ) represents predicted recall probability and ĥ_Θ = 2^(Θ·x) estimates the half-life. GitHub duolingo Neural variants replace the linear model Θ·x with deep networks ĥ_Θ = NN(x; Θ), enabling non-linear feature interactions. The incorporation of lexeme tag features (sparse indicators for word types) and linguistic complexity metrics (word frequency, morphological complexity, semantic similarity from Word2Vec/BERT embeddings) enables word-specific difficulty estimation crucial for vocabulary learning applications. duolingo

Reinforcement learning formulations model spaced repetition as a Markov Decision Process with state space S = (difficulty, time_delay, memory_strength), action space A = {1, 2, …, T_max} representing review intervals in days, and reward function R(s,a) = recall_probability - λ × review_cost. PNAS mdpi The Bellman equation V(s) = max_a [R(s,a) + γ Σ_{s’} P(s’|s,a) V(s’)] defines the optimal value function, with Deep Q-Networks (DQN) approximating Q-values using neural networks: Q(s_t, a_t; θ) = LSTM(Dense(s_t; θ)). Policy gradient methods like PPO optimize the objective L^{PPO}(θ) = E[min(r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t)], providing stable learning in the high-variance environment of human memory.

Transformer architectures have recently shown promise for spaced repetition, with SAINT+ separating exercise and response information processing through attention mechanisms. Graph neural networks model hierarchical knowledge structures where concepts “encompass” sub-concepts, with message passing m_{ij}^{(l)} = Message(h_i^{(l)}, h_j^{(l)}) propagating information through the knowledge graph. Neptune.ai ScienceDirect The LECTOR system, published in 2024, leverages large language models for semantic similarity assessment, achieving 90.2% success rate by addressing semantic confusion in vocabulary learning. arXiv These neural approaches achieve 2.9-4.8× training speedup while using only 34-50% of training data, ACL Anthology demonstrating the power of learned representations over hand-crafted features. aclanthology +2

Implementation architecture demands careful optimization

Production deployment of spaced repetition algorithms requires careful attention to numerical stability, with exponential forgetting curves requiring log-space computations to handle exp(-large_values) without underflow. The stability calculation S_{n+1} = S_n × SInc(S_n, R_n, D) needs safeguards for extreme parameter values, while gradient-based optimization requires proper scaling across parameters with different ranges. GitHub Gradient clipping and batch normalization prove essential for stable neural network training, with regularization terms preventing overfitting to individual user patterns.

Efficient data structures significantly impact system performance at scale. Priority queues implemented as Fibonacci heaps provide O(1) amortized decrease-key operations for updating review priorities, while calendar queues optimize time-based scheduling with O(1) average insertion and deletion. The compact representation for the FSRS algorithm requires only 16 bytes per card (4 floats for stability, difficulty, timestamp, and computed retrievability), Anki enabling millions of cards per gigabyte of memory. GitHub Balanced trees maintaining cards ordered by recall probability support efficient selection of items for review, with B+ trees providing cache-friendly access patterns for database-backed implementations.

Parallelization strategies enable scaling to millions of users, with data parallelism supporting independent optimization per user or collection. Wikipedia OpenAI The parallel FSRS optimization can be implemented using process pools: parallel_optimize(collections) with futures for asynchronous execution. Model parallelism distributes neural network layers across devices, while pipeline parallelism overlaps computation for different batches. OpenAI GPU acceleration provides 2.9-4.8× speedup for neural approaches ACL Anthology through batch processing of review sequences and matrix operations for stability calculations. aclanthology +2 Memory-efficient attention mechanisms using sparse patterns or local attention windows enable processing longer review histories without exhausting GPU memory.

The evaluation infrastructure requires sophisticated metrics beyond simple accuracy. Log loss L = -Σ_i [y_i log(p_i) + (1-y_i) log(1-p_i)] measures calibration quality, while AUC represents the probability that the algorithm assigns higher recall probability to recalled versus forgotten cards. Supermemopedia +3 The novel RMSE(bins) metric addresses calibration gaming by binning reviews by predicted probability and measuring RMS differences within bins. Expertium +2 Production systems must handle millisecond response times for scheduling decisions, robust handling of irregular review patterns where users may skip days or weeks, and long-term parameter stability across algorithm updates. A/B testing frameworks enable continuous improvement while maintaining user experience, with multi-armed bandit approaches balancing exploration of new algorithms against exploitation of proven methods

Tiny Pivots

Explorer

Spaced Repetition Learning meets LLMs

Neural architectures transform interval prediction paradigms

Implementation architecture demands careful optimization