Is the Ebbinghaus forgetting curve still considered accurate today?

The general *shape* of the curve (rapid loss in the first hours and days, then slower decline) has been replicated many times, most rigorously by Murre and Dros in 2015. But the specific numbers people quote ("you forget 70% in 24 hours") are a paraphrase of Ebbinghaus's single-subject self-experiment with nonsense syllables. His actual 24-hour savings was about 34%, which measured relearning effort, not direct recall. Modern analyses (Rubin and Wenzel, 1996) also show that four different mathematical functions fit forgetting data about equally well; the exponential decay in most textbook diagrams is one candidate among several.

What is the optimal gap between review sessions?

Cepeda and colleagues (2008) ran the largest study to date (1,354 adults across four retention intervals) and found the optimal gap is roughly 10 to 20% of the desired retention interval. For a one-week retention test the optimum was 20–40% of the interval; for a one-year test it dropped to 5–10%. Erring long is much cheaper than erring short. The curve declines gradually past the optimum but climbs steeply before it.

Why do flashcard apps use expanding intervals if the evidence is weak?

Expanding intervals, the schedule used by SuperMemo, Anki, FSRS, and Sticky, were proposed by Paul Pimsleur in 1967 as a theoretical idea, not an empirical finding. The strongest direct test (Karpicke and Bauernschmidt, 2011) compared expanding, equal, and contracting schedules with absolute spacing held constant. Absolute spacing produced a 200% advantage over massing, but the three schedule shapes were statistically indistinguishable. Apps work because they space retrievals over long intervals, not because they expand them specifically. The convention is reasonable; the evidence for *expansion* per se is thinner than the marketing implies.

Does spaced repetition help with conceptual understanding or just memorisation?

Both, but the evidence is much stronger for the recall layer. Karpicke and Blunt (2011) showed retrieval practice beat concept mapping even on a delayed concept-map-creation test. But Barnett and Ceci (2002) and Day and Goldstone (2012) make a separate point: knowledge acquired in one context rarely transfers to novel problem contexts without explicit comparison, analogical reasoning, or structural alignment cues. Spaced repetition optimises a *retention* function, which is necessary but not sufficient for higher-order learning. Use it for facts and vocabulary; build clinical reasoning, fluent speaking, or problem-solving on top of it.

How long do the effects of spaced repetition last?

The Bahrick family ran the longest study in the literature: a 9-year self-experiment on foreign-language vocabulary published in 1993. With the right schedule, 13 sessions spaced 56 days apart produced the same 5-year retention as 26 sessions spaced 14 days apart. Harry Bahrick's separate 1984 study tested 733 adults on Spanish learned up to 50 years earlier and found a "permastore" plateau: well-acquired material decayed for 3–6 years, then stayed roughly stable for the next 25. Spaced repetition can produce genuinely durable memory.

Why does spaced repetition feel less effective than cramming?

Because performance in the moment is a misleading signal of learning. Soderstrom and Bjork (2015) systematised this in their review of the *learning versus performance* distinction: conditions that boost short-term performance often fail to support long-term retention. Kornell (2009) demonstrated the consequence: 90% of participants objectively learned more in a spaced flashcard condition than a massed one, but a *majority* told the experimenters afterwards that they thought massing had worked better. Spaced practice surfaces what you've forgotten since the last session. The discomfort is the work.

Are there any conditions where spaced repetition doesn't work?

Yes, several. Donovan and Radosevich's 1999 meta-analysis found the spacing effect shrinks substantially as task complexity grows: d ≈ 0.97 for simple motor skills but only 0.07–0.11 for the most complex tasks. Dempster (1988) lists boundary conditions including paraphrased rather than verbatim repetition, very young children, and immediate-recall tests. Kerfoot's urology RCT (2007) found a spaced-education benefit on the targeted online tests but no transfer to the broader Urology In-Service Examination. And studies of motivated populations consistently report larger effects than studies of disengaged learners. The technique requires the learner to actually do it.

The Science of Spaced Repetition: A Research Review

You sit down the night before an exam, cram for six hours, and walk into the room feeling prepared. Two weeks later, you can barely recall a quarter of what you studied. The frustrating thing isn't that you forgot. It's that you forgot so much, so fast.

The conventional answer is that you should have spaced your study sessions out instead. That advice is correct. It is also one of the most-replicated findings in learning science, with roots in experiments from 1885 and a chain of replications that runs through invertebrate biology, university lecture halls, and the algorithms that run the flashcard app on your phone.

The harder question, the one most blog posts about spaced repetition skip, is how much of the popular story is actually supported by the data, and how much is folklore. The famous Ebbinghaus forgetting curve is real, but the most-shared version of it is wrong about the numbers. The spacing effect is robust, but smaller in rigorous studies than the spectacular ratios you'll see quoted. The algorithms in modern flashcard apps work, but a lot of their design choices rest on thinner evidence than the marketing implies.

This article is an attempt at the long-form, citation-honest version. Every empirical claim links to a primary source. Where the data is weaker than the popular telling, we say so.

What Ebbinghaus actually measured

In 1885, a German psychologist named Hermann Ebbinghaus published the first quantitative measurement of how memory fades over time[1]. He memorised lists of meaningless three-letter syllables (wid, zof, kep) and then re-learned them at intervals from twenty minutes to a month, tracking how much study time he saved on the second pass. His data showed that forgetting was rapid right after learning and then slowed: he saved 58% of his time when relearning after twenty minutes, 34% after one day, and 21% after a month.

Over a century later, a Dutch graduate student named Jaap Dros repeated the experiment with modern methodology under the supervision of Jaap Murre[2]. Their curve looked, in their own words, "remarkably" similar to the 1885 version. The Ebbinghaus forgetting curve is one of the most replicated findings in psychology.

That is the part that holds up. Now the part that doesn't.

The line you'll see in every infographic, "you forget 70% of what you learn in 24 hours", is a paraphrase that doesn't survive contact with the original paper. Ebbinghaus used the savings method, which measures how much faster you re-learn material the second time around, not how much you directly recall. His 24-hour savings was about 34%, meaning relearning took roughly two-thirds the original time, not that two-thirds of the information had vanished. He was also a single subject experimenting on himself, using nonsense syllables deliberately stripped of the prior knowledge that helps real learning stick.

The shape of his curve generalises. The specific numbers are properties of one person's memory.

The mathematical form is also less settled than the textbook story suggests. Rubin and Wenzel fit 105 different functions to 210 published retention datasets in 1996 and concluded that four families (logarithmic, power, exponential-square-root, and hyperbolic-square-root) describe forgetting data about equally well[3]. Wixted and Ebbesen make the strongest case for the power function specifically[4], but the data can't reliably distinguish them. The clean exponential drawn in most diagrams is one candidate among several.

The defensible claim is narrow: forgetting decelerates over time, fast at first and slow later. Anyone who tells you the exact equation is selling something.

The Ebbinghaus forgetting curve, then and now

Savings (% reduction in relearning time) by retention interval. Log x-axis.

Show with a review at day 1

Ebbinghaus (1885)Murre & Dros (2015)

Source: Ebbinghaus (1885), replicated by Murre & Dros (2015). The Ebbinghaus forgetting curve is one of the most-replicated findings in psychology — but the popular “you forget 70% in 24 hours” is a paraphrase; his actual 24-hour savings was about 34%.

The spacing effect: what 839 experiments converge on

If memory decays without review, the question becomes how to schedule the reviews. The answer (distribute them across multiple sessions instead of packing them into one) is older than psychology as a profession, and at this point it has been demonstrated more often than almost any other finding in the field.

The single most comprehensive synthesis is a 2006 meta-analysis by Cepeda, Pashler, Vul, Wixted and Rohrer of 839 effect sizes drawn from 317 verbal-recall experiments[5]. Their headline result: the inter-study interval and the retention interval jointly determine final retention. The interval that maximises long-term recall grows as the retention interval grows. Massed presentation reliably underperforms even modest spacing.

Two years later the same group ran the largest single test of the spacing effect ever attempted: 1,354 adults, 32 trivia facts each, study gaps from 0 to 105 days, and retention intervals of 7, 35, 70, or 350 days[6]. The paper that resulted mapped what they called a temporal ridgeline: the optimal gap as a function of how long you want to remember something.

The headline finding is more nuanced than it usually gets quoted. The optimal gap, as a proportion of the retention interval, shrinks as the retention interval grows. For a one-week test, the best gap was 20 to 40% of the interval (about a day or two). For a one-year test, it dropped to 5 to 10% (about a month). At the optimal gap, recall improved by 64% overall versus zero-day spacing.

Crucially: the cost of erring long is much smaller than the cost of erring short. The curve rises steeply, peaks, then declines gradually. If you don't know exactly when to review, schedule it later rather than sooner.

The optimal gap depends on how long you want to remember

Cepeda et al. (2008): final-test recall by study gap, for four retention intervals.

Source: Cepeda et al. (2008). The optimal gap between sessions grows with the retention interval — but as a proportion of it, the optimum shrinks (from 20–40% of a 1-week test to 5–10% of a 1-year test).

The effect is not confined to lab undergraduates. In 2011 Sobel, Cepeda and Kapler ran a small but elegant study in two Ontario fifth-grade classrooms[10]. Thirty-nine children learned eight GRE-level vocabulary words from their regular teacher. Half were re-taught a minute later; half a week later. Five weeks after that, the spaced group recalled 20.8% of the definitions and the massed group recalled 7.5%, nearly three times as many, with no extra teaching time.

Fifth-grade vocabulary recall, five weeks later

Same children, same words, same total teaching time — only the gap differs.

Source: Sobel, Cepeda & Kapler (2011). 39 fifth-graders learned GRE-level vocabulary either back-to-back or with a one-week gap. Five weeks later, the spaced group recalled nearly three times as many definitions (d = 0.48).

Set against this is a meta-analysis by Donovan and Radosevich from a decade earlier[11]. Across 112 effect sizes (N = 8,980), the overall weighted spacing effect was d = 0.46, confirming the basic pattern but smaller than the headline numbers from individual studies. More importantly, the effect varied dramatically by task: d = 0.97 for simple motor tasks, d = 0.42 for verbal/cognitive content, and d = 0.07–0.11 for the most complex tasks they coded. Methodologically rigorous studies showed d ≈ 0.40; lower-rigor studies inflated to d = 1.22. The "spacing doubles learning" claim attaches to the low-rigor end of the literature.

For the kind of material flashcards target (vocabulary, definitions, facts), the realistic effect size for a student is somewhere between half and one standard deviation. That is large by educational-research standards, but it is not a miracle.

Why retrieval, not re-exposure, does the work

Here is the most important distinction in this entire literature: spaced repetition is spaced retrieval, not spaced re-reading.

The single canonical demonstration was published in Science in 2008[14]. Karpicke and Roediger had 120 Washington University undergraduates learn 40 Swahili–English word pairs until they could recall each one. Then the researchers manipulated what happened next. In some conditions, items that had been recalled once stayed in the test pool. In others, recalled items were dropped from testing and kept in the study pool.

A week later, students whose items were repeatedly retrieved recalled 80% of them. Students whose items were repeatedly studied, even though they saw the answer on every trial, recalled only 36%. Re-exposure produced almost no benefit. Retrieval produced essentially all of it.

That is why spaced-repetition apps show you a question and wait for an answer instead of just flashing the card.

That 80%-versus-36% number deserves a caveat, because it gets repeated everywhere. It compares against an unusually unfavourable baseline: items dropped from further testing once they had been recalled once. The conservative test-vs-equal-restudy meta-analytic estimate is g = 0.50 (Rowland 2014)[16], which works out to roughly half a standard deviation. Still large, but not four standard deviations. Both numbers are real; they answer different questions.

The other essential finding is that the testing effect grows with the retention interval. In a 2006 study Roediger and Karpicke showed that when students were tested five minutes after studying a prose passage, repeated reading outperformed repeated testing[15]. But the advantage flipped at two days and grew larger at one week. Carpenter and colleagues extended the comparison out to 42 days and found the same pattern[18]. Rowland's meta-analysis confirmed it across 159 effect sizes: testing's advantage was g = 0.41 at retention intervals under a day, but g = 0.69 once the interval crossed a day. The longer the gap before the final test, the more testing wins.

Restudy wins now; retrieval wins later

Roediger & Karpicke (2006): proportion recalled at three retention intervals.

Source: Roediger & Karpicke (2006). Repeated studying outperforms repeated testing on an immediate test — but the lines cross within a day, and at one week, testing is well ahead. The benefit of retrieval grows with the retention interval.

For students preparing for an exam tomorrow, this is a small effect. For anyone trying to remember something a month or a year from now, it's the difference between a strategy that works and one that doesn't.

A common objection is that retrieval practice is "just memorisation": useful for vocabulary, but not for the kind of conceptual understanding real learning requires. In 2011 Karpicke and Blunt published a Science paper that addressed this directly[21]. Students read science texts and then either re-studied them, drew elaborative concept maps, or practised retrieval. A week later, retrieval practice beat concept mapping on a short-answer test, and on a second experiment where the final test was concept-map creation. The retrieval benefit is not a quirk of paired-associate learning. It extends to building usable conceptual knowledge from text.

The meta-analysis by Adesope, Trevisan and Sundararajan in 2017 synthesises 272 effect sizes from 188 experiments[17] and confirms the pattern across age groups, education levels, and material types. Practice testing beats restudying (g = 0.51) and substantially beats doing nothing (g = 0.93). Roediger and Butler's 2011 Trends in Cognitive Sciences review adds the consistent finding that feedback amplifies the testing benefit[22], relevant for flashcard apps, which almost always show the answer after the user's attempt.

The mechanism is biochemical

The reason spacing works is not psychological in any deep sense. It is biochemical.

When you learn something, neurons fire and trigger a cascade of intracellular signals: calcium enters the cell, cAMP rises, protein kinase A switches on, the extracellular signal-regulated kinase ERK activates, and a transcription factor called CREB ultimately turns on the genes that build the new proteins your synapses need to hold a long-term memory. Smolen, Zhang and Byrne's 2016 review in Nature Reviews Neuroscience lays out the evidence that this cascade has time constants[23]. Hit it too quickly (massed learning) and you saturate the system. Spacing your sessions lets the cascade peak, reset, and re-peak. That re-peaking is what reliably drives the CREB-dependent transcription L-LTP requires.

The cellular work is some of the most beautiful evidence in the literature, because the timing of the behavioural spacing effect maps directly onto the timing of the underlying molecular events. In Aplysia (the sea slugs Eric Kandel made famous), Philips, Tzvetkova and Carew showed in 2007 that two training shocks produced long-term memory only when delivered 45 minutes apart[24]. Fifteen minutes was too soon; 60 minutes was too late. Independently, they measured MAPK activation in the same neurons and found it peaked transiently, at 45 minutes after a single shock. The behavioural window and the molecular window coincided exactly.

The story repeats at the mammalian level. Kramár and colleagues showed in 2012 that in rat hippocampal slices, additional theta-burst stimulation added new LTP on top of previously saturated potentiation only if the bouts were spaced at least an hour apart[25]. Seese and colleagues in 2014 used a Fragile X mouse model to demonstrate that three brief training trials spaced 60 minutes apart rescued memory in animals whose massed-training memory was impaired[26], and that pharmacologically blocking the ERK1/2 pathway abolished the rescue. Computational models by Kim and colleagues in 2010 predict that the L-LTP threshold for PKA dependence emerges at intervals above roughly 60 seconds[27], because shorter intervals deplete adenylyl cyclase before the cAMP signal can summate.

In November 2024, Kukushkin and colleagues at NYU pushed the framework into territory that surprised even cellular neuroscientists[30]. They engineered human kidney cells to express a CREB-driven reporter, then "trained" the cells with chemical pulses spaced minutes apart or delivered as a single massed dose. Four spaced pulses produced about 2.8 times more CREB activity at 24 hours than the same total stimulation delivered all at once. ERK or CREB inhibitors abolished the difference. The press headlines about "memories live in your kidneys" overshot what the paper actually shows. But it is strong evidence that the spacing effect is a generic property of certain signalling cascades. Memory neurons are special because they hooked this machinery up to behaviour, not because they invented the spacing effect.

Spacing is a fractal phenomenon

The same principle, five orders of magnitude apart in time.

~60 secSub-cellular
Kim, Huang, Abel & Blackwell (2010)
Inter-train interval above which L-LTP becomes PKA-dependent.
~45 minCellular (invertebrate)
Philips, Tzvetkova & Carew (2007)
MAPK activation window for two-trial long-term memory in Aplysia.
~60 minCircuit (mammalian)
Kramár et al. (2012); Seese et al. (2014)
Inter-bout interval at which additional LTP appears in mouse hippocampus.
~12 hrSystems (sleep)
Mazza et al. (2016)
A night between sessions roughly halves the trials needed to reach criterion.
days to weeksBehavioural (human)
Cepeda et al. (2008)
Optimal gap between sessions for week-to-year retention intervals.

Spacing is a fractal phenomenon. The “right” interval depends on which signalling cascade you’re trying to catch — and the cascades operate at every scale from seconds (PKA) to weeks (human study schedules).

Sleep is not just a gap. Diekelmann and Born's 2010 review of the memory function of sleep summarises decades of work showing that slow-wave sleep actively replays hippocampal activity and redistributes new memories to the neocortex[28]. The practical consequence is striking: Mazza and colleagues in 2016 showed that learners who slept between two French-Swahili vocabulary sessions needed roughly half as many trials to re-reach 100% accuracy as learners who stayed awake for the same 12 hours[29]. The retention advantage persisted at one week and was still visible at six months. A study schedule that spans at least one night between sessions is qualitatively different from one that crams them into a single waking period.

The cellular timescales (minutes to hours) are not the same as the optimal gaps a human studying for an exam should use. The molecular work doesn't tell you whether to review your Spanish vocabulary tomorrow or next week. But it tells you something more important: the spacing effect isn't a happy accident of cognition. It is what the underlying biology requires to make memory durable.

From paper schedules to FSRS

Spacing only works in practice if something decides when to show you each card. The history of how that "something" has been built is, surprisingly, a story with very thin empirical roots until quite recently.

The first formal proposal was Paul Pimsleur's graduated-interval recall, published in 1967 in the Modern Language Journal[31]. He suggested reviewing each item at exponentially expanding intervals: 5 seconds, 25 seconds, 2 minutes, 10 minutes, and so on out to 2 years. It was a theoretical proposal with no experiment behind it; the schedule cannot adapt to individual learners or to items of different difficulty. Five years later, the German journalist Sebastian Leitner described his box system: get a card right, move it to the next box; get it wrong, send it back. Both schemes work. Both are blind to the obvious fact that some cards are harder than others, and some learners are faster than others.

The first per-item adaptive algorithm was SM-2, written by a Polish student named Piotr Wozniak in 1987 as the practical core of his master's thesis[32]. SM-2 added a single number per card (the easiness factor, starting at 2.5) that grew when you got the card right and shrank when you didn't. The next interval was always the previous interval multiplied by the easiness factor. Wozniak reported 89% retention on 10,255 English vocabulary items, studying about 41 minutes a day for a year.

Consider what that means. SM-2 was a single-subject self-experiment. It was never RCT'd. It is also, three decades later, the algorithm that runs Anki and, with minor variations, most of the spaced-repetition apps you have ever used. Twenty-five years of flashcard software rests on a master's thesis. Adoption is not evidence.

In 2016 the field shifted. Burr Settles and Brendan Meeder at Duolingo published a paper at the ACL conference describing Half-Life Regression: the first algorithm trained on real user data rather than hand-tuned by its inventor[33]. Their model fit a forgetting curve to 12.9 million practice traces from Duolingo learners and used it to predict how long until each word would slip below half-recall. HLR cut prediction error by about 45% versus Duolingo's previous Leitner-based scheduler. A live A/B test on 3.3 million students lifted daily app activity by 12%.

Buried in the paper is a caveat: even the best model achieved an AUC of only 0.54 on individual recall, barely above chance. Schedulers are good at aggregate timing, not at predicting what any one learner will remember tomorrow.

The current state of the art descended from a 2022 KDD paper by Ye and colleagues at MaiMemo, a Chinese language-learning app[34]. They modelled 220 million review logs and built a scheduler they called Stochastic Shortest Path – Minimise Memorisation Cost, which chose review times to minimise expected total review cost subject to a target retention. Their memory model represents each item with three numbers: Difficulty, Stability, and Retrievability. This conceptual move (splitting how strong the memory is right now from how fast it decays) is the move that separates a 1990s algorithm from a 2020s one.

The Free Spaced Repetition Scheduler (FSRS), descended from Ye et al.'s work and refined by an open-source community, has become the default in Anki since 2024. An open benchmark on 9,999 Anki user collections (about 350 million reviews) reports that FSRS produces more accurate recall predictions than SM-2 on 99.6% of users[35]. Simulations suggest students need roughly 20 to 30% fewer reviews to hit the same retention.

From paper boxes to neural-network schedulers

Major spaced-repetition algorithms, 1885–2024. Hand-tuned in orange; data-driven in purple.

FoundationalHand-tunedData-driven (machine learned)

From paper boxes to neural-network schedulers. The conceptual lineage of spaced-repetition algorithms — with a clear inflection around 2016, when machine learning replaced hand-tuned formulas.

No peer-reviewed RCT has tested the obvious question: whether students using FSRS actually outperform students using SM-2. The 20-to-30% reduction is from simulation on retrospective log data, not from a controlled classroom trial. SuperMemo's modern algorithms (SM-11 through SM-18) are proprietary and have never been independently benchmarked. The engineering case for FSRS is genuinely strong. The clinical-trial-style evidence is mostly absent.

A note on expanding intervals

Sticky, like Anki, SuperMemo, and FSRS, schedules expanding intervals. Cards you get right come back later; cards you get wrong come back sooner. This is the convention that everyone in the field follows.

The empirical case for expansion specifically, as opposed to simply longer absolute spacing, is much thinner than the convention's universality implies. In 2007 Karpicke and Roediger directly compared expanding (1-5-9 trials between tests) and equally spaced (5-5-5) schedules[20]. On a 10-minute final test, expanding won by about 10 percentage points, replicating an earlier finding by Landauer and Bjork. But on a 2-day delayed test, equally spaced retrieval produced better long-term retention. What mattered was delaying the first retrieval, not the shape of the schedule that followed.

Four years later Karpicke and Bauernschmidt ran the cleaner version of the experiment[19]. They crossed three absolute-spacing levels (short, medium, long) with three relative-spacing schedules (expanding, equal, contracting), plus a no-spacing control. Long absolute spacing produced 75% week-later recall; massed retrieval produced 25%. A 200% improvement, in their words. But within each absolute-spacing condition, the expanding, equal, and contracting bars were statistically indistinguishable.

Absolute spacing drives the ladder. Schedule shape doesn’t.

Karpicke & Bauernschmidt (2011): one-week recall by absolute spacing × schedule.

ExpandingEqualContracting

Source: Karpicke & Bauernschmidt (2011). Absolute spacing produces a 200% improvement in long-term retention. But within each level of absolute spacing, expanding, equal, and contracting schedules — the templates that flashcard apps fight over — are statistically indistinguishable.

The lesson for a flashcard app, and for Sticky specifically, is that the technique works because of the long absolute intervals between reviews, not because of the expanding pattern that those intervals happen to follow. Expanding intervals are a reasonable engineering convention. They are not a proven optimum. The 200% improvement in long-term retention comes from spacing your reviews over weeks and months. The exact shape of that schedule is a smaller second-order effect.

This is the kind of detail that gets buried in product marketing. We think it shouldn't.

Does it work in the wild?

Lab studies are tractable; classrooms are not. The most important question about spaced repetition isn't whether it works in a 90-minute session with undergraduates memorising trivia. It's whether the effect survives the transition into messy real environments, and what its limits are when it does.

The strongest evidence comes from Harry Bahrick's life work on language vocabulary retention. In 1984 he tested 733 adults on Spanish they had learned in high school or college, with retention intervals up to fifty years[8]. The forgetting curve was steep for the first three to six years, and then, for the next twenty-five years, it essentially stopped. Bahrick called the plateau the permastore. Students who had taken five or more Spanish courses still recognised more than 60% of the vocabulary a quarter-century later. What predicted permastore retention wasn't how often they used Spanish afterwards. It was how thoroughly they had learned it in the first place.

The Bahrick family then ran a nine-year self-experiment to ask how spacing affected the path into permastore[7]. Each of four authors learned 300 English-foreign-language word pairs under different intersession schedules. The headline finding: 13 relearning sessions spaced 56 days apart produced the same five-year retention as 26 sessions spaced 14 days apart. Half the work for the same outcome. The 56-day group acquired the words slightly more slowly during training, but that effort during learning was exactly what made the memory stick.

The studies have a sample size of four. They are also irreplaceable, because nobody else has tracked the same material in the same subjects across nine years. They earn their credibility from converging evidence (Cepeda's 1,354-subject Internet study, Sobel's fifth-grade classroom, and the Bahrick & Phelps 8-year Spanish vocabulary study[9]), not from their standalone power.

In medical education, where the consequences of forgetting are unusually serious, the evidence base is stronger. Kerfoot and colleagues' 2007 RCT on third-year medical students reported Cohen's d = 1.01 on year-end tests when spaced-education emails had finished six to eight months earlier[37], the kind of effect size that is hard to ignore in education research. A separate RCT the same year tested 537 urology residents and found that spaced delivery beat bolus delivery on online tests[36]. Larsen, Butler and Roediger's 2009 RCT on medical and nurse-practitioner residents showed repeated testing produced 6-month retention scores about 13 percentage points higher than repeated studying[38]. A 2013 follow-up found testing produced large effects (η² = 0.33) on 6-month free recall, with self-explanation adding a smaller boost on top[39].

The caveats matter. Kerfoot's urology RCT found that the spaced-education benefit did not transfer to the broader Urology In-Service Examination: the spaced delivery helped on tests that resembled what students had practised, but not on tests of broader knowledge that they hadn't. Most "spaced repetition" applied studies confound spacing with testing. They can't tell you which component is doing the work, because the intervention is naturally delivered as both at once.

The industry-funded evidence is the largest by volume and the most conflicted by source. The Duolingo HLR paper is the deepest published well of real-world data. Wothe and colleagues' 2023 observational survey at one US medical school reported that daily Anki users scored about 4.5 points higher on USMLE Step 1 (p = 0.039), but with an 18% response rate and no randomisation, the causal claim is weak[40]. The Anki-and-USMLE literature is suggestive, not dispositive: students who self-select into a demanding workflow are likely also more conscientious, study longer, and have stronger prior preparation.

On balance: spaced repetition works outside the lab, but the effect is messier, smaller, and conditional on the learner's engagement. No commercial flashcard app has yet been validated by an independent randomised trial against an active comparator.

Where spaced repetition underperforms

A real literature review owes its reader the inconvenient parts. The case for spaced repetition is, on balance, well-supported, but the case has limits, and the limits are interesting.

Learners are bad judges of their own learning

The most uncomfortable finding in this literature is also the most replicated. In Nate Kornell's 2009 flashcard experiments, 90 percent of participants objectively learned more in the spaced condition than the massed one[42]. Yet a majority told the experimenters afterwards that they thought massing had worked better. Massed practice feels productive in the moment. Spaced practice surfaces what you've forgotten since the last review, which feels like failure even though it is the work.

Karpicke, Butler and Roediger surveyed 177 Washington University undergraduates in 2009 about how they actually study[43]. 83.6% reported rereading as a primary strategy. Only 11% mentioned any form of self-testing. The strategy with the strongest evidence base is the one used least.

The strategy with the strongest evidence base is the one used least

Karpicke, Butler & Roediger (2009): how 177 students study.

Source: Karpicke, Butler & Roediger (2009). Survey of 177 students. The strategy psychologists agree works best — practising recall — is the one students use least.

Soderstrom and Bjork systematised this in a 2015 review they titled Learning versus Performance[41]. Conditions that boost short-term performance can fail to support long-term learning, and sometimes actively harm it. The fluency you feel during massed practice is a signal of current accessibility, not of storage durability. Any app that only shows users their session-level success rate is reinforcing the wrong feedback loop.

Spaced repetition is a retention technology, not a transfer technology

In 2002, Susan Barnett and Stephen Ceci published a 25-year review of the transfer-of-learning literature in Psychological Bulletin[44]. Their conclusion: near transfer is common; far transfer (applying what you learned in one context to a meaningfully new problem) is rare, and the conditions under which it reliably occurs are mostly untested.

Day and Goldstone's 2012 follow-up[45] sharpens the implication for spaced repetition specifically. Transfer requires structural alignment cues during learning, analogical comparison across cases, or abstraction that goes beyond rote retention. Spaced repetition scheduling alone provides none of these. It maximises a function (durable recall of items as practised) that is necessary but not sufficient for higher-order learning. Medical students who know every pharmacology card on Anki still need clinical rotations. Language learners with 5,000 spaced cards still need to speak.

The strongest field evidence is short-window

The largest deployed evaluation of an SR algorithm is Tabibian and colleagues' 2019 PNAS paper, which used about 12 million Duolingo sessions to derive an "optimal" review schedule[47]. Users whose behaviour aligned more closely with the optimum forgot less than those who didn't. It's a real result. It also covers, in the authors' own words, "only 2 wk" of observation: a limitation they explicitly call out as central. The marketing claim that spaced repetition helps you remember things for years is well-supported by Bahrick's permastore work. It is not supported, yet, by modern app-level data on real users at scale.

Doing it is the hard part

Dunlosky and colleagues' comprehensive 2013 review for Psychological Science in the Public Interest evaluated ten learning techniques and rated only two as high utility: distributed practice and practice testing[46]. But the rating came with a load-bearing condition the popular write-ups skip: "Students require explicit instruction in scheduling study sessions across multiple weeks. […] Most students begin to prepare and study only when they are reminded that the next exam is tomorrow. By that point, cramming is their only option."

The technique requires the learner to actually do it. The largest gains in the literature are recorded in highly motivated populations (medical students, intrinsically motivated language learners) who keep using the technique long enough for it to compound. The effect on a disengaged or low-motivation learner is probably much smaller, but no one has cleanly measured it.

A 1988 critique that has aged well

Frank Dempster wrote a paper in 1988 titled "The spacing effect: A case study in the failure to apply the results of psychological research"[13]. His thesis was that even a well-established psychological effect can fail to penetrate education when the conditions that demonstrate it in the lab (short retention intervals, controlled materials, passive learners) are weak proxies for the classroom. The digital tools that have appeared since 1988 partially address some of Dempster's nine impediments. The lab-to-classroom translation problem he named is, forty years later, still very much with us.

Bjork & Bjork's reframing

The most useful single concept for putting all of this together comes from Robert and Elizabeth Bjork's 2011 chapter on desirable difficulties[12]. The conditions that produce the most durable learning (spacing, retrieval, interleaving, variability) are precisely the conditions that make practice feel harder. Cramming feels like learning. Spacing feels like forgetting. The discomfort is the signal that the work is happening.

What this means for how you actually study

The article so far has been about evidence. This section is about what falls out of it. No new citations; just the practical synthesis.

Review just before you forget, not just after you learn. The optimal gap is roughly 10 to 20% of how long you want to remember. For a one-week test, review the day before, then a day or two before that. For a one-year test, push the gaps to weeks or a month.
Err long, not short. The recall curve climbs steeply to its peak and then declines gradually. A gap that's too long costs you less than a gap that's too short.
Retrieve, don't re-read. Cover the answer. Try to recall it. Then check. The benefit grows the further out the test is.
Sleep is a non-trivial gap. Schedule sessions that span nights, not just hours. The consolidation that happens overnight is qualitatively different from the same elapsed time spent awake.
Don't trust how it feels. Massed practice feels productive; spaced practice feels like you're failing. Trust the delayed test, not the cram.
Use spaced repetition for the retention layer. Use it for vocabulary, facts, definitions, formulas: anything where the goal is to reliably retrieve the right answer when prompted. For higher-order skills (clinical reasoning, fluent speaking, problem-solving) use it as the foundation, then build on top of it with practice that requires the integration spaced retrieval doesn't.

The science is unusually clear on what works. The hard part is the doing.

Key sources

Every empirical claim in this article is linked to a primary source below. Citations are numbered in the order they first appear.

Ebbinghaus, H. (1885/1913). Memory: A contribution to experimental psychology (H. A. Ruger & C. E. Bussenius, Trans.). Teachers College, Columbia University.
Source
Murre, J. M. J., & Dros, J. (2015). Replication and analysis of Ebbinghaus' forgetting curve. PLOS ONE, 10(7), e0120644.
Source DOI: 10.1371/journal.pone.0120644
Rubin, D. C., & Wenzel, A. E. (1996). One hundred years of forgetting: A quantitative description of retention. Psychological Review, 103(4), 734–760.
Source DOI: 10.1037/0033-295X.103.4.734
Wixted, J. T., & Ebbesen, E. B. (1991). On the form of forgetting. Psychological Science, 2(6), 409–415.
Source DOI: 10.1111/j.1467-9280.1991.tb00175.x
Cepeda, N. J., Pashler, H., Vul, E., Wixted, J. T., & Rohrer, D. (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin, 132(3), 354–380.
Source DOI: 10.1037/0033-2909.132.3.354
Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing effects in learning: A temporal ridgeline of optimal retention. Psychological Science, 19(11), 1095–1102.
Source DOI: 10.1111/j.1467-9280.2008.02209.x
Bahrick, H. P., Bahrick, L. E., Bahrick, A. S., & Bahrick, P. E. (1993). Maintenance of foreign language vocabulary and the spacing effect. Psychological Science, 4(5), 316–321.
Source DOI: 10.1111/j.1467-9280.1993.tb00571.x
Bahrick, H. P. (1984). Semantic memory content in permastore: Fifty years of memory for Spanish learned in school. Journal of Experimental Psychology: General, 113(1), 1–35.
Source DOI: 10.1037/0096-3445.113.1.1
Bahrick, H. P., & Phelps, E. (1987). Retention of Spanish vocabulary over 8 years. Journal of Experimental Psychology: Learning, Memory, and Cognition, 13(2), 344–349.
Source DOI: 10.1037/0278-7393.13.2.344
Sobel, H. S., Cepeda, N. J., & Kapler, I. V. (2011). Spacing effects in real-world classroom vocabulary learning. Applied Cognitive Psychology, 25(5), 763–767.
Source DOI: 10.1002/acp.1747
Donovan, J. J., & Radosevich, D. J. (1999). A meta-analytic review of the distribution of practice effect: Now you see it, now you don't. Journal of Applied Psychology, 84(5), 795–805.
Source DOI: 10.1037/0021-9010.84.5.795
Bjork, E. L., & Bjork, R. A. (2011). Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. In M. A. Gernsbacher, R. W. Pew, L. M. Hough, & J. R. Pomerantz (Eds.), Psychology and the real world: Essays illustrating fundamental contributions to society (pp. 56–64). Worth Publishers.
Source
Dempster, F. N. (1988). The spacing effect: A case study in the failure to apply the results of psychological research. American Psychologist, 43(8), 627–634.
Source DOI: 10.1037/0003-066X.43.8.627
Karpicke, J. D., & Roediger, H. L. (2008). The critical importance of retrieval for learning. Science, 319(5865), 966–968.
Source DOI: 10.1126/science.1152408
Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17(3), 249–255.
Source DOI: 10.1111/j.1467-9280.2006.01693.x
Rowland, C. A. (2014). The effect of testing versus restudy on retention: A meta-analytic review of the testing effect. Psychological Bulletin, 140(6), 1432–1463.
Source DOI: 10.1037/a0037559
Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701.
Source DOI: 10.3102/0034654316689306
Carpenter, S. K., Pashler, H., Wixted, J. T., & Vul, E. (2008). The effects of tests on learning and forgetting. Memory & Cognition, 36(2), 438–448.
Source DOI: 10.3758/MC.36.2.438
Karpicke, J. D., & Bauernschmidt, A. (2011). Spaced retrieval: Absolute spacing enhances learning regardless of relative spacing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(5), 1250–1257.
Source DOI: 10.1037/a0023436
Karpicke, J. D., & Roediger, H. L. (2007). Expanding retrieval practice promotes short-term retention, but equally spaced retrieval enhances long-term retention. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(4), 704–719.
Source DOI: 10.1037/0278-7393.33.4.704
Karpicke, J. D., & Blunt, J. R. (2011). Retrieval practice produces more learning than elaborative studying with concept mapping. Science, 331(6018), 772–775.
Source DOI: 10.1126/science.1199327
Roediger, H. L., & Butler, A. C. (2011). The critical role of retrieval practice in long-term retention. Trends in Cognitive Sciences, 15(1), 20–27.
Source DOI: 10.1016/j.tics.2010.09.003
Smolen, P., Zhang, Y., & Byrne, J. H. (2016). The right time to learn: Mechanisms and optimization of spaced learning. Nature Reviews Neuroscience, 17(2), 77–88.
Source DOI: 10.1038/nrn.2015.18
Philips, G. T., Tzvetkova, E. I., & Carew, T. J. (2007). Transient mitogen-activated protein kinase activation is confined to a narrow temporal window required for the induction of two-trial long-term memory in Aplysia. Journal of Neuroscience, 27(50), 13701–13705.
Source DOI: 10.1523/JNEUROSCI.4262-07.2007
Kramár, E. A., Babayan, A. H., Gavin, C. F., Cox, C. D., Jafari, M., Gall, C. M., Rumbaugh, G., & Lynch, G. (2012). Synaptic evidence for the efficacy of spaced learning. Proceedings of the National Academy of Sciences, 109(13), 5121–5126.
Source DOI: 10.1073/pnas.1120700109
Seese, R. R., Wang, K., Yao, Y. Q., Lynch, G., & Gall, C. M. (2014). Spaced training rescues memory and ERK1/2 signaling in fragile X syndrome model mice. Proceedings of the National Academy of Sciences, 111(47), 16907–16912.
Source DOI: 10.1073/pnas.1413335111
Kim, M., Huang, T., Abel, T., & Blackwell, K. T. (2010). Temporal sensitivity of protein kinase A activation in late-phase long-term potentiation. PLOS Computational Biology, 6(2), e1000691.
Source DOI: 10.1371/journal.pcbi.1000691
Diekelmann, S., & Born, J. (2010). The memory function of sleep. Nature Reviews Neuroscience, 11(2), 114–126.
Source DOI: 10.1038/nrn2762
Mazza, S., Gerbier, E., Gustin, M.-P., Kasikci, Z., Koenig, O., Toppino, T. C., & Magnin, M. (2016). Relearn faster and retain longer: Along with practice, sleep makes perfect. Psychological Science, 27(10), 1321–1330.
Source DOI: 10.1177/0956797616659930
Kukushkin, N. V., Carney, R. E., Tabassum, T., & Carew, T. J. (2024). The massed-spaced learning effect in non-neural human cells. Nature Communications, 15, 9635.
Source DOI: 10.1038/s41467-024-53922-x
Pimsleur, P. (1967). A memory schedule. The Modern Language Journal, 51(2), 73–75.
Source DOI: 10.1111/j.1540-4781.1967.tb06700.x
Woźniak, P. A. (1990). Optimization of Learning (Master's thesis). University of Technology in Poznań, Poland. Algorithm SM-2 documentation: https://super-memory.com/english/ol/sm2.htm
Source
Settles, B., & Meeder, B. (2016). A trainable spaced repetition model for language learning. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1848–1858.
Source DOI: 10.18653/v1/P16-1174
Ye, J., Su, J., & Cao, Y. (2022). A stochastic shortest path algorithm for optimizing spaced repetition scheduling. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4381–4390.
Source DOI: 10.1145/3534678.3539081
Expertium & Ye, J. (2024). Benchmark of Spaced Repetition Algorithms. open-spaced-repetition project. https://expertium.github.io/Benchmark.html
Source
Kerfoot, B. P., DeWolf, W. C., Masser, B. A., Church, P. A., & Federman, D. D. (2007). Spaced education to teach urology residents: A randomized controlled trial. Journal of Urology, 177(4), 1481–1487.
Source DOI: 10.1016/j.juro.2006.11.074
Kerfoot, B. P., Baker, H. E., Koch, M. O., Connelly, D., Joseph, D. B., & Ritchey, M. L. (2007). Spaced education improves the retention of clinical knowledge by medical students: A randomised controlled trial. Medical Education, 41(1), 23–31.
Source DOI: 10.1111/j.1365-2929.2006.02644.x
Larsen, D. P., Butler, A. C., & Roediger, H. L. (2009). Repeated testing improves long-term retention relative to repeated study: A randomised controlled trial. Medical Education, 43(12), 1174–1181.
Source DOI: 10.1111/j.1365-2923.2009.03518.x
Larsen, D. P., Butler, A. C., & Roediger, H. L. (2013). Comparative effects of test-enhanced learning and self-explanation on long-term retention. Medical Education, 47(7), 674–682.
Source DOI: 10.1111/medu.12141
Wothe, J. K., Wanberg, L. J., Hohle, R. D., Sakher, A. A., Bosacker, L. E., Khan, F., Olson, A. P. J., & Satin, D. J. (2023). Academic and wellness outcomes associated with use of Anki spaced repetition software in medical school. Journal of Medical Education and Curricular Development, 10.
Source DOI: 10.1177/23821205231173289
Soderstrom, N. C., & Bjork, R. A. (2015). Learning versus performance: An integrative review. Perspectives on Psychological Science, 10(2), 176–199.
Source DOI: 10.1177/1745691615569000
Kornell, N. (2009). Optimising learning using flashcards: Spacing is more effective than cramming. Applied Cognitive Psychology, 23(9), 1297–1317.
Source DOI: 10.1002/acp.1537
Karpicke, J. D., Butler, A. C., & Roediger, H. L. (2009). Metacognitive strategies in student learning: Do students practise retrieval when they study on their own? Memory, 17(4), 471–479.
Source DOI: 10.1080/09658210802647009
Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn? A taxonomy for far transfer. Psychological Bulletin, 128(4), 612–637.
Source DOI: 10.1037/0033-2909.128.4.612
Day, S. B., & Goldstone, R. L. (2012). The import of knowledge export: Connecting findings and theories of transfer of learning. Educational Psychologist, 47(3), 153–176.
Source DOI: 10.1080/00461520.2012.696438
Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013). Improving students' learning with effective learning techniques: Promising directions from cognitive and educational psychology. Psychological Science in the Public Interest, 14(1), 4–58.
Source DOI: 10.1177/1529100612453266
Tabibian, B., Upadhyay, U., De, A., Zarezade, A., Schölkopf, B., & Gomez-Rodriguez, M. (2019). Enhancing human learning via spaced repetition optimization. Proceedings of the National Academy of Sciences, 116(10), 3988–3993.
Source DOI: 10.1073/pnas.1815156116

About this review

This article was researched and written by the team at Sticky, a spaced-repetition learning app. We have an obvious commercial stake in the technique. We wrote this anyway because we think the popular version of the story understates how much we know and overstates how much it has been proven. We've tried to give both their due.

The Science of Spaced Repetition: What 140 Years of Research Actually Shows

What Ebbinghaus actually measured

The Ebbinghaus forgetting curve, then and now

The spacing effect: what 839 experiments converge on

The optimal gap depends on how long you want to remember

Fifth-grade vocabulary recall, five weeks later

Why retrieval, not re-exposure, does the work

Restudy wins now; retrieval wins later

The mechanism is biochemical

Spacing is a fractal phenomenon

From paper schedules to FSRS

From paper boxes to neural-network schedulers

A note on expanding intervals

Absolute spacing drives the ladder. Schedule shape doesn’t.

Does it work in the wild?

Where spaced repetition underperforms

Learners are bad judges of their own learning

The strategy with the strongest evidence base is the one used least

Spaced repetition is a retention technology, not a transfer technology

The strongest field evidence is short-window

Doing it is the hard part

A 1988 critique that has aged well

Bjork & Bjork's reframing

What this means for how you actually study

Key sources

About this review

Frequently Asked Questions

Is the Ebbinghaus forgetting curve still considered accurate today?

What is the optimal gap between review sessions?

Why do flashcard apps use expanding intervals if the evidence is weak?

Does spaced repetition help with conceptual understanding or just memorisation?

How long do the effects of spaced repetition last?

Why does spaced repetition feel less effective than cramming?

Are there any conditions where spaced repetition doesn't work?

Ace Spaced Repetition Science with smarter studying

What study method boosts long-term retention by 200%?

Related Articles

Spaced Repetition: The Complete Guide to Remembering What You Learn

The Forgetting Curve: Why You Forget and How to Fix It

The Spacing Effect: Why Spreading Out Study Sessions Beats Cramming

Retrieval Practice: Why Testing Yourself Is the Best Way to Learn

Active Recall: The Most Effective Study Technique You're Not Using

The Leitner System: A Simple Box Method for Smarter Flashcard Study

How Sticky Schedules Your Reviews So You Never Forget

Spaced Repetition Not Working? 5 Failure Modes and How to Fix Them

Related Flashcard Sets

AP Biology

Spanish Vocabulary

SAT Math

Start Remembering What You Learn