Tracking Japanese Progress: What to Measure and What to Ignore
Tracking Japanese progress means deciding which metrics actually move with your ability and which only feel like they do. When you know how to measure Japanese progress, you do not have to answer "am I improving?" by feel. You can adjust the plan instead of guessing.
Overview
This page splits the metric universe into two lists: a short set worth tracking, and a shorter set to deliberately ignore. For each useful metric, it names the tool that produces the number and the change a stalled reading should trigger.
The goal is not a dashboard with twenty gauges. It is a small set of honest signals, checked on the right schedule, that feed back into your study plan.
There is no Japanese to memorize here. The subject is the instrumentation around your study, so the examples are metrics and tools rather than sentences.
Why Progress Feels Invisible
Acquisition is non-linear and back-loaded
The "plateau" is a recognized phenomenon in language learning. It is the stage where learners seem to reach a point at which they no longer notice further progress. It typically shows up when moving from lower-intermediate toward upper-intermediate and advanced levels.1
Richards frames this plateau as a structural barrier between intermediate and advanced proficiency, not an individual failure. It is characteristic of the transition itself.1
One driver is a gap between receptive skills, such as listening and reading, and productive skills, such as speaking and writing. Learners can make considerable progress in listening and reading while still feeling inadequate in speaking. The skills they practice daily may improve, but the skill they judge themselves by can still feel flat.1
Richards also notes that at the plateau, fluency can advance at the expense of complexity. Vocabulary development can stall through over-use of lower-level words and limited acquisition of advanced words and collocations.1 A learner can therefore be improving on several dimensions while the single self-judged dimension feels flat.
Why a single number lies
The Japanese-Language Proficiency Test (JLPT) itself scores competence in separate sections (Language Knowledge, Reading, Listening), each with its own minimum, because one composite number can hide a weak skill. An examinee can clear the overall pass mark and still fail if one section falls below its minimum. That failure mode is dissected in JLPT Scoring Deep Dive: The Section-Minimum Trap.23
The official "Summary of Linguistic Competence Required for Each Level" describes competence by skill, covering reading and listening plus the underlying vocabulary and grammar.4 That reinforces a basic point: proficiency has several dimensions. It is not a single scalar, or one-number value.
If even a standardized exam refuses to collapse skills into one gate, a self-tracked dashboard should not either. This is why J-Compass recommends a small portfolio of metrics rather than one headline number.
What to Measure
Hours invested (not weeks elapsed)
Published JLPT study-hour estimates are expressed as hours of study, not calendar time, and they vary widely by background. Compiled ranges are roughly N5 ≈ 350–500 hours, N4 ≈ 600–1,000 hours, N3 ≈ 950–1,700 hours (no kanji background), N2 ≈ 1,600–2,800 hours, and N1 ≈ 3,000–4,800 hours.5
The same compilation reports that learners with a prior kanji background need materially fewer hours, about 30–50% fewer for several levels.5 That is why hours invested, not weeks on the calendar, is the input that better predicts where a learner stands.
Hours can come from immersion trackers, time trackers, and Anki's built-in time statistics, which report time spent reviewing.6 The mechanics of any specific timer belong to its own J-Compass tool article; this page only notes that such tools produce the number.
J-Compass treats logged hours as a leading input: necessary but not sufficient. The outcome metrics below confirm whether the hours converted into ability.
Deck size and retention
In FSRS, Anki's current scheduling algorithm, desired retention is a target you set. FSRS stands for Free Spaced Repetition Scheduler. The Anki manual describes desired retention as the value that "controls how likely you are to remember cards when they are scheduled for a review."7
Desired retention is distinct from measured or true retention: your actual recall performance. Anki shows that in its statistics through the answer-buttons graph and FSRS true-retention figures.76
The manual keeps desired retention within a sensible band and warns that workload rises steeply near the top. It advises keeping the value below about 97% and notes that review load increases very quickly above roughly 90%. The permissible range is 0.70–0.97, widened to 0.70–0.99 in newer versions.78
Mature-card count and true-retention percentage are the meaningful pair to watch. Raw reviews per day is not, because a high review count at low retention reflects re-learning churn rather than durable knowledge.76 For why scheduled review produces durable recall in the first place, see Spaced Repetition and the Forgetting Curve: Why Reviewing on a Schedule Works.
JPDB, a Japanese vocabulary and SRS tool, exposes equivalent statistics for Japanese specifically. It also computes vocabulary coverage for a given title: the share of unique words in a chosen anime, novel, or game that a learner already knows.910 That lets "deck size" be read as comprehension of real material rather than a bare card count. The tool's mechanics belong in its own article, Beyond Anki: SRS Tools and Approaches Compared.
Mock-test scores
The JLPT scores three sections on a 0–180 total scale. Each section is 0–60 for N1–N3. For N4–N5, Language Knowledge and Reading are combined into a 0–120 section, plus Listening.2
Passing requires both the overall pass mark and every sectional minimum. The official rule states a candidate fails "if there is even one scoring section where the score is below the sectional pass mark... no matter how high the total score."2
| Level | Overall pass | Sectional minimums |
|---|---|---|
| N5 | 80 / 180 | 38/120 combined, 19/60 Listening |
| N4 | 90 / 180 | 38/120 combined, 19/60 Listening |
| N3 | 95 / 180 | 19/60 per section |
| N2 | 90 / 180 | 19/60 per section |
| N1 | 100 / 180 | 19/60 per section |
Pass marks and sectional minimums above are the official figures.311
The JLPT applies Item Response Theory scaling, a statistical method for adjusting scores across test versions, so that scaled scores stay comparable across sessions of differing difficulty.2 That comparability is what makes a periodically retaken, same-format mock score useful over time, rather than just a raw correct-count.
A mock test is the strongest lagging indicator because it is timed, sectioned, and scored against fixed criteria. You need not sit the official exam to use its format as a yardstick. J-Compass suggests retaking the same format every 2–3 months, under the conditions described in How to Take a JLPT Mock Test Properly.
Reading speed
Report reading speed as characters per minute on level-appropriate text, sampled periodically. The number to watch is your own trend over time, not a benchmark figure. Treat it as a relative personal trend rather than an absolute target.
At constant text difficulty, characters per minute is a clean proxy for decoding automaticity: the sub-skill that makes word and sentence recognition faster, freeing working memory for comprehension. Rising speed on matched-difficulty text indicates the decode step is getting easier.
Absolute speed figures by level, and the method for sampling them, belong to the canonical reading-speed article, Japanese Reading Speed Milestones: cpm by Level. This page cites only the practice of sampling at constant difficulty. It deliberately avoids any fabricated benchmark.
Comprehension and lookup rate
JPDB's coverage calculator quantifies comprehension for a chosen work as the percentage of unique words a learner already knows. That gives an objective stand-in for "how much of this can I follow."910
At constant difficulty, a falling words-looked-up-per-page count, or pauses-per-scene rate, is a real gain signal. It separates improvement from the confound of simply reading easier material.
Pair the behavioral count (lookups per page) with JPDB coverage, its tool-backed cousin. One is what you observe while reading. The other is what the tool computes for the title.
Conversation comfort
The JLPT "Can-do Self-Evaluation List" was built from a self-evaluation survey of roughly 65,000 examinees between September 2010 and December 2011. Examinees rated about 30 activity statements per skill across Listening, Speaking, Reading, and Writing.12 The official site says this captures "what successful JLPT examinees of each level think they can do," and that it is neither a syllabus nor a guarantee of proficiency.12
That sets a precedent: a self-rating can be useful, and even officially modeled, when it is anchored to concrete can-do behaviors rather than mood. The honest version asks whether you can follow a phone call or sustain a ten-minute exchange, not whether you feel fluent.
The metrics worth self-rating here are latency, English-fallback frequency, and a monthly recorded speaking sample. Each is anchored to a behavior you can observe again, which keeps it separate from the vibes trap below.
What NOT to Measure
Vibes and "do I feel fluent today"
Self-reported feeling is mood-dependent and not comparable across days. In measurement terms, it lacks a stable referent, or fixed thing being measured. The JLPT explicitly separates verifiable competence, the sectioned and criterion-scored exam, from self-impression, which it treats as supplementary only.212
The fix is not to stop reflecting. It is to anchor the reflection to a behavior: a recorded sample, a lookup rate, or a mock section score, each of which has a fixed referent you can re-measure.
Percent of native / percent fluent
"Percent of native" has no defined denominator. There is no single quantity called "total native knowledge" against which a learner can be measured as a fraction, so the figure has nothing to divide by.
The official competence references describe proficiency qualitatively, by skill and by can-do activity, not as a percentage of an idealized native total.412 That apparatus offers criterion bands and can-do statements instead of a percent-of-native number. It is evidence that the field measures proficiency some other way.
Raw streak length and raw review counts
A long streak or a high reviews-per-day count can coexist with low true retention. Anki separates review-volume statistics from retention statistics because volume does not imply durable recall.76
A streak measures habit adherence: whether you showed up. That answers a different question from acquisition: whether knowledge stuck. The two are complementary. The error is substituting the habit metric for the progress metric.
Comparison to other learners
Hour-to-level data show that the same JLPT level can require very different hour totals depending on prior kanji background, roughly 30–50% fewer hours for several levels for kanji-background learners.5 Two learners at the same calendar point can therefore be at very different places for reasons unrelated to effort quality.
Different first languages, hour budgets, and goals make cross-learner comparison invalid in the literal measurement sense, because there is no shared scale. Compare yourself to your own prior samples instead, where the scale is fixed.
How to Turn a Metric Into a Plan Change
Read the leading indicators weekly, the lagging ones quarterly
In the standard definition, leading indicators are inputs you can directly control and that change early. Lagging indicators are outputs that are easy to measure but slow to move, and they confirm results after the fact.13
J-Compass maps the metrics onto that split as follows: hours invested, deck load, and lookup rate are leading and fast, so read them weekly. Mock-test scores and reading speed are lagging and slow, so read them quarterly. The cadence follows directly from which side of the split each metric sits on.13 The assignment and the cadence are J-Compass methodology, not part of the cited definition.
The split is easier to remember as a picture than as a list.
Decision table: metric stalls → what to change
A reading is only useful if it names a lever. Each row below pairs a stalled signal with the one input to adjust. The numeric thresholds and the method live in each lever's own article.
| If this stalls | Change this lever |
|---|---|
| Mock Listening section lags the other sections | Shift hours toward listening input |
| True retention drops while reviews climb | Cut daily new cards |
| Lookup rate stays high at fixed difficulty | Drop input difficulty a step |
| Reading speed flat at constant difficulty | Increase volume of matched-difficulty reading |
This section provides the routing logic: which lever a given stall points to. It does not re-derive each lever from scratch.
When to stop tracking
Repeated self-measurement is itself a behavioral intervention. Self-monitoring produces reactivity, a special case of the Hawthorne effect in which the act of measuring changes the measured behavior. It is a documented phenomenon, not folklore.14
Metrics also degrade when over-optimized. Goodhart's Law says that "when a measure becomes a target, it ceases to be a good measure." It has been observed empirically, for example in publication-count gaming, once a measure is pushed as a target.151613
Because measuring has a time cost and can distort behavior, simplify or pause the tracking if logging starts crowding out studying. The tracker is instrumentation, not the workout.
Good to know
Treating desired retention as if it were measured retention
A common slip is to say "my retention is 90% because I set FSRS desired retention to 90%." Desired retention is only the target the scheduler aims at. Your measured true retention is the separate figure you read from Anki's statistics.76
The accurate statement keeps the two apart: desired retention is set to 90%, and measured true retention this month is whatever the stats report. If you conflate them, you hide whether the scheduler is actually hitting its target.76
Reporting a total mock score while ignoring a failing section
Saying "I scored 110/180 on the N2 mock, so I'd pass" is wrong whenever a single section sits below its minimum. A 110/180 total with a Listening section of 17/60 is a fail, because 17 is under the 19 minimum.23
The JLPT fails any candidate with one section below its sectional minimum regardless of total, so a healthy-looking total can mask a disqualifying weak skill.23 Read the section scores, not just the headline number.
Using "% of native" as a progress number
"I'm about 40% of a native speaker now" is not a measurement, because there is no defined "total native knowledge" denominator for the percentage to move against.412 The number cannot change in a way you can interpret, so it cannot tell you what to do next.
The actionable replacement is anchored to criteria: which can-do items you can now handle on the N3 list, and your mock N3 Listening score out of 60. The official apparatus uses exactly these criterion bands and can-do statements in place of a percent figure.412
Track the input, not just the output
You can only adjust levers you log. Hours and deck load are the controllable inputs, or leading indicators. Scores and reading speed are the outputs, or lagging indicators.13
The phrase turns the leading-and-lagging split into a single rule you can recall while planning. If a number is not something you can directly turn up or down, it is an output to confirm, not a lever to pull.
Sample, don't surveil
A periodic reading-speed or speaking sample captures the trend at low cost. Continuous self-monitoring, by contrast, creates measurement reactivity and time overhead.14
Sampling keeps the audit trail without the trap. Check the trend on a schedule rather than watching the gauge constantly.
Where "Goodhart's law" actually comes from
The popular wording, "when a measure becomes a target, it ceases to be a good measure," is not economist Charles Goodhart's own 1975 sentence. His point was about statistical regularities collapsing under control pressure. The familiar wording is anthropologist Marilyn Strathern's 1997 generalization, later popularized across many fields.1516
Knowing the provenance keeps a learner from over-reading the law as a hard economic theorem. It is better understood as a robust empirical caution. It is a reason to pick metrics you cannot easily game, not a law of nature.
See also
- Auditing Your Japanese Study Time: A 7-Day Protocol to Find Where the Hours Go
- Why You Can Read Japanese But Can't Speak It: Closing the Output Gap
- How Many New Anki Cards Per Day: Computing Your Sustainable Ceiling
- How Long Does It Take to Learn Japanese? Setting Realistic Goals and the One-Year Trap
- Japanese Reading Speed Milestones: cpm by Level