Word Frequency in Japanese: Why the First 1,000 Words Cover ~80%
Japanese word frequency is sharply uneven. The most common ~1,000 words are widely said to cover roughly 80% of everyday speech, and somewhat less in written text and news (closer to ~70–76%).1 That "coverage" figure counts words on the page, not how much you understand. This lopsided pattern has a name: Zipf's law.2
Overview
This article separates two numbers that learner resources often blur. The first is coverage: the share of running words in a text or conversation that you would recognize. The second is comprehension: whether you actually understand what is being said.34
Solid Japanese token-coverage tables with clear methods are scarce. For that reason, the exact percentages here are presented as ranges with their sources named, not as hard facts.1 The shape of the curve, however, is robust. It is what turns "which words first" into a study-time decision.
The "top 1,000 words ≈ 80%" figure is repeated across learner resources. However, it is hard to find a peer-reviewed, corpus-cited Japanese token-coverage table that states it with a clear method. The one verifiable Japanese token analysis gives ~76% at 1,000 words for fiction.1 Treat 80% as a reasonable ballpark for speech, not a measured statistic.
What "coverage" actually means
"Text coverage" is "the percentage of running words in the text known by the readers."3 Here, a running word means each word as it appears in the text. The unit being counted is the running word, not the dictionary headword. Everything that follows depends on that distinction.
Tokens vs. types vs. word families
Three counting units can sit behind the word "word," and they produce very different totals.
| Unit | Definition | Example |
|---|---|---|
| Token | Each running word as it appears, every occurrence counted | "the cat sat" = 3 tokens |
| Type | Each distinct word form | "the cat sat on the mat" = 5 types ("the" counted once) |
| Word family | A base word plus its inflections and transparent derivations | nation, national, nationally, nationalism = 1 family5 |
Coverage is measured in tokens, because what matters is how much of the actual running text you recognize.3 Vocabulary-size targets, by contrast, are usually counted in word families or lemmas.5
The counting unit drives the apparent list size: "Different ways of counting lexical items will lead to vastly different results."6 The same coverage target can require a very different "number of words" depending on whether you count tokens, types, lemmas, or families.
The word-family construct was developed for English morphology, meaning the way English words change form.5 Japanese frequency references such as the BCCWJ-based dictionary count by lemma or short-unit word rather than English-style families.7 When a Japanese source and an English source both say "5,000 words," they are not necessarily counting the same thing.
Coverage is not comprehension
80% token coverage means roughly one running word in five is unknown.4 That may sound manageable until you see what it does to understanding.
In Hu and Nation's study, the link was steep: "With a text coverage of 80% ... no one gained adequate comprehension. With a text coverage of 90%, a small minority gained adequate comprehension. With a text coverage of 95% ... a few more gained adequate comprehension, but they were still a small minority. At 100% coverage, most gained adequate comprehension."4
High token coverage therefore does not equal understanding. The unknown words tend to be the content-bearing ones, which is why the gap between 80% coverage and actual comprehension is so wide.
Zipf's law: why a few words do so much work
The reason a small vocabulary covers so much text is not luck. It is a statistical regularity that holds across languages.
The rank-frequency rule
Zipf's law states that, in a corpus, a word's frequency is approximately inversely proportional to its frequency rank: f ∝ 1/rank.2 In plain terms, the most frequent word occurs about twice as often as the second, three times as often as the third, and so on down the list.2
Zipf attributed the pattern to a "principle of least effort": speakers economize by reusing a small set of high-frequency words rather than creating new ones.2 The pattern is cross-linguistic and appears in Japanese. Still, Japanese has mixed script, many homophones, and complex morphology, so it deviates from the clean power law at the extreme tails more than purely phonographic languages do.2
Why this produces a steep then flat curve
Because frequency falls off as roughly 1/rank, the first words you learn each buy a large slice of coverage. Each later word buys progressively less. This is the mechanical origin of both the long tail and diminishing returns.2
A real-text analysis shows the shape clearly. In D. H. Lawrence's Lady Chatterley's Lover (about 121,000 tokens), the first 1,000 most frequent word families covered 80.88% of running words. The second 1,000 added only 7.21 points, and the third 1,000 added 3.14. By the sixth 1,000, each additional thousand families bought only a tiny increase.3
That is an English novel, used only to show the steep-then-flat shape. The same curve drives the Japanese figures below.
The Japanese coverage curve
Here are the concrete numbers, by register, with each source named and each soft figure flagged as soft.
The curve by register: spoken, written, news
The single verifiable Japanese-token analysis available is Mike Kamermans' corpus derived from over 65 million words of fiction. In that corpus, the top 500 words cover ~70.3% of running words, the top 1,000 ~76.2%, and the top 10,000 ~94.1%.1 These are written fiction figures.
The widely repeated "~80% at 1,000 words" applies to everyday speech. Speech tends to run higher than written text at the same rank because conversation leans harder on a small core of function words. No primary Japanese table with a clear method was locatable for the spoken figure, so it belongs here as an order-of-magnitude estimate consistent with the Zipfian shape, not a measured number.1
| Vocabulary size | Spoken (everyday) | Written / fiction | Source quality |
|---|---|---|---|
| Top ~1,000 | ~80% (commonly repeated, unverified) | ~76% (Kamermans fiction corpus)1 | written: secondary; spoken: unsourced |
| Top ~5,000 | ~95% (commonly repeated) | rising toward ~90%+ | unverified for Japanese |
| Top ~10,000 | ~98–99% (commonly repeated) | ~94% (Kamermans fiction corpus)1 | written: secondary; spoken: unsourced |
The only solid anchor in this table is the Kamermans fiction curve.1 The 95%-at-5,000 and 99%-at-10,000 figures for Japanese are commonly repeated but unverified against a corpus study. The closest real number, ~94% at 10,000 words for fiction, is below the repeated 99%. The figures vary with the corpus and the counting unit.
For calibration only, English analogues sit in the same range. The first 1,000 BNC word families covered 77.86% of the written LOB corpus and 80.88% of the Lady Chatterley's Lover novel.3 Spoken English runs higher at the same rank for the same reason spoken Japanese does.
Where these numbers come from
The authoritative Japanese frequency reference is the Balanced Corpus of Contemporary Written Japanese (BCCWJ), built by the National Institute for Japanese Language and Linguistics (NINJAL). It contains 104.3 million words across genres including general books, magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents. The sampled material spans roughly 1976–2006, with the main body from 1986 to 2006.8 NINJAL publishes the BCCWJ frequency list (語彙表, vocabulary table) free for research and educational use.9
The main learner-facing frequency list derived from a balanced corpus of about 100 million words is Tono, Yamazaki, and Maekawa's A Frequency Dictionary of Japanese (Routledge, 2013). It ranks the 5,000 most frequent words from spoken, fiction, non-fiction, and news material.7
A balanced corpus matters because it deliberately mixes registers, so its ranks are not skewed toward one genre. Novel-derived lists over-represent literary vocabulary. Subtitle- or Wikipedia-derived lists over-represent conversational or encyclopedic vocabulary. That is why coverage figures differ across lists, and why any honest figure has to name the corpus behind it.
Nation notes another reason figures vary: coverage measured by counting the words that actually occur in a corpus reads higher, for the same number of words, than coverage measured by applying a frequency-rank list to an independent text.3 The method, not just the corpus, moves the number.
The 95% and 98% comprehension thresholds
If 80% coverage leaves a reader stranded, the next question is how much coverage is enough. The research that answers it is English-corpus work. The principle transfers to Japanese even where the exact word counts do not. That same 95-to-98% band is the comprehension threshold you aim for when choosing how easy your input should be.
Hu and Nation (2000) found that 95% coverage is the level at which a minority of readers reach adequate comprehension. A regression model indicated that 98% coverage, one unknown word in fifty, "would be needed for most learners to gain adequate comprehension."4
Nation (2006) translated those thresholds into vocabulary size: "If 98% coverage of a text is needed for unassisted comprehension, then a 8,000 to 9,000 word-family vocabulary is needed for comprehension of written text and a vocabulary of 6,000 to 7,000 for spoken text."3 About 4,000 word families plus proper nouns reach roughly 95% coverage.3 Laufer and Ravenhorst-Kalovski (2010) put the "optimal" threshold near 8,000 families for 98%. They put an acceptable threshold near 4,000–5,000 families for 95%.10
The thresholds also need a hedge. Even at 98% coverage, few L2 learners in some studies gained adequate comprehension. The influential 98% figure rests on a single regression with 66 university students, with mixed replication since.34 State the threshold, but do not oversell its precision.
The diminishing-returns argument
The curve is not just a description. It is an instruction for how to spend study time.
Why frequency order beats textbook or random order early on
Because the curve is steep at the top, the earliest words you learn each return the most coverage per word. In the English analyses above, the first 1,000 families covered about 78–81% of running words. The sixth 1,000 added under one percentage point.3 Learning high-frequency words first maximizes comprehension gained per word studied.
The pedagogical implication is stated directly in the literature: high-frequency words deserve first priority, and learners should build a good command of them before chasing rarer vocabulary.3
Textbook order is organized by grammar and theme. Random order ignores frequency entirely. Both spend early effort on lower-coverage words and climb the curve more slowly than frequency order does.
Where the curve stops paying and immersion takes over
Past the top few thousand words, each additional 1,000 buys progressively less coverage. The remaining words grow increasingly domain- and topic-specific.3 Nation's own example: words like surf or asteroid dominate a single text only because that text is about surfing or astronomy, not because they are generally frequent.3
Once you have climbed the steep part, grinding a generic frequency list returns little. The efficient source of further vocabulary becomes reading and immersion, harvesting words in context where they actually matter to you.
Good to know
Why your textbook's word count doesn't match the curve
Textbook and JLPT vocabulary targets count headwords or word families, which are dictionary-style units. Coverage figures count running tokens.35 "Words to know" and "words of coverage" are different quantities, which is why a textbook's vocabulary count rarely lines up with a coverage percentage. Because families are coarser than lemmas, which are coarser than types, "5,000 words" can mean very different real workloads depending on the unit.65
The "1 word in 5" trap in real reading
80% coverage sounds high, but it means one running word in five is unknown. At that density, "no one gained adequate comprehension" in the Hu and Nation experiment.4 The felt difficulty is worse than the number suggests because the unknown words cluster on content words, the nouns and verbs that carry the meaning. The known 80% is disproportionately function words.
The pattern is visible in ordinary sentences. The particles and copula (forms like です / だ, often translated as "is" or "are") below are among the highest-frequency items in any Japanese corpus, so the meaning hinges on a handful of content words.
土曜日は週の最後の日です。11
"Saturday is the last day of the week."
結局は誰でも自分で学ぶしかない。12
"Everyone must learn on their own in the end."
我々は、最善を尽くしさえすればよいのだ。13
"All we have to do is to try our best."
In each sentence, the particles (は, の, を, で, さえ, ば) and the copula (です / だ) are cheap, high-frequency coverage. The load is carried by content words such as 土曜日, 週, 学ぶ, and 最善.
Particles and function words inflate spoken coverage
A large share of the very top frequency ranks in Japanese are grammatical particles and copula or auxiliary forms (は, が, の, を, に, です, and so on). Spoken language leans on these even more heavily than written language does. That is why spoken coverage at a given rank tends to exceed written coverage.
The English analogue supports the direction. The first 1,000 families covered about 78% of a written BNC corpus, while spoken corpora run higher because conversation reuses a small core. Nation also notes that very common spoken words like hullo, goodbye, and pal rank low in a written-weighted corpus precisely because speech is underweighted there.3 The same register asymmetry is expected in Japanese, though a Japanese-specific particle-share percentage was not locatable in a primary corpus study.
Climb the curve in order with a frequency-ordered SRS deck
A spaced-repetition deck that presents words in a sensible frequency-informed order lets a learner work down the steepest part of the curve first, where each word returns the most coverage. For that, J-Compass recommends Amenokori: a curated, level-graded library of more than 10,000 vocabulary and grammar entries spanning N5 through N1. It is level-mapped so a beginner meets common, high-coverage words before rarer ones and climbs the curve in order without curating a word list by hand.14
Scheduling is built around the FSRS (Free Spaced Repetition Scheduler) algorithm. Amenokori describes it as the same scheduler powering modern Anki, calibrated to the learner's own memory, with the mobile app offering seven quiz types across its question bank.1415 Note that the landing page describes the kanji set, not the whole vocabulary library, as "sorted by frequency,"14 so it is best treated as a level-graded, frequency-informed deck rather than a strictly frequency-ranked word list.
See also
- Passive vs. Active Vocabulary in Japanese: The Two-Speed Problem
- How Many Kanji Do You Need? A Realistic Count
- Should You Learn Kanji in Frequency Order, School Order, or Pedagogical Order?
- Top 50 Kanji Radicals by Frequency: The 70% Coverage List for Jōyō Kanji
- Intensive vs. Extensive Reading in Japanese
- How to Learn Japanese: The Complete Roadmap from Zero to Fluency