Word Frequency in Japanese: Why the First 1,000 Words Cover ~80%

Japanese word frequency is sharply uneven. The most common ~1,000 words are widely said to cover roughly 80% of everyday speech, and somewhat less in written text and news (closer to ~70–76%).¹ That "coverage" figure counts words on the page, not how much you understand. This lopsided pattern has a name: Zipf's law.²

Overview

This article separates two numbers that learner resources often blur. The first is coverage: the share of running words in a text or conversation that you would recognize. The second is comprehension: whether you actually understand what is being said.³⁴

Solid Japanese token-coverage tables with clear methods are scarce. For that reason, the exact percentages here are presented as ranges with their sources named, not as hard facts.¹ The shape of the curve, however, is robust. It is what turns "which words first" into a study-time decision.

The headline number is an order-of-magnitude claim

The "top 1,000 words ≈ 80%" figure is repeated across learner resources. However, it is hard to find a peer-reviewed, corpus-cited Japanese token-coverage table that states it with a clear method. The one verifiable Japanese token analysis gives ~76% at 1,000 words for fiction.¹ Treat 80% as a reasonable ballpark for speech, not a measured statistic.

What "coverage" actually means

"Text coverage" is "the percentage of running words in the text known by the readers."³ Here, a running word means each word as it appears in the text. The unit being counted is the running word, not the dictionary headword. Everything that follows depends on that distinction.

Tokens vs. types vs. word families

Three counting units can sit behind the word "word," and they produce very different totals.

Unit	Definition	Example
Token	Each running word as it appears, every occurrence counted	"the cat sat" = 3 tokens
Type	Each distinct word form	"the cat sat on the mat" = 5 types ("the" counted once)
Word family	A base word plus its inflections and transparent derivations	nation, national, nationally, nationalism = 1 family⁵

Coverage is measured in tokens, because what matters is how much of the actual running text you recognize.³ Vocabulary-size targets, by contrast, are usually counted in word families or lemmas.⁵

The counting unit drives the apparent list size: "Different ways of counting lexical items will lead to vastly different results."⁶ The same coverage target can require a very different "number of words" depending on whether you count tokens, types, lemmas, or families.

The word-family unit is English-built

The word-family construct was developed for English morphology, meaning the way English words change form.⁵ Japanese frequency references such as the BCCWJ-based dictionary count by lemma or short-unit word rather than English-style families.⁷ When a Japanese source and an English source both say "5,000 words," they are not necessarily counting the same thing.

Coverage is not comprehension

80% token coverage means roughly one running word in five is unknown.⁴ That may sound manageable until you see what it does to understanding.

In Hu and Nation's study, the link was steep: "With a text coverage of 80% ... no one gained adequate comprehension. With a text coverage of 90%, a small minority gained adequate comprehension. With a text coverage of 95% ... a few more gained adequate comprehension, but they were still a small minority. At 100% coverage, most gained adequate comprehension."⁴

High token coverage therefore does not equal understanding. The unknown words tend to be the content-bearing ones, which is why the gap between 80% coverage and actual comprehension is so wide.

Zipf's law: why a few words do so much work

The reason a small vocabulary covers so much text is not luck. It is a statistical regularity that holds across languages.

The rank-frequency rule

Zipf's law states that, in a corpus, a word's frequency is approximately inversely proportional to its frequency rank: f ∝ 1/rank.² In plain terms, the most frequent word occurs about twice as often as the second, three times as often as the third, and so on down the list.²

Zipf attributed the pattern to a "principle of least effort": speakers economize by reusing a small set of high-frequency words rather than creating new ones.² The pattern is cross-linguistic and appears in Japanese. Still, Japanese has mixed script, many homophones, and complex morphology, so it deviates from the clean power law at the extreme tails more than purely phonographic languages do.²

Why this produces a steep then flat curve

Because frequency falls off as roughly 1/rank, the first words you learn each buy a large slice of coverage. Each later word buys progressively less. This is the mechanical origin of both the long tail and diminishing returns.²

A real-text analysis shows the shape clearly. In D. H. Lawrence's Lady Chatterley's Lover (about 121,000 tokens), the first 1,000 most frequent word families covered 80.88% of running words. The second 1,000 added only 7.21 points, and the third 1,000 added 3.14. By the sixth 1,000, each additional thousand families bought only a tiny increase.³

That is an English novel, used only to show the steep-then-flat shape. The same curve drives the Japanese figures below.

The Japanese coverage curve

Here are the concrete numbers, by register, with each source named and each soft figure flagged as soft.

The curve by register: spoken, written, news

The single verifiable Japanese-token analysis available is Mike Kamermans' corpus derived from over 65 million words of fiction. In that corpus, the top 500 words cover ~70.3% of running words, the top 1,000 ~76.2%, and the top 10,000 ~94.1%.¹ These are written fiction figures.

The widely repeated "~80% at 1,000 words" applies to everyday speech. Speech tends to run higher than written text at the same rank because conversation leans harder on a small core of function words. No primary Japanese table with a clear method was locatable for the spoken figure, so it belongs here as an order-of-magnitude estimate consistent with the Zipfian shape, not a measured number.¹

Vocabulary size	Spoken (everyday)	Written / fiction	Source quality
Top ~1,000	~80% (commonly repeated, unverified)	~76% (Kamermans fiction corpus)¹	written: secondary; spoken: unsourced
Top ~5,000	~95% (commonly repeated)	rising toward ~90%+	unverified for Japanese
Top ~10,000	~98–99% (commonly repeated)	~94% (Kamermans fiction corpus)¹	written: secondary; spoken: unsourced

Read these as ranges, not measurements

The only solid anchor in this table is the Kamermans fiction curve.¹ The 95%-at-5,000 and 99%-at-10,000 figures for Japanese are commonly repeated but unverified against a corpus study. The closest real number, ~94% at 10,000 words for fiction, is below the repeated 99%. The figures vary with the corpus and the counting unit.

For calibration only, English analogues sit in the same range. The first 1,000 BNC word families covered 77.86% of the written LOB corpus and 80.88% of the Lady Chatterley's Lover novel.³ Spoken English runs higher at the same rank for the same reason spoken Japanese does.

Where these numbers come from

The authoritative Japanese frequency reference is the Balanced Corpus of Contemporary Written Japanese (BCCWJ), built by the National Institute for Japanese Language and Linguistics (NINJAL). It contains 104.3 million words across genres including general books, magazines, newspapers, business reports, blogs, internet forums, textbooks, and legal documents. The sampled material spans roughly 1976–2006, with the main body from 1986 to 2006.⁸ NINJAL publishes the BCCWJ frequency list (語彙表, vocabulary table) free for research and educational use.⁹

The main learner-facing frequency list derived from a balanced corpus of about 100 million words is Tono, Yamazaki, and Maekawa's A Frequency Dictionary of Japanese (Routledge, 2013). It ranks the 5,000 most frequent words from spoken, fiction, non-fiction, and news material.⁷

A balanced corpus matters because it deliberately mixes registers, so its ranks are not skewed toward one genre. Novel-derived lists over-represent literary vocabulary. Subtitle- or Wikipedia-derived lists over-represent conversational or encyclopedic vocabulary. That is why coverage figures differ across lists, and why any honest figure has to name the corpus behind it.

Counting words in vs. applying a list out

Nation notes another reason figures vary: coverage measured by counting the words that actually occur in a corpus reads higher, for the same number of words, than coverage measured by applying a frequency-rank list to an independent text.³ The method, not just the corpus, moves the number.

The 95% and 98% comprehension thresholds

If 80% coverage leaves a reader stranded, the next question is how much coverage is enough. The research that answers it is English-corpus work. The principle transfers to Japanese even where the exact word counts do not. That same 95-to-98% band is the comprehension threshold you aim for when choosing how easy your input should be.

Hu and Nation (2000) found that 95% coverage is the level at which a minority of readers reach adequate comprehension. A regression model indicated that 98% coverage, one unknown word in fifty, "would be needed for most learners to gain adequate comprehension."⁴

Nation (2006) translated those thresholds into vocabulary size: "If 98% coverage of a text is needed for unassisted comprehension, then a 8,000 to 9,000 word-family vocabulary is needed for comprehension of written text and a vocabulary of 6,000 to 7,000 for spoken text."³ About 4,000 word families plus proper nouns reach roughly 95% coverage.³ Laufer and Ravenhorst-Kalovski (2010) put the "optimal" threshold near 8,000 families for 98%. They put an acceptable threshold near 4,000–5,000 families for 95%.¹⁰

These word-family counts are English, not Japanese targets

The 4,000 / 8,000 word-family figures are derived from English BNC data.³⁴¹⁰ The principle transfers to Japanese: unaided reading needs far more than 80% coverage. The precise counts do not necessarily transfer. Do not treat "8,000 word families" as a proven Japanese reading target.

The thresholds also need a hedge. Even at 98% coverage, few L2 learners in some studies gained adequate comprehension. The influential 98% figure rests on a single regression with 66 university students, with mixed replication since.³⁴ State the threshold, but do not oversell its precision.

The diminishing-returns argument

The curve is not just a description. It is an instruction for how to spend study time.

Why frequency order beats textbook or random order early on

Because the curve is steep at the top, the earliest words you learn each return the most coverage per word. In the English analyses above, the first 1,000 families covered about 78–81% of running words. The sixth 1,000 added under one percentage point.³ Learning high-frequency words first maximizes comprehension gained per word studied.

The pedagogical implication is stated directly in the literature: high-frequency words deserve first priority, and learners should build a good command of them before chasing rarer vocabulary.³

Textbook order is organized by grammar and theme. Random order ignores frequency entirely. Both spend early effort on lower-coverage words and climb the curve more slowly than frequency order does.

Where the curve stops paying and immersion takes over

Past the top few thousand words, each additional 1,000 buys progressively less coverage. The remaining words grow increasingly domain- and topic-specific.³ Nation's own example: words like surf or asteroid dominate a single text only because that text is about surfing or astronomy, not because they are generally frequent.³

Once you have climbed the steep part, grinding a generic frequency list returns little. The efficient source of further vocabulary becomes reading and immersion, harvesting words in context where they actually matter to you.

Good to know

Why your textbook's word count doesn't match the curve

Textbook and JLPT vocabulary targets count headwords or word families, which are dictionary-style units. Coverage figures count running tokens.³⁵ "Words to know" and "words of coverage" are different quantities, which is why a textbook's vocabulary count rarely lines up with a coverage percentage. Because families are coarser than lemmas, which are coarser than types, "5,000 words" can mean very different real workloads depending on the unit.⁶⁵

The "1 word in 5" trap in real reading

80% coverage sounds high, but it means one running word in five is unknown. At that density, "no one gained adequate comprehension" in the Hu and Nation experiment.⁴ The felt difficulty is worse than the number suggests because the unknown words cluster on content words, the nouns and verbs that carry the meaning. The known 80% is disproportionately function words.

The pattern is visible in ordinary sentences. The particles and copula (forms like です / だ, often translated as "is" or "are") below are among the highest-frequency items in any Japanese corpus, so the meaning hinges on a handful of content words.

土曜日どようびは週しゅうの最後さいごの日ひです。¹¹
"Saturday is the last day of the week."

結局けっきょくは誰だれでも自分じぶんで学まなぶしかない。¹²
"Everyone must learn on their own in the end."

我々われわれは、最善さいぜんを尽つくしさえすればよいのだ。¹³
"All we have to do is to try our best."

In each sentence, the particles (は, の, を, で, さえ, ば) and the copula (です / だ) are cheap, high-frequency coverage. The load is carried by content words such as 土曜日, 週, 学ぶ, and 最善.

Particles and function words inflate spoken coverage

A large share of the very top frequency ranks in Japanese are grammatical particles and copula or auxiliary forms (は, が, の, を, に, です, and so on). Spoken language leans on these even more heavily than written language does. That is why spoken coverage at a given rank tends to exceed written coverage.

The English analogue supports the direction. The first 1,000 families covered about 78% of a written BNC corpus, while spoken corpora run higher because conversation reuses a small core. Nation also notes that very common spoken words like hullo, goodbye, and pal rank low in a written-weighted corpus precisely because speech is underweighted there.³ The same register asymmetry is expected in Japanese, though a Japanese-specific particle-share percentage was not locatable in a primary corpus study.

Climb the curve in order with a frequency-ordered SRS deck

A spaced-repetition deck that presents words in a sensible frequency-informed order lets a learner work down the steepest part of the curve first, where each word returns the most coverage. For that, J-Compass recommends Amenokori: a curated, level-graded library of more than 10,000 vocabulary and grammar entries spanning N5 through N1. It is level-mapped so a beginner meets common, high-coverage words before rarer ones and climbs the curve in order without curating a word list by hand.¹⁴

Scheduling is built around the FSRS (Free Spaced Repetition Scheduler) algorithm. Amenokori describes it as the same scheduler powering modern Anki, calibrated to the learner's own memory, with the mobile app offering seven quiz types across its question bank.¹⁴¹⁵ Note that the landing page describes the kanji set, not the whole vocabulary library, as "sorted by frequency,"¹⁴ so it is best treated as a level-graded, frequency-informed deck rather than a strictly frequency-ranked word list.

References

Kamermans, Mike ("Pomax"). Japanese word-frequency analysis derived from a novel corpus of over 65 million words. Nihongoresources.com. (Figures reproduced and discussed at offbeatband.com, "The Most Commonly Used Japanese Words by Frequency.") http://www.offbeatband.com/2010/12/the-most-commonly-used-japanese-words-by-frequency/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Zipf, George Kingsley. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge, Mass.: Addison-Wesley Press, 1949. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Nation, I. S. P. "How Large a Vocabulary Is Needed for Reading and Listening?" The Canadian Modern Language Review, vol. 63, no. 1, 2006, pp. 59–82. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶
Hu, Marcella, and I. S. P. Nation. "Unknown Vocabulary Density and Reading Comprehension." Reading in a Foreign Language, vol. 13, no. 1, 2000, pp. 403–430. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Bauer, Laurie, and I. S. P. Nation. "Word Families." International Journal of Lexicography, vol. 6, no. 4, 1993, pp. 253–279. ↩ ↩² ↩³ ↩⁴ ↩⁵
Schmitt, Norbert, Xiangying Jiang, and William Grabe. "The Percentage of Words Known in a Text and Reading Comprehension." The Modern Language Journal, vol. 95, no. 1, 2011, pp. 26–43. ↩ ↩²
Tono, Yukio, Makoto Yamazaki, and Kikuo Maekawa. A Frequency Dictionary of Japanese. Routledge Frequency Dictionaries. London and New York: Routledge, 2013. ↩ ↩²
国立国語研究所 (National Institute for Japanese Language and Linguistics, NINJAL). 『現代日本語書き言葉均衡コーパス』(Balanced Corpus of Contemporary Written Japanese, BCCWJ). https://clrd.ninjal.ac.jp/bccwj/en/ ↩
NINJAL. BCCWJ Word List (frequency list / 語彙表). https://clrd.ninjal.ac.jp/bccwj/en/freq-list.html ↩
Laufer, Batia, and Geke C. Ravenhorst-Kalovski. "Lexical Threshold Revisited: Lexical Text Coverage, Learners' Vocabulary Size and Reading Comprehension." Reading in a Foreign Language, vol. 22, no. 1, 2010, pp. 15–30. ↩ ↩²
Tatoeba Project. Sentence #124456. CC BY 2.0 FR. https://tatoeba.org/en/sentences/show/124456 ↩
Tatoeba Project. Sentence #75139. CC BY 2.0 FR. https://tatoeba.org/en/sentences/show/75139 ↩
Tatoeba Project. Sentence #186183. CC BY 2.0 FR. https://tatoeba.org/en/sentences/show/186183 ↩
Amenokori. Product landing page. https://amenokori.com/ ↩ ↩² ↩³
Amenokori. Mobile app page. https://amenokori.com/mobile-app/ ↩

Overview​

What "coverage" actually means​

Tokens vs. types vs. word families​

Coverage is not comprehension​

Zipf's law: why a few words do so much work​

The rank-frequency rule​

Why this produces a steep then flat curve​

The Japanese coverage curve​

The curve by register: spoken, written, news​

Where these numbers come from​

The 95% and 98% comprehension thresholds​

The diminishing-returns argument​

Why frequency order beats textbook or random order early on​

Where the curve stops paying and immersion takes over​

Good to know​

Why your textbook's word count doesn't match the curve​

The "1 word in 5" trap in real reading​

Particles and function words inflate spoken coverage​

Climb the curve in order with a frequency-ordered SRS deck​

See also​

References​

Footnotes​