Japanese Pronunciation Drills: A Daily 5-Minute Protocol with Minimal Pairs, Shadowing, and Record-and-Compare

Japanese pronunciation drills are short, daily, targeted repetitions. They train articulation and timing as motor skills, unlike passive listening, which trains only perception.¹ This article sets out a four-stage daily protocol of roughly five minutes. It also explains the self-correction loop and escalation criteria that make the protocol adaptive.

Overview

Why drills, not just exposure

Pronunciation teaching reliably outperforms exposure alone. A meta-analysis of 86 unique samples from 1982 to 2017 found a large overall positive effect of pronunciation instruction on L2 outcomes. An earlier meta-analysis of 86 studies reported similarly large effects favoring instruction over no instruction.¹²

The effect is not confined to perception. Adult Japanese speakers of English who completed perceptual identification training on /r/ versus /l/ showed measurable improvement in their production of the contrast, rated by listeners blind to condition.³ A follow-up tested the same speakers three months after training and found both perceptual and production gains were retained.⁴

Speech production is a motor skill in the sense that matters here: it relies on practiced articulatory routines whose timing and coordination improve with rehearsal.⁵ For motor skills, spacing practice sessions across days works better than putting the same total time into fewer, longer sessions. This is one of the more robust findings in the skill-learning literature.⁶⁷

Daily cadence does more work than session length

Distributed practice (many short sessions across days) reliably beats massed practice (few long sessions) for motor-skill acquisition, and the spacing benefit scales with inter-session interval up to about a day.⁶⁷ The protocol's daily, brief shape is built around this finding.

Pronunciation instruction is most effective when it targets monitored production of specific segmental features (individual sounds) or suprasegmental features (patterns such as timing and pitch), rather than generic communicative practice.¹ The four-stage protocol below keeps attention narrow within each stage. It also feeds yesterday's diagnostic into today's drill selection.

What this protocol does not cover

This article does not redescribe segmental phonology, the sound system of Japanese: consonants, vowels, devoicing, the mora-N, geminates, and rendaku. Those topics live in the dedicated atom articles in the pronunciation series.

It also does not redescribe pitch-accent theory or run pitch-accent minimal-pair drills as their own routine. Pitch-accent perception is its own training problem with its own stimulus sets. The protocol below includes pitch in Stage 3 (passage shadowing) and Stage 4 (record-and-compare), rather than as a standalone stage.

Shadowing as a listening-comprehension drill is also out of scope. Shadowing has been used both as a pronunciation drill (the framing here) and as a listening drill, where the gains are on auditory parsing rather than articulation.⁸⁹ A systematic review of L2 shadowing for pronunciation teaching notes the literature has often conflated the two goals; this article treats shadowing strictly as a pronunciation drill.¹⁰

The 5-minute daily protocol

The protocol stacks four short stages in a fixed order. Each stage's time budget is protocol guidance, calibrated against the distributed-practice literature. It is not a research-validated dose. The "5 minutes" floor compresses training paradigms that typically used much longer sessions.³⁶⁷

The dashed arrow is the load-bearing piece. Stage 4's diagnostic feeds Stage 1's pair selection for the next session. That is what makes the protocol adaptive rather than rote.

Stage 1: segmental minimal pairs (60 seconds)

A minimal pair is a pair of words that differ in exactly one phonological feature. In Japanese, the three feature dimensions that most reliably distinguish minimal pairs are vowel length, consonant gemination (single versus geminate, e.g. /k/ vs /kk/), and pitch accent.⁵

Perceptual training transfers to production. Training native English speakers to perceive the Japanese short-versus-long vowel contrast generalized from training words to new words and to new speaking rates. It also transferred from isolated words to words embedded in sentences.¹¹ Training Japanese listeners on /r/ versus /l/ with stimuli from multiple talkers produced perceptual gains. Those gains generalized to novel items and novel talkers, and they transferred to production without any explicit production instruction.³

The drill pushes back on the perceptual-magnet effect. In this effect, L1 phoneme categories warp the perceptual space around their prototype. That is why an L1-Japanese listener and an L1-English listener can hear the same acoustic token and assign it to different categories.¹²

In practice, pick three to five pairs that target the learner's L1 weak spots. For L1 English speakers, recurring trouble spots are long versus short vowels, geminate versus single consonants, the ら-row, つ, and ふ. The ら-row, つ, ふ trio is teaching-tradition framing, not a single-source claim, and individual learners should let Stage 4 confirm or revise that list. Say each pair twice and rotate the set based on Stage 4's diagnostic.

来きて vs 切手きって⁵
"come (te-form)" vs "(postage) stamp"

おばさん vs おばあさん⁵
"aunt" vs "grandmother"

病院びょういん vs 美容院びよういん⁵
"hospital" vs "beauty parlor"

These pairs are everyday vocabulary (N5 to N4 by JLPT band) and recur across textbook series. They are chosen because the contrast carries meaning in routine sentences, not because the vocabulary is difficult.⁵

Stage 2: mora-timing taps (60 seconds)

Japanese is mora-timed in this sense: the mora is the unit of speech segmentation and rhythm that Japanese listeners actively use to parse running speech. Segmentation experiments showed Japanese listeners using moraic units while listeners with other L1s did not.¹³¹⁴

Strict acoustic isochrony, meaning every mora having exactly the same duration, is not robustly attested. The mora is a perceptual-cognitive timing unit, and Japanese listeners' use of it is better established than its acoustic uniformity.¹⁵ A moraic geminate consonant (the small っ in きって) and a moraic-N (the ん in さん) each occupy one mora of timing on their own, distinct from the consonant or vowel they neighbor.⁵

Vowel-length contrasts are also durational at the mora level: /i/ is one mora, /ii/ is two, which is why おじさん (4 moras) and おじいさん (5 moras) are heard as distinct words rather than the same word at different speaking rates.⁵

The drill is simple. Tap one beat per mora on the desk (or against the thumb) while saying a short word with at least one of: long vowel, geminate, or mora-N. Tapping makes the mora count external and forces the mouth to follow it.

The five words below cover the geminate, long vowel, mora-N, yō-on, and a longer everyday utterance. Tap counts are from Vance's per-mora segmentation.⁵

Word	Reading	Mora segmentation	Tap count
切手きって	きって	き / っ / て	3
学校がっこう	がっこう	が / っ / こ / う	4
先生せんせい	せんせい	せ / ん / せ / い	4
旅行りょこう	りょこう	りょ / こ / う	3
ありがとうございました	ありがとうございました	あ / り / が / と / う / ご / ざ / い / ま / し / た	11

A yō-on is one mora, not two

りょ in 旅行 is one mora because the small ょ following り forms a yō-on, a palatalized cluster occupying a single moraic slot. Tapping ょ as a separate beat is the most common over-tap error.⁵

The 11-tap test of ありがとうございました is a useful self-diagnostic. Learners with a syllable-timed L1 (English, Spanish) tend to undercount it as five or six beats.¹³¹⁴

Stage 3: short-passage shadowing (120 seconds)

Shadowing is a paced auditory tracking task: you immediately vocalize what you hear.¹⁶ It was imported from interpreter training into Japanese EFL (English as a Foreign Language) classrooms in the early 1990s as a listening drill. It is now also used as a pronunciation drill.⁸⁹

A systematic review of shadowing for L2 pronunciation teaching reports that shadowing training improves phonemic perception, word recognition, and segmental and suprasegmental pronunciation features. The largest benefits appeared in learners who already had basic phoneme-level competence.¹⁰ Hamada's classroom studies of shadowing with Japanese learners of English used 10-to-15-minute sessions, three to four times a week, for six weeks. They found significant gains in phonemic perception across proficiency groups, plus listening-comprehension gains in lower-proficiency learners.⁸ Complete beginners often show smaller or noisier gains.⁹¹⁰

Visible articulation is worth seeking out. Hirata and Kelly found that the speaker's lips and face enhanced English learners' perception of Japanese vowel-length contrasts, which is a reason to prefer video sources (or live audio with a visible speaker) over audio-only sources for early shadowing.¹⁷

The drill: one 15-to-30-second native clip, run twice. In Pass 1, read along with the script. In Pass 2, set the script aside and shadow with a half-beat lag.

今日きょうは天気てんきがいいです。⁵
"The weather is nice today."

駅えきまで歩あるいて十分じゅっぷんくらいかかります。⁵
"It takes about ten minutes on foot to the station."

昨日きのう友達ともだちと映画えいがを見みに行いきました。⁵
"Yesterday I went to see a movie with a friend."

よろしくお願ねがいします。⁵
"Pleased to meet you / I appreciate your help in advance."

すみません、もう一度いちど言いってください。⁵
"Excuse me, could you say that again?"

These five passages use register-neutral polite forms (です/ます), the default everyday register and the register used in news and most published listening material. They are the right starting target for Stage 3.⁵ A 15-to-30-second clip at natural news pace will contain roughly two to four utterances of this length, which is the window the protocol assumes.⁸

Stage 4: record-and-compare (60 seconds)

Audio self-recording is a recognized teaching tool for raising learners' awareness of their own pronunciation features. Recordings have been used in phonetics-and-phonology classroom protocols as a self-awareness intervention.¹⁸ Higher awareness of L2 phonology correlates with more accurate L2 pronunciation. Classroom interventions that explicitly raise phonological awareness produce measurable pronunciation gains.¹⁹

Pronunciation instruction is most effective when it targets monitored production of specific segmental or suprasegmental features.¹ Recording and comparing is how a self-study learner generates that monitored-production signal without a teacher.

The drill: record yourself saying the same passage you just shadowed in Stage 3. Play your recording and the native recording back-to-back. Note one specific gap, such as missed vowel length, a short mora count, pitch falling one mora too early, or a devoiced vowel that was not devoiced. Use that gap to select tomorrow's Stage 1 pairs.

One named gap, not a checklist

End each Stage 4 with one explicit observation, such as "the u in です was voiced", "the っ in きって had no hold", or "the pitch on 病院 fell one mora too early". A single named gap drives tomorrow's drill; a vague "needs work" does not.¹

Recommended audio sources

The protocol assumes free or low-cost sources for the clips used in Stages 1 and 3. The four below cover the range. The recommended use case for each follows its description.

Forvo

Forvo describes itself as the largest pronunciation dictionary in the world, with words pronounced by native speakers. It uses a crowdsourced model in which registered users upload audio recordings, and the platform indexes pronunciations across more than 400 languages.²⁰ The Japanese section indexes over 100,000 pronunciations contributed by native-speaker users, organized by word and accessible by search. Entries typically carry multiple recordings per word from different speakers.²¹

Single-word native recordings make Forvo the right fit for Stage 1 (minimal-pair drill stimuli), where the target is one word per recording rather than running speech.

NHK Easy News audio

NHK News Web Easy is the public broadcaster's simplified-Japanese news site, with articles rewritten in easier vocabulary and grammar. Furigana appears above kanji on every article, and an audio playback button reads each article aloud.²² The audio is machine-generated rather than recorded by a human reader.²³

Its 30-to-90-second news clips with synchronized script make NHK Easy News the default Stage 3 source from week two onward, once the learner has a functioning mora-tap in place.

JapanesePod101 audio lessons

JapanesePod101 is a commercial audio-lesson platform from Innovative Language Learning. It delivers Japanese instruction through short audio and video lessons centered on real-world conversations. Lessons include slowed-down audio, line-by-line breakdowns of native dialogue, associated word lists, and voice-recording tools that let learners record themselves and compare against the native track.²⁴

Dialogue-format clips with conversational register and line-by-line pacing fit Stage 3 when news pacing is too fast. The line-by-line audio also supports the transition from Stage 1 to Stage 3.

OJAD Suzuki-kun

OJAD (Online Japanese Accent Dictionary) is hosted by the University of Tokyo. It contains over 9,000 nouns and 3,500 declinable words with audio samples from male and female speakers, and it indexes approximately 42,300 conjugated forms.²⁵ Suzuki-kun is OJAD's prosody-tutor feature: it takes user-supplied text, predicts accent and intonation (including variants), and renders synthesized speech with a visible pitch contour. Users can adjust speech rate and select between voice options.²⁵

The synthesized output is generated, not recorded by a native speaker; the underlying accent prediction is trained on the dictionary data, and the prosody is rendered for arbitrary input sentences.²⁵ When the learner wants to drill a sentence and cannot find a native recording, Suzuki-kun provides a target pitch contour and synthesized read-aloud for comparison with a Stage 4 recording. Guidance on how to interpret Suzuki-kun's pitch contour lives in the dedicated OJAD walkthrough article.

The map is one-to-many on purpose. Choose the source by the stage's stimulus shape, not the other way around.

The record-and-flinch self-correction loop

"Record-and-flinch" is this article's label for the audio-self-recording awareness mechanism. It is not a term of art from the literature. The term names the underlying mechanism: audio self-recording raises phonological awareness, which correlates with pronunciation gains.¹⁸¹⁹

Why hearing your own voice feels wrong, and why that is useful

A learner's awareness of their own L2 phonology is the leverage point for improvement; awareness gains correlate with pronunciation gains in classroom studies of phonological awareness interventions.¹⁹

Audio self-recording externalizes the speaker's own output. It breaks the loop in which speakers monitor their own speech using the internal forward model that produced it. The recording is a third-party signal the learner can evaluate as if it were someone else's.¹⁸

This externalization is the diagnostic mechanism. The discomfort of hearing one's own recorded voice is not a bug to be desensitized to. It is the signal that the internal monitoring channel and the external acoustic channel disagree. That disagreement is where the next gap to fix is located.¹⁸

One gap per session, not a checklist

Instructional effectiveness is greatest when it targets specific segmental or suprasegmental features rather than global pronunciation. Protocols that ask learners to attend to everything at once produce weaker outcomes than protocols that isolate one feature at a time.¹

In practice, each Stage 4 ends with one named gap, not a list. "The u in です was voiced" is operational; "my pronunciation needs work" is not.

Feeding tomorrow's stage 1 from today's gap

Perceptual training transfers to production, and the transfer requires training stimuli that include the contrast the learner gets wrong. A learner who confuses long versus short vowels gains from training on long-versus-short pairs. The same learner gains less, or more slowly, from training on /r/-row pairs.³

The four-stage protocol is adaptive in this specific sense: the diagnostic from Stage 4 selects the next day's minimal-pair set. The training stimuli therefore match the contrast the learner has been failing on.¹³

The loop closes on itself. The only fresh input from outside the loop is the audio sources Stages 1 and 3 draw on.

When to escalate to a tutor

Plateau signals

Training-induced gains are not unbounded. Long-term retention studies show retained gains at three months post-training. But the gains plateau within the training window and do not continue scaling with more identical practice.⁴

Pronunciation acquisition for adult learners is not uniformly successful across phonemes. Studies of /r/-/l/ training with adult Japanese speakers report group-mean gains alongside persistent individual variability. Some learners still fail to converge on near-native production after extended training.³⁴

In protocol terms, a specific gap that survives five or more sessions of targeted Stage 1 work, or a sound that has not budged after four weeks, signals that the self-study loop has reached its diagnostic ceiling. The learner cannot diagnose what they cannot perceive. An outside ear is the next step.

What a diagnostic tutor session looks like

Pronunciation instruction that produces measurable gains is typically explicit and feature-specific. It targets articulation (tongue placement, voicing onset, lip rounding) at the segmental level, or pitch and timing at the suprasegmental level.¹²

A tutor session at this stage is for articulation diagnosis, not conversation practice. The learner brings the gap from their plateau. The tutor returns a feature-level observation, such as "the tongue tip is too far back on ら-row" or "the geminate hold is releasing 50 ms early". A single 30-to-45-minute session is the typical shape.

italki, Preply, and Wasabi are commercial tutor marketplaces; university-affiliated Japanese-language centers are an alternative venue. The protocol does not endorse any specific platform.

What this section is not

This section is not a claim that tutors are required for pronunciation. The training literature shows substantial self-administered gains from perceptual identification training alone, without explicit production instruction.³ The protocol is built so most learners reach functional pronunciation without escalating.

It is also not an endorsement of any specific tutor or platform. Venues are named only as known categories of provision.

Good to know

Read the script before you shadow, not during

Shadowing as defined in the L2 literature is immediate vocalization of auditorily presented stimuli, not delayed reading aloud.¹⁶ A learner who reads the script while audio plays is not shadowing. They are sight-reading at the audio's pace, which trains a different and weaker skill.

In protocol terms, the script is for the first listen-and-read pass (Stage 3 pass 1). For pass 2, the actual shadow, set the script aside.

Faster is not better; native pace will come

Training with varied speaking rates produces transfer to new rates, but the training is most efficient when the input is intelligible. Chasing native news pace from week one maximizes acoustic exposure but minimizes per-clip uptake.¹¹ Shadowing studies similarly find that gains are largest when the learner can already track the phoneme stream. Pushing pace above that threshold reduces returns.⁸¹⁰

Start Stage 3 at 0.75x playback. Most platforms support this directly, including NHK Easy News and Suzuki-kun. Let the pace ramp on its own over weeks.

Five minutes is a floor, not a ceiling

The "5 minutes" figure is protocol guidance, not a meta-analytic finding. What the literature does support is the underlying shape: distributed practice (many short sessions across days) reliably outperforms massed practice (few long sessions) for motor-skill acquisition.⁶⁷ A 5-minute daily routine maps onto the distributed end of that continuum. A 25-minute once-weekly routine maps onto the massed end.

Motor-skill consolidation depends on spacing across days specifically, not just across hours. Retention-test gains scale with inter-session interval up to roughly a day.⁷ Daily cadence beats session length. Longer sessions are useful, but not at the cost of daily cadence.

Do not stack two drill articles in one session

Pronunciation instruction is most effective when it targets specific features rather than diffusing attention across many. A single five-minute block that tries to drill both segmental contrasts and pitch-accent minimal pairs at once dilutes the per-feature signal.¹

If the learner is also running a pitch-accent minimal-pairs drill, alternate days or run the pitch drill as a second five-minute block at a different time of day.

Drilling すみません and よろしくお願いします without their pragmatic context

Both are register-marked formulae whose pronunciation in fluent native speech is reduced and compressed. For example, すみません is often heard as [sɯmasen] in casual speech. Drilling the canonical citation form is correct for Stage 1 minimal-pair work and Stage 2 mora taps. The reduced forms are appropriate Stage 3 material only when paired with a register-matched clip.⁵

What to do when you cannot speak aloud

Motor-program rehearsal can occur sub-vocally, meaning without audible speech. The motor-learning literature treats silent rehearsal as a related but reduced form of practice. It preserves some of the consolidation benefits of distributed practice without the acoustic feedback.⁷

Silent articulation drills (lip, tongue, and jaw shaping without voicing) are a viable substitute for Stage 1 on a day when the learner cannot speak aloud. They are not a substitute for Stage 4, because Stage 4 requires the acoustic recording.

Undercounting moras when a yō-on is present

The wrong reading taps the small ょ as its own beat (り / ょ / こ / う, four taps for 旅行) or skips it entirely. The correct reading binds りょ into a single mora. That gives three taps (りょ / こ / う):

旅行りょこう⁵
"(a) trip"

The small ょ does not occupy its own moraic slot. The りょ cluster is a single mora: a palatalized consonant plus vowel.⁵

Treating the geminate (small っ) as silent rather than as a held mora

The wrong reading collapses きって to "kit-e" with no audible hold on the /t/. This makes きって ("stamp") merge toward きて ("come, te-form"). The correct reading places a measurable closure on the /t/ that occupies one mora of timing. That gives three taps (き / っ / て):

切手きって⁵
"(postage) stamp"

The first half of a geminate stop is one mora of silent or constricted articulation. Without the hold, the length contrast is lost.⁵

Mistaking the mora-N for a syllable-final consonant

The wrong reading treats せんせい as "sen-sei", two beats. The correct reading taps four (せ / ん / せ / い):

先生せんせい⁵
"teacher"

The ん is moraic and counts as its own beat. It is not a coda consonant attached to the syllable nucleus.⁵

"Tap-don't-listen" for Stage 2

The failure mode in mora-timing drills is saying the word at a natural English-syllable rhythm, then mentally matching the output to the correct mora count after the fact. The tap is the intervention: the hand commits to one beat per mora before the mouth does. That forces the mouth to comply.¹³¹⁴

"Flinch tomorrow's pair" for Stage 4

The diagnostic from Stage 4 (the one specific gap) directly selects the next day's Stage 1 pair set. This is the adaptive loop. A flinch, the moment of audible disagreement between your recording and the native, is the diagnostic. The flinch produces tomorrow's pairs.¹¹⁸¹⁹

References

Saito, Kazuya, and Luke Plonsky. "Effects of Second Language Pronunciation Teaching Revisited: A Proposed Measurement Framework and Meta-Analysis." Language Learning 69 (3): 652–708. Wiley. https://onlinelibrary.wiley.com/doi/abs/10.1111/lang.12345 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰
Lee, Junkyu, Jeffrey Jang, and Luke Plonsky. "The Effectiveness of Second Language Pronunciation Instruction: A Meta-Analysis." Applied Linguistics 36 (3): 345–366. Oxford University Press. https://academic.oup.com/applij/article-abstract/36/3/345/2422438 ↩ ↩²
Bradlow, Ann R., David B. Pisoni, Reiko Akahane-Yamada, and Yoh'ichi Tohkura. "Training Japanese Listeners to Identify English /r/ and /l/: IV. Some Effects of Perceptual Learning on Speech Production." The Journal of the Acoustical Society of America 101 (4): 2299–2310. https://pubs.aip.org/asa/jasa/article/101/4/2299/558452 ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷
Bradlow, Ann R., Reiko Akahane-Yamada, David B. Pisoni, and Yoh'ichi Tohkura. "Training Japanese Listeners to Identify English /r/ and /l/: Long-Term Retention of Learning in Perception and Production." Attention, Perception, & Psychophysics 61 (5): 977–985. https://link.springer.com/article/10.3758/BF03206911 ↩ ↩² ↩³
Vance, Timothy J. The Sounds of Japanese. Cambridge University Press, 2008. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³ ↩¹⁴ ↩¹⁵ ↩¹⁶ ↩¹⁷ ↩¹⁸ ↩¹⁹ ↩²⁰ ↩²¹ ↩²² ↩²³
Donovan, John J., and David J. Radosevich. "A Meta-Analytic Review of the Distribution of Practice Effect: Now You See It, Now You Don't." Journal of Applied Psychology 84 (5): 795–805. https://gwern.net/doc/psychology/spaced-repetition/1999-donovan.pdf ↩ ↩² ↩³ ↩⁴
Shea, Charles H., Quinn Lai, Charles Black, and Jin-Hoon Park. "Spacing Practice Sessions Across Days Benefits the Learning of Motor Skills." Human Movement Science 19 (5): 737–760. Elsevier. https://www.sciencedirect.com/science/article/abs/pii/S016794570000021X ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Hamada, Yo. "Shadowing: Who Benefits and How? Uncovering a Booming EFL Teaching Technique for Listening Comprehension." Language Teaching Research 20 (1): 35–52. Sage. https://journals.sagepub.com/doi/abs/10.1177/1362168815597504 ↩ ↩² ↩³ ↩⁴ ↩⁵
Hamada, Yo. "Shadowing: What Is It? How to Use It. Where Will It Go?" RELC Journal 50 (3): 386–393. Sage. https://journals.sagepub.com/doi/full/10.1177/0033688218771380 ↩ ↩² ↩³
Shimizu, Hiroshi. "A Systematic Review of Research on the Use of Shadowing for Second Language Pronunciation Teaching." Taylor & Francis. https://www.tandfonline.com/doi/full/10.1080/29984475.2025.2546827 ↩ ↩² ↩³ ↩⁴
Hirata, Yukari. "Training Native English Speakers to Perceive Japanese Length Contrasts in Word Versus Sentence Contexts." The Journal of the Acoustical Society of America 116 (4): 2384–2394. https://pubmed.ncbi.nlm.nih.gov/15532669/ ↩ ↩²
Iverson, Paul, and Patricia K. Kuhl. "Perceptual Magnet and Phoneme Boundary Effects in Speech Perception: Do They Arise from a Common Mechanism?" Perception & Psychophysics 62 (4): 874–886. https://link.springer.com/content/pdf/10.3758/BF03206929.pdf ↩
Otake, Takashi, Giyoo Hatano, Anne Cutler, and Jacques Mehler. "Mora or Syllable? Speech Segmentation in Japanese." Journal of Memory and Language 32 (2): 258–278. ↩ ↩² ↩³
Cutler, Anne, and Takashi Otake. "Mora or Phoneme? Further Evidence for Language-Specific Listening." Journal of Memory and Language 33 (6): 824–844. ↩ ↩² ↩³
Otake, Takashi. "Mora and Mora-Timing." In The Handbook of Japanese Linguistics (Mineharu Nakayama et al., eds.). Wiley. ↩
Lambert, Sylvie. "Shadowing." Meta: Journal des Traducteurs 37 (2): 263–273. (Term-of-art definition.) ↩ ↩²
Hirata, Yukari, and Spencer D. Kelly. "Effects of Lips and Hands on Auditory Learning of Second-Language Speech Sounds." Journal of Speech, Language, and Hearing Research 53 (2): 298–310. ↩
Couper, Graeme. "Talking About Pronunciation: Audio Recordings as a Self-Awareness Tool for Improving Second Language Pronunciation in the Phonetics and Phonology Classroom." Conference materials. https://www.academia.edu/8019666/ ↩ ↩² ↩³ ↩⁴ ↩⁵
Trofimovich, Pavel. "Language Awareness and Second Language Pronunciation: A Classroom Study." Language Awareness 21 (4): 345–366. https://www.researchgate.net/publication/232937316 ↩ ↩² ↩³ ↩⁴
Forvo Media S.L. Forvo: The Pronunciation Dictionary. Platform self-description. (platform) https://forvo.com/ ↩
Forvo Media S.L. Japanese Pronunciation Dictionary. Platform self-description. (platform) https://forvo.com/languages/ja/ ↩
日本放送協会 (NHK). NHK NEWS WEB EASY (やさしい日本語で書いたニュース). Platform self-description. (platform) https://www3.nhk.or.jp/news/easy/ ↩
Tofugu, LLC. "NHK News Web Easy Review." (limitation: secondary; cited only for platform-feature observations not visible on the source itself, e.g. furigana toggle, machine-generated audio.) https://www.tofugu.com/japanese-learning-resources-database/nhk-news-web-easy/ ↩
Innovative Language Learning. JapanesePod101. Platform self-description. (platform) https://www.japanesepod101.com/ ↩
峯松信明研究室, 東京大学. OJAD (Online Japanese Accent Dictionary) and Suzuki-kun prosody tutor. Platform self-description. (platform) https://www.gavo.t.u-tokyo.ac.jp/ojad/ ↩ ↩² ↩³

Overview​

Why drills, not just exposure​

What this protocol does not cover​

The 5-minute daily protocol​

Stage 1: segmental minimal pairs (60 seconds)​

Stage 2: mora-timing taps (60 seconds)​

Stage 3: short-passage shadowing (120 seconds)​

Stage 4: record-and-compare (60 seconds)​

Recommended audio sources​

Forvo​

NHK Easy News audio​

JapanesePod101 audio lessons​

OJAD Suzuki-kun​

The record-and-flinch self-correction loop​

Why hearing your own voice feels wrong, and why that is useful​

One gap per session, not a checklist​

Feeding tomorrow's stage 1 from today's gap​

When to escalate to a tutor​

Plateau signals​

What a diagnostic tutor session looks like​

What this section is not​

Good to know​

Read the script before you shadow, not during​

Faster is not better; native pace will come​

Five minutes is a floor, not a ceiling​

Do not stack two drill articles in one session​

Drilling すみません and よろしくお願いします without their pragmatic context​

What to do when you cannot speak aloud​

Undercounting moras when a yō-on is present​

Treating the geminate (small っ) as silent rather than as a held mora​

Mistaking the mora-N for a syllable-final consonant​

"Tap-don't-listen" for Stage 2​

"Flinch tomorrow's pair" for Stage 4​

See also​

References​

Footnotes​