Explore how Andrey Markov’s poetry-driven Markov chains shaped AI, autocorrect, and modern language models, with a step-by-step guide and real-world insights.

How Markov Chains, Born from Russian Poetry, Sparked AI & Autocorrect

Published by Brav

Table of Contents

TL;DR

Markov chains trace back to Andrey Markov’s 19th-century study of Russian poetry.
These chains underpin everything from Google’s PageRank to your phone’s autocorrect.
LLMs like ChatGPT share a core principle: predicting the next symbol.
Build a simple letter-generator in minutes to see Markov chains in action.
Know the limits: real language models use transformers that remember far more than Markov’s memoryless steps.

Why this matters

I’ve spent years debugging a predictive-typing feature that never seemed to understand context. When I first stumbled on a 2025 article about how Markov chains fueled Google and autocorrect, I realized I was standing on the shoulders of a man who’d never seen a computer. Markov’s work gives us a mental model that ties probability, poetry, and modern AI together—removing the mystery from our daily tech.

Pain point: Most of us don’t know the historical roots of AI.
Pain point: We’re confused about how a Markov chain “works” in practice.
Pain point: It feels impossible to connect a poem to a language model.

Understanding Markov’s story turns those gaps into a clear narrative: a mathematician who turned letters into a statistical machine that predicted the future.

Core concepts

The Markov property

A process is memoryless if the future depends only on the present, not on the sequence of events that led there.
Mathematically:

( P(X_{n+1} = x | X_n = x_n, X_{n-1} = x_{n-1}, … ) = P(X_{n+1} = x | X_n = x_n) )

This is exactly what Markov wrote about a chain of vowels and consonants in Eugene Onegin—he counted 20 000 letters, labeled each as vowel or consonant, and calculated transition probabilities between the two categories. That tiny binary chain was the first statistical model that could predict a letter based only on the preceding one, a birth-place of modern predictive text.

From poetry to probability theory

Markov’s biography is a story of an under-funded student in St. Petersburg who used crutches and the discipline of a teacher’s lecture to build the Markov chainMarkov chain — Wikipedia (2025). The 1906 paper “On a certain class of mathematical statistical processes”Markov chain — Wikipedia (2025) shows how he extended the law of large numbers and the central limit theorem to dependent random variables, laying the groundwork for stochastic processes*Markov chain — Wikipedia (2025).

The Markov chain as a building block

Parameter	Use Case	Limitation
Transition matrix	Predict next symbol from current state	Assumes independence of past states
Stationary distribution	Long-run behavior of a system	Requires time-homogeneous transitions
Autocorrect & PageRank	Estimate most likely next word or page	Needs large data to estimate probabilities

Markov’s chain is foundational to stochastic processes. Later mathematicians—Wiener, Kolmogorov, Bernstein—built on it, eventually feeding into the transformer architecture that powers ChatGPT and other LLMs.

Language models and next-token prediction

LLMs like ChatGPT are, at their core, next-token predictors. The OpenAI GPT-4 API doc explains the model as an autoregressive transformer that generates a probability distribution over the next token given the entire past context*OpenAI GPT-4 — API Documentation (2023)*. That is an evolution of the Markov idea: instead of looking at one previous state, the transformer can condition on thousands of tokens, but the mathematical principle—predict the next symbol—remains the same.

How to apply it

Let’s build a tiny Markov-chain letter generator to see how the math plays out.

import random
import collections

# Count transitions from the first 20 000 letters of Eugene Onegin
# (already reduced to 0 = consonant, 1 = vowel for simplicity)
def build_chain(text):
    transitions = collections.defaultdict(lambda: collections.Counter())
    prev = None
    for ch in text.lower():
        cur = 1 if ch in "aeiou" else 0
        if prev is not None:
            transitions[prev][cur] += 1
        prev = cur
    # Normalize to probabilities
    probs = {k: {c: v / sum(cnt.values()) for c, v in cnt.items()}
             for k, cnt in transitions.items()}
    return probs

def generate(probs, length=200):
    state = random.choice([0, 1])
    out = ''
    for _ in range(length):
        out += 'vowel' if state else 'consonant'
        state = random.choices(list(probs[state].keys()),
                               list(probs[state].values()))[0]
    return out

# Example: load a small sample of Onegin
sample = 'Ещё раз смотришь ты… и над тем не смотришь.'
chain = build_chain(sample)
print(generate(chain))

Metrics you can measure

Metric	How to compute	Why it matters
Per-token entropy	*-sum(p log2(p))**	Indicates how predictable the chain is. Lower entropy → better predictive power
State distribution	Count of 0/1 occurrences	Shows whether vowels dominate, affecting model bias
Transition consistency	Ratio of observed transitions to total	Detects data sparsity; high consistency suggests a reliable chain

By tweaking the transition matrix or adding more symbols (e.g., whole words), you can see the trade-off between simplicity and accuracy.

Pitfalls & edge cases

Memorylessness vs. real language: Human language has long-range dependencies. Markov chains ignore them, so your model will fail on phrases that rely on context beyond one word.
Data sparsity: With only 20 000 letters, the transition counts for rare patterns are noisy. Larger corpora reduce this problem but increase computational cost.
Transformers vs. Markov chains: Transformers can attend to any part of the input, but they are heavy. For tiny devices, a Markov chain might be the only feasible solution (think early cell-phone autocorrect).
Open question: How did Markov’s background (a poor family, crutches, political dissent) shape his analytical rigor? Many scholars believe the need to “survive” forced him to find concise, deterministic models.

Quick FAQ

Q: How did Markov’s upbringing influence his work?
A: Markov grew up in poverty, used crutches until age 10, and taught himself advanced mathematics—traits that fostered a drive to find elegant, universal patterns.

Q: What exactly is the Markov property?
A: The future state depends only on the current state, not on the sequence of events that led there.

Q: Why do LLMs still use next-token prediction?
A: Predicting the next token is the simplest statistical task that scales with context length; transformers are just a powerful implementation of that idea.

Q: Can Markov chains power modern autocorrect?
A: Yes, simple Markov models underpin early autocorrect; modern systems now use more complex probabilistic language models.

Q: Are there cases where Markov chains outperform transformers?
A: In very low-resource environments or when only short sequences matter, Markov chains can be faster and require less memory.

Q: What are the limitations of Markov chains for language modeling?
A: They cannot capture long-range dependencies and suffer from data sparsity when state spaces grow.

Conclusion

Markov chains are more than an academic curiosity—they are the conceptual ancestor of every predictive text, search engine ranking, and LLM you use. If you’re a student or a hobbyist looking to grasp AI fundamentals, start by building a simple Markov chain. Then layer on transformers and attention mechanisms. The history of AI is a lineage: from Andrey Markov’s vowel counts to ChatGPT’s billions of tokens. Embrace the story, experiment with code, and watch the math come to life.

Actionable next steps

Grab the 20 000-letter dataset from Eugene Onegin (public domain) and run the code above.
Replace the binary states with entire words; observe how entropy changes.
Compare your Markov chain predictions with a pre-trained GPT-3.5 Turbo API call to see how many more contexts matter.

Who should read this?
Students, science enthusiasts, devs building low-resource text tools, or anyone curious about the hidden math behind everyday AI.

Glossary

Markov chain: A stochastic process where the next state depends only on the current state.
Stochastic process: A collection of random variables indexed by time or space.
Central limit theorem: A statistical theorem that states the sum of many independent random variables tends toward a normal distribution.
Transformer: A neural network architecture that uses self-attention to process sequences of tokens.
LLM (Large Language Model): A model that can generate human-like text by predicting the next token.
Autocorrect: A feature that automatically corrects misspelled words based on statistical likelihoods.

References

Andrey Markov — Biography (2023) – https://en.wikipedia.org/wiki/Andrey_Markov
Markov chain — Wikipedia (2025) – https://en.wikipedia.org/wiki/Markov_chain
Markovian processes — Britannica (2023) – https://www.britannica.com/science/probability-theory/Markovian-processes
OpenAI GPT-4 — API Documentation (2023) – https://platform.openai.com/docs/models/gpt-4
OpenAI GPT-3.5 Turbo — API Documentation (2023) – https://platform.openai.com/docs/models/gpt-3.5-turbo
Markov Chains and Google AI — Modern Sciences (2025) – https://modernsciences.org/math-feud-markov-chain-google-ai-august-2025/

Last updated: December 19, 2025