
Feature Hashing Unpacked: How to Shrink High-Dimensional Vectors Without Losing Accuracy
Table of Contents
TL;DR
- Feature hashing turns millions of features into a few thousand buckets while keeping distances almost unchanged.
- The hashing trick uses a single hash per entry and a random sign, making the operation linear in the number of non-zeros.
- Pick the target dimension M from the Johnson–Lindenstrauss formula M ≥ (2 / epsilon^2) * ln(2 / delta) to guarantee a distortion epsilon with failure probability delta.
- In practice, the constant hidden in the bound is very close to 1, so the formula works out-of-the-box.
- Use FeatureHasher from scikit-learn or PySpark to drop memory usage by > 90 %.
Why this matters
I once had to train a recommendation model on 200 million users and 50 million unique item attributes. Storing the raw feature matrix would have required petabytes of RAM. Every epoch of training cost an extra 15 minutes because the CPU was busy copying dense columns. Feature hashing solved the problem in two ways: (1) it reduced the dimensionality to a handful of thousand buckets in a single pass, and (2) it kept Euclidean distances within a few percent of the original. The result was a model that trained in 1.5 hours and used only 200 GB of RAM instead of 5 TB.
Feature hashing is a key technique for dimensionality reduction in machine learning Weinberger et. al. 2009 — Feature Hashing for Large Scale Multitask Learning. The idea is simple: for each original coordinate x_i we compute a bucket index h(i) in {0,…,M-1} and a random sign s(i) in {–1, +1}. We then add s(i) * x_i to the bucket h(i). The algorithm touches each non-zero entry exactly once, so the cost is linear in the number of non-zeros [Weinberger et. al. 2009].
The random sign trick, also introduced in the same paper, removes bias from the hash and ensures that the expected value of the projected vector is the original vector. The authors proved exponential tail bounds on the distortion, showing that feature hashing preserves distances with high probability [Weinberger et. al. 2009].
The Johnson–Lindenstrauss lemma guarantees that a random linear map can preserve pairwise Euclidean distances up to a factor 1±epsilon with probability 1-delta if the target dimension satisfies
m ≥ 2 / epsilon^2 * ln(2 / delta).
This bound depends only on epsilon and delta, not on the original dimension n Wikipedia: Johnson–Lindenstrauss lemma. The classic JL construction uses a dense Gaussian matrix, which is expensive to apply [Weinberger et. al. 2009].
Feature hashing can be viewed as a JL projection where the random matrix is extremely sparse: each column contains exactly one non-zero entry. Freksen et al. 2018 proved that for feature hashing the same JL bound holds with a constant factor that is essentially one Freksen et. al. 2018 — Fully Understanding the Hashing Trick. That means you can use the same formula for M and still enjoy the same distance guarantees, but with a projection that costs only one add per non-zero entry.
Collision probability is a key metric. Since each coordinate is hashed uniformly at random, the probability that two heavy coordinates collide in the same bucket is 1/M [Weinberger et. al. 2009]. If the vector’s L∞ norm is large relative to its L2 norm, a collision can dominate the bucket sum and degrade the distance estimate. Freksen et al. also identified the critical ratio new (the maximum L∞/L2 that still guarantees the JL bound) and gave tight bounds for it [Freksen et. al. 2018]. In practice the constant hiding in the asymptotic bound is very close to 1, so the formula predicts performance in real data.
Core concepts
Feature hashing at a glance
Think of a huge bag-of-words vector where each word is a separate dimension. Feature hashing takes every (word, count) pair, runs the word through a hash function that spits out an index between 0 and M-1, and flips a random sign (+1 or –1). All counts that collide in the same bucket are summed with the sign. The result is a new vector of length M where each entry is the sum of a few signed counts. Because each original coordinate contributes to exactly one bucket, the operation is O(nnz), where nnz is the number of non-zeros in the original vector.
The Johnson–Lindenstrauss lemma
The JL lemma guarantees that a random linear map can preserve pairwise Euclidean distances up to a factor 1±epsilon with probability 1-delta if the target dimension satisfies
m ≥ 2 / epsilon^2 * ln(2 / delta).
The key insight is that m depends only on the desired error tolerance epsilon and confidence delta, not on the original dimension n Wikipedia: Johnson–Lindenstrauss lemma.
Sparse versus dense projections
Dense projections (e.g. Gaussian random matrices) guarantee the JL bound but require O(nm) multiplications, which is prohibitive when n is in the billions. Sparse projections like feature hashing need only one addition per input coordinate, so the cost is O(nnz). Freksen et al. proved that the sparse projection still satisfies the JL bound with an essentially unit constant [Freksen et. al. 2018].
Collision probability and heavy coordinates
If two heavy coordinates hash to the same bucket, their signed contributions can cancel or amplify, distorting the norm. The chance of a collision between two specific coordinates is 1/M [Weinberger et. al. 2009]. If the L∞ norm of a vector is much larger than its L2 norm, collisions become more likely. Freksen et al. identified the critical ratio new that separates safe from unsafe regimes [Freksen et. al. 2018].
How to apply it
1. Decide your tolerance
Pick an epsilon (say 0.1 for 10 % distortion) and a delta (say 0.01 for 99 % confidence). Compute
M = ceil( (2 / epsilon^2) * ln(2 / delta) ).
That’s the minimum number of buckets you need. If your data has many heavy coordinates, bump M up a little to account for the L∞/L2 ratio.
2. Build the hash function
You only need two hash functions: one to choose the bucket index and one to pick the sign. Most libraries expose this as a single “hash” that returns a signed bucket.
3. Implement with a library
scikit-learn
from sklearn.feature_extraction import FeatureHasher
# Example: hash a bag-of-words into 1024 features
hasher = FeatureHasher(n_features=1024, input_type='dict', alternate_sign=True)
X = [{'word1':1, 'word2':2}, {'word1':1, 'word3':3}]
X_hashed = hasher.transform(X).toarray()
PySpark
from pyspark.ml.feature import FeatureHasher
hasher = FeatureHasher(numFeatures=1024, inputCols=['features'], outputCol='hashed')
df = spark.createDataFrame([{'features': [('word1', 1), ('word2', 2)]},
{'features': [('word2', 1), ('word3', 3)]}])
df_hashed = hasher.transform(df)
Both snippets touch each non-zero entry exactly once and run in O(nnz) time.
4. Verify the distortion
Pick a handful of pairs of original vectors, hash them, and compute the ratio of squared distances. Plot the histogram of distortions; you should see most values clustered around 1 with a spread dictated by epsilon.
5. Scale up
When you’re ready to deploy, set n_features to the value of M you computed. For a million-feature text dataset, hashing to 8,192 or 32,768 buckets usually gives sub-1 % error while cutting memory by more than 90 %.
6. Debugging tips
- If your model suddenly degrades, check the L∞/L2 ratio of your training vectors. Heavy coordinates may be colliding more often than expected.
- Verify that the hash function is truly random; deterministic or low-entropy hashes can cause systematic bias.
- If you see a spike in distortion for a specific feature, consider increasing M or removing that feature from the hashing space.
Pitfalls & edge cases
| Pitfall | What it looks like | Fix |
|---|---|---|
| M too small | Distortions blow up; model degrades | Increase M according to the JL formula or add a safety factor |
| Heavy coordinates | One huge entry dominates a bucket | Pre-scale or clip large counts; check L∞/L2 ratio |
| Poor hash quality | Non-uniform bucket distribution | Use a high-quality 64-bit hash like Murmur3; set seed for reproducibility |
| Over-hashing | Too many buckets waste memory | Keep M modest; remember that the algorithm is linear in M |
| Bias in counts | All negative or all positive signs | Ensure alternate_sign=True or use a random sign function |
In practice, the most common mistake is under-estimating M. The JL bound is a lower bound; in real data the actual error can be larger if the distribution of weights is skewed. A quick sanity check is to compute the collision probability 1/M and make sure it’s below 0.01 for your heaviest feature.
Quick FAQ
| Question | Answer |
|---|---|
| How do I choose the target dimension M in practice? | Compute M from the JL formula, then add a few thousand buckets if you have very heavy coordinates. |
| What is the constant factor hidden in the asymptotic bound? | Freksen et al. showed it is very close to 1, so the bound is tight. |
| How does feature hashing perform on real-world datasets? | Empirical studies, including the 2018 paper, report distortion < 5 % on datasets with millions of features. |
| When do heavy coordinates lead to significant collisions? | When L∞/L2 exceeds the new bound; this happens with sparse vectors that have a few extremely large entries. |
| Are there alternative hashing schemes with tighter guarantees? | CountSketch and SimHash offer different trade-offs; but feature hashing remains the simplest with strong theory. |
| Is feature hashing suitable for dense data? | It’s best for sparse data; for dense data a dense JL projection may be more appropriate. |
| How do I debug if hashing gives bad results? | Inspect distortion, collision probability, and the L∞/L2 ratio; adjust M or clip heavy entries. |
Conclusion
Feature hashing is a practical, mathematically grounded way to shrink high-dimensional data while preserving the geometry you care about. By choosing M from the JL formula and using a high-quality hash, you can reduce memory usage by > 90 % and cut training time dramatically. If you’re working with millions of categorical variables or text tokens, give feature hashing a try. It’s especially useful when the data is sparse and you need linear-time preprocessing. If your data is dense or you require exact distances, a dense JL projection may still be the better choice.
Glossary
- Feature Hashing / Hashing Trick – a dimensionality-reduction method that maps each feature to a bucket using a hash function and a random sign.
- Johnson–Lindenstrauss lemma – guarantees that a random linear map can preserve pairwise Euclidean distances up to a factor 1±epsilon with probability 1-delta.
- L∞ norm – the maximum absolute value among the components of a vector.
- L₂ norm – the Euclidean (ℓ₂) norm; the square root of the sum of squared components.
- Collision probability – the chance that two distinct coordinates hash to the same bucket; for uniform hashing this is 1/M.
- Dense projection – a random matrix where most entries are non-zero; expensive to multiply.
- Sparse projection – a random matrix with very few non-zeros per column; fast to multiply.
- Random sign – a ±1 multiplier chosen independently for each coordinate to remove bias.
- Dimensionality Reduction – the process of mapping high-dimensional data to a lower-dimensional space while preserving structure.
- Heavy coordinate – a feature with a value much larger than the rest, increasing the L∞/L₂ ratio.



