
Bloom Filters: Space-Efficient Username Availability Check
Learn how Bloom Filters provide a space-efficient probabilistic data structure for checking username availability across millions of users with O(1) lookup time.
Introduction
Imagine you're building a website with 20 million users, and a new user comes to sign up. They want to register with the username @batman. How do you efficiently check if this username is already taken?
This seemingly simple problem becomes a significant engineering challenge at scale. Let's explore different solutions and understand why Bloom Filters are the elegant answer to this problem.
The Problem: Username Availability at Scale
When a new user tries to sign up with a username like @batman, we need to:
- Check our database of 20 million existing users
- Determine if
@batmanalready exists - Return the result quickly to provide a good user experience
The Naive Approach: Direct Database Query
The straightforward solution is to query the database directly:
SELECT * FROM users WHERE username = 'batman';But what happens when 5,000 users come for signup simultaneously?
- Database and server get more load
- Performance degrades significantly
- Time complexity: O(n) for each lookup
- This is bad performance-wise
Solution 1: HashMap
Our first optimization is to use a HashMap to store all usernames:
const usernameMap = {
'@batman': true,
'@superman': true,
'@wonderwoman': true,
// ... millions more entries
};HashMap Lookup
When a user enters their desired username, we simply check:
// O(1) lookup time
const isUsernameTaken = usernameMap[username];Advantages:
- Lookup time is O(1) - constant time!
- Much faster than database queries
Disadvantages:
- Space will escalate dramatically
- If 1,000 users are added every day → 1,000 new entries daily
- A lot of space will be needed to store all usernames
For 20 million users with average username length of 15 characters:
- Each entry ≈ 20 bytes (username + boolean + overhead)
- Total memory ≈ 400 MB just for usernames!
Solution 2: Bloom Filters
This is where Bloom Filters come to the rescue. A Bloom Filter is a space-efficient probabilistic data structure that works on the principle of false positives.
Key Characteristics
- If username is taken → Bloom Filter will say "definitely taken"
- If username is not taken → Bloom Filter will say "maybe available"
Wait, why "maybe"? Let's understand the two scenarios:
Scenario 1: Username IS Taken (@batman exists)
@batman (Taken) → Bloom Filter says "Taken"
✅ This is correct! If the Bloom Filter says it's taken, it IS taken for sure.
Scenario 2: Username is NOT Taken (@batman1 doesn't exist)
@batman1 (Not Taken) → Bloom Filter might say "Taken"
✅ This is okay! A false positive means we might reject an available username, but we'll never allow a duplicate. The user can simply try another username.
Why False Positives are Acceptable
| Actual State | Bloom Filter Says | Result | Acceptable? |
|---|---|---|---|
| Taken | Taken | Correct | ✅ Yes |
| Taken | Not Taken | Would be BAD | ❌ Never |
| Not Taken | Taken | False Positive | ✅ Yes |
| Not Taken | Not Taken | Correct | ✅ Yes |
The critical point: Bloom Filters NEVER produce false negatives. If something is in the set, it will always be detected.
How Bloom Filters Work
A Bloom Filter consists of two main components:
- Hash Functions - Convert input to array indices
- Bit Array (Buckets) - Store presence information as bits
The Bit Array
Let's take a simple example with a 10-bucket bit array:
Index: 0 1 2 3 4 5 6 7 8 9
Bits: [0] [0] [0] [0] [0] [0] [0] [0] [0] [0]
Each bucket stores just a single bit (0 or 1), making it extremely space-efficient.
Adding a Username: @batman
When we add @batman, the hash function might return indices: 0, 2, 4, 8
hash('@batman') → [0, 2, 4, 8]We set these bits to 1:
Index: 0 1 2 3 4 5 6 7 8 9
Bits: [1] [0] [1] [0] [1] [0] [0] [0] [1] [0]
↑ ↑ ↑ ↑
Adding Another Username: @superman
Hash function returns: 1, 4, 6, 8
hash('@superman') → [1, 4, 6, 8]Updated bit array:
Index: 0 1 2 3 4 5 6 7 8 9
Bits: [1] [1] [1] [0] [1] [0] [1] [0] [1] [0]
↑ ↑
(new) (new)
Checking Username Availability
Check: Is @superman available?
hash('@superman') → [1, 4, 6, 8]Check each index:
- Index 1 → 1 (taken)
- Index 4 → 1 (taken)
- Index 6 → 1 (taken)
- Index 8 → 1 (taken)
All bits are 1 → Username is TAKEN ✅
Check: Is @lemon available?
hash('@lemon') → [2, 3, 4, 7]Check each index:
- Index 2 → 1 (taken by @batman)
- Index 3 → 0 (not set!)
Even one 0 means the username is definitely available! ✅
Check: Is @appleuser available? (False Positive Example)
hash('@appleuser') → [1, 2, 6, 8]Let's check step by step:
- Index 1 → 1 (taken by @superman)
- Index 2 → 1 (taken by @batman)
- Index 6 → 1 (taken by @superman)
- Index 8 → 1 (taken by @batman & @superman)
All bits are 1 → Bloom Filter says "Taken"
But wait! We know @appleuser was never added. This is a false positive!
Why False Positives Happen
Since we know there is no @appleuser in our hash function, but all bits at positions [1, 2, 6, 8] are taken by other usernames, the Bloom Filter incorrectly reports it as taken.
This is the reason it's called a false positive.
Reducing False Positives
The false positive rate depends on:
- Bucket size - More buckets = fewer collisions
- Number of hash functions - More functions = better distribution
- Number of elements - More elements = more collisions
Impact of Bucket Size
| Bucket Size | False Positive Rate |
|---|---|
| 10 | High collisions |
| 1,000 | Lower collisions |
| 1,000,000 | Very low collisions |
With a bucket size of 1,000 instead of 10:
- Number of collisions decreases
- We can insert more users reliably
- Still much smaller than storing actual usernames!
Space Efficiency: The Real Magic
For 20 million users:
| Approach | Space Required |
|---|---|
| HashMap | ~400 MB |
| Bloom Filter | ~1 MB |
A Bloom Filter with 1% false positive rate for 20 million items needs only about ~20 MB of space!
Real-World Applications
Bloom Filters are used everywhere:
- Chrome Browser - Checks if a URL is malicious against a list of known bad URLs
- Gmail - Spam detection
- Databases - Check if a key exists before expensive disk lookups (LSM trees)
- CDNs - Avoid caching one-hit-wonders
- Blockchain - Bitcoin uses Bloom Filters for wallet synchronization
Implementation in JavaScript
Here's a simple Bloom Filter implementation:
class BloomFilter {
constructor(size = 1000, hashCount = 3) {
this.size = size;
this.hashCount = hashCount;
this.bitArray = new Array(size).fill(0);
}
// Simple hash functions
hash(str, seed) {
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = (hash * seed + str.charCodeAt(i)) % this.size;
}
return Math.abs(hash);
}
// Get all hash positions for a string
getHashPositions(str) {
const positions = [];
for (let i = 0; i < this.hashCount; i++) {
positions.push(this.hash(str, i * 10 + 1));
}
return positions;
}
// Add an element
add(element) {
const positions = this.getHashPositions(element);
positions.forEach((pos) => {
this.bitArray[pos] = 1;
});
}
// Check if element might exist
mightContain(element) {
const positions = this.getHashPositions(element);
return positions.every((pos) => this.bitArray[pos] === 1);
}
}
// Usage
const usernameFilter = new BloomFilter(10000, 4);
// Add existing usernames
usernameFilter.add('@batman');
usernameFilter.add('@superman');
// Check availability
console.log(usernameFilter.mightContain('@batman')); // true (definitely taken)
console.log(usernameFilter.mightContain('@wonderwoman')); // false (definitely available)
console.log(usernameFilter.mightContain('@newuser')); // might be false positiveWhen to Use Bloom Filters
Use Bloom Filters when:
- Space efficiency is critical
- False positives are acceptable
- You need O(1) lookup time
- The dataset is large
Don't use Bloom Filters when:
- You need to delete elements (use Counting Bloom Filter instead)
- False positives are unacceptable
- You need to retrieve actual values (Bloom Filters only check membership)
Conclusion
Bloom Filters are a brilliant example of trading accuracy for efficiency. By accepting a small probability of false positives, we gain:
- O(1) constant time lookups
- Massive space savings (often 10x or more)
- Scalability for millions or billions of elements
Next time you're designing a system that needs to check membership in a large set, consider whether a Bloom Filter might be the perfect solution!
Have questions about Bloom Filters or other system design concepts? Feel free to reach out!

