Bloom Filters: Space-Efficient Username Availability Check

Introduction

Imagine you're building a website with 20 million users, and a new user comes to sign up. They want to register with the username @batman. How do you efficiently check if this username is already taken?

This seemingly simple problem becomes a significant engineering challenge at scale. Let's explore different solutions and understand why Bloom Filters are the elegant answer to this problem.

The Problem: Username Availability at Scale

When a new user tries to sign up with a username like @batman, we need to:

Check our database of 20 million existing users
Determine if @batman already exists
Return the result quickly to provide a good user experience

The Naive Approach: Direct Database Query

The straightforward solution is to query the database directly:

SELECT * FROM users WHERE username = 'batman';

But what happens when 5,000 users come for signup simultaneously?

Database and server get more load
Performance degrades significantly
Time complexity: O(n) for each lookup
This is bad performance-wise

Solution 1: HashMap

Our first optimization is to use a HashMap to store all usernames:

const usernameMap = {
  '@batman': true,
  '@superman': true,
  '@wonderwoman': true,
  // ... millions more entries
};

HashMap Lookup

When a user enters their desired username, we simply check:

// O(1) lookup time
const isUsernameTaken = usernameMap[username];

Advantages:

Lookup time is O(1) - constant time!
Much faster than database queries

Disadvantages:

Space will escalate dramatically
If 1,000 users are added every day → 1,000 new entries daily
A lot of space will be needed to store all usernames

For 20 million users with average username length of 15 characters:

Each entry ≈ 20 bytes (username + boolean + overhead)
Total memory ≈ 400 MB just for usernames!

Solution 2: Bloom Filters

This is where Bloom Filters come to the rescue. A Bloom Filter is a space-efficient probabilistic data structure that works on the principle of false positives.

Key Characteristics

If username is taken → Bloom Filter will say "definitely taken"
If username is not taken → Bloom Filter will say "maybe available"

Wait, why "maybe"? Let's understand the two scenarios:

Scenario 1: Username IS Taken (@batman exists)

@batman (Taken) → Bloom Filter says "Taken"

✅ This is correct! If the Bloom Filter says it's taken, it IS taken for sure.

Scenario 2: Username is NOT Taken (@batman1 doesn't exist)

@batman1 (Not Taken) → Bloom Filter might say "Taken"

✅ This is okay! A false positive means we might reject an available username, but we'll never allow a duplicate. The user can simply try another username.

Why False Positives are Acceptable

Actual State	Bloom Filter Says	Result	Acceptable?
Taken	Taken	Correct	✅ Yes
Taken	Not Taken	Would be BAD	❌ Never
Not Taken	Taken	False Positive	✅ Yes
Not Taken	Not Taken	Correct	✅ Yes

The critical point: Bloom Filters NEVER produce false negatives. If something is in the set, it will always be detected.

How Bloom Filters Work

A Bloom Filter consists of two main components:

Hash Functions - Convert input to array indices
Bit Array (Buckets) - Store presence information as bits

The Bit Array

Let's take a simple example with a 10-bucket bit array:

Index:  0   1   2   3   4   5   6   7   8   9
Bits:  [0] [0] [0] [0] [0] [0] [0] [0] [0] [0]

Each bucket stores just a single bit (0 or 1), making it extremely space-efficient.

Adding a Username: @batman

When we add @batman, the hash function might return indices: 0, 2, 4, 8

hash('@batman') → [0, 2, 4, 8]

We set these bits to 1:

Index:  0   1   2   3   4   5   6   7   8   9
Bits:  [1] [0] [1] [0] [1] [0] [0] [0] [1] [0]
        ↑       ↑       ↑               ↑

Adding Another Username: @superman

Hash function returns: 1, 4, 6, 8

hash('@superman') → [1, 4, 6, 8]

Updated bit array:

Index:  0   1   2   3   4   5   6   7   8   9
Bits:  [1] [1] [1] [0] [1] [0] [1] [0] [1] [0]
            ↑               ↑
            (new)           (new)

Checking Username Availability

Check: Is @superman available?

hash('@superman') → [1, 4, 6, 8]

Check each index:

Index 1 → 1 (taken)
Index 4 → 1 (taken)
Index 6 → 1 (taken)
Index 8 → 1 (taken)

All bits are 1 → Username is TAKEN ✅

Check: Is @lemon available?

hash('@lemon') → [2, 3, 4, 7]

Check each index:

Index 2 → 1 (taken by @batman)
Index 3 → 0 (not set!)

Even one 0 means the username is definitely available! ✅

Check: Is @appleuser available? (False Positive Example)

hash('@appleuser') → [1, 2, 6, 8]

Let's check step by step:

Index 1 → 1 (taken by @superman)
Index 2 → 1 (taken by @batman)
Index 6 → 1 (taken by @superman)
Index 8 → 1 (taken by @batman & @superman)

All bits are 1 → Bloom Filter says "Taken"

But wait! We know @appleuser was never added. This is a false positive!

Why False Positives Happen

Since we know there is no @appleuser in our hash function, but all bits at positions [1, 2, 6, 8] are taken by other usernames, the Bloom Filter incorrectly reports it as taken.

This is the reason it's called a false positive.

Reducing False Positives

The false positive rate depends on:

Bucket size - More buckets = fewer collisions
Number of hash functions - More functions = better distribution
Number of elements - More elements = more collisions

Impact of Bucket Size

Bucket Size	False Positive Rate
10	High collisions
1,000	Lower collisions
1,000,000	Very low collisions

With a bucket size of 1,000 instead of 10:

Number of collisions decreases
We can insert more users reliably
Still much smaller than storing actual usernames!

Space Efficiency: The Real Magic

For 20 million users:

Approach	Space Required
HashMap	~400 MB
Bloom Filter	~1 MB

A Bloom Filter with 1% false positive rate for 20 million items needs only about ~20 MB of space!

Real-World Applications

Bloom Filters are used everywhere:

Chrome Browser - Checks if a URL is malicious against a list of known bad URLs
Gmail - Spam detection
Databases - Check if a key exists before expensive disk lookups (LSM trees)
CDNs - Avoid caching one-hit-wonders
Blockchain - Bitcoin uses Bloom Filters for wallet synchronization

Implementation in JavaScript

Here's a simple Bloom Filter implementation:

class BloomFilter {
  constructor(size = 1000, hashCount = 3) {
    this.size = size;
    this.hashCount = hashCount;
    this.bitArray = new Array(size).fill(0);
  }

  // Simple hash functions
  hash(str, seed) {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      hash = (hash * seed + str.charCodeAt(i)) % this.size;
    }
    return Math.abs(hash);
  }

  // Get all hash positions for a string
  getHashPositions(str) {
    const positions = [];
    for (let i = 0; i < this.hashCount; i++) {
      positions.push(this.hash(str, i * 10 + 1));
    }
    return positions;
  }

  // Add an element
  add(element) {
    const positions = this.getHashPositions(element);
    positions.forEach((pos) => {
      this.bitArray[pos] = 1;
    });
  }

  // Check if element might exist
  mightContain(element) {
    const positions = this.getHashPositions(element);
    return positions.every((pos) => this.bitArray[pos] === 1);
  }
}

// Usage
const usernameFilter = new BloomFilter(10000, 4);

// Add existing usernames
usernameFilter.add('@batman');
usernameFilter.add('@superman');

// Check availability
console.log(usernameFilter.mightContain('@batman')); // true (definitely taken)
console.log(usernameFilter.mightContain('@wonderwoman')); // false (definitely available)
console.log(usernameFilter.mightContain('@newuser')); // might be false positive

When to Use Bloom Filters

Use Bloom Filters when:

Space efficiency is critical
False positives are acceptable
You need O(1) lookup time
The dataset is large

Don't use Bloom Filters when:

You need to delete elements (use Counting Bloom Filter instead)
False positives are unacceptable
You need to retrieve actual values (Bloom Filters only check membership)

Conclusion

Bloom Filters are a brilliant example of trading accuracy for efficiency. By accepting a small probability of false positives, we gain:

O(1) constant time lookups
Massive space savings (often 10x or more)
Scalability for millions or billions of elements

Next time you're designing a system that needs to check membership in a large set, consider whether a Bloom Filter might be the perfect solution!

Have questions about Bloom Filters or other system design concepts? Feel free to reach out!