posted 7 years ago

I started reading The Drunkard's Walk and it struck me like lightning -- I maybe shouldn't generate short ids randomly.

The idea is to generate a reasonably short *unique* alphanumeric string before creating each new record. It's much easier to reason about `site.co/n8V34`

than whole mongo ids like `site.co/537d11c63437320002010000`

.

The psuedocode was as follows.

```
id = generate_new_id()
while id_already_exists(id)
id = generate_new_id()
end
```

There are `26 + 26 + 10`

alphanumeric characters, or about 60. With only 2 characters we could accomodate over 3600 records and worry about it later. Right?

(All of the following code is in Julia.)

The Birthday Problem is well-known and the premise is: How many people need enter a room until two people share the same birthday? Assume date of birth is random and ignore leap years. There are 365 days, so obviously if there are 366 people, two *must* share the same birthday. 100% probability. Beyond that we need probability theory.

The first person is first, so there's a (0 / 365) chance she has the same birthday as another. The second person has a (1 / 365) chance of having the same birthday as the first person. The third a (2 / 365) chance of sharing with either the first or second person. We add the probabilities because any would be a winning condition.

```
sum(map(x -> x / 365.0, 1:365)) # 183.0%
```

Well fuck we're over 100% for the last person. Something went wrong! The problem is that this violates the principle of the *sample space*. Every possibility must be considered. The first person's exact birthday is counter-intuitively important.

For example, if the first birthday is Jan 1, the probability of the second birthday also being Jan 1 is `(1 / 365) * (1 / 365) = (1 / 365^2)`

. The problem is that the probability of the first birthday being Jan 1 and the second being Jan 2 is also `(1 / 365^2)`

. Now, the probability of the first being Jan 1, the second being Jan 2, and the third being Jan 1 is `(1 / 365^3)`

. However that's only one winning combination considering person 3. It gets complicated.

```
npermutations(x) = permutations(365, x) |> it -> size(collect(it), 1)
probability(n) = ((365 ^ n) - npermutations(n)) / (365 ^ n)
probability(2) # ~0.3%
probability(3) # ~0.8%
probability(365) # ~100.0%
```

Now the 365th person has a 100% probability. Much better. If you try some of the middle terms tho you might break your computer.

A better way to solve the problem is to calculate the probability there isn't a special birthday pair. The first person enters the room with a (365 / 365) chance there are no matches. The second person has a (364 / 365) chance there remain no matches. The third person walks in with a (363 / 365) chance.

```
probability(n) = map(x -> (365 - x) / 365, 1:(n - 1)) |> xs -> reduce(*, xs)
1 - probability(2) # ~0.3%
1 - probability(3) # ~0.8%
1 - probability(365) # ~100.0%
```

Simple enough. Double-check with experimental data? Why not.

Install Gadfly, the Julia charting library.

```
$ julia -e 'Pkg.add("Gadfly")'
```

To simulate the birthday problem we add a random number from 1 to 365 representing one day to a set. If the number's already in the set, two people share the same birthday and we return the count.

```
using Gadfly
const days = 365
const runs = 1000
function runsim()
birthdays = Set()
while true
birthday = rand(1:days)
if birthday in birthdays
return length(birthdays) + 1
end
push!(birthdays, birthday)
end
end
function runresults()
results = Dict()
for i = 1:runs
result = runsim()
results[result] = get(results, result, 0) + 1
end
results
end
results = runresults()
xs = sort(collect(keys(results)))
ys = map(x -> results[x], xs)
p = plot(x=xs, y=ys, Guide.xlabel("People"), Guide.ylabel("Frequency"))
draw(SVG("birthdays.svg", 1000px, 500px), p)
```

Next we can compare the simulated distribution with the theoretical distribution.

```
# change runs
const runs = 100000
function prob(n)
pct = map(x -> (365 - x) / 365, 1:(n - 1)) |> xs -> reduce(*, xs)
int(floor((1 - pct) * runs))
end
function probresults()
results = Dict()
results[1] = 0
for i = 2:days
results[i] = prob(i)
end
# no double counting
for i = days:-1:2
results[i] -= results[i - 1]
end
results
end
sims = runresults()
xs = sort(collect(keys(sims)))
ys = map(x -> sims[x], xs)
actual = layer(x=xs, y=ys, Geom.point)
probs = probresults()
ys = map(x -> probs[x], xs)
theory = layer(x=xs, y=ys, Geom.line)
graph = plot(actual, theory, Guide.xlabel("People"), Guide.ylabel("Frequency"))
draw(SVG("graph.svg", 1000px, 500px), graph)
```

Not bad at all.

It doesn't take very many random walks to get a collision. The numbers for alphanumerics are pretty startling. With only 40 records the probability of a collision among the records is over 99%.

So now you're probably thinking the algorithm in psuedocode at the beginning is flawed. Nope. There's nothing wrong with it. The Birthday Problem isn't a proper isomorphism.

On the server, we don't care about all permutations. If the first token is `xF`

then that token is taken. There is no alternate reality where the first token is `5j`

. So there's a (0 / 3844) chance for the first token, and a (1 / 3844) chance for the second token, etc. An anology would be how many people does it take to find someone in a room with *your* birthday. About half. That said, it's always better to use a deterministic algorithm like the wonderful hashids project.

Tweet @aj0strow if I messed up somewhere.