# Number of Combinations for German Tax Id

I was asked by a colleague of mine how many combinations are possible for the German tax id. The German tax id is built entirely from eleven digits. Its specification is given in Pruefziffernberechnung, see Steueridentifikationsnummer in the German Wikipedia. Every person reported to live in Germany is assigned one tax id, even babies and children. 20 years after the death of a person the number can be recycled.

The rules for constructing a tax id are as follows:

1. last digit is check digit, and can be ignored for the rest of this discussion
2. first digit cannot be zero, but any digit from 1 to 9
3. one digit must appear exactly two times
4. the rest of the digits may only appear a single time

For example, 4895437120 and 5549267083 are valid tax ids, ignoring the 11th check digit. The first example tax id has 4 as repeating digit, the second has 5 as repeating digit. All other digits do not repeat.

Hint: As we have ten digits altogether, one of them being the repeating digit, so in effect we only have nine remaining digits to choose one of them to be the repeating digit. For example, 9×87054321, nine digits, excluding digit 6, we have nine digits to choose from, e.g., digit x=2. So, position 2 and position 9 now have both digit 2. For 98×7654321, we have only 8 possibilities, not nine, to choose our repeating digit, i.e., we can choose either the first digit (here 9), the fourth (here 7), the fifth (here 6), and so on. We cannot choose the second one (here 8), as this repeating digit was already covered in the case 9×87054321.

To deduce the number of possible combinations nine different cases have to be considered, depending on the repeating digit. First case:

1. 1st digit: 9 possibilities for the first digit (1-9)
2. 2nd digit: assume 2nd digit is our repeating digit, then we have 9 possibilities for this, see hint
3. 3rd digit: 9 possibilities (ten digits, one was already used for the 1st digit)
4. 4th digit: 8 possibilities
5. 5th digit: 7 possibilities
6. 6th digit: 6 possibilities
7. 7th digit: 5 possibilities
8. 8th digit: 4 possibilities
9. 9th digit: 3 possibilities
10. 10th digit: 2 possibilities

Add to this the second case:

1. 1st digit: 9 possibilities for the first digit (1-9)
2. 2nd digit: 9 possibilities (ten digits, one was already used for the 1st digit)
3. 3rd digit: assume 3rd digit is our repeating digit, then we have 8 possibilities for this, see hint
4. 4th digit: 8 possibilities
5. 5th digit: 7 possibilities
6. . . .
7. 10th digit: 2 possibilities

and so on up to the ninth case:

1. 1st digit: 9 possibilities for the first digit (1-9)
2. 2nd digit: 9 possibilities (ten digits, one was already used for the 1st digit)
3. . . .
4. 9th digit: 2 possibilities
5. 10th digit: assume 10th digit is our repeating digit, then we have 1 possibility for this, see hint

This is

9*(9)*9*8*…*2 + 9*9*(8)*8*…*2 + 9*9*8*(7)*…*2 + 9*9*8*7*…*(1)
= (9*9*8*…*2)*(9+8+…+2+1)
= 146.966.400

That number is strikingly low, compared to the roughly 82 million population in Germany, see German population.

The chance of getting a valid German tax id (without check digit) out of a complete random number (all digits with equal probability) is therefore 146.966.400 / 10^10, i.e., 1,47%.

As the derivation of the number of combinations might be a little entangled, it is good to know, whether there is an alternative way to prove that the number above is indeed correct, or at least plausible. For this I have written a short simulation which calculates the probability that a number is indeed a German tax id, see taxidsim.c. The output looks something like this:

``` ~/c/taxidsim -tn100000 ... *9 2 6 7 8 1 5 0 8 4 / 1110111121 *3 5 2 7 0 1 6 9 8 5 / 1111021111 *1 9 5 8 5 7 4 2 3 0 / 1111120111 *9 3 8 0 6 7 2 1 5 7 / 1111011211 Digit Occur. Perc. 0 100488 10.048800 1 100456 10.045600 2 100817 10.081700 3 100026 10.002600 4 99938 9.993800 5 99432 9.943200 6 99475 9.947500 7 99826 9.982600 8 99690 9.969000 9 99852 9.985200 tax ids: 1453 1.453000 ```

Enlarging n on the commandline gives 1.470%, which is quite near to our theoretical value.

taxidsim.c contains a number of unrolled loops, see, e.g., Dongarra/Hinds (1979), thus helping Out-of-order execution. The trick to detect tax ids is, first to count the number of occurences of x[i], i=0(1)9 in r[i], i=0(1)9, and then to count occurences in r[] in rs[]. The test is then
``` x[0] != 0 && rs[0] == 1 && rs[1] == 8 && rs[2] == 1 ```
i.e., first digit is non-zero, the number of no-occurrences is one, rs[0]=1, eight digits occur exactly one time, the double-occurence is 1, rs[2]=1.