List

My obsession with Cryptography and MCMC algorithms led me across several articles regarding applications of MCMC to decrypt different kinds of cyphers [1]. Among these, substitution cyphers are the simplest. A substitution cypher is basically every symbol in some arbitrary text (alphabets, numerals, punctuation marks, blank spaces) mapped bijectively over a set of arbitrary symbols. This set can be the same as the domain set or can be different. If the sets are same, the text would look scrambled. Depending on the cardinality C of the domain set, the actual key can be any of the C! combinations. Finding a key by brute force on a computer, would still require some sort of evaluation function to check whether the key is correct.

As you could have guessed, these challenges are insurmountable if one goes about doing it naively. Statistics comes to our rescue in almost a magical way. Here is a Shiny application I built, that demonstrates decryption of a demo text using a MCMC algorithm [1]. This app could also be used to decrypt your own text, however you would need to make sure the text is more than 1500 words to have sufficient sample size (Weak law of large numbers, blah blah…) and uses only letters, numbers and commonly used punctuations. Also decryption with reasonable accuracy takes 15K iterations at minimum, so you may need to be patient. To know more about how this algorithm works, keep reading.

Idea and framework

Let’s define the true key as a bijective function f : {code space} \rightarrow {usual alphabet} and the space of all possible keys as \mathcal{F} such that f \in \mathcal{F}. Here I use a very standard approach [1] to decrypting by using statistics of written English to guess at probable choices for f.

To get the statistics, I downloaded a pdf of “War and Peace” and recorded the first order transitions i.e. the proportion of consecutive text symbols from one character to the consecutive next one. This gives a matrix M(x, y) of
transitions for all possible pairs in the alphabet space. To define an evaluation criteria, I use plausibility of f being the right key as:
$$Pl(f) = \prod_{i\in \mathcal{S}}M (f(s_i), f(s_i+1))$$
where s_i runs over the space of consecutive symbols (\mathcal{S}) in the encrypted text. Functions f which have high values of  Pl(f)
are good candidates for decryption. Space \mathcal{F} can be searched for f's that maximizes Pl(f) iteratively by running the following Markov chain Monte Carlo (MCMC) algorithm:

Algorithm

  1. Start with a preliminary guess, say \(f\).
  2. Compute \(Pl(f)\).
  3. Change to \(f^*\) by making a random transposition of the values \(f\) assigns to two symbols.
  4. Compute \(Pl(f^∗)\); if this is larger than \(Pl(f)\), accept \(f^∗\).
  5. If not, flip a \(Pl(f^∗)/Pl(f)\) coin; if it comes up heads, accept \(f^∗\).
  6. If the coin toss comes up tails, stay at \(f\).
  7. Keep repeating steps 1 to 6 until \(Pl(f)\) stops increasing.

The algorithm tries to improve the current \(f\) by making random transpositions. The coin tosses allow it to go to less plausible \(f’s\), and keep it from getting stuck in local maxima.

References

[1] [pdf] Unknown bibtex entry with key []
[Bibtex]