Using Genetic Algorithms and Markov Chains to Crack Simple Substitution Ciphers
Authors: Don Miner and Matt Rodatus
Class: CMSC433H with Dr. Stephens
This algorithm uses a generic genetic algorithm with a fitness function that utilizes a hidden Markov model. The entities in the population are potential keys to crack the simple substitution. Strings of decoded characters that are more probable based on a trained model (trained off a large sample of text) yield higher fitness scores. In the end we discovered experimentally that using chain lengths of 2 through 5 is very good at cracking the codes and still remains fast.
Also, we discovered that using the Markov chains and a fitness bonus for getting common words (such as "and", "because", etc.) worked better in some situations. However, in some situations the algorithm got stuck on a local minimum because it incorrectly guessed some words early. We opted to purely using the Markov chains as it is more robust and elegant.
Properties we found experimentally to be good at finding solutions quickly:
- Use n-grams 2 through 5
- A population size of 200 keys
- The keys with fitnesses in the top 33% in the population survive to continue to the next generation
- An initial mutation rate of 33%, decreasing by 1%, compounded, each generation

A graph showing the increase of fitness per generation. Notice how the average value increases each generation and the standard deviation decreases (converging to a optimal fitness).
README.txt - Usage instructions
Example Output 1 - Sample cipher text, trained on Pride and Prejudice.
Example Output 2 - Same cipher, no spaces, trained on Pride and Prejudice with spaces removed
Complete source code is available if requested.
This program has been tested and appears to work as long as the trained text is similar to the plain text. For example, a Spanish song by Shakira was decrypted by training our application on Don Quixote (the original text written in Spanish).