Substitution Cipher — Decryption & Encryption

🇸🇰 Slovensky | 🇬🇧 English | 🇫🇷 Français

Background — What Is This About?

A classic high school exercise: given a ciphertext and a table of letter frequencies, can it be decrypted using only single-character frequencies — without bigrams or trigrams?

The short answer: not automatically. A frequency table alone gives only a rough initial guess. Many letters share similar frequencies (e.g. s, n, r, t, i in French all fall between 6–8 %), so the ordering is unreliable. What students actually did was combine the frequency hint with manual trial-and-error and linguistic intuition — guessing words from partial context. Bigrams and trigrams automate exactly that intuition.

This program replicates that process algorithmically, using frequency analysis combined with simulated annealing.

Key Insight — Data Quality Matters Most

The single most important factor for decryption quality is the size and accuracy of the n-gram frequency tables. A first version using only ~40 bigrams and ~50 trigrams produced garbled text; expanding the tables to ~250 bigrams and ~100 trigrams immediately produced near-perfect decryption — a far bigger improvement than any algorithmic tuning (more restarts, more iterations, trigram weights, etc.).

The scoring function can only distinguish good French from bad French if it has enough reference data. With too few n-grams, almost every letter pair receives the same penalty, and the algorithm cannot navigate toward the correct solution.

What Is a Substitution Cipher?

A substitution cipher uses a secret key — a table of 26 pairs mapping each plaintext letter to a ciphertext letter:

plaintext:  a b c d e f g ...
ciphertext: r x k o f p q ...

Every a becomes r, every b becomes x, etc. The result looks random, but it preserves the statistical properties of the original language — some letters still appear more often than others. That is exactly what allows breaking the cipher without knowing the key.

Project Files

File	Description
`decrypt_revised2.py`	Main decryption script
`encrypt.py`	Encryption script (monoalphabetic substitution)
`crypt.txt`	Encrypted input text
`french_character_frequencies.csv`	French single-letter frequencies (26 letters)
`french_bigram_frequencies.csv`	French bigram frequencies (~250 pairs)
`french_trigram_frequencies.csv`	French trigram frequencies (~100 triples)
`stat.csv`	Alternative frequency source (format: `char;frequency`)
`preklady/`	Output directory for results (created automatically)
`test/`	Test suite

Frequency tables sourced from sttmedia.com and expanded for better coverage.

How to Run

Decryption

python3 decrypt_revised2.py

All input files must be in the same directory as the script. The result is printed to the console and saved to preklady/preklad_Rev2_YYYYMMDD_HHMMSS.txt.

Encryption

encrypt.py encrypts text using the same cipher type (monoalphabetic substitution).

# Random key, interactive input
python3 encrypt.py

# Encrypt a file with a random key
python3 encrypt.py -i input.txt -o encrypted.txt

# Encrypt with a specific 26-letter key (alphabet permutation)
python3 encrypt.py -i input.txt -k qwertyuiopasdfghjklzxcvbnm

# Invert a decryption mapping from decrypt_revised2.py
python3 encrypt.py --from-decrypt-map '{"a": "q", "b": "w", ...}'

# Verify correctness (encrypts then decrypts and compares)
python3 encrypt.py -i input.txt --verify

The key is a 26-character string — a permutation of the alphabet where position i specifies the replacement for the i-th letter. The key is automatically saved alongside the output file (*_kluc.txt).

Algorithm — Step by Step

Step 0 — Caesar Detection (Pre-check)

try_caesar()

Before launching the expensive simulated annealing, the algorithm tests all 25 possible Caesar shifts. A Caesar cipher maps every letter by the same fixed offset (e.g. every letter shifted by 4). If the best Caesar shift produces a high score, the result is used directly and simulated annealing is skipped entirely.

For general substitution ciphers (where each letter has an independent mapping), the Caesar scores will be very low and the algorithm proceeds to the full optimisation.

Step 1 — Frequency Analysis (Initial Guess)

generate_frequency_mapping()

Letter frequencies in the ciphertext are computed and sorted from most to least frequent. The same is done for the French reference table. The two sorted lists are aligned — the most frequent cipher letter is mapped to the most frequent French letter, and so on.

cipher (by freq):  r  s  h  f  e  n  ...
French (ref):      e  a  s  t  i  r  ...
→ initial mapping: r→e, s→a, h→s, f→t, ...

This is only a rough estimate — letters with similar frequencies will often be misassigned. It serves as a starting point for the next stage. Letters absent from the ciphertext are filled in randomly.

Step 2 — Scoring

score_text()

To evaluate a mapping, the algorithm scores the decrypted text using log-probability of n-grams.

A bigram is a pair of adjacent letters (e.g. le, en, es). A trigram is a triple (e.g. ent, que, les). Each language has characteristic n-grams — in French, le is very common (~2.2 %), while xq is virtually absent.

For each bigram and trigram in the decrypted text, the algorithm looks up its reference frequency and takes the natural logarithm. All values are summed:

score = Σ w_bi · log(P(bigram)) + Σ w_tri · log(P(trigram))

Trigrams carry double weight (w_tri = 2.0) because they encode more structural information. Unknown n-grams receive a small penalty 1e-9 instead of zero (to avoid log(0)). A higher (less negative) score means the text more closely resembles natural French.

Step 3 — Simulated Annealing (Optimisation)

simulated_annealing()

The goal is to find the 26-letter mapping with the highest score. The search space has 26! ≈ 4 × 10²⁶ possibilities — exhaustive search is impossible.

The algorithm is inspired by the physical process of annealing metals: a metal is heated to a high temperature and slowly cooled. At high temperature, atoms move freely and can escape unfavourable arrangements; at low temperature, they settle into a stable (ideally globally optimal) state.

At each step:

Randomly swap two letters in the current mapping
Compute the score delta incrementally (only n-grams affected by the swap are recalculated)
If the new mapping is better — always accept it
If the new mapping is worse — accept it with probability exp(Δ/T)

The probability of accepting a worse solution depends on temperature T, which decreases exponentially:

T(i) = T_start · (T_end / T_start)^(i / iterations)

At high T, even much worse solutions are accepted — the algorithm explores broadly. At low T, it behaves like hill-climbing and refines the details. This is the key advantage over plain hill-climbing: the ability to escape local optima.

Incremental scoring: Instead of rescoring the entire text after every swap (~800 n-gram lookups), the algorithm only recalculates the n-grams at positions affected by the two swapped letters (~5–15 % of the text). This makes each iteration significantly faster.

Step 4 — Multiple Restarts

__main__

Simulated annealing is non-deterministic — each run may end at a different solution. The entire process is therefore repeated NUM_RESTARTS times, each with a fresh random starting mapping. Only the result with the highest score is kept.

Multiple restarts significantly increase the probability that at least one run finds the global (or near-global) optimum.

Step 5 — Saving the Result

The best decrypted text is:

printed to the console
saved to preklady/preklad_Rev2_YYYYMMDD_HHMMSS.txt

The timestamp in the filename ensures every run produces a unique file without overwriting previous results.

Function Reference

Function	Description
`load_char_stats(filepath)`	Loads letter frequencies from CSV; supports `;` and `,` separators
`load_ngram_stats(filepath)`	Loads bigram/trigram frequencies; skips accented characters
`get_cipher_text(filepath)`	Reads the ciphertext from file
`get_text_freqs(text)`	Computes percentage letter frequencies in a text
`decrypt_text(text, mapping)`	Applies a mapping to text, preserving case
`score_text(text, bigrams, trigrams)`	Scores text by log-probability of bigrams and trigrams
`try_caesar(cipher_text, bigrams, trigrams)`	Tests all 25 Caesar shifts and returns the best
`generate_frequency_mapping(...)`	Creates initial mapping from frequency analysis
`precompute_cipher(cipher_text)`	Converts text to integer index array for fast scoring
`compute_score_delta(...)`	Incrementally computes score change for a letter swap
`simulated_annealing(...)`	Optimises the mapping using simulated annealing with incremental scoring

Tunable Parameters

Parameter	Default	Description
`NUM_RESTARTS`	`50`	Number of independent runs
`ITERATIONS`	`100000`	Steps per annealing run
`T_start`	`5.0`	Initial annealing temperature
`T_end`	`0.0005`	Final annealing temperature
`w_bi`	`1.0`	Bigram weight in scoring
`w_tri`	`2.0`	Trigram weight in scoring
`CAESAR_THRESHOLD`	`-500`	Caesar score above which SA is skipped

Increasing NUM_RESTARTS or ITERATIONS improves result quality at the cost of runtime. However, the most impactful improvement is using comprehensive n-gram frequency tables — see Key Insight.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
preklady		preklady
test		test
.gitignore		.gitignore
README.fr.md		README.fr.md
README.md		README.md
README.sk.md		README.sk.md
README1.md		README1.md
Riesenie.txt		Riesenie.txt
crypt.txt		crypt.txt
decrypt_revised1.py		decrypt_revised1.py
decrypt_revised2 copy.py		decrypt_revised2 copy.py
decrypt_revised2.py		decrypt_revised2.py
encrypt.py		encrypt.py
french_bigram_frequencies V 0.csv		french_bigram_frequencies V 0.csv
french_bigram_frequencies V 1.csv		french_bigram_frequencies V 1.csv
french_bigram_frequencies.csv		french_bigram_frequencies.csv
french_character_frequencies.csv		french_character_frequencies.csv
french_trigram_frequencies V 0.csv		french_trigram_frequencies V 0.csv
french_trigram_frequencies V 1.csv		french_trigram_frequencies V 1.csv
french_trigram_frequencies.csv		french_trigram_frequencies.csv
stat.csv		stat.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Substitution Cipher — Decryption & Encryption

Background — What Is This About?

Key Insight — Data Quality Matters Most

What Is a Substitution Cipher?

Project Files

How to Run

Decryption

Encryption

Algorithm — Step by Step

Step 0 — Caesar Detection (Pre-check)

Step 1 — Frequency Analysis (Initial Guess)

Step 2 — Scoring

Step 3 — Simulated Annealing (Optimisation)

Step 4 — Multiple Restarts

Step 5 — Saving the Result

Function Reference

Tunable Parameters

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Substitution Cipher — Decryption & Encryption

Background — What Is This About?

Key Insight — Data Quality Matters Most

What Is a Substitution Cipher?

Project Files

How to Run

Decryption

Encryption

Algorithm — Step by Step

Step 0 — Caesar Detection (Pre-check)

Step 1 — Frequency Analysis (Initial Guess)

Step 2 — Scoring

Step 3 — Simulated Annealing (Optimisation)

Step 4 — Multiple Restarts

Step 5 — Saving the Result

Function Reference

Tunable Parameters

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages