Decode a Caesar ciphertext with high probability

Question

Caesar ciphers

A Caesar cipher with shift=N is the process of replacing any alphabetic character in a string with the letter which is N positions ahead in the alphabet (wrapping back at the beginning).

This is the key for Caesar(shift=5) (supposing a single-case English alphabet):

 these: ABCDEFGHIJKLMNOPQRSTUVWXYZ map to: FGHIJKLMNOPQRSTUVWXYZABCDE

And this is the result of applying it to "HELLO, WORLD!":

 "HELLO, WORLD!" "MJQQT, BTWQI!"

There have been other challenges ( like these ) requiring to crack the Caesar cipher, using an extra piece of information beyond the ciphertext to mathematically figure out the shift.

This challenge

This challenge gives you no extra hint. It just asks to:

«Write a program or function that takes a short Caesar-encrypted text and finds with high probability the original plain English text. »

To avoid any doubts, I'm asking you to try to crack patterns in the English language (like for instance the high probability that the most abundant letter decodes to an "e").

Your function/program takes as input:

a string of characters from this ascii subset: " abcdefghijklmnopqrstuvwxyz()-,;: "'!? ", containing 5 to 10 words (i.e. bits separated by " ").

It should output with high accuracy:

the shift N (in the range 0..25 ) that was most likely used to obtain this string from a unencrypted sentence made of English words
OR the anti-shift M (in the range 0..25 ) that would be required to obtain an unencrypted sentence made of English words ( M = 26 - N except for N = 0 , for which M = 0 too)
OR the unencrypted sentence itself
( OR just the alphabetic characters of it)

Scoring

This is both code-golf and test-battery , so you need to write a short code (low #bytes ) that performs sufficiently well (high #correct answ. ) on a large number of test cases.

The score is computed as (these are equivalent):

$${\rm score} = \frac{{\rm \#bytes}}{\rm accuracy} = \frac{10000\cdot {{\rm \#bytes}}}{\rm \# correct\,\,answ.} = \frac{{\rm \#bytes}}{1-\frac{\rm \# errors}{10000}}$$

after having tested the code on a sample of 10'000 cyphertexts . Lowest score (per programming language) wins.

Accuracy must be at least 30% for a qualifying answer.

The 10'000 test cases are here . Here is an excerpt:

 jubx rmnwcroh ngrbcrwp kruub cqjc fn                                       9  17  also identify existing bills that we                                                    always have it compute all the posterior possibilities for all             0   0  always have it compute all the posterior possibilities for all                          fuhgleoh vrxufhv djuhh wkdw vxssob zloo eh yhub wljkw wkurxjkrxw           3  23  credible sources agree that supply will be very tight throughout                        zhuh qrw vhqvlwlyh wr wkh                                                  3  23  were not sensitive to the                                                               wivv, jf zk'j rmrzcrscv kf repfev ivxriucvjj fw vtfefdzt jkrklj           17   9  free, so it's available to anyone regardless of economic status                         svvecdbkdsyx yp ryg dbisxq dy cryo-rybx sx k                              10  16  illustration of how trying to shoe-horn in a                                            ivhlzivu kf gifultv, reu sp                                               17   9  required to produce, and by                                                             svwev nwz, eqbp bpm illml                                                  8  18  known for, with the added

Use the first column as a sequence of inputs with which to test your program/function.

Aim at predicting correctly the output. The correct output is reported in the second, third and fourth columns in different valid formats. Be consistent with your output: always aim at outputting the shift, or the anti-shift (remember, this is 26-N modulo 26 ), or the plaintext.

(Notes: (1) the battery file is made of fixed-length columns, it's not separator-based; a CSV version is provided here that uses double quotes when necessary, and escapes double quotes with double-double quotes ( "" ); (2) the battery file is based on a corpus and may contain offensive words)

If your code has a very slow runtime, or for the purpose of showing proof of your score on services like AttemptThisOnline, you can use just a subset of the test battery as long as you pick from the head and not cherry pick. If possible, try to run the code locally on the whole battery or the largest head-subset you can handle, before declaring your score.

Vyxal, 0 bytes, score 0/(402/10000) = 0. You might want to add a +1 to the bytes, or remove unchanged testcases entirely, since cat programs are usually very short — emanresu A, May 18 at 23:23
Do you happen to have this file in a format a bit easier to parse, like CSV? — Command Master, May 19 at 4:26
Taking the rotation with the maximum number of appearances of etoainsr works in 9657 of the tests — Command Master, May 19 at 4:45
Taking the most common letter (not space) and assuming it goes to e is enough to hit ~35%. — xnor, May 19 at 6:06

Neil · Accepted Answer · 2024-05-19 20:06:28Z

nine

Charcoal , 31 bytes, 93.66%, score 33.1

 ≔ＥβΣＥθ∧№βλ№etaonis§β⁻⌕βλκηＩ⌕η⌈η

Attempt This Online! Link is to verbose version of code. Outputs the shift N . Explanation: Finds the lowest N where the "cleartext" contains as many of the letters etaonis as possible. Removing a letter will reduce the accuracy to 88.58% while adding a letter will only increase the accuracy to 94.65% either way resulting in a slightly higher score. Even switching to calculating M reduces the accuracy to 93.42%!

There is actually a way to run a whole test suite from the command line but sadly I've never tried it myself, so for testing I actually wrote a longer version which reads in all 10,000 strings in turn.

My best accuracy using a variation of this method is 99.17% achieved by adding negative weighting for the letters xxzzjjkvp (yes that's double weighting for xzj ), plus disallowing all q s not followed by a u .

answered May 19 at 20:06

Neil

171k 12 gold badges 72 silver badges 276 bronze badges

\$\begingroup\$ Which ones does it get wrong with the improved 99.17% method? \$\endgroup\$
– Simd
May 20 at 4:16
\$\begingroup\$ @Simd Mostly short ones that contain those rare consonants, such as "we could get these people", which it thinks is encoded. \$\endgroup\$
– Neil
May 20 at 5:41
\$\begingroup\$ As many estonia n's as possible. \$\endgroup\$
– Jonathan Allan
May 20 at 18:35
\$\begingroup\$ @JonathanAllan If only Charcoal had a) dictionary compression and b) it included Estonia... \$\endgroup\$
– Neil
May 20 at 18:41

Add a comment |

emanresu A · Accepted Answer · 2024-05-21 02:11:42Z

Vyxal , 10 bytes, 99.9%, score 10.01

 ‡ka*İ‡øDL∵

Try it Online!

Based on Jonathan Allan's idea of compressing the strings , this takes the rotation that's compressed the best in Vyxal's dictionary. Unlike Jelly, Vyxal has a string compression function øD built into the language.

 ‡---İ      # Collect all the unique results of ka*       # Ring translating the input by the lowercase alphabet ‡---∵ # Take the minimum by øDL  # Length of string when compressed with Vyxal's dictionary

For the curious, the 10 failed testcases are:

 got 'vg gb enmr vg, ohg' expected 'it to raze it, but' got 'qh c "agu qt pq"' expected 'of a "yes or no"' got 'rpe xh bti pcs id' expected 'cap is met and to' got 'sio oj ni vy u' expected 'you up to be a' got 'ct gaiuuzwbu obr dwfo qm' expected 'of smuggling and pira cy' got 'ihy iz nby gyh ch nby' expected 'one of the men in the' got 'dnswpa, dnswfm, dnsxfnv, afek, vwfek, vgpens' expected 'schlep, schlub, schmuck, putz, klutz, kvetch' got "w'a cb am hift wt" expected "i'm on my turf if" got "h aoplm pz ohyzo, ildhyl" expected "a thief is harsh, beware" got "aol hesl vm aol dolls" expected "the axle of the wheel"

Most of these are a consequence of a) vyxal not compressing two-letter words b) vyxal compressing a lot of common three-letter sequences.

Can confirm \$93.66\%\$ - the actual score may depend upon the order of the 26 translations as there are quite a few with multiple maxima, but you are using the same order as Neil. Have run a port to confirm that there are 934/10000 errors. — Jonathan Allan, May 20 at 20:52

Jonathan Allan · Accepted Answer · 2024-05-20 23:21:35Z

five

Jelly , 19 bytes , \$94.65\%\$ ; score \$=\frac{10000\times 19}{9415}\approx 20.18\$

 ØaṙJ,€¤y€ċⱮ“Ẉ²»S$ÞṪ

A monadic Link that accepts the encrypted text and yields its guess.

Try it online!

How?

Same approach as Neil's answer , except that:

the translation order is different
it uses antiheroes to (a) add the two next most common letters, r and h , and (b) double the importance of e over the others.

Here is some ungolfed Python code that gives 100% accuracy for the test battery (only first 500 shown due to time limit on TIO)

It works by finding the minimal length optimal compression of the \$26\$ transforms using Jelly's dictionary ( a Linux words file from Dennis' computer , split into short and long words). It has only been tweaked beyond looking for the shortest compressed string to avoid nine false results by disallowing eleven strings:

 " yt" " qi" " cn" " xc" " noy" " c'" " kc" " x " " paa" " ej " " wb"

edited May 20 at 23:21

answered May 20 at 21:09

Jonathan Allan

107k 7 gold badges 64 silver badges 276 bronze badges

\$\begingroup\$ Crazy idea, but what if you concatenate the whole dictionary to get the true-ish distribution of letters? \$\endgroup\$
– xnor
May 20 at 21:36
one

\$\begingroup\$ @xnor There are issues with that... The actual dictionary (and making a compressed string) is not available from within Jelly itself (unless maybe using some embedded Python which will be expensive). Jelly's dictionary has a lot of very non-English entries (so e.g. compressing each translation and seeing which is shorter is surprisingly bad! I tried that out in Python). \$\endgroup\$
– Jonathan Allan
May 20 at 21:42
\$\begingroup\$ @xnor ^ I made a mistake with the compression test, it only makes 9 errors. Maybe one permutation of the 26 translations would make none?! \$\endgroup\$
– Jonathan Allan
May 20 at 21:55
\$\begingroup\$ antiheroes also conveniently compresses to “Ẉ²» , which is shorter than any other relevant string I could find. \$\endgroup\$
– Neil
May 23 at 0:11

Add a comment |

Arnauld · Accepted Answer · 2024-05-20 09:50:42Z

JavaScript (ES6), 80 bytes / 0.9468 ≈ 84.50

Improved^(*) by taking inspiration from Neil's approach .

Expects an array of ASCII codes.

 a=>(g=b=>i--? g(a.map(c=>n+=8920258>>(c-i)%26&c>>6,n=0)|n>b?(o=i,n):b):o)(0,i=26)

Try it online! (only the first 100 entries)

_{(*) Compared to my initial version which was based on 2-character patterns}

JavaScript (Node.js) , 48 bytes / 0.3269 ≈ 146.83

This is a short (and fast) one, showing that taking only the first two characters into account is enough to reach a success rate above 30%.

Expects an array of ASCII codes.

 ([x,y])=>(x+Buffer("BCFG?M;")[(y-x+78)%26%7])%26

Try it online! (only the first 100 entries)

65 bytes / 0.4683 ≈ 138.80

A better success rate and overall better score can be achieved with a longer lookup string. But this somewhat defeats the purpose of the above version, which was to have the smallest possible valid code.

 ([x,y])=>(x+Buffer("G8F:?M=4?6M?9;BF=G?4;BC49G")[(y-x+78)%26])%26

Try it online!

Nicola Sap · Accepted Answer · 2024-05-21 14:11:20Z

two

Python 3 , 131 bytes / 0.9464 = 138.41927303465764

-34 bytes thanks to Nicola Sap

-2 bytes thanks to ShadowRanger

 lambda x:max((''.join(chr(97+(ord(u)-97+i)%26)*u.isalpha()for u in x)for i in range(26)),key=lambda x:sum(map(x.count,'etoanirs')))

This answer is a lot worse than other answers because there's not a convenient way in Python to count multiple substrings, so I have to use a for loop to do so.

The lambda l enumerates the offsets, and then judges the string by the occurences of etoanirs in the code, and then returns the decrypted string.

Tell me if you can improve this.

Try it online!

edited May 21 at 14:11

Nicola Sap

3,508 2 gold badges 9 silver badges 23 bronze badges

answered May 20 at 13:51

None1

nine hundred and thirty-one 2 silver badges 18 bronze badges

\$\begingroup\$ Note that per the rules you don't need to output nonalpha characters, and your algorithm don't need them either. So the k lambda can just be (... for i in x if i.isalpha()) rather than (...if i.isalpha()else i for i in x) (saves 5) \$\endgroup\$
– Nicola Sap
May 20 at 14:04
\$\begingroup\$ And saves an extra 2 if you index by the boolean: (...[:i.isalpha()]for i in x) \$\endgroup\$
– Nicola Sap
May 20 at 14:11
\$\begingroup\$ Do you really need to define k as a lambda? It should work fine if the "".join() was placed directly at the point where you use it. At which point your other ( l ) lambda would also not need explicit naming (anonymous functions are permitted, and a `f=` header is generally acceptable for Python lambda answers) \$\endgroup\$
– Nicola Sap
May 20 at 14:20
\$\begingroup\$ Last one: sum(map(x.count,'etoanirs')) should work. All in all, I think this algorithm codes in one hundred and thirty-three . I haven't checked its accuracy but it should be totally equivalent code. \$\endgroup\$
– Nicola Sap
May 20 at 14:25
\$\begingroup\$ @NicolaSap: Even shorter, ...[:i.isalpha()] can just be i.isalpha()*... (or ...*i.isalpha() , doesn't matter here), when isalpha returns False , the string is multiplied by zero and eliminated, when True , multiplied by one and kept unchanged; costs 1 for * , rather than 3 for [:] . \$\endgroup\$
– ShadowRanger
May 20 at 19:35

| Show three more comments

Stack Exchange Network

Decode a Caesar ciphertext with high probability

Caesar ciphers

This challenge

Scoring

5 Answers five

Charcoal , 31 bytes, 93.66%, score 33.1

Vyxal , 10 bytes, 99.9%, score 10.01

Jelly , 19 bytes , \$94.65\%\$ ; score \$=\frac{10000\times 19}{9415}\approx 20.18\$

How?

JavaScript (ES6), 80 bytes / 0.9468 ≈ 84.50

JavaScript (Node.js) , 48 bytes / 0.3269 ≈ 146.83

65 bytes / 0.4683 ≈ 138.80

Python 3 , 131 bytes / 0.9464 = 138.41927303465764

Your Answer

Not the answer you're looking for? Browse other questions tagged
code-golf
string
natural-language
test-battery
or ask your own question .

Linked

Hot Network Questions

Decode a Caesar ciphertext with high probability

Caesar ciphers

This challenge

Scoring

5 Answers five

Charcoal , 31 bytes, 93.66%, score 33.1

Vyxal , 10 bytes, 99.9%, score 10.01

Jelly , 19 bytes , \$94.65\%\$ ; score \$=\frac{10000\times 19}{9415}\approx 20.18\$

How?

JavaScript (ES6), 80 bytes / 0.9468 ≈ 84.50

JavaScript (Node.js) , 48 bytes / 0.3269 ≈ 146.83

65 bytes / 0.4683 ≈ 138.80

Python 3 , 131 bytes / 0.9464 = 138.41927303465764

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged code-golf string natural-language test-battery or ask your own question .

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
code-golf
string
natural-language
test-battery
or ask your own question .