Skip to main content
The National Cipher Challenge

Help with hillclimb

  • This topic has 5 replies, 3 voices, and was last updated 1 month ago by TheCipherCrackers.
Viewing 11 posts - 1 through 11 (of 11 total)
  • Author
    Posts
  • #98727
    TheCipherCrackers
    Participant

    We’re trying to build a hill climb algorithm to help us obtain the key for substitution ciphers. We’ve implemented the algorithm as defined in the boss guide but we’re running into an issue where the algorithm “losses its way” because the fitness of a plaintext deciphered with a key which has only one character different from the correct key is too far away (e.g. fitness = -33.18) from the fitness = -9.62 we see if we decrypt with the correct key. So the algorithm “doesn’t realise” it’s almost cracked the key and goes off exploring other random keys rather than homing in on the correct key. Are we missing something about when and where we can successfully use the hill climb technique? Fitnesses quoted are based on tetragrams.

    And if this is allowed, here’s a great web page where you can find the tetragram frequencies (http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/english-letter-frequencies/). We originally tried to calculate these ourselves but ran into issues with text encodings when we were trying to find appropriate English texts to feed in and we didn’t want to spend time working that issue out so we fell back to finding the tetragram frequencies online (I hope that’s not considered cheating!)

    It is a good resource and we think it is reasonable given the difficulty of constructing your own corpus, though we would definitely give additional credit for anyone who does manage that. Anyway, we have added a link to that page in our Official Boss Deciphering Tools page to make it clear that this is allowed for everyone.

    #98746
    Puzzling_Pelican
    Participant

    If you don’t mind me asking, how are you using the tetragram frequencies to score the decrypt as it is highly unusual for 1 character difference to change the score that significantly?

    Usually each tetragram has a “score” which is added to a decrypt’s total for every time it occurs in the text e.g. “CIPHER” has a score of score(“CIPH”) + score(“IPHE”) + score(“PHER”).

    The score for a single tetragram is generated based of how frequently the tetragram is expected to occur in English (or you can just be lazy and use the number of times it occurred in a large corpus e.g. the plaintexts of the past few years of cipher challenge).

    Alternatively, ensure the “solve substitution” function is working as intended as if it is wrong that might explain how a slight change in key results in a text with wildly varying scores.

    P.S. Harry, if you think this is too much help feel free to cut parts out however I believe writing code to solve the cipher is a lot of fun and it would be a shame to see someone give up.

    #98748
    TheCipherCrackers
    Participant

    It’s not one character difference in the output but rather one character change in the substitution cipher key causing a big difference in the fitness of the decrypts. We can see that the substitution decrypt is correct as we’ve verified it against the output we get using the BOSS tool.

    The issue seems to be that the hill climb tries to change only one character at a time and if it doesn’t find the right letter in (say) position 1 of the key on its first go round (whilst all the other letters in the randomly generated key guess are still wrong) then it doesn’t incrementally start moving towards the correct key. In particular we have seen that when the key guess has duplicate letters (which get removed when decrypting and so ‘shift’ a lot of the other letter mappings) we get very different decrypts from relatively similar keys, which seems to put the hill climb off the right scent! We’re using Challenge 4B to verify this is working so we know what the right key is and what the correct decrypt is.

    #98749
    _madness_
    Participant

    There is an article by Jakobsen in Cryptologia in 1995 about doing just this. He uses digrams, which is a bad choice with modern computers.

    As for my recommendations, you can try my book and work up through Unit 36. There is a link in there for a corpus that has only ASCII characters, and there are formulae for fitness and stuff. There is also a dark yellow book on substitution ciphers (link in the BOSS library) with pseudocode and everything.

    As for your post, a fitness of -33 sounds outrageous if you are using logs of frequencies. I suggest looking for coding bugs.

    Can you post the ciphertext you are attacking? And the two keys that give -33 and -9?

    #98751
    TheCipherCrackers
    Participant

    Maybe this is a clearer way of explaining the issue we’re seeing. Say we’ve guessed at a key length of 6.

    W = wrong key letter
    R = right key letter

    If we start the algorithm with a starting key of “WRRRRR” then the hill climb algorithm cycles through the alphabet and eventually finds the correct first letter of the key, sees the resulting fitness is -9.6 and we’re done.

    But…if we start with “RWRRRR” the algorithm starts by cycling through the alphabet for position 1 and finds that the fitness of the decrypt using “WWRRRR” is best seen and it then never manages to get back to locating “RRRRRR”

    We think we somehow need to let the algorithm randomly move than one letter if it gets into a rut to get it moving in the right direction again?

    #98752
    TheCipherCrackers
    Participant

    Ciphertext from challenge 4B
    Fitness using tetragrams and the log formula found in https://www.cipherchallenge.org/wp-content/uploads/2022/09/Two-attacks-on-tableau-based-ciphers.pdf

    Key 1 = CUSTOM – fitness = -9.62471350522921
    key 2 = CUSTOP – fitness = -33.18946108835086

    #98755
    _madness_
    Participant

    Well, 4B is a monoalphabetic substitution. Key-length is a feature of Vigenère cipher and similar. And the hill-climbing algorithm that you are vaguely describing sounds vaguely like it vaguely applies to Vigenère.

    #98756
    Puzzling_Pelican
    Participant

    Thank you for clarifying this. The shift in part of a substitution key would explain why this happens.

    My suggestion would be to start with a key of “ABCDEFGHIJKLMNOPQRSTUVWXYZ” and then keep changing / mutating the key by randomly switching 2 letters. If the decrypt is better, that is the new key and then keep repeating the process, otherwise discard the new key and keep using the previous one. If it gets stuck, you can add a 5% chance it picks a worse key if the score is within a threshold of around 85% of the best (you can play with these values until it works).

    This way the scoring function is able to clearly show when a key is close: in 2 decrypts, with 2 letters in the substitution key being switched, the scores should be very similar with one being slightly higher.

    Additionally, once this is working and you want to speed it up, you can give it a “boost” by not starting with “AB..” but using letter frequencies to guess a much better start key resulting in a smaller part of the hill left to climb (the fastest I have got my scripts is 70ms for 4B).

    Good luck!

    #98757
    TheCipherCrackers
    Participant

    Thinking about it we think the -33 is so large just because we made our epsilon very small and one or more tetragrams in the decrypt must be missing in the list of tetragrams we’ve used (probably correctly)

    #98758
    RickOShea
    Participant

    Just a thought, or more perhaps a question… For the more advanced challenges, where white space has been removed, The normal gram (bigram, trigram etc.) frequencies, from plain English text, will be skewed by the ‘false’ letter groups generated from adjacent words. Do you produce your own counts, from text with white space removed?

    My hill climbing method wasn’t very successful, but having previously used a Nelder–Mead downhill simplex method for minimum location in multidimensional space, we’d re-run the method with randomized starting conditions to weedle out local minima.

    Nelder Mead method

    #98784
    upsidedown
    Participant

    > Just a thought, or more perhaps a question… For the more advanced challenges, where white space has been removed, The normal gram (bigram, trigram etc.) frequencies, from plain English text, will be skewed by the ‘false’ letter groups generated from adjacent words. Do you produce your own counts, from text with white space removed?

    I would expect poor(er) performance if you are using (bi|tri)grams that are not skewed by the “false” letter groups. Plus, if you never use letter groups from adjacent words, you miss out all words that have fewer characters than your n-gram.

    I always match my corpus to whatever I expect the plaintext of the cipher I am attacking to be. So if there are spaces, or uppercase and lowercase letters, in the ciphertext, I will transform my corpus to include (only) those when computing log probabilities. For example:

    – In 4B, there is just one letter case and no meaningful spaces, so I use 26 letters for scoring.
    – In madness’ interrupted vigenere cipher, there are no spaces in the ciphertext but there are spaces in the plaintext, so I use 26 letters and a space to score plaintexts.
    – Sometimes you get uppercase/lowercase letters in the ciphertext which map to the same case in the plaintext, and you can distinguish between the cases when you score the plaintext (good for names & places from a custom corpus). The same applies any other information that’s left in the ciphertext, like apostrophes, commas, full stops, etc.

    You can also make the distribution of your corpus closer to the letter distribution of cipher challenge plaintexts by including the past plaintexts in your corpus.

Viewing 11 posts - 1 through 11 (of 11 total)
  • You must be logged in to reply to this topic.
Report a problem