

Curious here, this is base 64? And what’s behind it is more often than not an image or text? And you need to do ocr to get the characters?
Maybe for the text it could use a dictionary to rubber stamp whether that zero is actually a letter oh, etc etc?
I’m curious to know what the challenge is and what your approach is.






Ah yes pdf is a clusterfuck where anything is valid I think, so minimal redundancy.
Text and image formats are way more lenient and are full of redundancies.