To Inventcoin a Phraseword »      
 

How To Crack Captchas

June 5th, 2007

This page will teach you how to write a not-necessarily-very-good programme to beat some common captchas, but it will not provide any useful code to do so for you. It should give you an idea how to go about defeating captchas not listed here. But mostly, I hope it will be instructive for anyone who wants to write a less easily defeated captcha in the future, since apparently you’re all hopeless at it at the moment.

As everyone in the world knows by now, most websites and forums use “captchas” to try and stop computer programmes from posting fake comments containing adverts. “Captcha” stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”. And as everyone in the world ought to have realised by now, they don’t work.

There exist a number of ways around them, the most cunning and most effective, although the most difficult to set up, is to build a pornographic website and get real humans to solve the captchas for you in exchange for naked pictures.

But mostly, they’re easy to get around because they’re shit. This, for example, is the default captcha that comes with the now obsolete phpbb2:

phpbb2 Captcha

That is very easy to solve. (It should perhaps be pointed out at this stage that my job is in large part to extract shy information from images.) As with all the algorithms I’ll show you, this is the first and simplest one I could come up with, and it’s only the start. In all cases I will extract a binary mask of the letters for transferal to a more general OCR system. Also in all cases, I will use Matlab 6 to perform the analysis.

Here is the code required to make this captcha machine readable:

function bank=solvefuzzy(bank)

[x y c n]=size(bank);

%First, greyscale the lot by taking the red channel
bank=bank(:,:,1,:);

%Now blur it slightly.
for (i=1:n)
bank(:,:,1,i)=filter2(ones([3 3])/9, bank(:,:,1,i));
end

%Now threshold it.
bank=(bank<0.63);

%Now trim the borders.
bank(1:x,[1 y],:,:)=0;
bank([1 x],1:y,:,:)=0;

Here is the result of this algorithm on four example captchas:

phpbb2 cracked

Now that wasn’t hard. And I know that the letter shapes aren’t ideal, but it’s a very uniform font they use, and the letters aren’t rotated, so it’s as easy as pie to extract the characters from this mask.

Of course, cracking obsolete captchas isn’t terribly useful, or wouldn’t be if anyone bothered to update their forums, so here’s a captcha from phpbb3:

phpbb3 classy

The first thing you should notice about this is that it’s gorgeous. But let’s have a look at the code required to break it:

function bank=solveclassy(bank)

%First, greyscale the whole thing
bank=mean(bank,3);

%Now threshold it.
bank=(bank<0.55);

That’s two commands. Now, let’s check the results.

phpbb3 classy cracked

So that’s done it perfectly, with no real image processing at all. Of course, this is a little unfair to the new captcha — it is in reality much tougher than phpbb2’s, because it uses many fonts and angles and sizes. So some kind of cunning would be required to turn these shapes back into characters, but that’s really nothing OCR software doesn’t do as a matter of course. And if you were to take each character individually (even if they overlap, the colours in the original image would distinguish them) and perform, say, a Radon transform on them about their centroids, this would give you a distinctive pattern for each letter and number in each font. The point is that from here it is entirely possibly to crack this captcha. Besides which, phpbb is open source, so training the process by Radon transforming all possible characters would be fairly simple.

But phpbb3 has another trick up its sleeve: a second type of captcha. This is the one used on www.phpbb3.com’s forum, so presumably they trust it:

phpbb3 funky

Aside from the fact that it looks like something from a bad Spectrum game, the first thing you should notice about this is that it has clearly been designed to be almost impossible to crack, by someone who knows nothing about cracking captchas. For example, the uniform background colour. Well over 90% of the image is a single colour and every pixel of every letter is that colour. Then, the individual letters are outlined in distinct and uniform colours, and then, as if that wasn’t helpful enough for crackers, the characters are made up of little square elements (which I will call ‘charels’), so we can reconstruct the characters down to the pixel, generally the right way up. It really couldn’t be more helpful if it tried.

So let’s see some code.

function result=solvefunky(bank);

[x y c n]=size(bank);
result=zeros([x,y,1,n]);

%First, determine that background colours.
%We assume the first colour with five continuous pixels of itself
%along the first row is background.

background=zeros(3,n);
for (i=1:n)
colour=[0 0 0];
j=1;
count=0;
while (count<5)
if ((bank(1, j, 1, i)==colour(1)) && …
(bank(1, j, 2, i)==colour(2)) && …
(bank(1, j, 3, i)==colour(3)))
count=count+1;
else
colour(:)=bank(1,y,:,i);
end
end
background(:,i)=colour(:);

%Next, find areas that are that colour
backgroundareas(:,:,i)=((bank(:,:,1,i)==background(1,i)) & …
(bank(:,:,2,i)==background(2,i)) & …
(bank(:,:,3,i)==background(3,i)));

%Now, find areas of that colour smaller than 15 pixels
temp=bwlabel(backgroundareas(:,:,i), 4);
small=zeros([x,y]);
numberofregions=max(temp(:));
for (region=1:numberofregions)
pixels=sum(temp(:)==region);
if (pixels<15)
thisarea=temp==region;
%This leaves a lot of bits which aren’treal, but we know from
%looking at the captcha that the letters are outlined in just
%one colour, so lets eliminate anything that’s got more than
%one colour adjacent to it. (In fact, we allow one pixel of a
%different colour as this works better.)

adjacentpixels=(imdilate(thisarea, [0 1 0;1 1 1;0 1 0])&~thisarea);
red=bank(:,:,1,i); green=bank(:,:,2,i); blue=bank(:,:,3,i);
ar=red(adjacentpixels); ag=green(adjacentpixels); ab=blue(adjacentpixels);
if ((sum((ar~=ar(1)))<2) && …
(sum((ag~=ag(1)))<2) && …
(sum((ab~=ab(1)))<2))
small=small|thisarea;
end
end
end
result(:,:,1,i)=small(:,:);

end

That results in this rather pleasing image:

phpbb3 funky crached

Now all I’ve done here is to find the pixels that are part of the larger “charels” which comprise the message. There’s still a lot of work to be done to find the message in ASCII format, but it can be done: you can separate the individual characters by recourse to the original image — each one is outlined in a distinctive colour, and if a colour is reused then it probably isn’t reused in adjacent characters, so a simple contiguity test will catch it; you can reorient and rescale each character by picking a charel arbitrarily and seeing where its neighbours lie relative to it, using the provided coloured outline as a guide and the ’shadow’ colour to define the vertical and/or horizontal axes — this will allow you to build up a reoriented image of each letter, which can be easily checked against the known font to see which it most closely matches.

Those of you who know me should already have worked out that this took me less than one evening, including grabbing all the pictures and writing this entry. I wouldn’t bother if it was going to take longer. You know that. So if you employed a good programmer for a week to crack such a captcha you ought to be able to finish the job off. Then you’d have access to every phpbb3 forum out there.

Clearly there are false positives and things in these processed images: the bottom one in particular has a large false positive in the Z, and the last H has a bit missing where the L overlapped it. I don’t think either of these would actually affect a good OCR algorithm (given that said algorithm would have the font used built into it and have an ideally oriented and scaled image of the letters, albeit with the odd mistake), and even if it did, well, we cracked the other three. If we assume we can crack 75% of these captchas, then we can break into a forum which allows us 5 attempts (which is pretty standard) 99.9% of the time.

phpbb3 also allows the user an almost ludicrous amount of options for their captcha. This is good, as it means that a cracker will have a harder time beating the captcha in the general case. But in the specific case of the default settings, which almost everyone will use, this won’t help at all.

So what’s the solution? Personally, I use a bespoke text-based captcha. Image based ones are hard to programme, which isn’t a problem if you’re doing something like phpbb, because it has to be hard to crack (oh dear) and text based ones really aren’t. Another problem with image-based solutions is that some devices or people can’t read them, so there usually has to be a fallback, and then you have two links, of which a cracker need only outsmart the weakest. (Sorry for the mixed metaphor there.) I think bespoke text-based is good because there’s no really motivation for a cracker to devote any time to cracking it, as they’ll only get access to my websites, and if they do I can very easily change it the following evening. But it couldn’t work for phpbb as you can’t make a bespoke captcha for every user.

Some captchas are obfuscated further than these. Sometimes this is a simple case of drawing lines over and through the text. This is pretty easy to beat — any good photo touching-up software has had this feature since the week after flatbed scanners were invented, and replicating it is not hard, even when the lines must be found automatically. A better solution is to deform the letters themselves, though this involves a very direct tradeoff: anything you do that makes letter shapes harder for a computer to identify will have the same effect for your legitimate users. Again, I would attack such a captcha by not attempting to restore the original image, but by developing an algorithm to characterise each… well, character based on its Euler number, the number of sharp corners in its outline and their relative locations, and maybe the Euler number of the shape you get if you dilate it a bit. I believe this could crack such a captcha with minimal training.

Theoretically, human authentication is the best way, but humans aren’t apparently very good at that. It’s not always apparent from a name and an email address if a user is a human or a spambot. My proposed solution is a deliberately impossible captcha: you find or create an image, possible of random abstract ‘art’, or a landscape, or a sort of randomly generated Rorschach ink-blot test, and ask the user for a vague, one sentence description. Then a human would authenticate the user’s account by seeing if the user’s description of the image relates to that image in any way. It’d be a little subjective, but I really can’t see it being cracked, except perhaps be Derren Brown concocting a sentence that would appear to describe any image. And people would learn to spot that sentence. It would still be susceptible to the porn crack, but then everything is, and honestly I think it’d be fairly easy to tell which descriptions of Rorschach ink-blots had come from the minds of teenage boys looking for naked pictures with a pretty high degree of certainty.

Plus, I think it’d offer a fascinating glimpse into the psyche of all prospective users of your forum.

You can download all the above code, and some general making-it-work gubbins in the Code Factory, but you’ll need Matlab to make it work, and you’ll need the Image Processing Studio, to make it run. If anyone wants to extend the code, do feel free. Complete code is available if you want it, though — people sell it to prospective spammers.

[?]
You can skip to the end and leave a response. Pinging is currently not allowed.

24 Responses to “How To Crack Captchas”

  1. Gravatar SupSuper Says:

    The first phpBB3 captcha and all the info on that article about the phpBB3 Admin Panel are from “Beta 1″ and are completely obselete since the final (well, RC1 at least) is completely different. The captcha used on http://www.phpbb.com is the final captcha, and the final Admin Panel has an option to add random “same color as the letters” lines to obfuscate it.


  2. Gravatar Andrew Says:

    Yeah, I’ve not downloaded phpbb3 to check up on it, as I don’t really have any use for it any more, but I think there ‘random “same color as the letters” lines’ would do little to stop the routine outlined above — they might generate some false positive areas, but they’d mostly be the wrong size and shape so it wouldn’t take much morphology to identify and discard them.

    It’s a clever captcha in a sense: it looks so utterly unlike other captchas that most generic cracking programmes would probably not work. But it’s going to be so widely used by almost everyone who installs phpbb3 that it’ll be worth writing routines specifically to crack it.

    I think they’d be better off making it very easy for users to create and install their own captchas. Granted, a lot of users wouldn’t know how to do that, but if enough do it’ll become far less worthwhile cracking the default one, and at least it’ll mean some people get a secure forum.


  3. Gravatar Yoji Says:

    The excellent article is made better with the little test on your own reply form.
    Simple and effective.


  4. Gravatar me Says:

    Is there a package for C#, java that can do this matlab trick?


  5. Gravatar Andrew Says:

    I’ve not found one. There are a few image processing libraries for C#, although most of them are pricey. Most of the functions I’ve called on this page are simple enough to write, although an efficient version of bwlabel would be a pain to code — you’d want to start with a flood fill and build it from there. Simple flood fill routines are very inneficient, though. A good one is a scanline fill — not too complex, but usually very fast.

    I don’t know anything much about Java programming, though. I expect there’s something out there.


  6. Gravatar nick Says:

    Hi,

    Do you crack yahoo captcha in any of your articles? Do you know where I can find it?

    Thanks,

    Nick


  7. Gravatar Andrew Says:

    No. I never use Yahoo, so I’ve not even seen their captcha. The only ones I ever see these days are on Blogger.


  8. Gravatar SupSuper Says:

    Apparently MediaWiki-sites now have a captcha for when someone tries to put external links in an entry.


  9. Gravatar Jamie Says:

    There’s always KittenAuth.
    http://www.thepcspy.com/contact


  10. Gravatar jimmy Says:

    To begin, I must say that the author fancies himself a bit more of an intellectual than he actually is. Allow me to elaborate on his opinions if I may.

    Computer Vision is (and has been for some time) quite advanced. Trust me when I say that if Computer Vision is powerful enough to race autonomous robots through unknown terrain, navigate cars through a city on their own, take facial fingerprints (not images, but measurements) of every single person entering the stadium for the Superbowl in 2001, it can be used to crack a stupid captcha image.

    The problem is that you are focusing on each individual captcha that you are cracking, and engineering a crack based on that particular image. So what? Any green-horn should be able to do that, and if they can’t should not even be trying. Why wouldn’t you push toward designing a program to read ANY image captcha that it encounters? You may have to put aside your trusty old Matlab for that one, and no, I won’t send you any code.

    The rant about a administrator approving each and every attempt to register a user is about as brute force of a solution as I can think of. There are effective captchas out there, some of which I have written that have never been cracked a single time and have been up on forums, guestbooks, blogs, etc… for years with thousands of visitors a day. Don’t claim that you have accomplished something because you have cracked phpBB’s image captcha, which, by the way, is not cracking anything at all, since it is open source to begin with.

    Maybe you should submit your solution “entering two letter q’s” to open source forums so they can benefit from this knowledge.

    If I was an asshole, I’d redirect the spam traffic that I get on all the websites I’ve written and maintain to here; they’d have you cracked in a day or so and fill you up so full of crap you’d actually have some content on the site.

    Sorry if I sound mean, I actually spit out my coffee when I read the letter ‘q’ thing because I was laughing so hard. You made my day man. Thanks.


  11. Gravatar Andrew Says:

    No, that’s kind of my whole point: “any green-horn” (whatever that might be) should indeed be able to do all of the above, because it’s really quite easy. Computer Vision, and all its high-end algorithms, is all well and good and doing amazing things, but it’s well outwith the reach of the average person.

    A general algorithm would be more interesting, yes, but it would be a major investment of time and I’d be the wrong person to do it. My point was just that I could spend a week or so, crack phpBB’s captchas, and then spam all the phpBB forums in the world, which to my mind is a far greater weakness than a susceptibility to advanced computer vision algorithms, because as you pointed out there are so many people who could do what I just did. A few of them are bound to do it.

    On the subject of my own captcha, I think to be fair you’ve underestimated it. It actually was broken once, but not by “entering two letter q’s”. That’s just one of five or six questions that loop around the comments form, which are all simple tasks involving maths and/or moving letters around, and about a year ago, a spambot learned to add two numbers (which is a common question in captchas anyway). All I did was load up the captcha file and delete that question (leaving four or five others). That secured the site again without blocking real comments, and later that day I replaced it with a new question so no harm was done. Yes, the questions are weak but the system that surrounds them is much more robust. I could put in a conventional image-based captcha if I wanted, although that would reduce accessibility, so for the amount of traffic I have now it would probably do more harm than good.

    But that’s the only time I’ve had any spam (other than pingbacks, which are so problematic they’ve all but shut down Technorati) since I installed it, whereas phpBB’s captchas are broken daily even on low-traffic forums. Okay, so the smarter spammers you enjoy might learn to enter two “q’s” in a day, but I can remove that question that same day. I expect an automated system could remove it in a moment the first or second time WordPress detected a “spammy” comment getting through. Unless they can reverse-md5 long and meaningless strings, they’ve got nothing long-term but cracking individual questions as they appear.


  12. Gravatar jimmy Says:

    Unfortunately, most boards, particularly those that are freely available, suffer from fundamental downfalls in their captcha methods.

    Firstly, you CANNOT allow the user to see the relationship between the captcha question or image and it’s solution. You’d be surprised at how often this is done. For example, the md5 hashing of the question that you provide on this site does just this (the md5 “answerhash” is embedded in the form code and thus visible to any HTML parser). Once one solution is calculated (which is easy in your case), it can be applied since they know the relationship between the hashing and the solution, even though it changes. Banning ip addresses wouldn’t work either because spam bots typically work in groups to avoid this, and you don’t want to ban a potential website visitor because they’ve slipped typing in the answer. Once spammers get it, they share the methods to other spammers so that everyone may enjoy the security hole.

    Secondly, you MUST generate a unique captcha every time. A question or image should never come up twice, no matter what. It is pretty clear why this is important. This is typically why text verification is rarely used, at least by itself.

    Thirdly, a good, foolproof, changing, and unique captcha image needs to be developed. For example, you may have an external chunk of php code using gd to generate an image, generate an id for this image, and put both the id and the solution in a database somewhere unbeknownst to the user. Then upon submission of something, check the id against the solution in the database, maybe have a script on the db return either true or false, and automatically and immediately delete that entry from the database.

    I’ve always been interested in seeing how easy captcha images are to crack, and maybe even writing some code myself to do so. You should put a small section up on your site where users can submit a link to a chunk of code somewhere that outputs an image, and see how long it takes to develop a crack for it (one that can be applied to any image that is generated, and computes the correct solution 100% of the time). I firmly believe that with the proper linking of the characters with lines, and the correct usage of colors, fonts, scaling, rotation, overlapping, etc…that captchas are extremely effective.

    In all fairness, you’ve attacked weak catpchas in your post above. There are some good ones!


  13. Gravatar Andrew Says:

    I have. What I did, really, was to see the captcha for phpBB3, think “that’s rubbish, I bet I can crack it in a day” and attack it to see if I could. To be honest, the captcha for phpBB is fairly irrelevant anyway, given how easy it is for even the dimmest script-kiddies to gain access to the admin panel and turn the index page into a billboard.

    The “answerhash” isn’t just the md5 of the answer, by the way. It’s salted with a site ID and some material unique to the page it’s on, so a stored answer would only work on one page. If I have to change it again I’ll add to that a question ID so that when a question is retired all hashes associated with it are retired as well. It’d mean checking five hashes instead of one, but that’s okay.

    With image-based ones, the gap between “what computers can’t do” and “what humans can do” is closing fast, and probably closed long ago if you consider humans with any kind of disabilities. I don’t even think captchas on individual sites are the right approach, long term. We need to stop the spam being sent in the first place — the web traffic it generates is problematic and expensive enough and the tests to block it are the antithesis of most modern interface design principles.

    (Personally, I think we need to start identifying the people whose computers are spam-sending zombies and taking away their broadband. They can’t be trusted with it.)


  14. Gravatar Andrew Says:

    This is how all my opinions seem to work; I start out with a gut instinct, and tell people until I happen across one who knows what they’re talking about, and after a few exchanges I have a far better justified position.

    I’d like to think that means I’m open to correction, but I usually seem to end up with roughly the same opinion I had before but for better reasons so either my instincts are fantastic or I can justify any irrational prejudice. I don’t really know how to tell the two apart. I usually enjoy the process, though, so I don’t worry about it much.


  15. Gravatar john Says:

    Well, I enjoyed your article quite a bit just for the record.


  16. Gravatar jared Says:

    Captchas have to use non-linear transformation to become “hard”. Rotation, skew, etc. can all be solved by Principle Component Analysis.

    Check out this author’s run down of why linear transformation is not enough:

    http://churchturing.org/captcha-dist/captcha/final.medium.png
    http://churchturing.org/captcha-dist/


  17. Gravatar Dan Says:

    Hi Dr. Taylor, A very interesting article. Having written my first captcha, I was searching for info on cracking them and found your site. My first idea was to place random objects such as guns, umrellas, scarecrows etc at a random position, at a random angle with lots of random line and dots placed over them. I then felt this would be too easily solved by histogram analysis. The idea I settled on was to use random moire clockfaces shown at different angles where the user has to convert the time from analog to digital. You can see a demo at http://evolveradio.com/clockedya (released under GPL)
    I would welcome any feedback on it’s effectiveness or any advice you have may have to offer as to how I might improve it. At the moment, I am not limiting the number of tries – I guess I should as I reckon the odds of guessing the correct answer are 1 in 1320. (12 x 11 x 10) the hands can not appear in the same position as each other.
    Anyway, as I said, if you have a few minutes, I’d love to know how easy or difficult you think it would be for my captcha to be solved by a computer.


  18. Gravatar Dan Says:

    Oh I forgot to say.. How about a simple crossword clue combined with an anagram captcha or maybe an odd one out puzzle of the type you find in IQ tests?

    Or. I may write this one next for the hell of it. You display random notes and coins in random position (some overlapping) – and ask for the total amount?

    Or… How about recognising celebrities?

    or… analysing the data given in other form fields?

    or… testing that the entered email address exists?

    It’s a fascinating subject – I believe the fatal flaw may in the first two letters of the Captcha acronym – ie: Completely Automated. We are asking a one computer to test another.


  19. Gravatar Andrew Says:

    I can’t find a working demo of the clock captcha on the web. It sounds interesting, although I can’t see how hard it could be to automate the reading of an analogue clock. Finding the angles of straight lines, especially if they all start in the centre of the image, is pretty straightforward. The hardest part would be telling the hands apart, which didn’t ought to be too taxing depending on the style, but even if you didn’t bother, you’ve still cracked 1 in 6 of them by chance. That said, ‘moire clockface’ isn’t a term I’m familiar with, so I may have missed something there.

    I like it when there’s no cultural knowledge required: recognising celebrities would probably stop me from posting, and coins can be very unclear to foreigners (especially if some modern graphic designer has taken the numbers off them). Text-based ones like the normal ones have the advantage of requiring no knowledge beyond what you need to read and understand the site it’s on. A friend of mine ran a videogame website and its captcha rather cleverly asked you to identify game boxes. Obviously it only had a finite number of questions, which was a drawback, but for a small site that’s not a problem. (Although I’m certain that spambots have mastered the art of email confirmation by now.)

    It is, you’re right, a massive disadvantage to have a computer sat opposite the subject in the Turing test, but it shouldn’t be automatically fatal: we know computers can test things they can’t do. For example, hashing algorithms can’t be reversed, but can be tested against. We have to find something where the computer can work backwards from the answer (say by warping the letters and drawing lines on) to generate a question, but can’t work forwards from there to find the answer. The hard part is that it has to be something a human can do easily. I suspect the bigger problem is that computers are too clever and people too dim.


  20. Gravatar Digi Says:

    How about cracking this captcha? :)

    http://www.wowanno.com/forums/ucp.php?mode=register

    I placed it there after a shitload of spam on that abandoned site.

    A message to any potential spammer out there that might use the help you provide: fuck you.


  21. Gravatar Andrew Says:

    MATLAB again:
    imshow(imdilate(imerode(im(:,4:320)==im(:,1:317),ones(3)),ones(3)))

    That was actually easier than just reading it.


  22. Gravatar Jernej Says:

    Hello,

    I was playing a little bit with your code, after all it looked like that… http://pastebin.com/m50d11fa2

    How ever I use Octave insted Matlab and I just hit the problem … the code at paste bin produces error:

    http://pastebin.com/m5661fd57

    Now the problem I think is here:

    adjacentpixels=(imdilate(thisarea, [0 1 0;1 1 1;0 1 0])&~thisarea);

    but I have no idea how to fix that. Can you help or explain please?

    Oh yes and if I uncomment this:

    %class(thisarea)
    %class(temp)
    %class(region)

    i get:

    ans = logical
    ans = double
    ans = double

    Thanks!


  23. Gravatar Andrew Says:

    I’ve never used Octave, but whenever I get a type error like this I normally solve it by adding double(…) or logical(…) or whatever around whatever variable or expression it’s whining about.


  24. Gravatar Jernej Says:

    double() solved this, thanks! :)


Leave a Reply

Search


Blog Pages

Other Pages

Cartoons

Other Sites

Me Elsewhere