How To Crack Captchas
June 5th, 2007This page will teach you how to write a not-necessarily-very-good programme to beat some common captchas, but it will not provide any useful code to do so for you. It should give you an idea how to go about defeating captchas not listed here. But mostly, I hope it will be instructive for anyone who wants to write a less easily defeated captcha in the future, since apparently you’re all hopeless at it at the moment.
As everyone in the world knows by now, most websites and forums use “captchas” to try and stop computer programmes from posting fake comments containing adverts. “Captcha” stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”. And as everyone in the world ought to have realised by now, they don’t work.
There exist a number of ways around them, the most cunning and most effective, although the most difficult to set up, is to build a pornographic website and get real humans to solve the captchas for you in exchange for naked pictures.
But mostly, they’re easy to get around because they’re shit. This, for example, is the default captcha that comes with the now obsolete phpbb2:
That is very easy to solve. (It should perhaps be pointed out at this stage that my job is in large part to extract shy information from images.) As with all the algorithms I’ll show you, this is the first and simplest one I could come up with, and it’s only the start. In all cases I will extract a binary mask of the letters for transferal to a more general OCR system. Also in all cases, I will use Matlab 6 to perform the analysis.
Here is the code required to make this captcha machine readable:
function bank=solvefuzzy(bank)
[x y c n]=size(bank);
%First, greyscale the lot by taking the red channel
bank=bank(:,:,1,:);%Now blur it slightly.
for (i=1:n)
bank(:,:,1,i)=filter2(ones([3 3])/9, bank(:,:,1,i));
end%Now threshold it.
bank=(bank<0.63);%Now trim the borders.
bank(1:x,[1 y],:,:)=0;
bank([1 x],1:y,:,:)=0;
Here is the result of this algorithm on four example captchas:
Now that wasn’t hard. And I know that the letter shapes aren’t ideal, but it’s a very uniform font they use, and the letters aren’t rotated, so it’s as easy as pie to extract the characters from this mask.
Of course, cracking obsolete captchas isn’t terribly useful, or wouldn’t be if anyone bothered to update their forums, so here’s a captcha from phpbb3:
The first thing you should notice about this is that it’s gorgeous. But let’s have a look at the code required to break it:
function bank=solveclassy(bank)
%First, greyscale the whole thing
bank=mean(bank,3);%Now threshold it.
bank=(bank<0.55);
That’s two commands. Now, let’s check the results.
So that’s done it perfectly, with no real image processing at all. Of course, this is a little unfair to the new captcha — it is in reality much tougher than phpbb2’s, because it uses many fonts and angles and sizes. So some kind of cunning would be required to turn these shapes back into characters, but that’s really nothing OCR software doesn’t do as a matter of course. And if you were to take each character individually (even if they overlap, the colours in the original image would distinguish them) and perform, say, a Radon transform on them about their centroids, this would give you a distinctive pattern for each letter and number in each font. The point is that from here it is entirely possibly to crack this captcha. Besides which, phpbb is open source, so training the process by Radon transforming all possible characters would be fairly simple.
But phpbb3 has another trick up its sleeve: a second type of captcha. This is the one used on www.phpbb3.com’s forum, so presumably they trust it:
Aside from the fact that it looks like something from a bad Spectrum game, the first thing you should notice about this is that it has clearly been designed to be almost impossible to crack, by someone who knows nothing about cracking captchas. For example, the uniform background colour. Well over 90% of the image is a single colour and every pixel of every letter is that colour. Then, the individual letters are outlined in distinct and uniform colours, and then, as if that wasn’t helpful enough for crackers, the characters are made up of little square elements (which I will call ‘charels’), so we can reconstruct the characters down to the pixel, generally the right way up. It really couldn’t be more helpful if it tried.
So let’s see some code.
function result=solvefunky(bank);
[x y c n]=size(bank);
result=zeros([x,y,1,n]);%First, determine that background colours.
%We assume the first colour with five continuous pixels of itself
%along the first row is background.
background=zeros(3,n);
for (i=1:n)
colour=[0 0 0];
j=1;
count=0;
while (count<5)
if ((bank(1, j, 1, i)==colour(1)) && …
(bank(1, j, 2, i)==colour(2)) && …
(bank(1, j, 3, i)==colour(3)))
count=count+1;
else
colour(:)=bank(1,y,:,i);
end
end
background(:,i)=colour(:);%Next, find areas that are that colour
backgroundareas(:,:,i)=((bank(:,:,1,i)==background(1,i)) & …
(bank(:,:,2,i)==background(2,i)) & …
(bank(:,:,3,i)==background(3,i)));%Now, find areas of that colour smaller than 15 pixels
temp=bwlabel(backgroundareas(:,:,i), 4);
small=zeros([x,y]);
numberofregions=max(temp(:));
for (region=1:numberofregions)
pixels=sum(temp(:)==region);
if (pixels<15)
thisarea=temp==region;
%This leaves a lot of bits which aren’treal, but we know from
%looking at the captcha that the letters are outlined in just
%one colour, so lets eliminate anything that’s got more than
%one colour adjacent to it. (In fact, we allow one pixel of a
%different colour as this works better.)
adjacentpixels=(imdilate(thisarea, [0 1 0;1 1 1;0 1 0])&~thisarea);
red=bank(:,:,1,i); green=bank(:,:,2,i); blue=bank(:,:,3,i);
ar=red(adjacentpixels); ag=green(adjacentpixels); ab=blue(adjacentpixels);
if ((sum((ar~=ar(1)))<2) && …
(sum((ag~=ag(1)))<2) && …
(sum((ab~=ab(1)))<2))
small=small|thisarea;
end
end
end
result(:,:,1,i)=small(:,:);end
That results in this rather pleasing image:
Now all I’ve done here is to find the pixels that are part of the larger “charels” which comprise the message. There’s still a lot of work to be done to find the message in ASCII format, but it can be done: you can separate the individual characters by recourse to the original image — each one is outlined in a distinctive colour, and if a colour is reused then it probably isn’t reused in adjacent characters, so a simple contiguity test will catch it; you can reorient and rescale each character by picking a charel arbitrarily and seeing where its neighbours lie relative to it, using the provided coloured outline as a guide and the ’shadow’ colour to define the vertical and/or horizontal axes — this will allow you to build up a reoriented image of each letter, which can be easily checked against the known font to see which it most closely matches.
Those of you who know me should already have worked out that this took me less than one evening, including grabbing all the pictures and writing this entry. I wouldn’t bother if it was going to take longer. You know that. So if you employed a good programmer for a week to crack such a captcha you ought to be able to finish the job off. Then you’d have access to every phpbb3 forum out there.
Clearly there are false positives and things in these processed images: the bottom one in particular has a large false positive in the Z, and the last H has a bit missing where the L overlapped it. I don’t think either of these would actually affect a good OCR algorithm (given that said algorithm would have the font used built into it and have an ideally oriented and scaled image of the letters, albeit with the odd mistake), and even if it did, well, we cracked the other three. If we assume we can crack 75% of these captchas, then we can break into a forum which allows us 5 attempts (which is pretty standard) 99.9% of the time.
phpbb3 also allows the user an almost ludicrous amount of options for their captcha. This is good, as it means that a cracker will have a harder time beating the captcha in the general case. But in the specific case of the default settings, which almost everyone will use, this won’t help at all.
So what’s the solution? Personally, I use a bespoke text-based captcha. Image based ones are hard to programme, which isn’t a problem if you’re doing something like phpbb, because it has to be hard to crack (oh dear) and text based ones really aren’t. Another problem with image-based solutions is that some devices or people can’t read them, so there usually has to be a fallback, and then you have two links, of which a cracker need only outsmart the weakest. (Sorry for the mixed metaphor there.) I think bespoke text-based is good because there’s no really motivation for a cracker to devote any time to cracking it, as they’ll only get access to my websites, and if they do I can very easily change it the following evening. But it couldn’t work for phpbb as you can’t make a bespoke captcha for every user.
Some captchas are obfuscated further than these. Sometimes this is a simple case of drawing lines over and through the text. This is pretty easy to beat — any good photo touching-up software has had this feature since the week after flatbed scanners were invented, and replicating it is not hard, even when the lines must be found automatically. A better solution is to deform the letters themselves, though this involves a very direct tradeoff: anything you do that makes letter shapes harder for a computer to identify will have the same effect for your legitimate users. Again, I would attack such a captcha by not attempting to restore the original image, but by developing an algorithm to characterise each… well, character based on its Euler number, the number of sharp corners in its outline and their relative locations, and maybe the Euler number of the shape you get if you dilate it a bit. I believe this could crack such a captcha with minimal training.
Theoretically, human authentication is the best way, but humans aren’t apparently very good at that. It’s not always apparent from a name and an email address if a user is a human or a spambot. My proposed solution is a deliberately impossible captcha: you find or create an image, possible of random abstract ‘art’, or a landscape, or a sort of randomly generated Rorschach ink-blot test, and ask the user for a vague, one sentence description. Then a human would authenticate the user’s account by seeing if the user’s description of the image relates to that image in any way. It’d be a little subjective, but I really can’t see it being cracked, except perhaps be Derren Brown concocting a sentence that would appear to describe any image. And people would learn to spot that sentence. It would still be susceptible to the porn crack, but then everything is, and honestly I think it’d be fairly easy to tell which descriptions of Rorschach ink-blots had come from the minds of teenage boys looking for naked pictures with a pretty high degree of certainty.
Plus, I think it’d offer a fascinating glimpse into the psyche of all prospective users of your forum.
You can download all the above code, and some general making-it-work gubbins in the Code Factory, but you’ll need Matlab to make it work, and you’ll need the Image Processing Studio, to make it run. If anyone wants to extend the code, do feel free. Complete code is available if you want it, though — people sell it to prospective spammers.
[More Help]
16 Responses to “How To Crack Captchas”
Leave a Reply
Apathy Sketchpad is proudly powered by
WordPress
Entries (RSS)
and Comments (RSS).







June 6th, 2007 at 01:11
The first phpBB3 captcha and all the info on that article about the phpBB3 Admin Panel are from “Beta 1″ and are completely obselete since the final (well, RC1 at least) is completely different. The captcha used on http://www.phpbb.com is the final captcha, and the final Admin Panel has an option to add random “same color as the letters” lines to obfuscate it.
June 6th, 2007 at 10:17
Yeah, I’ve not downloaded phpbb3 to check up on it, as I don’t really have any use for it any more, but I think there ‘random “same color as the letters” lines’ would do little to stop the routine outlined above — they might generate some false positive areas, but they’d mostly be the wrong size and shape so it wouldn’t take much morphology to identify and discard them.
It’s a clever captcha in a sense: it looks so utterly unlike other captchas that most generic cracking programmes would probably not work. But it’s going to be so widely used by almost everyone who installs phpbb3 that it’ll be worth writing routines specifically to crack it.
I think they’d be better off making it very easy for users to create and install their own captchas. Granted, a lot of users wouldn’t know how to do that, but if enough do it’ll become far less worthwhile cracking the default one, and at least it’ll mean some people get a secure forum.
June 28th, 2007 at 18:45
The excellent article is made better with the little test on your own reply form.
Simple and effective.
January 18th, 2008 at 09:49
Is there a package for C#, java that can do this matlab trick?
January 18th, 2008 at 11:32
I’ve not found one. There are a few image processing libraries for C#, although most of them are pricey. Most of the functions I’ve called on this page are simple enough to write, although an efficient version of bwlabel would be a pain to code — you’d want to start with a flood fill and build it from there. Simple flood fill routines are very inneficient, though. A good one is a scanline fill — not too complex, but usually very fast.
I don’t know anything much about Java programming, though. I expect there’s something out there.
January 19th, 2008 at 02:58
Hi,
Do you crack yahoo captcha in any of your articles? Do you know where I can find it?
Thanks,
Nick
January 19th, 2008 at 03:03
No. I never use Yahoo, so I’ve not even seen their captcha. The only ones I ever see these days are on Blogger.
January 19th, 2008 at 19:41
Apparently MediaWiki-sites now have a captcha for when someone tries to put external links in an entry.
January 22nd, 2008 at 04:07
There’s always KittenAuth.
http://www.thepcspy.com/contact
January 23rd, 2008 at 07:17
To begin, I must say that the author fancies himself a bit more of an intellectual than he actually is. Allow me to elaborate on his opinions if I may.
Computer Vision is (and has been for some time) quite advanced. Trust me when I say that if Computer Vision is powerful enough to race autonomous robots through unknown terrain, navigate cars through a city on their own, take facial fingerprints (not images, but measurements) of every single person entering the stadium for the Superbowl in 2001, it can be used to crack a stupid captcha image.
The problem is that you are focusing on each individual captcha that you are cracking, and engineering a crack based on that particular image. So what? Any green-horn should be able to do that, and if they can’t should not even be trying. Why wouldn’t you push toward designing a program to read ANY image captcha that it encounters? You may have to put aside your trusty old Matlab for that one, and no, I won’t send you any code.
The rant about a administrator approving each and every attempt to register a user is about as brute force of a solution as I can think of. There are effective captchas out there, some of which I have written that have never been cracked a single time and have been up on forums, guestbooks, blogs, etc… for years with thousands of visitors a day. Don’t claim that you have accomplished something because you have cracked phpBB’s image captcha, which, by the way, is not cracking anything at all, since it is open source to begin with.
Maybe you should submit your solution “entering two letter q’s” to open source forums so they can benefit from this knowledge.
If I was an asshole, I’d redirect the spam traffic that I get on all the websites I’ve written and maintain to here; they’d have you cracked in a day or so and fill you up so full of crap you’d actually have some content on the site.
Sorry if I sound mean, I actually spit out my coffee when I read the letter ‘q’ thing because I was laughing so hard. You made my day man. Thanks.
January 23rd, 2008 at 13:36
No, that’s kind of my whole point: “any green-horn” (whatever that might be) should indeed be able to do all of the above, because it’s really quite easy. Computer Vision, and all its high-end algorithms, is all well and good and doing amazing things, but it’s well outwith the reach of the average person.
A general algorithm would be more interesting, yes, but it would be a major investment of time and I’d be the wrong person to do it. My point was just that I could spend a week or so, crack phpBB’s captchas, and then spam all the phpBB forums in the world, which to my mind is a far greater weakness than a susceptibility to advanced computer vision algorithms, because as you pointed out there are so many people who could do what I just did. A few of them are bound to do it.
On the subject of my own captcha, I think to be fair you’ve underestimated it. It actually was broken once, but not by “entering two letter q’s”. That’s just one of five or six questions that loop around the comments form, which are all simple tasks involving maths and/or moving letters around, and about a year ago, a spambot learned to add two numbers (which is a common question in captchas anyway). All I did was load up the captcha file and delete that question (leaving four or five others). That secured the site again without blocking real comments, and later that day I replaced it with a new question so no harm was done. Yes, the questions are weak but the system that surrounds them is much more robust. I could put in a conventional image-based captcha if I wanted, although that would reduce accessibility, so for the amount of traffic I have now it would probably do more harm than good.
But that’s the only time I’ve had any spam (other than pingbacks, which are so problematic they’ve all but shut down Technorati) since I installed it, whereas phpBB’s captchas are broken daily even on low-traffic forums. Okay, so the smarter spammers you enjoy might learn to enter two “q’s” in a day, but I can remove that question that same day. I expect an automated system could remove it in a moment the first or second time WordPress detected a “spammy” comment getting through. Unless they can reverse-md5 long and meaningless strings, they’ve got nothing long-term but cracking individual questions as they appear.
January 23rd, 2008 at 15:27
Unfortunately, most boards, particularly those that are freely available, suffer from fundamental downfalls in their captcha methods.
Firstly, you CANNOT allow the user to see the relationship between the captcha question or image and it’s solution. You’d be surprised at how often this is done. For example, the md5 hashing of the question that you provide on this site does just this (the md5 “answerhash” is embedded in the form code and thus visible to any HTML parser). Once one solution is calculated (which is easy in your case), it can be applied since they know the relationship between the hashing and the solution, even though it changes. Banning ip addresses wouldn’t work either because spam bots typically work in groups to avoid this, and you don’t want to ban a potential website visitor because they’ve slipped typing in the answer. Once spammers get it, they share the methods to other spammers so that everyone may enjoy the security hole.
Secondly, you MUST generate a unique captcha every time. A question or image should never come up twice, no matter what. It is pretty clear why this is important. This is typically why text verification is rarely used, at least by itself.
Thirdly, a good, foolproof, changing, and unique captcha image needs to be developed. For example, you may have an external chunk of php code using gd to generate an image, generate an id for this image, and put both the id and the solution in a database somewhere unbeknownst to the user. Then upon submission of something, check the id against the solution in the database, maybe have a script on the db return either true or false, and automatically and immediately delete that entry from the database.
I’ve always been interested in seeing how easy captcha images are to crack, and maybe even writing some code myself to do so. You should put a small section up on your site where users can submit a link to a chunk of code somewhere that outputs an image, and see how long it takes to develop a crack for it (one that can be applied to any image that is generated, and computes the correct solution 100% of the time). I firmly believe that with the proper linking of the characters with lines, and the correct usage of colors, fonts, scaling, rotation, overlapping, etc…that captchas are extremely effective.
In all fairness, you’ve attacked weak catpchas in your post above. There are some good ones!
January 23rd, 2008 at 17:44
I have. What I did, really, was to see the captcha for phpBB3, think “that’s rubbish, I bet I can crack it in a day” and attack it to see if I could. To be honest, the captcha for phpBB is fairly irrelevant anyway, given how easy it is for even the dimmest script-kiddies to gain access to the admin panel and turn the index page into a billboard.
The “answerhash” isn’t just the md5 of the answer, by the way. It’s salted with a site ID and some material unique to the page it’s on, so a stored answer would only work on one page. If I have to change it again I’ll add to that a question ID so that when a question is retired all hashes associated with it are retired as well. It’d mean checking five hashes instead of one, but that’s okay.
With image-based ones, the gap between “what computers can’t do” and “what humans can do” is closing fast, and probably closed long ago if you consider humans with any kind of disabilities. I don’t even think captchas on individual sites are the right approach, long term. We need to stop the spam being sent in the first place — the web traffic it generates is problematic and expensive enough and the tests to block it are the antithesis of most modern interface design principles.
(Personally, I think we need to start identifying the people whose computers are spam-sending zombies and taking away their broadband. They can’t be trusted with it.)
January 23rd, 2008 at 17:56
This is how all my opinions seem to work; I start out with a gut instinct, and tell people until I happen across one who knows what they’re talking about, and after a few exchanges I have a far better justified position.
I’d like to think that means I’m open to correction, but I usually seem to end up with roughly the same opinion I had before but for better reasons so either my instincts are fantastic or I can justify any irrational prejudice. I don’t really know how to tell the two apart. I usually enjoy the process, though, so I don’t worry about it much.
February 29th, 2008 at 23:38
Well, I enjoyed your article quite a bit just for the record.
March 7th, 2008 at 03:28
Captchas have to use non-linear transformation to become “hard”. Rotation, skew, etc. can all be solved by Principle Component Analysis.
Check out this author’s run down of why linear transformation is not enough:
http://churchturing.org/captcha-dist/captcha/final.medium.png
http://churchturing.org/captcha-dist/