Customer portal
Category

Product news

"SOS
Investigation, Product news

Cracking CAPTCHAs for fun and profit

Through synthetic training sample dataset generation and ML training.

Preface

Cracking CAPTCHAs is already a well-documented and established process which this article looks to expand on. We will approach this article with a general view of how we’ve cracked CAPTCHAs within undesirable conditions. This article is not meant to be a how-to or detailed guide to replicate our steps. However, it may give you some inspiration for your specific challenge. 

We believe that the methods laid out in this article are novel and significantly improve the efficiency of automated CAPTCHA solving in contrast to traditional approaches. Especially when considering a target CAPTCHA system with poor sample harvesting opportunities.

Ethics

We bypass human verification checks to maintain automatic information collection pipelines. The use of the methods we have developed only extends as far as what is required to automate our collection process. 

If a CAPTCHA or other human verification check system is poorly designed and not adequately rate limited, condition checked etc. bypassing it on scale may lead to a DDoS (Distributed Denial of Service) attack in the worst of cases. But with correctly implemented human verification systems, you should mitigate this even with the system bypassed. At best, unethical manipulation of these verification systems can lead to spam posts/comments and otherwise undesirable automated “bot” interaction. We do not condone this type of use. 

The Problem

There are several well-established methods to automate the solving of CAPTCHAs, depending on the complexity of the CAPTCHA, and if we start at the easy end of the spectrum we are presented with a fairly basic alphabetical captcha. 

With a simple distortion background, one might choose to apply a straightforward process of applying denoise filters or Gaussian blurring to an image to reduce or remove the amount of “stars” or random dot pixels present in its background that are applied at random. 

This process can give us a less noisy picture and we can further convert the image to grayscale.  If the source sample is a colour image doing so improves edge detection. 

The image can then be processed through a standard OCR (Optical Character Recognition) library and in our experience can result in a 0.1% failure rate yielding excellent stable solutions. 

In some cases, a good test of CAPTCHA ease of solvability is to feed it to Google Translate as an image; have Google Translate attempt to read the text and translate the letters back into English. If it can, then you have a very good chance that rudimentary OCR libraries will also work for you.

But this article is not about the easy end of the challenge…

What we are dealing with is a CAPTCHA that is both alphanumeric, upper and lower case with random character placement and rotation, and random disruption lines across the image and characters.  Furthermore, most importantly, a point that we will discuss in more detail is where the target source is a Tor Onion website that, at the best of times loads slowly and at the worst of times is offline or responds with backend timeout errors. 

The image complexity of the source CAPTCHA means it’s nearly impossible to effectively read it by OCR. This is made challenging due to the disruption patterns provided by the background random line arrangement (an outward star pattern) and each of our characters are independently disrupted with seemingly random lines of various length and width. Combining all that with offset angles of each character it’s beyond what most OCR or OpenCV methods can handle. 

Therefore, for more complex CAPTCHAs image manipulation (removing noise, grey scaling etc.) is typically not sufficient. These challenges usually require machine learning to get a reasonable failure rate and sufficient solving speed. 

The biggest factor in achieving a good model that will solve accurately is having a large enough sample base. In some cases, many thousands of samples are required for training. Certainly, when dealing with a CAPTCHA that may have upper, lowercase and numerical characters with randomisation of all these points plus randomisation on disruption patterns or lines the larger the sample set, the more accurate a model the training will produce. 

So how do you get thousands of samples from a source that is slow to load and has poor availability, both conditions of the source being a Tor website? Harvesting samples this way would be far too inefficient and we can’t hang around! 

Even with a target source that responds reasonably quickly, has good availability, and can be harvested without aggressively hitting rate limits, who would want to sit there endlessly solving eight thousand captchas to feed to an optical character recognition model? 

I know that’s not going to be me! Sure, there are options to outsource these problems and crowdsource them, but those options take time, money and are likely to introduce errors in our training sample data. Neither of these is desirable, so how do we get 100% accurate sample data cheaply without human solving, without having to harvest the source, and that can scale? 

The Solution

The solution we came up with was first to not focus on the solving of the CAPTCHAs, or the training of our model, or anything that was a direct result or outcome of the end goal we are driving towards. Instead, we looked at how the CAPTCHAs are constructed; what do they look like and what are their elemental parts. 

We know harvesting is not an optimal option, so we have put that aside. Doing so leaves us with a handful of maybe 20 or so harvested solved CAPTCHA samples. Nowhere near enough to start training but it’s enough to start focusing on the sample set we have.

If we look at how the CAPTCHA is constructed and try and break its construction down piece by piece, in a way “reverse engineering” the construction of the CAPTCHA we might either: 1) be able to generate our own `synthetic` CAPTCHAs on demand and at scale all 100% accurately pre solved, or 2) sufficiently understand the method of construction to identify the library or process in which the CAPTCHA is constructed and reimplement it for ourselves with the same 100% accurately pre-solved outcome. 

In our case and the example, we are writing this article from the path of the former option. This option was chosen as some time was spent trying to identify the particular CAPTCHA library but no exact match was found, and in the interest of not burning too much time, and depending on external factors we decided to attempt to create our own synthetic CAPTCHA generation process.

To create our CAPTCHAs, we used Pillow (a PIL Python Fork), a Python Image Manipulation Library that offers a wide range of features all well suited for the job at hand. 

We start by defining a few values that we have observed to be fixed, such as a defined image size (in our case, 280 by 50 pixels) and use this to create a simple image. 

Then we define our letter set (a to z, A to Z, 0 to 9) as we know these to be fixed. 

Using `random.choice` we can pick a required amount of characters.  In our case, the CAPTCHA uses a fixed length of 6 characters. 

The text font is also important and from our source samples we see it is fixed: therefore we try to match the font type as closely as possible. Font size also remains constant. This will be important in ensuring that our training is as accurate as possible when our model is presented with real sample data.

To kick things off, the process carefully establishes the dimensions of the image canvas, akin to laying out a pristine piece of paper before beginning a drawing. Then, with a deft stroke, we construct a blank background canvas, pristine and white, awaiting the arrival of the CAPTCHA characters. But here’s where the true artistry takes centre stage; the process methodically layers complexity onto the character, 

With each character in the CAPTCHA text, our process doesn’t simply slap it onto the canvas; instead, it treats each letter as an individual brushstroke, adding specific characteristics at every turn. We begin by precisely measuring the width and height of each character, ensuring that characters will not be chopped off the edges, correctly fit and fill the CAPTCHA, and that they resemble the source CAPTCHA text. Then, like with the source samples, we introduce randomness into the mix, spacing out the letters with varying degrees of separation, akin to scattering scrabble pieces.

We are also introducing a touch of chaos by randomly rotating each character, giving them a tilt that defies conventional alignment. This clever sleight of hand resembles the source samples accurately and adds to the difficulty level of solving this CAPTCHA. 

Yet the process doesn’t stop there. No, it goes above and beyond, adorning our canvas with a riotous display of crisscrossing lines, as if an abstract artist had gone wild with a brush. These random lines serve as a digital labyrinth, obscuring the text beneath a veil of confusion and intrigue.

We then add and overlay lines of random length and weight across each character, aligned to the character’s angle closely matching that of the source sample. 

Now that we have a way to populate our image canvas, we have a working framework with which we can iterate to get an output that resembles the source samples as closely as possible. 

For now, we generate a few hundred samples, each image file is named the randomly selected CAPTCHA text, assisting us by essentially generating a sample set that has already been solved. 

After that, we compared each iteration’s output closely to the source and made tweaks and adaptations. For each iteration of the CAPTCHA generator we looked closely at just one specific attribute to simplify the synthesis process. We adjust the random scattered background lines, adjusting their length, width and count.  Moving then onto tweaking the letter placement and random angles, to closely match the apparent pseudo randomness of the sample data set.

Following sufficient tweaking and iterations, we are producing a CAPTCHA that is at least visually very closely matching our source samples. It matches so closely that if mixed with real samples it’s difficult to distinguish. This is the ideal level of synthesis we are looking to achieve. 

Example synthetic captcha on the left, real on the right

Next steps

Now that we have a way to produce synthetic CAPTCHAs that very closely match our target, it’s time to produce a few thousand of them. This is easily and quickly done by specifying the total count in our process loop and out pops 5,000 freshly generated pre-solved captchas all nicely labelled and ready for shoving into our training process. 

For model training, we’ve chosen to use the TensorFlow framework alongside the ONNX Runtime machine learning model accelerator. This combination worked well for us for both training accuracy and efficiency. All training was conducted with the use of a Nvidia GPU.

Following initial training, using just our best-produced synthetic CAPTCHA samples as our data set, we achieved a CER (Character error rate) of 3.26%. For a first batch run of a model trained against a synthetic data set was not too bad at all. But we knew we could do better. 

Now that we had a model to work with, we could use it to start solving actual real target CAPTCHAs.  This would allow us to generate a larger pool of real CAPTCHA samples, with a solve set, and mix those in with our synthetic set.  We were looking to generate 5k synthetic and 1k real harvested CAPTCHAs with our newly trained, albeit unoptimized model. 

With a framework in place that would interface with the target website, collect CAPTCHAs, generate a text prediction, check that with the website and if solved, store the solved and labelled CAPTCHA image we generated about 1,000 samples over a short time.

Feeding this back into the mix of training model data we dropped the CER down to 2.77%.

A screen shot of a black screen

Description automatically generated

We were confident that even with 2.7% it was a rate better than a human could achieve, and we were also confident that our methodology was working. 

Our remaining tasks were to reiterate the model once more, using this slightly more optimised model and generate a slightly larger set of labelled real CAPTCHAs. 

We were able to go from the initial model, with a worse CER (orange line) to the best model (green line) in only a few training iterations.

The model training improvements are best shown in the graph below with each improvement yielding a lower CER, for longer (more stable) and at a sooner point in time. 

At which point we settled on a final model, with a CER of 1.4%, opting for an optimal  mix real CAPTCHAs to synthetic. 

Our final ML model diagram: 

Once the efficacy of this model was validated it was then a task of simply plugging it into the collection pipeline process and enlivening it into our production collection system. The automated solver process has been running stable ever since and most of the disruption we’ve observed has solely been to the target source going offline and being unavailable. 

Bias and Variance

A key consideration during the training process was to be aware of and mitigate where possible Overfitting and Overtraining our model. Instead of using the terms `overfitting` and `overtraining` I like to instead use Bias and Variance as two potential pitfalls of ML training as they better explain undesirable conditions that may occur. Without diving into too many details around these ML concepts as to fully understand them you would probably need a PhD. The best way I can describe what my simple mind can understand is as follows.

Due to the nature of our novel, one might say clever iterative process to train a CAPTCHA solver on a very low original source data set we are by virtue potentially adding bias into our training process. For example, from the first model any solved data sets will be solved by a model that has a predefined bias to solving a particular set, style or character combination potentially resulting in a new data set that is biassed towards what that previous model was good at solving thereby amplifying the bias in our next model’s training. 

This bias would result in a real world regression of CER as the model is unoptimised to solve a wider range of character combinations and randomisation characteristics. 

Our second pitfall: overfitting slides at both ends of the extreme in terms of providing an overly varied training set or an insufficiently varied training set, i.e. creeping into bias. Whereby we must consider that although we could train a model to solve many different types of CAPTCHAs, beyond just this one example, from one model using a very varied data set doing so and if not carefully tuned could result in `overfitting` our data set thereby introducing an unoptimised CER as our model is essentially training on more noise than signal. 

We therefore considered both Bias and Variance closely, ensuring a healthy mix of varied real correctly labelled CAPTCHAs harvested from source to a ratio of synthetically generated CAPTCHAs with a randomly distributed character set. An optimal CER band was then discovered through iterative AB testing of data set mix, training iterations until a stable plateau was identified. 

Conclusion

We deploy a final model, incorporating a mix of synthetic and real CAPTCHAs, achieving a CER of 1.4%. The automated solver process seamlessly integrates into our production collection system, ensuring stability and efficiency.

By leveraging synthetic sample training data generation, we’ve advanced CAPTCHA cracking. Our approach offers an effective and efficient solution for CAPTCHA cracking without significant human involvement or effort allowing for effective automated data collection.

With this capability, we are able to add value to our customers by automating the collection from otherwise programmatically inaccessible sources, where we would have to manually have a human solve the CAPTCHA access the page, insert any updates and then alert our customers. Automation is key to what we do at speed and at scale especially when dealing with many hundreds of collection sources as we do.

Photo by Kaffeebart on Unsplash.

"SOS
Product news

Introducing the SOS Intelligence Source Library

Amir and Daniel

I’m delighted to announce that last week we launched our newest feature, the Source Library for paying customers. This has been in development for the past few months and the team has done an outstanding job getting this live. Thank you guys!

I sat down with Daniel, our Threat Intelligence Analyst and frequent guest on our webinars to run through the specifics.

You can see what we covered below:

  1. Introduction of the Source Library: this has been developed in the background over the last few months and the team has done an excellent job. Having our new developer, Srdjan is already paying dividends.
  2. Purpose: The Source Library aims to provide customers with additional context and information about the sources being monitored, as well as specific alerts generated. This has been something that has been requested and gives the extra information which often helps with context and understanding of what is happening, or could happen.
  3. Strategic Decision: Integrating the Source Library into the platform was a strategic decision based on customer feedback and the direction of the platform. The 2024 roadmap is looking solid! We are always balancing the work required / difficulty and return.
  4. Collection Plan Management: The focus of the development was on managing the collection plan, which is crucial for the intelligence process, especially in content ingestion and matching.
  5. Features of the Source Library:
    • Provides a browsable view of all collection sources with status indicators making it easy to read.
    • Includes tags for categorizing sources based on topics – this is extremely useful for marking and returning to data.
    • Implements a risk scoring system for each source based on various factors, showing the high risk items.
    • Offers transparency and visibility to our customers.
  6. Continuous Development: The Source Library is considered a living thing and will be continuously updated and expanded as the platform evolves.
  7. Ransomware Data and Statistics: Customers can access ransomware statistics, filtering by industry vertical, group, and time period, to understand the frequency and distribution of ransomware attacks.
  8. Integration with Alerts: Each alert references a collection source, allowing users to quickly assess the risk level associated with the alert based on the source’s risk score.

I’d like to highlight the importance of listening to our customers. We pride ourselves on actively listening to feedback and requests. Whilst not all may be feasible, a lot are and we are focused on continuing to launch new features based on customer needs.

Thanks again to Daniel and Srdjan for the work on this!

If you have any questions about the source library or SOS Intelligence in general and how it can become part of your companies’ cyber protection, please do get in touch.

Photo by Ryunosuke Kikuno on Unsplash

"SOS
Product news

Business Update

We’ve had a lot going on since the start of the year and so I’ve recorded a short update for you. Click to watch and listen!

We are very thankful for all our customers, those who have been with us since we started and the new ones over the past months.

"SOS
Product news

Join us for our next SOS Intelligence webinar on Understanding Third-Party Risk for Cybersecurity

I’m delighted to invite you to our next webinar on Wednesday 14th June at 11am for twenty minutes.

Understanding Third-Party Risk for Cybersecurity 

Who is this for?

  • Anyone in a business or organisation who has responsibility for online security.
  • CTOs or senior managers who want to understand the risks of third-party cyber breaches and how to monitor them.
  • MSSPs who would like to leverage our solution with their clients.

You will learn:

  • What are third-party cyber security risks and what are the common breaches + consequences
  • The role of cyber threat intelligence in third-party risk management
  • How SOS Intelligence will help you manage your risk and your third parties

We are recording the session so if you sign up and are not able to make it, you will be sent a replay.

Sign up takes seconds, just click the button below.

"Eastern
Product news

Supporting the Eastern Cyber Resilience Centre

We are delighted to announce that we are the newest Eastern Cyber Resilience Centre Community Ambassador.

The Eastern Cyber Resilience Centre (ECRC) supports and helps protect SMEs, supply chain businesses and third sector organisations in the East of England against cyber crime.

The ECRC began its journey in November 2020. Led by Policing and facilitated by Business Resilience International Management (BRIM), they have followed a structured modular programme based on a highly successful model that had previously been established for over 9 years in Scotland.

They work in structured partnership with regional Policing, Academia, Businesses, Third and Public Sector organisations through a variety of ways.

What is a Community Ambassador?

Community Ambassadors are local businesses who recognise that cyber resilience is essential for their own customers and supply chains and want to help the ECRC promote this message.

We fully support what the ECRC are doing and very much look forward to working closely with them in the future.

"SOS
Product news

The new SOS Intelligence UI

I’m delighted to announce that our new UI is now live on the SOS Intelligence platform. This is something we have been working on for a good few months and is the culmination of customer feedback since launch.

Not only does it give a better experience visually, it’s more intuitive, easier to navigate and much simpler to use.

This is the first important step as part of a series of improvements across the platform. This development and investment in SOS Intelligence as part of our growth funding project which we recently announced.

Our old UI, whilst ok, was not as good as it should be. Ever since launching SOS Intelligence it’s something that’s always caused me to wince slightly – the design and UI didn’t match the product.

Good software lives or dies by how easy it is to use and interact with and it sure helps to look nice too!

We’ve focused on improving the menus and navigation so that you can see exactly where you are and see how to get to the next thing. We’ve also made use of a full screen on desktop. Previously it felt cramped and we still had a lot of unused space. No more! We now have a well laid out screen which has easy-to-read visuals and the new colours.

Here is a walk through video showing the new UI:

You can see most of the new screens below with an explanation of what they are and what you can do:

Our new dashboard now gives you unparalleled information about your keyword alert performance. At a glance view your most recent alerts, Most popular collection type and keyword performance over time. 
Dashboard

Our new alerts UI allow you to get the information you need fast. Highlighting of matched keyword enables you to zone in on exactly what’s been identified. View the full content for accurate context. Not only do we provide you with the full URL but also the full unredacted content. 

Acknowledge the alert once you have completed your review. 

Provide feedback to us if the alert was useful or not, and you can provide a reason and commentary.

Alert management
Alerts
Alerts

OSINT Search – You can view posts on a forum or any collection, live without having to have an account on that forum yourself, this is especially useful for closed forums. Narrow down your search with the Search by Date option or add a keyword if you are searching for something or some one specific.

OSINT Search
OSINT Search

The new Dark Search – Use our Onion address search feature to search for just part of an onion address or URL – search for what you have or know and we will match the most relevant Onion service address.

Dark Search

Generate an on demand live screenshot of an onion website without having to use a Tor browser. Images on Onion sites are not rendered.

Dark Search

Search the dark web and retrieve thumbnail for Onion websites, text content and generate on demand screenshots for your search results. You can also customise your search by searching just for the page titles, content, content & title or part of an onion address.

Dark Search

Last but not least, we have the user management:

User profile

It’s been a complex project, not only the design but also the integration into the code base and structure of the platform.

If you’d like to know more and let us show you how easy it is to use, then please book a demo call here. Thank you!

"SOS
Product news

SOS Intelligence – Growth Fund grant from the NCSC For Startups programme

We are thrilled to announce that we have received a Growth Fund grant from the NCSC For Startups programme. This award will allow us to accelerate the development of our product and deliver both requested and innovative features to our clients. 

Amir Hadzipasic, CEO and Founder said:

“We are absolutely delighted to receive the grant from the NCSC Startups Programme. It’s going to make a significant difference for our development and timescales and we are grateful for the support. 

As Alumni of the programme, the continued mentorship and support helps significantly.”

Aamir Zaheer, Business Development Manager said:

“When speaking with existing clients and prospects, we also listen to their needs and suggestions. The Growth Fund grant allows us to accelerate our development to meet these needs and provide an affordable solution for businesses and organisations.

We recently announced a special plan for UK Charities, NHS Trusts and Schools, so we are very pleased for a strong start to 2023.”

Photo by micheile dot com on Unsplash

"Cyber
Product news

A Special Cyber Threat Intelligence Plan for UK Charities, NHS Trusts and Schools

We like brands, companies and organisations that do the right thing. They are for good. They want to help. Their product or service is helpful, is useful and goes some way to fight the bad in the world, and let’s face it, there is way too much of that right now.

So, we are also going to try and do the right thing. We are a startup, a fledgling business and one which has not got endless reserves and pots of cash. But, we strongly believe that by helping people we can develop a loyal customer in the future…

From today, if you are a UK charity, a NHS trust or UK school, you can apply for a special account with SOS Intelligence, which gives you the first six months for free. An application takes seconds and once approved, you can up and running in minutes. We are offering this as we know this can make a huge difference to your cyber security, and we know that is more and more important.

Apply here.

What does this account include?

  • 10 Keyword Limit
  • 3 User Account Limit
  • Breach Monitoring, OSINT & Dark Web 
  • Excludes Domain Monitoring. 
  • Email Notification.

After the six months free time period, this will cost £200+VAT per month or £1,920+VAT with a 20% discount for 1 year.


We have seen time and time again that organisations who don’t act, even with intelligence we’ve come across ourselves, leave themselves open to tremendous risk.

Charities at increased risk

A new threat report published by the NCSC reveals why the charity sector is particular vulnerable to cyber attacks, the methods used by criminals, and how charities can best defend themselves.

 “More charities are now offering online services and fundraising online, meaning reliable, trusted digital services are more important than ever. During the Ukraine crisis, we saw more criminals taking advantage of the generosity of the public, masquerading as charities for their own financial gain.”
Lindy Cameron, NCSC CEO

You can read their blog post here and download the report here.

Just one set of compromised credentials is it all takes. Imagine, if you will, knowing when a user has been compromised and so you can act and secure the account. Imagine seeing an alert, almost in real time, where some of your data has been posted on a dark web forum.

Intelligence means you can do something about it.

Please do share this far and wide – we want to help! 🙂

Apply here.



FAQs

  1. Who can apply? This is open for any UK charity, NHS trust or school. If you are a non-profit, don’t fit in these categories, but think you should be considered, you can fill out the form here and click no to the fit question – you will be prompted to enter more information and we will get back to you.
  2. How long is the free account for? It is for six months from the date of account sign up. When this period has finished, you will be charged on the card you used for sign up. The annual version gives you a 20% discount and is by far the most popular option.
  3. What if I don’t want to continue using SOS Intelligence? You will need to tell us prior to the end of the six months as otherwise you may be charged.
  4. Do you provide training? At present, we offer email support and screencasts to get you up and running.
  5. What is the process to apply? To apply, head on over to the application form here and we will be in touch as soon as possible. If successful you will receive an email with a link to sign up and a voucher code to use to give you the six month free access. 
  6. Do I need to add credit card details on sign up? Yes, we use Stripe for payment and this requires card details. However, you will not be initially charged as you will use a six month free voucher. At the end of the six months the plan will renew using the card details provided.
  7. What about domain / typo / squatting monitoring? This is not included on this plan but is on the Pro or Enterprise plans.
  8. What is typo-squatting? Typo-squatting is the act of registering domain names, i.e. Web Domains that look similar to your legitimate domain name. Cyber Criminals may by several domains across a number of different Top Level Domain Registrars. Typo-squatting could be used against you, as a business to phish your employees or customers or in order to contact fraud under your name or brand. Most common occurrence is 419 Advance Fee Fraud. 

    SOS Intelligence monitors recently registered domain names from a large number of Top Level Domain Registrars and scans those against you domain type keywords.
"SOS
Product news

Join us for our first SOS Intelligence webinar on December 8th at 11am

We are delighted to invite you to our first webinar. This is at 11am on Wednesday 8th December and will last around twenty minutes.

Hosted by myself, I’ll give you a short overview of the product and how it fits as an essential part of your business or organisation’s online security plus a demonstration of how easy it is to use the keyword alert feature.

Who is this for?

  • Anyone in a business or organisation who has responsibility for online security
  • CTOs who wants to understand the risks of cyber breaches and how to monitor them
  • MSSPs who would like to leverage our solution with their clients

You will learn:

  • Why cyber threat intelligence and especially on the Dark Web is so vital
  • What SOS Intelligence does and what you can expect when using it
  • How it meets the need of a modern business / organisation

All you need to do is click the button below. We look forward to seeing you!

1 2 3
Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google
Spotify
Consent to display content from - Spotify
Sound Cloud
Consent to display content from - Sound