Home Tips & Tricks How to Effectively Block Bad Bots in 2020

How to Effectively Block Bad Bots in 2020

0
0

A bot is a software program that automatically interacts with a website (or elements of a website) over the internet. A bot can be either ‘good’ or ‘bad’: a good bot won’t disrupt the website visitor’s experience, and can in fact provide various benefits. A bad bot, on the other hand, can perform various malicious tasks including: 

  • Stealing users’ information and reusing it
  • Scraping the site’s content and data

Not only can bad bots be dangerous for both the website owner and users;  they can also skew your site metrics (like inflating your traffic numbers) and burden your server with additional traffic. 

In short, bad bots have direct impacts on any website, and website owners must minimize or even eliminate this bad bot traffic altogether. 

Potential Impacts of Bad Bots

Above, we have discussed some common impacts of bad bots and what they can do. However, here are some other malicious impacts of bad bots and why you should care: 

. Affecting Your Link Profile

Attackers and bot operators may sell backlinks from your site to other parties by ‘injecting’ low-quality links to their clients’ websites into your content. In most cases, this is done via the comment section of your website (and so won’t affect much), but it can hurt your visitor’s experience and might send your readers to scam websites. 

. Data Breach And Theft

One of the most common applications for bad bots is to steal sensitive information that users put into forms and comments. Attackers can then use this information to launch an even bigger attack or sell it to competitors. 

Also, newer bad bots can even harvest user’s financial data like credit card details and other information. When user information is stolen, it can significantly hurt the website’s reputation and so website owners should take extra caution to prevent this issue. 

. Content Scraping And Duplication

Bots can scrape your website and copy your content, and they might post it elsewhere without your permission, which may create a duplicate content issue if you haven’t canonized your original content. If you are not careful, your site’s SERP ranking can drop and Google might even penalize your site permanently. 

. Higher Advertising Cost

Bots can skew various metrics on your site and even click on your ads, which can significantly affect your advertising cost. For example, when there’s inflated traffic due to bot activities, ad publishers can charge more for advertising space, and advertisers might need to pay an extra price for fake traffic. This can hurt your website in the long run. 

How We Can Identify and Block Bad Bots

In general, there are several potential ways to block bad bots. However, it’s very important that we can properly identify these bad bots before blocking them for two reasons: 

  • As mentioned, we have to differentiate between traffic coming from good bots and traffic coming from bad bots. Allowing good bots like Google’s crawlers to continue working is very important since they actually provide various benefits on our site. 
  • We also need to make sure we are not blocking legitimate traffic from actual users for obvious reasons. 

So, we’d need a system that can properly recognize traffic coming from bad bots, and only then we can effectively block them. 

With that being said, there are several approaches we can implement: 

Manual Approach

This is the simplest and also the most cost-effective approach, but you’d need to spend time creating your own solution depending on the severity of your case. 

In General, There Are Three Core Steps Involved In This Approach: 

  • Observe and identify bad bots
  • Log these bad bots
  • Block their activities using your web server directives

There are several techniques you can use to identify these bad bots, for example checking your robots.txt directives to find bots that aren’t following these specific directives. 

All servers would also keep a list of every request to the website as log files. The location may vary depending on your platform, but it should be easy enough to find. After this, you can open the log files and observe the incoming activities. 

In the past, we usually identified bots by IPs since most bot attackers tend to rely on proxies, and so we can simply block the data center proxy’s IP address to block the bad bot. However, today bot operators can quite easily (and cheaply) rotate through thousands or even millions of IPs through various proxy services. So, IP-based detection is, in most cases, not very effective anymore. 

Instead, we can create a honey pot (or trap) by creating a link to a section on your site that wouldn’t be visible to human users. When a visitor makes a request to this link, then we know it’s a bot with malicious intent. This area should be restricted in your robots.txt file, and you can then create a script that would log the details about the bot. 

In general, here are some common signs of bad bot activities that you can use to identify them: 

. Very Long Or very Short Session Durations:

human users typically take a similar amount of time to consume the content on a specific page, and so when there’s a significant spike or extremely short session duration, it may be bot activity.

. Higher Bounce Rate:

a spike in bounce rate (the number of users who visit a page and leave without moving to another page or clicking anything) can be a sign of web scraping bots that are designed to scrape an entire website. 

. Higher Page Views:

a sudden spike in page views might indicate bad bot activities

Spam Content:

bad bots might fill forms with spam or junk content to spam your inbox or eat your resources

. High Server Resource Usage:

when your site becomes slower or when there’s a sudden spike in CPU usage or network usage, it is a relatively clear indication of bad bot presence 

So, what should we do after we’ve identified these bots? We can then use our web server’s configuration to block these bots via several possible approaches: 

. IP Blocking:

useful when the IP address is known, but as mentioned today it’s fairly ineffective since bot operators can simply rotate between many different addresses

. User-Agent:

some bots can be identified via a unique user-agent string that can be differentiated from the search engine bots and browsers. 

. Referrer:

can be useful in cases where the referrer is a known referrer for bad bots and spams. 

There are also other possible approaches, but the key here is to identify the bot, find something unique that you can use to identify it, and use this unique identity to block this bad bot. 

Place CAPTCHA On Sensitive Areas

CAPTCHA stands for “Completely Automated Public Turing Test to Tell Computers and Humans Apart”, and as the name suggests, it is used to filter out bots from human users. The main idea about CAPTCHA is that it should be (very) easy to solve for human users, yet very difficult for bots. 

However, having too many CAPTCHAs on your site can disrupt your site’s user experience (UX) and might also disrupt the activities of good bots, which might be beneficial for our site. 

With that being said, you might want to consider placing a CAPTCHA on: 

. Signup/Login Pages:

pretty obvious, since this is the main target of brute force attacks by bad bots. 

. Comment Sections:

if you have a blog on your site, consider protecting your comment section with a CAPTCHA. Your comment section is often the target of bad bots where they might spam you with low-quality and even scam links. 

. Any Surveys/Polls/Forms:

this is important so bad bots can’t submit fake data that might skew your metrics and also eat up your database’s resources. 

Thankfully, nowadays it’s fairly easy and affordable to implement CAPTCHA on your site. You can, for example, use Google’s reCAPTCHA which is totally free and fairly reliable. 

Using Web Application Firewalls

A Web Application Firewall, or WAF, is a special firewall that is uniquely designed for HTTP/S based applications that can apply a specific set of rules to an HTTP conversation. 

A WAF can be considered as a reverse proxy. Proxies typically protect the client’s device, but WAFs protect servers. As the name suggests, a WAF is implemented to protect a specific web application or several web applications.

Although not specifically designed to protect sites against bots, a WAF can block bad bots activities based on various criteria like IP address, source location, user agents, and so on. However, WAFs aren’t very effective against newer bots that mimic the behavior of real human users. 

Using a Third-Party Bot Detection Solutions to Protect Your Website

As we can see, blocking bad bots effectively requires an advanced means since you want to accurately detect bad bots without affecting legitimate traffic and good bot activities (false positives). You also want to avoid false negatives (when bad bots are recognized as legitimate users).  

The most common approaches in identifying and blocking bots are blocking ranges of IP addresses, geolocations/countries, WAFs, and web configurations like we have discussed above. However, they tend to be ineffective in detecting newer, more advanced bots that can mimic human behaviors (like non-linear mouse movements). This is why we need a behavior-based bot detection software like DataDome to block these more sophisticated bad bots. 

Today’s bad bots can rotate between thousands and even millions of IP addresses and can execute human-like behaviors, and so are very difficult to identify. Also, nowadays bots can adapt “behavior hijacking” techniques to change their identifiable characteristics over time by adopting human behaviors. 

While conventional bot mitigation services analyze interactions with the website (like mouse movements and clicking patterns), bad bots can now imitate these human interactions to evade these bot detection measures. This is where a behavioral-based bot detection like DataDome can help. 

End Words

Bot detection is more difficult than ever: today’s bots can almost perfectly imitate human behaviors and technologies, and also rotate between a lot of IPs, making IP-based detection solutions obsolete. 

Real-time behavioral analysis is arguably the only effective solution nowadays, and considering the potential impact of bad bots on your digital assets, it’s very important to figure out how you will block these bad bots effectively. 

Load More Related Articles
Load More By Lucy
Load More In Tips & Tricks

Leave a Reply

Your email address will not be published. Required fields are marked *

Check Also

How Online Australian Criminal History Checks Have Revolutionized Efficiency

The role of background check technology in the hiring process cannot be overemphasized. Th…