Data Mining

How to bypass anti-scraping techniques in web scraping

Web scraping is a technique that enables quick in-depth data retrieving. It can be used to help people of all fields capturing massive data and information from the internet.

As more and more people turn to web scraping for acquiring data, tools like Octoparse are becoming popular as they help people to quickly turn web data into spreadsheets.

During this process, however, web scraping does put some extra pressure on the target website. While a crawler goes un-restrained and sends an overwhelming number of requests to a website, the server could potentially crash down. As a result, many websites “protect” themselves using anti-scraping mechanisms to avoid being “attacked” by web-scraping programs.

Luckily, for those who use web scraping responsibly, there are solutions to bypass anti-scaping techniques and avoid being blocked by anti-scraping systems. In this article, we will see some common anti-scraping systems and discuss the corresponding solutions to tackle them.

1. Scraping speed matters

Most web scraping bots aim to fetch data as quickly as possible, however, this can easily get you exposed as a scraping bot as there’s no way a real human can surf the web so fast. Websites can track your access speed easily and once the system finds you are going through the pages too fast, it will suspect that you are not a human and block you by default.

Solution: We can set random time intervals between requests, i.e., we can either add “sleep” in the code when writing a script or set up wait time when using Octoparse to build a crawler.

2. Dealing with CAPTCHA

Type 1: Click the CAPTCHA option

Type 2: Enter the CAPTCHA code

Type 3: Select the specified images from all the given images.

Solution: With the surge of the image recognition tech, conventional CAPTCHA can be cracked easily, though it costs a lot. Tools like Octoparse does provide cheaper alternatives with a bit compromised results.

3. IP restriction

When a site detects there are a number of requests coming from a single IP address, the IP address can be easily blocked. To avoid sending all of your requests through the same IP address, you can use proxy servers. A proxy server is a server (a computer system or an application) that acts as an intermediary for requests from clients seeking resources from other servers (Proxy server). It allows you to send requests to websites using the IP you set up, masking your real IP address.

Of course, if you use a single IP set up in the proxy server, it is still easy to get blocked. You need to create a pool of IP addresses and use them randomly to route your requests through a series of different IP addresses.

Solutions: Many servers, such as VPNs, can help you to get rotated IP. Octoparse Cloud Service, for instance, is supported by hundreds of cloud servers, each with a unique IP address.

When an extraction task is set to be executed in the Cloud, requests are performed on the target website through various IPs, minimizing the chances of being traced. Octoparse local extraction allows users to set up proxies to avoid being blocked.

4. Scrape behind login

Login can be regarded as permission to gain more access to some specific web pages, like Twitter, Facebook, and Instagram. Take Instagram as an example, without login, visitors can only get 20 comments under each post.

Solution: Octoparse works by imitating human browsing behaviors, so when login is required to access the data needed, you can easily incorporate the login steps, ie. inputting username and password as part of the workflow. More details can be found in Extract data behind a login.

5. Employing the JavaScript encryption tech

JS encryption tech is used to keep content safe from being scraped. Crawlers that written in JavaScript can be “tricked” easily.

Solution: When an HTTP Post request is sent, there is no doubt that JavaScript encryption will make it more difficult to scrape. However, with Octoparse, this can be easily dealt with as Octoparse can directly access the data from the target website in its built-in browser then analyze it automatically.

6. Be aware of honeypot traps

Honeypots are links that are invisible to normal visitors but are there in the HTML code and can be found by web scrapers. They are just like traps to detect scraper by directing them to blank pages. Once a particular visitor browses a honeypot page, the website can be relatively sure it is not a human visitor and starts throttling or blocking all requests from that client.

Octoparse uses XPath for precise capturing or clicking actions, avoiding clicking the faked links (see how to use XPath to locate element here).

7. Pages with different layouts

To avoid being scraped easily, some websites are built with slightly different page layouts. For example, page 1 to10 of a directory listing may look slightly different than page 11 to 20 from the same list.

Solution: There are two ways to solve this. For crawlers that written in Java Scripts, a set of new codes is needed. For the crawlers built with Octoparse, you can easily add a “Branch Judgment” into the workflow to tell apart the different layouts then proceed to extract the data precisely.

I hope all these tips above can help you build your own solutions or improve your current solution. You’re welcome to share your ideas with us or if you feel anything can be added to the list.

3 Comments
  1. Hi there to every one, the contents present at this web page are really amazing for people experience, well, keep up
    the good work fellows.

  2. freelancer parvez 2 months ago
    Reply

    The post like very much like this was very good about this ,very nice

  3. Everyone loves what you guys are usually up too.
    This type of clever work and exposure! Keep up the
    good works guys I’ve included you guys to my personal blogroll.

Leave a Comment

Your email address will not be published.

You may also like

Pin It on Pinterest