You need to make your scraper undetectable to be able to extract data from a webpage, and the main types of techniques for that are imitating a real browser and simulating human behavior. For example, a normal user wouldn’t make 100 requests to a website in one minute.
Be a Responsible Scraper: Make sure you check your target’s terms of services. Also, perform scraping during off-peak hours, so it doesn’t affect the system performance for other users. Use a legitimate proxy provider and give the system some breathing space between requests. Don’t be greedy for data: outline your requirements and collect only that.
Set Real Request Headers
Web browsers usually send a lot of information that HTTP clients or libraries don’t.
Luckily this is easy to solve. First, go to Httpbin and check the request headers your current browser sends. In our case, we got this:
|
|
User-Agent Rotation
Vary the user-agent string with each request to mimic different browsers and devices. This can help prevent being flagged as a bot.
Rotating user agents using python:
|
|
|
|
Use Proxies
Rotate through a pool of proxies to mask your IP address and distribute requests across different IP addresses, reducing the likelihood of being detected and blocked.
Another essential distinction between proxies is that some use a data center IP while others rely on a residential IP.
Data center IPs are reliable but are easy to identify and block. Residential IP proxies are harder to detect since they belong to an Internet Service Provider (ISP) that might assign them to an actual user.
Free Proxies
Premium Proxies
High-speed and reliable proxies with residential IPs sometimes are referred to as premium proxies. For production crawlers and scrapers, it’s common to use these types of proxies.
Proxy Network Services
Consider utilizing proxy network services that offer advanced features such as automatic IP rotation, residential IP addresses, and built-in anti-blocking measures.
Use Headless Browsers
Use headless browsers such as Selenium to render JavaScript-heavy pages. This allows for the execution of JavaScript on the page, making the scraping process less detectable.
However, even if you use an official browser in headless mode, you need to make its behavior look real. It’s common to add some special request headers to achieve that, like a User-Agent.
Selenium and other browser automation suites allow you to combine headless browsers with proxies. That will enable you to hide your IP and decrease the risk of being blocked.
Selenium
Puppeteer
Playwright
Outsmart Honeypot Traps
Some websites will set up honeypot traps. These are mechanisms designed to attract bots while being unnoticed by real users. They can confuse crawlers and scrapers by making them work with fake data.
Some of the most basic honeypot traps are links in the website’s HTML code but are invisible to humans. Make your crawler or scraper identify links with CSS properties that make them invisible.
Ideally, your scraper shouldn’t follow text links with the same color as the background or hidden from users on purpose. You can see a basic JavaScript snippet that identifies some invisible links in the DOM below.
|
|
Some apps add code that’s invisible to human users and is to be explicitly read by bots and scrapers. Your crawler should be programmed to skip those that come with the display: none and visibility: hidden properties since they’re often a trap.
Another fundamental way to avoid honeypot traps is to respect the robots.txt file. It’s written only for bots and contains instructions about which parts of a website can be crawled or scraped and which should be avoided.
Avoid Fingerprinting
If you change a lot of parameters in your requests, but your scraper is still blocked, you might’ve been fingerprinted. Namely, the anti-bot system uses some mechanism to identify you and block your activity.
- Don’t make the requests at the same time every day. Instead, send them at random times.
- Change IPs often.
- Forge and rotate TLS fingerprints. You can learn more about this in our article on bypassing Cloudflare.
- Use different request headers, including other User-Agents.
- Configure your headless browser to use different screen sizes, resolutions, and installed fonts.
- Use different headless browsers.
Bypass Anti-bot Systems
If your target website uses Cloudflare, Akamai, or a similar anti-bot service, you probably couldn’t scrape the URL because it has been blocked. Bypassing these systems is challenging but possible.
akamai
cloudflare
shape
PerimeterX
Request Rate Limiting
Mimic human behavior by controlling the rate of requests made to the target website. Implement random delays between requests to avoid triggering rate limits or detection algorithms.
Session Management
Emulate human browsing patterns by managing sessions and cookies, allowing you to maintain state across multiple requests and appear more like a genuine user.
Stop Repeated Failed Attempts
It’s best to detect and log failed attempts and get notified when it happens to suspend scraping.