Making your Scrapy spider undetectable by applying basic statistics
Scrapy can either make calls to a website in constant intervals or use auto throttling to delay the next request. AutoThrottle calculates the website latency and uses it to make requests as often as possible. There is no way we can hide the fact of using a bot by doing this. From the server’s perspective, the intervals between requests are almost constant, but we can make it look more like human generated traffic.
Many real-life processes follow an exponential distribution. It can be used to predict the time between phone calls in a call center or requests to a website.
The distribution applies to situations in which event occurrences are independent, happen at a constant average rate and occur continuously.
Obviously, sometimes there are exceptions. Websites tend to be more popular during some hours of a day. Phone calls may occur more often in the afternoon, but it does not mean that exponential distribution does not work. It does, but its parameters change throughout the day.
It also applies only to normal situations. If we had been talking about a tech support call center and we had received the first call informing us about a malfunction, we would have known that many, many calls would follow soon. Most likely it would have been exponentially distributed too, but with dramatically different parameters.
Let’s look at a probability density function of the exponential distribution. What we see is the long tail, which means that long delays are less likely to happen. We also notice the huge probability of no delay occurring at all.
Define the distribution
To generate delays according to a given exponential distribution, we are going to use Scipy. Remember that the distribution has one parameter: lambda, but the Scipy implementation of all distributions always get two parameters: loc and scale. So what are we going to do?
According to the documentation, we can set the lambda parameter by passing the scale equal to 1 / lambda.
What does the lambda mean?
It is the rate of the event occurrences. It tells us how often on average the event occurs in a given unit of time. We can estimate the rate by looking at real occurrences of the event and fitting the data to a distribution, just like I did in my Monte Carlo simulation blog post.
In this example, I am going to set the rate to 0.2 events per second. It is going to be a slow scraper, but it was supposed to be undetectable, not very efficient.
1 2 3 4 from scipy.stats import expon lambda_param = 0.2 distribution = expon(scale = 1 / lambda_param)
Let’s check whether it worked. First, I am going to plot the probability density function of my distribution:
1 2 3 4 5 6 7 8 import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 20, 2000) _, ax = plt.subplots(1, 1) ax.plot(x, distribution.pdf(x), '-', lw=2) plt.title('Exponential distribution with lambda = 0.2') plt.show()
Looks good. Let’s generate 10000 random values using that distribution and plot a histogram:
1 2 3 4 5 6 7 8 generated_values = distribution.rvs(10000) import pandas as pd pd.DataFrame( generated_values, columns=["Exponential distribution, rate = 0.2, 10000 random values"] ).hist(bins = 50)
Make it work in Scrapy
To add such delays to Scrapy, we must modify the AutoThrottling implementation. First, let’s copy the source code from GitHub.
Now, I am going to change the implementation a little bit. We need the lambda parameter which should be passed as a configuration variable, so let’s add the following code to the
1 2 3 4 from scipy.stats import expon lambda_param = crawler.settings.getfloat("AUTOTHROTTLE_LAMBDA") self.random_delay_distribution = expon(scale = 1 / lambda_param)
I am not going to replace the existing implementation with a completely new one. Instead of that, I will generate a random delay after every request and add it to the delay calculated by AutoThrottling extension, so after my changes, the code of the _adjust_delay function looks like this (note that I removed the existing comments to make it shorter):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 def _adjust_delay(self, slot, latency, response): target_delay = latency / self.target_concurrency new_delay = (slot.delay + target_delay) / 2.0 random_delay = random_delay_distribution.rvs() #I generate delay in seconds (because the rate was 0.2 per second), but Scrapy delays are in ms, so I have to multiply it new_delay = new_delay + random_delay * 1000 new_delay = max(target_delay, new_delay) new_delay = min(max(self.mindelay, new_delay), self.maxdelay) if response.status != 200 and new_delay <= slot.delay: return slot.delay = new_delay
The other aspects of hiding a scraper
Making the delays more human-like won’t help at all if the spider sends some bizarre User-Agent header or initiates all requests from the same IP. So there is more to that than just delays.
Law and ethics
Many websites have conditions of use which explicitly forbid you from using scrapers. Remembers always to read and obey their conditions of use! If the website offers an API, always use it to get the data instead of scrapping their content.
If there is no API and they don’t allow using scrapers, you can always ask for the data you need. You will be surprised how often you get what you wanted.
You may also like
- Dependencies between DAGs: How to wait until another DAG finishes in Airflow?
- Definition of done for data engineers
- Row number in Apache Spark window — row_number, rank, and dense_rank
- Three biggest traps to avoid while setting Spark executor memory
- What is the difference between data lake, data warehouse, and data mart