# How to use Scrapy to follow links on the scraped pages

In the previous blog post, I described the basics of Scrapy usage. Of course, a web spider which does not follow links is not very useful, so in this blog post, I am going to describe how to handle links.

# The wrong way

Let’s begin with doing it in the wrong way. I am going to parse the content of the page and follow all links.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy
from scrapy.crawler import CrawlerProcess

class Scraper(scrapy.Spider):
name = "Scraper"
start_urls = [
'https://en.wikipedia.org/wiki/Web_scraping',
]

def parse(self, response):
for next_page in response.css('div.mw-parser-output > p > a'):
yield response.follow(next_page, self.parse)

for quote in response.css('div.mw-parser-output > p'):
yield {'quote': quote.extract()}

process = CrawlerProcess()
process.crawl(Scraper)
process.start()


I ran that Scraper and my browser freeze. I had to kill the Jupyter Notebook process. What happened?

I parsed the “Web scraping” page and followed all links in its content. Then I parsed all pages linked from the Web scraping page and followed their links. I kept following the links on every page. It is probably possible to reach every Wikipedia page if you keep opening all links on every page you see.

# Limit the number of links to follow

I don’t want to download the whole Wikipedia. Obviously I can filter the output of response.css(‘div.mw-parser-output > p > a’) and follow only some of the links. I can even keep a counter of pages and stop following new links when I reach a threshold.

All such solutions require writing some code. I want to use something which is built-in Scrapy.

How can I stop Scrapy from doing it without writing much code? The simplest solution is to use DEPTH_LIMIT parameter. This time Scrapy is going to follow links only on the first page and ignore others.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy
from scrapy.crawler import CrawlerProcess

class ScraperWithLimit(scrapy.Spider):
name = "ScraperWithLimit"
start_urls = [
'https://en.wikipedia.org/wiki/Web_scraping',
]

custom_settings = {
'DEPTH_LIMIT': 1
}

def parse(self, response):
for next_page in response.css('div.mw-parser-output > p > a'):
yield response.follow(next_page, self.parse)

for quote in response.css('div.mw-parser-output > p'):
yield {'quote': quote.extract()}


What do we see in the log output? Things like:

1


What is the order of request? Is it breadth-first or depth-first?

According to the documentation, the spider requests the pages in the depth-first order.

We can change that by using the DEPTH_PRIORITY setting which is extremely unintuitive. In short DEPTH_PRIORITY: 0 = default setting (depth-first), DEPTH_PRIORITY: 1 = breadth-first, DEPTH_PRIORITY: -1 = depth-first.

Wait a second. What if the page links to itself or if a page A links to B, page B links to C, and page C links back to page A? The spider will never finish!

Well, it is not that bad. Scapy will not follow loops. There is the DUPEFILTER_CLASS configuration parameter which by default uses scrapy.dupefilters.RFPDupeFilter to deduplicate requests.

We can disable deduplication by replacing it with scrapy.dupefilters.BaseDupeFilter, but most likely we will end up with a Spider requesting pages in an infinite loop.

Fortunately, we can keep using the RFPDupeFilter, and if we ever want to visit a page more than once, we can set the “dont_filter” property of the request.

Such an approach works when we produce a new request instead of using the “follow” function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy
from scrapy.crawler import CrawlerProcess
import urllib

class ScraperWithDuplicateRequests(scrapy.Spider):
name = "ScraperWithDuplicateRequests"
start_urls = [
'https://en.wikipedia.org/wiki/Web_scraping',
]

custom_settings = {
'DEPTH_LIMIT': 1
}

def parse(self, response):
for next_page in response.css('div.mw-parser-output > p > a::attr(href)').extract_first():
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)

for quote in response.css('div.mw-parser-output > p'):
yield {'quote': quote.extract()}


Remember to share on social media!