How to use Scrapy to follow links on the scraped pages

In the previous blog post, I described the basics of Scrapy usage. Of course, a web spider which does not follow links is not very useful, so in this blog post, I am going to describe how to handle links.

Table of Contents

The wrong way
Limit the number of links to follow
Breadth-first vs. depth-first
Link loops and deduplication

The wrong way

Let’s begin with doing it in the wrong way. I am going to parse the content of the page and follow all links.

import scrapy
from scrapy.crawler import CrawlerProcess

class Scraper(scrapy.Spider):
    name = "Scraper"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]

    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)

        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

process = CrawlerProcess()
process.crawl(Scraper)
process.start()

I ran that Scraper and my browser freeze. I had to kill the Jupyter Notebook process. What happened?

I parsed the “Web scraping” page and followed all links in its content. Then I parsed all pages linked from the Web scraping page and followed their links. I kept following the links on every page. It is probably possible to reach every Wikipedia page if you keep opening all links on every page you see.

Limit the number of links to follow

I don’t want to download the whole Wikipedia. Obviously I can filter the output of response.css(‘div.mw-parser-output > p > a’) and follow only some of the links. I can even keep a counter of pages and stop following new links when I reach a threshold.

All such solutions require writing some code. I want to use something which is built-in Scrapy.

How can I stop Scrapy from doing it without writing much code? The simplest solution is to use DEPTH_LIMIT parameter. This time Scrapy is going to follow links only on the first page and ignore others.

import scrapy
from scrapy.crawler import CrawlerProcess

class ScraperWithLimit(scrapy.Spider):
    name = "ScraperWithLimit"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]

    custom_settings = {
        'DEPTH_LIMIT': 1
    }

    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)

        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

What do we see in the log output? Things like:

DEBUG: Ignoring link (depth > 1): https://en.wikipedia.org/wiki/Thread_safety.

Great. Now we don’t follow the links indefinitely.

Breadth-first vs. depth-first

What is the order of request? Is it breadth-first or depth-first?

According to the documentation, the spider requests the pages in the depth-first order.

We can change that by using the DEPTH_PRIORITY setting which is extremely unintuitive. In short DEPTH_PRIORITY: 0 = default setting (depth-first), DEPTH_PRIORITY: 1 = breadth-first, DEPTH_PRIORITY: -1 = depth-first.

Link loops and deduplication

Wait a second. What if the page links to itself or if a page A links to B, page B links to C, and page C links back to page A? The spider will never finish!

Well, it is not that bad. Scapy will not follow loops. There is the DUPEFILTER_CLASS configuration parameter which by default uses scrapy.dupefilters.RFPDupeFilter to deduplicate requests.

We can disable deduplication by replacing it with scrapy.dupefilters.BaseDupeFilter, but most likely we will end up with a Spider requesting pages in an infinite loop.

Fortunately, we can keep using the RFPDupeFilter, and if we ever want to visit a page more than once, we can set the “dont_filter” property of the request.

Such an approach works when we produce a new request instead of using the “follow” function:

import scrapy
from scrapy.crawler import CrawlerProcess
import urllib

class ScraperWithDuplicateRequests(scrapy.Spider):
    name = "ScraperWithDuplicateRequests"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]

    custom_settings = {
        'DEPTH_LIMIT': 1
    }

    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a::attr(href)').extract_first():
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)

        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}