In the previous blog post, I described the basics of Scrapy usage. Of course, a web spider which does not follow links is not very useful, so in this blog post, I am going to describe how to handle links.

Table of Contents

  1. The wrong way
  2. Limit the number of links to follow
  3. Breadth-first vs. depth-first
  4. Link loops and deduplication

The wrong way

Let’s begin with doing it in the wrong way. I am going to parse the content of the page and follow all links.

import scrapy
from scrapy.crawler import CrawlerProcess

class Scraper(scrapy.Spider):
    name = "Scraper"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]

    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)

        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

process = CrawlerProcess()
process.crawl(Scraper)
process.start()

I ran that Scraper and my browser freeze. I had to kill the Jupyter Notebook process. What happened?

I parsed the “Web scraping” page and followed all links in its content. Then I parsed all pages linked from the Web scraping page and followed their links. I kept following the links on every page. It is probably possible to reach every Wikipedia page if you keep opening all links on every page you see.

I don’t want to download the whole Wikipedia. Obviously I can filter the output of response.css(‘div.mw-parser-output > p > a’) and follow only some of the links. I can even keep a counter of pages and stop following new links when I reach a threshold.

All such solutions require writing some code. I want to use something which is built-in Scrapy.

How can I stop Scrapy from doing it without writing much code? The simplest solution is to use DEPTH_LIMIT parameter. This time Scrapy is going to follow links only on the first page and ignore others.

import scrapy
from scrapy.crawler import CrawlerProcess

class ScraperWithLimit(scrapy.Spider):
    name = "ScraperWithLimit"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]

    custom_settings = {
        'DEPTH_LIMIT': 1
    }

    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)

        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

What do we see in the log output? Things like:

DEBUG: Ignoring link (depth > 1): https://en.wikipedia.org/wiki/Thread_safety.

Great. Now we don’t follow the links indefinitely.

Breadth-first vs. depth-first

What is the order of request? Is it breadth-first or depth-first?

According to the documentation, the spider requests the pages in the depth-first order.

We can change that by using the DEPTH_PRIORITY setting which is extremely unintuitive. In short DEPTH_PRIORITY: 0 = default setting (depth-first), DEPTH_PRIORITY: 1 = breadth-first, DEPTH_PRIORITY: -1 = depth-first.

Wait a second. What if the page links to itself or if a page A links to B, page B links to C, and page C links back to page A? The spider will never finish!

Well, it is not that bad. Scapy will not follow loops. There is the DUPEFILTER_CLASS configuration parameter which by default uses scrapy.dupefilters.RFPDupeFilter to deduplicate requests.

We can disable deduplication by replacing it with scrapy.dupefilters.BaseDupeFilter, but most likely we will end up with a Spider requesting pages in an infinite loop.

Fortunately, we can keep using the RFPDupeFilter, and if we ever want to visit a page more than once, we can set the “dont_filter” property of the request.

Such an approach works when we produce a new request instead of using the “follow” function:

import scrapy
from scrapy.crawler import CrawlerProcess
import urllib

class ScraperWithDuplicateRequests(scrapy.Spider):
    name = "ScraperWithDuplicateRequests"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]

    custom_settings = {
        'DEPTH_LIMIT': 1
    }

    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a::attr(href)').extract_first():
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)

        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}
Subscribe to the newsletter
Older post

How to scrape a single web page using Scrapy in Jupyter Notebook?

Scrapy Spiders and processing pipelines 101

Newer post

How to run a single test in SBT

Why testOnly does not work?