How to use Scrapy to follow links on the scraped pages

In the previous blog post, I described the basics of Scrapy usage. Of course, a web spider which does not follow links is not very useful, so in this blog post, I am going to describe how to handle links.

The wrong way

Let’s begin with doing it in the wrong way. I am going to parse the content of the page and follow all links.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy
from scrapy.crawler import CrawlerProcess

class Scraper(scrapy.Spider):
    name = "Scraper"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]
    
    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)
        
        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

process = CrawlerProcess()
process.crawl(Scraper)
process.start()

I ran that Scraper and my browser freeze. I had to kill the Jupyter Notebook process. What happened?

I parsed the “Web scraping” page and followed all links in its content. Then I parsed all pages linked from the Web scraping page and followed their links. I kept following the links on every page. It is probably possible to reach every Wikipedia page if you keep opening all links on every page you see.

Limit the number of links to follow

I don’t want to download the whole Wikipedia. Obviously I can filter the output of response.css(‘div.mw-parser-output > p > a’) and follow only some of the links. I can even keep a counter of pages and stop following new links when I reach a threshold.

All such solutions require writing some code. I want to use something which is built-in Scrapy.

How can I stop Scrapy from doing it without writing much code? The simplest solution is to use DEPTH_LIMIT parameter. This time Scrapy is going to follow links only on the first page and ignore others.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy
from scrapy.crawler import CrawlerProcess

class ScraperWithLimit(scrapy.Spider):
    name = "ScraperWithLimit"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]
    
    custom_settings = {
        'DEPTH_LIMIT': 1
    }
    
    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)
        
        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

What do we see in the log output? Things like:

1
DEBUG: Ignoring link (depth > 1): https://en.wikipedia.org/wiki/Thread_safety.

Great. Now we don’t follow the links indefinitely.

Breadth-first vs. depth-first

What is the order of request? Is it breadth-first or depth-first?

According to the documentation, the spider requests the pages in the depth-first order.

We can change that by using the DEPTH_PRIORITY setting which is extremely unintuitive. In short DEPTH_PRIORITY: 0 = default setting (depth-first), DEPTH_PRIORITY: 1 = breadth-first, DEPTH_PRIORITY: -1 = depth-first.

Link loops and deduplication

Wait a second. What if the page links to itself or if a page A links to B, page B links to C, and page C links back to page A? The spider will never finish!

Well, it is not that bad. Scapy will not follow loops. There is the DUPEFILTER_CLASS configuration parameter which by default uses scrapy.dupefilters.RFPDupeFilter to deduplicate requests.

We can disable deduplication by replacing it with scrapy.dupefilters.BaseDupeFilter, but most likely we will end up with a Spider requesting pages in an infinite loop.

Fortunately, we can keep using the RFPDupeFilter, and if we ever want to visit a page more than once, we can set the “dont_filter” property of the request.

Such an approach works when we produce a new request instead of using the “follow” function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy
from scrapy.crawler import CrawlerProcess
import urllib

class ScraperWithDuplicateRequests(scrapy.Spider):
    name = "ScraperWithDuplicateRequests"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]
    
    custom_settings = {
        'DEPTH_LIMIT': 1
    }
    
    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a::attr(href)').extract_first():
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)
                
        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

Did you enjoy reading this article?
Would you like to learn more about leveraging AI to drive growth and innovation, software craft in data engineering, and MLOps?

Subscribe to the newsletter or add this blog to your RSS reader (does anyone still use them?) to get a notification when I publish a new essay!

Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.

Bartosz Mikulski

Bartosz Mikulski

  • MLOps engineer by day
  • AI and data engineering consultant by night
  • Python and data engineering trainer
  • Conference speaker
  • Contributed a chapter to the book "97 Things Every Data Engineer Should Know"
  • Twitter: @mikulskibartosz
  • Mastodon: @mikulskibartosz@mathstodon.xyz
Newsletter

Do you enjoy reading my articles?
Subscribe to the newsletter if you don't want to miss the new content, business offers, and free training materials.