How to use Scrapy to follow links on the scraped pages

How to use Scrapy to follow links on the scraped pages

In the previous blog post, I described the basics of Scrapy usage. Of course, a web spider which does not follow links is not very useful, so in this blog post, I am going to describe how to handle links.

The wrong way

Let’s begin with doing it in the wrong way. I am going to parse the content of the page and follow all links.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy
from scrapy.crawler import CrawlerProcess

class Scraper(scrapy.Spider):
    name = "Scraper"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]
    
    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)
        
        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

process = CrawlerProcess()
process.crawl(Scraper)
process.start()

I ran that Scraper and my browser freeze. I had to kill the Jupyter Notebook process. What happened?

I parsed the “Web scraping” page and followed all links in its content. Then I parsed all pages linked from the Web scraping page and followed their links. I kept following the links on every page. It is probably possible to reach every Wikipedia page if you keep opening all links on every page you see.

Are you interested in data engineering?

Check out my other blog https://easydata.engineering

Limit the number of links to follow

I don’t want to download the whole Wikipedia. Obviously I can filter the output of response.css(‘div.mw-parser-output > p > a’) and follow only some of the links. I can even keep a counter of pages and stop following new links when I reach a threshold.

All such solutions require writing some code. I want to use something which is built-in Scrapy.

How can I stop Scrapy from doing it without writing much code? The simplest solution is to use DEPTH_LIMIT parameter. This time Scrapy is going to follow links only on the first page and ignore others.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy
from scrapy.crawler import CrawlerProcess

class ScraperWithLimit(scrapy.Spider):
    name = "ScraperWithLimit"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]
    
    custom_settings = {
        'DEPTH_LIMIT': 1
    }
    
    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a'):
            yield response.follow(next_page, self.parse)
        
        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

What do we see in the log output? Things like:

1
DEBUG: Ignoring link (depth > 1): https://en.wikipedia.org/wiki/Thread_safety.

Great. Now we don’t follow the links indefinitely.

Breadth-first vs. depth-first

What is the order of request? Is it breadth-first or depth-first?

According to the documentation, the spider requests the pages in the depth-first order.

We can change that by using the DEPTH_PRIORITY setting which is extremely unintuitive. In short DEPTH_PRIORITY: 0 = default setting (depth-first), DEPTH_PRIORITY: 1 = breadth-first, DEPTH_PRIORITY: -1 = depth-first.

Link loops and deduplication

Wait a second. What if the page links to itself or if a page A links to B, page B links to C, and page C links back to page A? The spider will never finish!

Well, it is not that bad. Scapy will not follow loops. There is the DUPEFILTER_CLASS configuration parameter which by default uses scrapy.dupefilters.RFPDupeFilter to deduplicate requests.

We can disable deduplication by replacing it with scrapy.dupefilters.BaseDupeFilter, but most likely we will end up with a Spider requesting pages in an infinite loop.

Fortunately, we can keep using the RFPDupeFilter, and if we ever want to visit a page more than once, we can set the “dont_filter” property of the request.

Such an approach works when we produce a new request instead of using the “follow” function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy
from scrapy.crawler import CrawlerProcess
import urllib

class ScraperWithDuplicateRequests(scrapy.Spider):
    name = "ScraperWithDuplicateRequests"
    start_urls = [
        'https://en.wikipedia.org/wiki/Web_scraping',
    ]
    
    custom_settings = {
        'DEPTH_LIMIT': 1
    }
    
    def parse(self, response):
        for next_page in response.css('div.mw-parser-output > p > a::attr(href)').extract_first():
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse, dont_filter=True)
                
        for quote in response.css('div.mw-parser-output > p'):
            yield {'quote': quote.extract()}

Remember to share on social media!
If you like this text, please share it on Facebook/Twitter/LinkedIn/Reddit or other social media.

If you watch programming live streams, check out my YouTube channel.
You can also follow me on Twitter: @mikulskibartosz

If you want to hire me, send me a message on LinkedIn or Twitter.


Bartosz Mikulski
Bartosz Mikulski * data scientist / software/data engineer * conference speaker * organizer of School of A.I. meetups in Poznań * co-founder of Software Craftsmanship Poznan & Poznan Scala User Group