Scraping the web with Scrapy and Jenkins

🎒 Resources

In the last part of this tutorial, we have explored Scrapy Shell and learned how to create CSS selectors to extract data from a website. Now we will create a Scrapy Spider based on the knowledge we have gained so far.

Creating a Scrapy Spider - A Naïve Approach

Let’s start by creating a new Scrapy Spider. These spiders are not catching flies, but rather crawling websites and extract valuable data. Our objective is to scrape Apple’s website for the latest iPhone models and their prices to compare them across different countries, focusing on the US, Swedish, and Hungarian Apple websites.

Initialize a Git Repository & Create a New Spider

Since a dedicated Jenkins Build Agent will do the heavy lifting, it would be nice to keep things tidy and place all the Spider-related files in a separate Git repository. Let’s create a new repository and initialize our Scrapy project inside it.

Create a new directory for the project and navigate into it, and initialize a new Git repository:

1
2
3
mkdir iphone_price_bot
cd iphone_price_bot
git init

Now, let’s create a new Scrapy project inside this directory:

1
2
3
4
5
6
7
8
scrapy startproject iphone_price_bot

New Scrapy project 'iphone_price_bot', using template directory '/opt/homebrew/lib/python3.11/site-packages/scrapy/templates/project', created in:
    /Users/tamas/Desktop/iphone_price_bot/iphone_price_bot

You can start your first spider with:
    cd iphone_price_bot
    scrapy genspider example example.com

This will create a new Scrapy project with the following structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
└── iphone_price_bot
    ├── iphone_price_bot
    │   ├── __init__.py
    │   ├── items.py
    │   ├── middlewares.py
    │   ├── pipelines.py
    │   ├── settings.py
    │   └── spiders
    │       └── __init__.py
    └── scrapy.cfg

Nice, we have a new Scrapy project! Now, let’s create a new Spider inside the spiders directory. We will name it apple_website_spider.py:

1
2
cd iphone_price_bot
scrapy genspider apple_website_spider https://www.apple.com/shop/buy-iphone

If you open the newly created apple_website_spider.py file, you will see something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import scrapy


class AppleWebsiteSpiderSpider(scrapy.Spider):
    name = "apple_website_spider"
    allowed_domains = ["www.apple.com"]
    start_urls = ["https://www.apple.com/shop/buy-iphone"]

    def parse(self, response):
        pass

Now it is not especially useful, is it? Let’s make our Spider more useful - shall we?

Fetch the Pages of iPhone Models

iPhone Models Photo by Tron Le on Unsplash

Let’s begin by fetching all the pages for the different iPhone models in the US, Swedish, and Hungarian Apple sites. To achieve this, we need to create a list named start_urls to specify the URLs we intend to scrape. Without this, our Spider would only target the US website.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class AppleWebsiteSpider(scrapy.Spider):
    name = "apple_website_spider"
    allowed_domains = ["www.apple.com"]
    start_urls = [
        "https://www.apple.com/shop/buy-iphone",
        "https://www.apple.com/hu/shop/buy-iphone",
        "https://www.apple.com/se/shop/buy-iphone",
    ]
    phone_urls = []

    # ...

I’ve also initialized an empty list called phone_urls. My intention is to store the URLs of the iPhone model pages into this list. We will use these URLs to fetch the details of the iPhone models.

Finally, I set the allowed_domains attribute to ["www.apple.com"] to prevent the Spider from crawling other websites - let’s stick to Apple’s website for this job.

The provided parse method is the entry point of the Spider. It is called with the response of the URLs specified in the start_urls list. We will use this method to extract the URLs of the iPhone models to dig deeper and extract the interesting data of a given iPhone model.

1
2
3
4
5
6
7
def parse(self, response):
        self.phone_urls = response.css(
            "div.rf-hcard.rf-hcard-40 > a::attr(href)"
        ).extract()

        for url in self.phone_urls:
            yield scrapy.Request(url, self.parse_phone)

Remember the CSS selector we used in the Scrapy Shell to extract the URLs of the iPhone models? We use that same selector here to extract the URLs of the iPhone models. The response.css("div.rf-hcard.rf-hcard-40 > a::attr(href)").extract() line selects all the a tags that are children of a div tag with the classes rf-hcard and rf-hcard-40 and extracts the href attribute of these a tags. In the for loop, we iterate over the extracted URLs and call the parse_phone method to extract the details of the iPhone models.

Extract the Details of the iPhone Models

Now, let’s implement the parse_phone method to extract the details of a given iPhone model. We will extract the name of the iPhone model and its price from the page.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def parse_phone(self, response):
    phones = response.css("div.details")
    model = response.css("h1.fwl::text").extract_first()

    if model:
        model = model.replace("Buy ", "")
        model = model.replace("vásárlása", "")
        model = model.replace("Köp", "")
        model = clean_value(value=model)

    for phone in phones:
        price = phone.css(".current_price::text").extract_first()
        capacity = phone.css("span.dimensionCapacity::text").extract_first()
        capacity += phone.css(
            "span.dimensionCapacity > small::text"
        ).extract_first()

        if price:
            yield {
                "country": get_country_code(response.url),
                "model": model,
                "price": clean_value(price),
                "capacity": capacity,
                "color": phone.css("span.dimensionColor::text").extract_first(),
            }

Let’s break down this quite verbose method into more manageable chunks. First, we put all the div tags with the class details into the phones variable. We will eat them one bite after another. What are these funny words like ‘vásárlása’ and ‘Köp’? These are the Hungarian and Swedish translations of ‘Buy’, respectively. It’s important to remove the ‘Buy’ verb from the model’s name, not only in the English version of the website but across all language variants. Next, we use the clean_value function to drop any unnecessary characters from the extracted values, ensuring clarity and consistency.

Now we can iterate over the phones collection and extract the price, capacity, and color of the iPhone models. We use the get_country_code function to extract the country code from the URL of the page. Remember that we don’t have prices for the carrier-locked iPhones? Consequently, we must ignore those entries in the model list.

Almost forgot to include the clean_value and get_country_code functions. Here they are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def get_country_code(url):
    country = url.split("/")[3]
    return "us" if country == "shop" else country


def clean_value(value):
    value = value.strip()
    value = value.replace(" ", "")
    value = re.sub(r"[^\x00-\x7F]+", "", value)
    return value

Nothing fancy. The get_country_code function does exactly what its name suggests: it extracts the country code from the URL of the page. For the US, we identify the country using the term ‘shop’ instead of a specific country code. The clean_value function removes the leading and trailing empty-spaces and any non-ASCII characters from the value.

Let’s see how our Spider looks like now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import scrapy
import re

from iphone_price_bot.items import IphonePricesItem


def get_country_code(url):
    country = url.split("/")[3]
    return "us" if country == "shop" else country


def clean_value(value):
    value = value.strip()
    value = value.replace(" ", "")
    value = re.sub(r"[^\x00-\x7F]+", "", value)
    return value


class AppleWebsiteSpider(scrapy.Spider):
    name = "apple_website_spider"
    allowed_domains = ["www.apple.com"]
    start_urls = [
        "https://www.apple.com/shop/buy-iphone",
        "https://www.apple.com/hu/shop/buy-iphone",
        "https://www.apple.com/se/shop/buy-iphone",
    ]
    phone_urls = []

    def parse(self, response):
        self.phone_urls = response.css(
            "div.rf-hcard.rf-hcard-40 > a::attr(href)"
        ).extract()

        for url in self.phone_urls:
            yield scrapy.Request(url, self.parse_phone)

    def parse_phone(self, response):
        phones = response.css("div.details")
        model = response.css("h1.fwl::text").extract_first()

        if model:
            model = model.replace("Buy ", "")
            model = model.replace("vásárlása", "")
            model = model.replace("Köp", "")
            model = clean_value(value=model)

        for phone in phones:
            price = phone.css(".current_price::text").extract_first()
            capacity = phone.css("span.dimensionCapacity::text").extract_first()
            capacity += phone.css(
                "span.dimensionCapacity > small::text"
            ).extract_first()

            if price:
                yield {
                    "country": get_country_code(response.url),
                    "model": model,
                    "price": clean_value(price),
                    "capacity": capacity,
                    "color": phone.css("span.dimensionColor::text").extract_first(),
                }

Testing the Spider 🕷️

It’s time to run the script and verify if it can successfully extract the details of the iPhone models from the Apple website:

1
2
cd iphone_price_bot/iphone_price_bot
scrapy crawl apple_website_spider -o iphone_prices.csv 

This command will execute the apple_website_spider Spider and save the extracted data into a CSV file named iphone_prices.csv. If everything goes well, you should see the extracted data in the CSV file.

Let’s peek into the CSV file:

1
2
3
4
5
6
7
country,model,price,capacity,color
us,iPhoneSE,$579.00,256GB,Midnight
us,iPhoneSE,$429.00,64GB,Starlight
hu,iPhoneSE,244990Ft,128 GB,(PRODUCT)RED
hu,iPhoneSE,219990Ft,64 GB,éjfekete
se,iPhone14,"15095,00kr",512 GB,midnatt
se,iPhone14,"15095,00kr",512 GB,blå

If you ask me, it looks promising - but far from perfect. We have some issues with the extracted data. First of all, the colors are not consistent across the different countries. The capacity is not in the same format either. Lastly, the prices are formatted in different ways. Wouldn’t it be nice to have all the prices in the same currency - let’s say in USD? So we could compare them conveniently.

In the upcoming final instalment of this series, we will address these issues refactoring our Spider into a structure what a real-world Scrapy project would look like. Finally, we will incorporate it into our Jenkins pipeline.

Stay tuned! 👋