Featured image of post Scraping the Web with Scrapy and Jenkins - Part 4

Scraping the Web with Scrapy and Jenkins - Part 4

In the last part of my Web Scraping series, we will refactor the Spider and fix the issues we have encountered in the previous parts. Also, we will deploy and run our Scrapy Spider on Jenkins.

🎒 Resources

👈 Previously we have created a Scrapy Spider that scrapes the Apple website for iPhone models and their prices. It has some issues that we need to fix before we can consider it a production-ready web scraping solution.

Let’s fix things step by step, shall we?

📐 Improving the Spider

We’ll begin by separating the helper functions from the main Spider code and placing them into a new file. This step will make our Spider more organized and easier to understand. Plus, it allows us to use these functions in other Spiders in the future. Please create a new Python file called spider_utils.py in the iphone_price_bot folder.

Next, we’ll standardize the colour names. Our goal is to have consistent colour names in English, regardless of the country. So, we’ll convert all colour names from Hungarian and Swedish to English.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def translate_color_hu(color_hu):
    """
    Translate iPhone color names from Hungarian to English.

    Parameters:
        color_hu (str): The color name in Hungarian.

    Returns:
        str: The color name in English. If the color name is not found in the
        dictionary, the original color name is returned.
    """
    return {
        "zöld": "Green",
        "kék": "Blue",
        "sárga": "Yellow",
        "fekete": "Black",
        "rózsaszín": "Pink",
        "éjfekete": "Midnight",
        "lila": "Purple",
        "kék titán": "Blue Titanium",
        "fehér titán": "White Titanium",
        "fekete titán": "Black Titanium",
        "natúr titán": "Natural Titanium",
        "csillagfény": "Starlight",
    }.get(color_hu, color_hu)

The translate_color_hu function is designed to convert Hungarian colour names into their English counterparts. If a colour name isn’t in our list, it keeps the name as is. This tool lets us match colours between Hungarian and American iPhone models without needing to translate each colour name by hand.

As you may have guessed, we need a similar function for the Swedish colour names. This function is called translate_color_se and it is very similar to the translate_color_hu function. I will put here nothing else, only the dictionary for the colour translations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
    return {
        "grön": "Green",
        "blå": "Blue",
        "gul": "Yellow",
        "svart": "Black",
        "rosa": "Pink",
        "midnatt": "Midnight",
        "lila": "Purple",
        "blått titan": "Blue Titanium",
        "vit titan": "White Titanium",
        "svart titan": "Black Titanium",
        "naturlig titan": "Natural Titanium",
        "stjärnljus": "Starlight",
        "stjärnglans": "Starlight",
    }.get(color_se, color_se)

Nice, we have the colour translation functions. Now, let’s organize our code by moving the country code extraction function into the spider_utils.py file. This function, named get_country_code, is similar to what we’ve seen before but now includes an extra part called loader_context.

This function will be used in the Item Loader to extract the country code from the URL of the page. The loader_context parameter is a dictionary that holds the context of the Item Loader. Don’t worry, we’ll go over Item Loaders in more detail soon.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def get_country_code(url, loader_context):
    """
    Get the country code from the URL.

    Parameters:
        url (str): The URL of the website.
        loader_context (dict): The context of the loader.

    Returns:
        str: The country code.
    """
    country = loader_context["response_url"].split("/")[3]
    return "us" if country == "shop" else country

I hope you still remember why we have this "shop" string in the URL. Yeah, it’s indicates that we’re on the US Apple website.

Next, move our clean_value function to the spider_utils.py file. This function removes the leading and trailing empty-spaces and any non-ASCII characters from the value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def clean_value(value):
    """
    Clean the value from unnecessary characters.
    Also, replace the comma with a dot to help make it a float number.

    Parameters:
        value (str): The value to clean.

    Returns:
        str: The cleaned value.
    """
    value = value.strip()
    value = value.replace(" ", "")
    value = value.replace(",", ".")
    value = re.sub(r"[^\x00-\x7F]+", "", value)
    return value

Next, we’ll streamline the process we’ve been using to remove the word ‘Buy’ and its equivalents in Swedish and Hungarian from the model names. We’ve created a function named clean_model for this purpose.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def clean_model(value):
    """
    Clean the model value by removing unnecessary words and characters.

    Parameters:
        value (str): The model value to clean.

    Returns:
        str: The cleaned model value.
    """
    value = value.replace("Buy ", "")
    value = value.replace("vásárlása", "")
    value = value.replace("Köp", "")
    value = value.strip()
    value = clean_value(value=value)
    return value

As you can see, we also cleaned the model name from any unwanted stuff. One small function here is to ensure the model capacity is free from unwanted characters (spaces and other blank characters).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def clean_capacity_unit(value):
    """
    Clean the capacity unit value by removing unnecessary characters.

    Parameters:
        value (str): The capacity unit value to clean.

    Returns:
        str: The cleaned capacity unit value.
    """
    return clean_value(value=str(value[:2]))

Pretty straightforward, right? Let’s move on and streamline our crawling logic with Item Pipelines.

🔁 Item Pipelines

Pipelines? You might be wondering if we’re discussing iPhone model scraping or constructing CI/CD pipelines here. Interestingly, the notion of processes executed in sequence isn’t limited to the realm of DevOps. It’s a good practice to delegate the tasks of manipulating and processing scraped data to distinct classes. When we call these classes ‘pipelines’, it leads us to the concept of Scrapy Item Pipelines.

🔍 Anatomy of a Scrapy Pipeline

A Scrapy pipeline is a Python class that processes items returned by the Spider. It can perform various tasks, such as cleaning, validating, and storing the scraped data. The pipeline class must implement the process_item method, which receives the item and the Spider instance as arguments. The method should return the processed item or raise a DropItem exception to discard the item. You might already figure out that we are going to use this exception to filter out the models without prices.

Pipeline for Price Data Processing

Let’s start with delegating the task of processing price data to a dedicated pipeline. Add the following class to the pipelines.py script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class IphonePriceSpiderPipeline:
    """
    Pipeline for processing iPhone price items.

    This pipeline checks if the item has a "price" field. If it does, it adds the "capacity_unit" to the "capacity" field,
    removes the "capacity_unit" field, and returns the modified item. If the item does not have a "price" field, it raises
    a DropItem exception to discard the item.

    Args:
        item (dict): The item to be processed.

    Returns:
        dict: The processed item.

    Raises:
        DropItem: If the item does not have a "price" field.
    """

    def process_item(self, item, spider):
        if "price" in item:
            item["capacity"] = item["capacity"] + item["capacity_unit"]
            del item["capacity_unit"]
            return item
        else:
            raise DropItem()

This pipeline checks if the item has a "price" field. If it does, it melds the "capacity_unit" with the "capacity" field, then drops the "capacity_unit" field itself, finally returns the modified item. If the item does not have a "price" field, it raises a DropItem exception to discard the item.

Pipeline for colour Translation

Yes indeed, we are going to delegate the translation logic into a separate pipeline. Add the following class to the pipelines.py script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class IphoneColorConverterPipeline:
    """
    Pipeline for converting iPhone color based on country.

    This pipeline converts the color field of the item based on the country field. If the country is "hu", it uses the
    translate_color_hu function to translate the color. If the country is "se", it uses the translate_color_se function.
    The converted color is then assigned back to the color field of the item.

    Args:
        item (dict): The item to be processed.

    Returns:
        dict: The processed item.
    """

    def convert_color(self, item):
        if item["country"] == "hu":
            item["color"] = translate_color_hu(item["color"])
        elif item["country"] == "se":
            item["color"] = translate_color_se(item["color"])
        return item["color"]

    def process_item(self, item, spider):
        item["color"] = self.convert_color(item=item)
        return item

Are you satisfied with the convert_color method? Neither am I. Feel free to refactor it to your liking. Now let’s move on and see what is missing from our pipeline? Oh, I know! The price conversion logic. Let’s add a new pipeline for this task:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class IphonePriceConvertPricePipeline:
    """
    Pipeline for converting iPhone price to USD based on country.

    This pipeline cleans the currency value by removing commas, dollar signs, and country-specific currency symbols.
    It then converts the cleaned price to USD based on the country field of the item. The converted price is assigned
    back to the price field of the item.

    Args:
        item (dict): The item to be processed.

    Returns:
        dict: The processed item.
    """

    def clean_currency(self, value):
        value = (
            value.lower()
            .replace(",", ".")
            .replace("$", "")
            .replace("ft", "")
            .replace("kr", "")
        )

        if value.count(".") > 1:
            value = value.replace(".", "", value.count(".") - 1)

        return value

    """
    Convert the price to USD based on the country field of the item.

    Args:
        price (str): The price to be converted.
        country (str): The country of the item.

    Returns:
        float: The converted price in USD.
    """

    def convert_to_usd(self, price, country):
        price = float(price)
        return {
            "us": price,  # 1 USD = 1 USD :)
            "hu": price * 0.0034,  # 1 HUF = 0.0034 USD
            "se": price * 0.11,  # 1 SEK = 0.11 USD
        }.get(country, "N/A")

    def process_item(self, item, spider):
        item["price"] = self.convert_to_usd(
            price=self.clean_currency(value=item["price"]), country=item["country"]
        )

        return item

Okay, that’s a bunch of code for a pipeline. Let’s break it down. The clean_currency method removes commas, currency symbols, and other unwanted characters from the price value. The convert_to_usd method converts the cleaned price to USD based on the country field of the item. The converted price is then assigned back to the price field of the item.

Summary of the Pipelines

So far so good, we have three items in our pipeline - let’s visualize them:

Scrapy Pipeline

The Spider sends the scraped, “raw” data to the Item Pipelines.

  • The IphonePriceSpiderPipeline processes the data by merging the capacity unit with the capacity field. If the item does not have a price field, it raises a DropItem exception to discard the given item.
  • The IphoneColorConverterPipeline converts the color field to English based on the country field.
  • Finally, the IphonePriceConvertPricePipeline does the same, but with the price field.

Enable the Pipelines

There’s still more to do! For Scrapy to utilize the pipelines we’ve crafted, we must activate them. This is done in the settings.py file. Please insert these lines into that file to get the pipelines up and running:

1
2
3
4
5
ITEM_PIPELINES = {
    "iphone_price_bot.pipelines.IphonePriceSpiderPipeline": 300,
    "iphone_price_bot.pipelines.IphonePriceConvertPricePipeline": 400,
    "iphone_price_bot.pipelines.IphoneColorConverterPipeline": 500,
}

The numbers assigned to each pipeline in the dictionary determine their execution sequence. A lower number means the pipeline runs sooner. You’re not bound to use the exact numbers I’ve chosen; adjust them as you see fit. It’s wise to space them out, perhaps in increments of 100, to allow painless insertion of new pipelines later on.

⬇️ Item Loaders

Item Loaders are a powerful Scrapy feature that simplifies the data extraction workflow. They allow you to define reusable input processors for your items. You can define these processors in a separate class and reuse them across different Spiders. Let’s see how we can use Item Loaders to fetch and clean properly the iPhone model data from Apple’s website.

Define the Item Loader

For our current project, we’ll integrate an Item Loader directly within the Spider. Please put the following class into your apple_website_spider.py script:

1
2
    class IphoneItemLoader(ItemLoader):
        default_output_processor = TakeFirst()

Now, that’s not much, is it? The IphoneItemLoader class is a subclass of the ItemLoader class provided by Scrapy. We set the default_output_processor attribute to TakeFirst() - let me clarify the terms input and output processors a bit later. Trust me, it will make sense soon.

We are going to use this loader to extract the iPhone model data from the website. To fetch the URLs of the iPhone models, we still extract them using a CSS selector:

1
"div.rf-hcard.rf-hcard-40 > a::attr(href)"

Before we can use the IphoneItemLoader class, we need to make some changes to the parse_phone method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
    def parse_phone(self, response):
        phones = response.css("div.details")
        model = response.css("h1.fwl::text").extract_first()

        for phone in phones:
            phone_loader = self.IphoneItemLoader(
                item=IphonePricesItem(), selector=phone
            )
            phone_loader.context["response_url"] = response.url
            phone_loader.add_value("country", "")
            phone_loader.add_css("price", ".current_price::text")
            phone_loader.add_value("model", model)
            phone_loader.add_css("capacity", "span.dimensionCapacity::text")
            phone_loader.add_css(
                "capacity_unit", "span.dimensionCapacity > small::text"
            )
            phone_loader.add_css("color", "span.dimensionColor::text")
            yield phone_loader.load_item()

First, we create an instance of the IphoneItemLoader class, but what is this IponePricesItem? It’s the item class we have defined in the items.py file.

For a refresher, here is the original version of this class, as we defined it in the items.py file:

1
2
3
4
5
6
class IphonePricesItem(scrapy.Item):
    country = scrapy.Field()
    model = scrapy.Field()
    capacity = scrapy.Field()
    color = scrapy.Field()
    price = scrapy.Field()

Let’s move the cleaning logic from the Spider to the Item. Every Item Field can have exactly one input processor and one output processor. We can define these processors in the items.py file (or a set default one in the Item Loader, if it is more convenient - as we did above).

Here is the updated version of the IphonePricesItem class:

1
2
3
4
5
6
7
8
class IphonePricesItem(scrapy.Item):

    country = scrapy.Field(input_processor=MapCompose(spider_utils.get_country_code))
    model = scrapy.Field(input_processor=MapCompose(spider_utils.clean_model))
    capacity = scrapy.Field(input_processor=MapCompose(spider_utils.clean_value))
    capacity_unit = scrapy.Field()
    color = scrapy.Field()
    price = scrapy.Field(input_processor=MapCompose(spider_utils.clean_value))

The MapCompose function applies the given function to the input value. In this case, we use the functions we have defined in the spider_utils.py file to clean the model, capacity, and price fields. As I promised earlier, let’s clarify input and output processors.

Input and Output Processors

Input processors are functions that modify the data given to the Item Loader through methods like add_value, add_css, or add_xpath.

Once all the data is gathered and adjusted, we use output processors to finalize it. This is done by using the load_item method from the Item Loader. Recall how we set up a default output processor for the IphoneItemLoader class? It was the TakeFirst() processor, which selects the first valid value from the collected data.

Item Loader Context

If you look closely at the parse_phone method, you will see that we set the response_url key in the phone_loader.context dictionary.

What is this dictionary anyway? If we need to pass more data to input and output processors, we can use the context dictionary. In our case, we pass the URL of the response to the get_country_code function. Any time we need to access any information in addition to the input value, we can use the context dictionary.

Quite handy! My dear vigilant readers might remember that we have used the loader_context parameter in the get_country_code function. This is where it comes into play.

As a final tweak, let’s specify the default output format for the scraped data. Add the following lines to the settings.py file:

1
2
FEED_FORMAT = "csv"
FEED_URI = "iphone_prices.csv"

This is not required, but it makes our life easier with Jenkins. Nice, we have refactored the Spider and added Item Loaders to streamline the data extraction process. Let’s move on and deploy our Scrapy Spider on Jenkins.

🎩 Mr Jenkins, Let’s Crawl!

Now that we have all the improvements checked in to the sample repository, we can improve our Jenkins pipeline to run the actual Scrapy Spider. Here is the updated Jenkinsfile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
pipeline {
    agent {
        docker { image 'tcsizmadia/scrapy-jenkins-agent' }
    }
    
    parameters {
        booleanParam(name: 'DRY_RUN', defaultValue: false, description: 'No scraping will happen')
        booleanParam(name: 'SLACK_SEND', defaultValue: false, description: 'Send notification to Slack')
    }
    
    stages {
        stage('Checkout') {
            steps {
                checkout([
                    $class: 'GitSCM', 
                    branches: [[name: '*/main']], 
                    doGenerateSubmoduleConfigurations: false, 
                    extensions: [], 
                    submoduleCfg: [], 
                    userRemoteConfigs: [[url: 'https://github.com/tcsizmadia/iphone-price-scraper.git']]
                ])
            }
        }
        stage('Pre-Flight') {
            steps {
                // Check Python version
                sh 'python3 --version'
                // Check Scrapy version
                sh 'scrapy version'
            }
        }
        stage('Scrape Website') {
            steps {
                script {
                    if (params.DRY_RUN) {
                        echo 'Dry run, no scraping will happen'
                    } else {
                        sh 'cd iphone_price_bot && scrapy crawl apple_website_spider'
                    }
                }
            }
        }
        
        stage('Archive') {
            when {
                expression {
                    currentBuild.result == 'SUCCESS'
                }
            }
            steps {
                // Archive the scraped data
                archiveArtifacts artifacts: 'sandbox.json', fingerprint: true
            }
        }
        
        stage('Notify') {
            when {
                expression {
                    currentBuild.result == 'SUCCESS'
                }
            }
            steps {
                script {
                    if (params.SLACK_SEND) {
                        // Send notification to Slack
                        slackSend channel: '#sandbox', color: 'good', message: 'Scraping completed!'
                    }
                }
            }
        }
    }
    post {
        always {
            archiveArtifacts artifacts: 'iphone_price_bot/iphone_prices.csv', fingerprint: true
        }
    }
}

Although the Jenkinsfile is quite similar to the previous one, there are some notable changes.

Checkout Stage

Having my public GitHub repository, we can now clone it, so the Jenkins agent can access the Spider code.

Pre-Flight Stage

No changes here, we still check the Python and Scrapy versions.

Scrape Website Stage

The most significant change is the shell command: sh 'cd iphone_price_bot && scrapy crawl apple_website_spider'. This command navigates to the iphone_price_bot directory and runs the Scrapy Spider. If the DRY_RUN parameter is set to true, the Spider will not run.

Archive Stage

If everything goes well, we will have the outcome of the scraping job in the iphone_prices.csv file. We archive this file to keep track of the scraped data. This way, we can easily access the scraped data from the Jenkins UI.

Let’s run the job and see what happens!

Jenkins Job

The job has run successfully, and we can see the iphone_prices.csv file in the Jenkins UI. We can download it and inspect the scraped data. Open it in your favourite spreadsheet editor and do some analysis!

Where Can I Buy the Cheapest iPhone 14?

I wanted to find out where I could buy a blue iPhone 14 with 256 GB storage without spending too much money. 💰

iPhone 14 Source: Apple

So, I opened the iphone_prices.csv file and filtered the data to ‘iPhone 14’ only, with the colour ‘Blue’ and the capacity of ‘256 GB’.

Here is what I’ve found:

iPhone 14 Prices

It seems that the iPhone 14 with 256 GB storage and blue colour is the cheapest in the US, and unfortunately, it costs much-much more in Hungary. 😭

🔬 Conclusion

In this series, we have learned how to scrape the web with Scrapy and make the process more efficient and streamlined with Jenkins. We have refactored the Spider to make it scalable and more maintainable. We have used Item Loaders to extract and clean the data more effectively. We have also used Item Pipelines to process the scraped data and make it production-ready. Finally, we have deployed the Scrapy Spider on Jenkins and automated the scraping process.

As we wrap up, I hope this series has been both informative and enjoyable. 🙏 It’s clear that what we’ve explored is merely the beginning — only the tip of the iceberg. With the insights and techniques you’ve acquired, I’m confident you’ll craft more sophisticated Spiders, extract really valuable data, and derive meaningful conclusions to guide your decisions. 👓

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy