🎒 Resources
- GitHub Repository for Jenkins CI/CD Pipeline
- GitHub Repository for the iPhone Prices Scrapy Spider
- First Part of the Series: Introduction
- Second Part of the Series: Playing with Scrapy Shell
In the last part of this tutorial, we have explored Scrapy Shell and learned how to create CSS selectors to extract data from a website. Now we will create a Scrapy Spider based on the knowledge we have gained so far.
Creating a Scrapy Spider - A Naïve Approach
Let’s start by creating a new Scrapy Spider. These spiders are not catching flies, but rather crawling websites and extract valuable data. Our objective is to scrape Apple’s website for the latest iPhone models and their prices to compare them across different countries, focusing on the US, Swedish, and Hungarian Apple websites.
Initialize a Git Repository & Create a New Spider
Since a dedicated Jenkins Build Agent will do the heavy lifting, it would be nice to keep things tidy and place all the Spider-related files in a separate Git repository. Let’s create a new repository and initialize our Scrapy project inside it.
Create a new directory for the project and navigate into it, and initialize a new Git repository:
|
|
Now, let’s create a new Scrapy project inside this directory:
|
|
This will create a new Scrapy project with the following structure:
|
|
Nice, we have a new Scrapy project! Now, let’s create a new Spider inside the spiders
directory. We will name it apple_website_spider.py
:
|
|
If you open the newly created apple_website_spider.py
file, you will see something like this:
|
|
Now it is not especially useful, is it? Let’s make our Spider more useful - shall we?
Fetch the Pages of iPhone Models
Let’s begin by fetching all the pages for the different iPhone models in the US, Swedish, and Hungarian Apple sites.
To achieve this, we need to create a list named start_urls
to specify the URLs we intend to scrape. Without this, our Spider would only target the US website.
|
|
I’ve also initialized an empty list called phone_urls
. My intention is to store the URLs of the iPhone model pages into this list. We will use these URLs to fetch the details of the iPhone models.
Finally, I set the allowed_domains
attribute to ["www.apple.com"]
to prevent the Spider from crawling other websites - let’s stick to Apple’s website for this job.
The provided parse
method is the entry point of the Spider. It is called with the response of the URLs specified in the start_urls
list. We will use this method to extract the URLs of the iPhone models to dig deeper and extract the interesting data of a given iPhone model.
|
|
Remember the CSS selector we used in the Scrapy Shell to extract the URLs of the iPhone models? We use that same selector here to extract the URLs of the iPhone models. The response.css("div.rf-hcard.rf-hcard-40 > a::attr(href)").extract()
line selects all the a
tags that are children of a div
tag with the classes rf-hcard
and rf-hcard-40
and extracts the href
attribute of these a
tags. In the for
loop, we iterate over the extracted URLs and call the parse_phone
method to extract the details of the iPhone models.
Extract the Details of the iPhone Models
Now, let’s implement the parse_phone
method to extract the details of a given iPhone model. We will extract the name of the iPhone model and its price from the page.
|
|
Let’s break down this quite verbose method into more manageable chunks. First, we put all the div
tags with the class details
into the phones
variable. We will eat them one bite after another.
What are these funny words like ‘vásárlása’ and ‘Köp’? These are the Hungarian and Swedish translations of ‘Buy’, respectively. It’s important to remove the ‘Buy’ verb from the model’s name, not only in the English version of the website but across all language variants. Next, we use the clean_value
function to drop any unnecessary characters from the extracted values, ensuring clarity and consistency.
Now we can iterate over the phones
collection and extract the price, capacity, and color of the iPhone models. We use the get_country_code
function to extract the country code from the URL of the page. Remember that we don’t have prices for the carrier-locked iPhones? Consequently, we must ignore those entries in the model list.
Almost forgot to include the clean_value
and get_country_code
functions. Here they are:
|
|
Nothing fancy. The get_country_code
function does exactly what its name suggests: it extracts the country code from the URL of the page. For the US, we identify the country using the term ‘shop’ instead of a specific country code. The clean_value
function removes the leading and trailing empty-spaces and any non-ASCII characters from the value.
Let’s see how our Spider looks like now:
|
|
Testing the Spider 🕷️
It’s time to run the script and verify if it can successfully extract the details of the iPhone models from the Apple website:
|
|
This command will execute the apple_website_spider
Spider and save the extracted data into a CSV file named iphone_prices.csv
. If everything goes well, you should see the extracted data in the CSV file.
Let’s peek into the CSV file:
|
|
If you ask me, it looks promising - but far from perfect. We have some issues with the extracted data. First of all, the colors are not consistent across the different countries. The capacity is not in the same format either. Lastly, the prices are formatted in different ways. Wouldn’t it be nice to have all the prices in the same currency - let’s say in USD? So we could compare them conveniently.
In the upcoming final instalment of this series, we will address these issues refactoring our Spider into a structure what a real-world Scrapy project would look like. Finally, we will incorporate it into our Jenkins pipeline.
Stay tuned! 👋