How To Scrape News from Reuters with Python & Scrapy

A step-by-step guide to mining news data

Kamen Zhekov
10 min readMay 22, 2023

Welcome to this practical guide on extracting news articles from the Reuters website using Python and Scrapy.

In the age of data-driven decision-making, having the ability to harvest news data can empower us in numerous ways, from understanding trending topics to conducting sentiment analysis and enhancing data journalism. Throughout this article, we’ll unlock the potential of web scraping, transforming raw data into meaningful insights.

Whether you’re a seasoned programmer or a coding novice, this step-by-step guide is designed to equip you with the skills needed to navigate the world of web scraping. Let’s dive right in!

An Introduction to Scrapy

Imagine having your own relentless digital agent, capable of navigating the vast labyrinth of the internet, accurately extracting and curating the precise data you require. Meet Scrapy, the powerful, open-source web crawling framework that makes this fantasy a reality.

At its core, Scrapy is a Python library specifically engineered for web scraping, a method tailored for swift and efficient extraction of large quantities of data from websites.

This tool opens up a world of possibilities for data analysis, trend prediction, and much more. Let’s delve deeper into its functionalities of and how it can be used to harness the true power of data.

First, a few notions:

  1. Spiders: Scrapy uses scripts known as ‘spiders’ to define how a website (or a group of websites) should be used to extract information. These spiders are like the blueprint for your scraping operation, outlining where to dig — the URLs, and what to look for — the data.
  2. Requests and Response: The spiders send requests to specific URLs, which return responses containing the HTML of the webpage. In a lot of scraping cases, we would use an intermediary, such as ScrapeOps or ScrapingBee that take care of a lot of the issues you could face, such as getting blocked or facing Cloudflare protection.
  3. Selectors: Once a response is received, Scrapy uses selectors to extract the desired data. These
  4. Items: The extracted data is then stored in containers called ‘items’. Each item is a clean, organized collection of data ready for analysis — the polished gems from our data mining!

With these components, Scrapy is more than capable of tackling large-scale web scraping projects. Its ability to handle both the breadth (multiple websites) and depth (detailed data extraction) of web scraping tasks makes it a top choice for data miners globally. And today, we’re going to harness this power to dig into the treasure trove that is Reuters. Let’s get started!

Scraping Process for Reuters

The Scraping Targets

For our data mining journey, we’ve set our sights on a goldmine of information — the Reuters website. This article could easily be adapted to other news websites that function in a similar way, such as CNN or the New York Times, but we’ll use Reuters for this article. Specifically, we’re interested in scraping two types of pages from the website: the search results and individual articles.

Scraping Search Results

Our first spider, defined in the reuters.py file, will define how to extract information from the Reuters' search results. Given a specific search query — the same words you would use when searching on their website — the spider creates the URL for the search and get a list of articles with their titles, dates, and topics. When run with Scrapy, it will output a CSV with a list of those articles, and that list will be used to scrape them one by one.

Reuters Search Results for “Ukraine War”

Scraping Specific Articles

Our second crawler, defined in reuters_article.py, is more specialized, built to dive into the depths of individual articles and extract specific data such as the article title, date, author, and the text from all paragraphs.

Reuters Article About The Ukraine War

With the combined powers of these two spiders, we’re primed and ready to mine the treasure troves of Reuters for every nugget of information we can find. The first one gets all the relevant articles for a search query, and the second one scrapes those articles one by one. So hold onto your hats, it’s time to get digging!

Setting up our tools

We will start our journey with a bit of exploration of the Reuters website. We will look for the element that are of interest to us and save them in JSON configuration files that we will nameselectors_reuters.json and selectors_reuters_article.json. These JSON files will then be loaded in our scrapers in order to extract the relevant information.

Here’s a brief glance at selectors_reuters.json:

{
"titles": "a[data-testid=\"Heading\"]::text",
"urls": "a[data-testid=\"Heading\"]::attr(href)",
"dates": "time[data-testid=\"Body\"]::attr(datetime)",
"topics": "span[data-testid=\"Text\"]::text"
}

And selectors_reuters_article.json:

{
"title": "h1[data-testid=\"Heading\"]::text",
"date": "time > span:nth-child(2)::text",
"author": "div[class*=\"article-header__author\"]::text",
"paragraphs": "p[data-testid*=\"paragraph-\"]::text"
}

Inside these JSON files, we have a series of CSS selectors. They target particular HTML elements on a webpage that hold the valuable data we seek. Take a look at the picture below for a more visual example.

Example of a CSS selector for an “input” element with class “form-control”

Each CSS selector captures a specific HTML element and its associated data. For example, in the image we have, input.form-control is a selector for every input tag that has a form-control class.

In our JSON files, another example is "a[data-testid=\"Heading\"]::text", which targets the text within anchor tags <a> that have a data-testid attribute of "Heading".

I will not dive too deep into how to use CSS selectors here, but there are many resources out there that explain them very well, so if you are interested, check out the article below.

Now, let’s dive into the code to see how these selectors work with our spiders to gather data!

Step 1: Defining the Spider & Loading the Selectors

Let’s start by defining the skeleton of the Scrapy spiders and reading the configuration defined above. The file structure for our project is going to look like this:

News Scraping File Structure

The config folder holds our selector configuration, the spiders folder holds the different spiders we are going to define in a minute, and the remaining files store helper functions (postprocessing.py and preprocessing.py) or the Scrapy-specific settings.py, where all the project settings are contained, like activating pipelines, middlewares etc.

When first starting work on a Scrapy spider, we at least need to define the start_requests and the parse functions — the first for defining which URLs will get scraped, and the second to define what information to extract after we receive the response.

In both Python spiders, we open our corresponding JSON file and use json.load() to transform the CSS selectors into a Python dictionary. Our search spider skeleton will then look like this:

class ReutersSpider(scrapy.Spider):
"""
A spider that crawls Reuters's list of article search results.
"""
name = 'reuters'
with open("config/selectors_reuters.json") as selector_file:
selectors = json.load(selector_file)

def start_requests(self):
pass

def parse(self, response: TextResponse, **kwargs):
pass

And the article spider skeleton looks like this (note the list_of_articles which points to the output file of the search spider).

class ReutersArticleSpider(scrapy.Spider):
"""
A spider that crawls a list of articles individually, based on the Reuters search crawler. It does not require JS to render the product pages.
"""
name = 'reuters_article'
list_of_articles = "output/reuters.csv"
with open("config/selectors_reuters_article.json") as selector_file:
selectors = json.load(selector_file)

def start_requests(self):
pass

def parse(self, response: TextResponse, **kwargs):
pass

Now, our selectors are spiders are armed and ready for action, we just need to define the scraping logic.

Step 2: Crafting the Requests

The start_requests method in each spider determines the starting point of our treasure hunt. It is a required function to generate the start_urls list, which Scrapy uses as scraping targets. First, we need to define a few helper functions for our scrapers. One for creating the search URL, one for creating the ScrapeOps proxy URL, one for getting all URLs from a list of dictionaries, and one for reading into a list of dictionaries the CSV output file of the search spider.

A small note — you could skip the ScrapeOps proxy, although they have a generous free tier, if you feel like facing the challenges of scraping on your own, but bear in mind that you might run into issues that are not described in this article.

def create_reuters_search_url(query):
return f"https://www.reuters.com/site-search/?query={query.replace(' ', '+')}"

def create_scrapeops_url(url, js=False, wait=False):
key = os.getenv("SCRAPEOPS_API_KEY")
scraping_url = f"https://proxy.scrapeops.io/v1/?api_key={key}&url={url}"
if js:
scraping_url += "&render_js=true"
if wait:
scraping_url += f"&wait_for={wait}"
return scraping_url

def get_urls_from_dict(list_of_dicts):
urls = []
for dict_with_url in list_of_dicts:
if dict_with_url.get("url"):
urls.append(dict_with_url.get("url"))
return urls

def read_csv(csv_path: str):
items = []
with open(csv_path) as csvfile:
reader = csv.DictReader(csvfile, delimiter=',')
for row in reader:
items.append(row)
return items

In the search spider defined inreuters.py, we generate a search URL with a specific search query by using the first two helper functions we defined above. Here, we want to search for the keywords “ukraine war”:

def start_requests(self):
start_urls = [
create_scrapeops_url(create_reuters_search_url("ukraine war"), wait="time")
]
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse)

In reuters_article.py, we get all URLs from the CSV output file of the search spider and crawl the articles one by one using the first two helper functions we defined above .

def start_requests(self):
start_urls = get_urls_from_dict(read_csv(self.path_for_urls))
for url in start_urls:
yield scrapy.Request(url=create_scrapeops_url(url), callback=self.parse)

Step 3: Parsing the Response

Now we’ve reached the destination, it’s time to find the treasure. Our spiders utilize the selectors to mine data from the HTML response. Specifically, the parse method will use the CSS selectors that we loaded from our configuration file to extract one by one the important elements from the HTML response.

In the search spider defined inreuters.py, we will extract the titles, URLs, dates and article topics using CSS selectors and create a JSON list of article items that contain that data:

def parse(self, response: TextResponse, **kwargs):
titles = [clean(title, remove_special=False) for title in response.css(self.selectors["titles"]).getall()]
urls = ["https://www.reuters.com" + url for url in response.css(self.selectors["urls"]).getall()]
dates = [clean(description) for description in response.css(self.selectors["dates"]).getall()]
topics = [clean(metadata) for metadata in response.css(self.selectors["topics"]).getall()]

articles = []
for title, url, date, topic in zip(titles, urls, dates, topics):
article = {
"title": title,
"url": url,
"date": date,
"topic": topic,
}
articles.append(article)

return articles

In the specific article spider defined in reuters_article.py, we will extract for a single article its title, text, the date it was published, and its author, and create a JSON item that contains that data:

def parse(self, response: TextResponse, **kwargs):
title = clean(response.css(self.selectors["title"]).get(), remove_special=False)
paragraphs = response.css(self.selectors["paragraphs"]).getall()
text = " ".join([clean(paragraph) for paragraph in paragraphs])
date = clean(response.css(self.selectors["date"]).get())
author = clean(response.css(self.selectors["author"]).get())
url = get_source_url_from_scraping_url(response.request.url)

return {
"title": title,
"url": url,
"date": date,
"author": author,
"text": text,
}

Step 4: Unleashing the Spiders

Once our Spiders are coded and ready to explore the wilderness of the web, it’s time to set them free. In this section, we’ll learn how to do just that using Scrapy’s command-line interface (CLI).

To run our spiders, we’ll use the following command:

scrapy crawl reuters -o output/reuters.csv & scrapy crawl reuters_article -o output/reuters_article.csv

Here’s a quick rundown of what’s happening in this command:

  1. scrapy crawl reuters: This part of the command tells Scrapy to start our first spider named 'reuters'. The crawl command is the green light for our spider to start its journey into the depths of the Reuters search results.
  2. -o output/reuters.csv: This is where we specify the output for the data our spider collects. The -o option tells Scrapy to output the data in a specific format and location. In this case, we're storing the scraped data as a CSV file in the 'output' directory with the filename 'reuters.csv'.
  3. scrapy crawl reuters_article: And this will launch our second spider, 'reuters_article'. This specialist dives into the individual articles from the search results extracted by the first spider and extracts the article-specific details.
  4. -o output/reuters_article.csv: Similar to before, this part of the command specifies where the 'reuters_article' spider should store its collected data. Again, we're choosing a CSV file in the 'output' directory, this time with the filename 'reuters_article.csv'.

And voila! With a simple command, we’ve launched two sophisticated web-crawlers on their data-hunting expedition. As they navigate through the web, they’ll gather valuable data and organize it neatly in CSV files, ready for your analysis. Happy crawling!

Conclusion

And there we have it — a step-by-step guide on building your own web-crawling expedition with Scrapy to mine the vast data landscapes of Reuters. From understanding the power of Scrapy and CSS selectors to unleashing the spiders into the wilderness of the web, we’ve traversed quite a journey. It’s amazing what a blend of Python and the Scrapy library can achieve in terms of data collection and processing.

But our adventure doesn’t stop here. The true power of web scraping lies in its versatility and scalability. With the basic understanding you now possess, you can tailor your spiders to explore other websites, dive into different article types, or dig for other pieces of information.

So equip yourself with these newfound skills, and continue venturing into the rich, untapped mines of web data.

But remember, just like any adventure, always be respectful to the native land (or in this case, the website’s robots.txt file). Only scrape data you have permission to access. Happy data hunting!

--

--

Kamen Zhekov

I am a Python Engineer with experience in ML Engineering, full stack and API architectures. Currently, I am working with ACA Group's amazing Python team!