processing rss feed with python

Tutorial: Processing an RSS-feed with Python and storing results in S3 using boto3

In this article, we will create a simple ETL-script (Extract Transform Load) where we will extract headlines and links from a Dutch news RSS-feed. The script will then generate a semicolon-separated CSV-file that will then automatically be uploaded to an S3 bucket using the boto3 library.

For this tutorial it is required that:

  • You have Python and pip and boto3 installed. I actually used Python 3.8.2 and pip 19.2.3 for this tutorial;
  • You have your AWS CLI installed and configured correctly;
  • You have an S3 bucket already created

We will create the script step by step.

First we install beautifulsoup4 with pip the package installer for Python.

pip install beautifulsoup4

For supporting handling XML files we also install the package lxml.

pip install lxml

Also we install the requests package which we will use in combination with beautifulsoup to do requests.

pip install requests

Now we can start coding. So start an IDE or your favorite text-editor.

First we need some imports for logging, creating CSV-files, communication with AWS and creating HTTP-requests. Finally we also import BeautifulSoup from the bs4 package that we will use for parsing the RSS-feed.

import logging
import csv
import boto3
import requests
from bs4 import BeautifulSoup

Since we want to do some logging and also want to make sure the log-messages are flushed in time we do some configuration of this in our Python-script:

# So later on we can do some manual flushing when doing logging
if len(logging.getLogger().handlers) > 0:
    logging.getLogger().setLevel(logging.INFO)
else:
    logging.basicConfig(level=logging.INFO)
logger = logging.getLogger()

We now have a logger that we can use to log messages. Next we will create a function that:

  • Connects to an known RSS-feed with tech-nieuws and obtain the contents
  • Parse through the RSS-items using BeautifulSoup and store only the title and link of each item in a list of items to be returned.

The listing of this can be seen below:

def get_data_from_rss():
       URL = "https://feeds.nos.nl/nosnieuwstech"
       #Some headers where obtained while doing request to the site and using the inspector of the browser (Firefox in this case)
       headers = {
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
       'Accept-Encoding': 'gzip, deflate, br',
       'Access-Control-Allow-Origin': '*',
       'Access-Control-Allow-Methods': 'GET',
       'Access-Control-Allow-Headers': 'Content-Type',
       'Access-Control-Max-Age': '7200',
       'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:106.0) Gecko/20100101 Firefox/106.0'
        }
        request = requests.get(URL, headers)
        xmlSoup = BeautifulSoup(request.content, features='xml')
        items = xmlSoup.find_all('item')
        rss_data_items = []
        for item in items:
          title = [title.text for title in item.select("title")]
          link = [link.text for link in item.select("link")]
          rss_item = {}
          rss_item['title'] = title[0]
          rss_item['link'] = link[0]
          rss_data_items.append(rss_item)
        return rss_data_items

Now that we have a list with parsed items where each item contains a title and a link field we will create a function that can parse such a list with items and create a CSV-file:

def create_csv_file(rss_items):
       filename = 'news_and_links.csv'
       with open(filename, 'w', newline='') as f:
         w = csv.DictWriter(f, delimiter=";", fieldnames=['title','link'])
         w.writeheader()
         for rss_item in rss_items:
           w.writerow(rss_item)
       return filename

This function creates a CSV-file where items are separated using semicolons as delimiters. It returns the filename of the file that was created. Next we will create a function that will upload a file given a filename to a hardcoded bucket using its name.

def upload_file_to_s3(filename):
       #create an S3 client. We assume that you have boto3 and the AWS CLI correctly installed
       s3client = boto3.client('s3')
       #when uploading the file to a bucket we let the objectname
       #in this case be the same as the filename to upload
       s3client.upload_file(filename, "BUCKET_NAME", filename)

Finally to make it all work we create a main-function with some extra logging and make sure it is invoked automatically:

def main():
       logger.info('Start ETL script\n\n')
       logger.handlers[0].flush()
       rss_items = get_data_from_rss()
       filename = create_csv_file(rss_items)
       logger.info('\n\nFile '+filename+' created. Starting upload to S3')

       #we assume that you already have set up a bucket in AWS and have
       #your current AWS CLI config configured correctly for using boto3
       upload_file_to_s3(filename)
       logger.handlers[0].flush()
       logger.info('\n\nFile '+filename+' uploaded to S3')

if __name__ == '__main__':
    main()

You can download the full listing of the script here. Or reach out to us and we will help you out!

Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!

Don't miss out!

Subscribe to our newsletter and stay up to date with our latest articles and events!

Subscribe now

Newsletter Subscription