Serengeti logo BLACK white bg w slogan
Menu

Python Web Scraping: A Beginner's Guide

Romano Marinić, Junior Software Developer
09.11.2023.

Web scraping is a powerful tool for extracting information from websites. It's used in a variety of fields, from data science to digital marketing. Python, with its rich ecosystem of libraries, is one of the most popular languages for web scraping. In this article, we’ll guide you through the basics of web scraping using Python with a practical example.

What is Web Scraping?

Web scraping is the process of programmatically retrieving information from websites. This is done by making HTTP requests to the website and then parsing the HTML of the webpage to extract the data you need.

Why Python?

Python is a favorite for web scraping because of its ease of use and the powerful libraries it offers, such as Beautiful Soup and Scrapy. These libraries provide tools for navigating and searching the document parse tree, which is crucial in web scraping.

A Simple Example: Scraping a Recipe Website

For our example, we'll scrape a recipe from a cooking website: bestrecipes.com. Let’s say we want to scrape the ingredients and cooking instructions for a chocolate cake recipe.

Step 1: Inspecting the Web Page

Before writing any code, we need to understand the structure of the website. Using a browser, we navigate to the chocolate cake recipe page and inspect the HTML structure. We notice that the ingredients are listed within a <div> tag with the class ingredients-description, and the instructions are in a <div> tag with the class recipe-method-step-content.

Step 2: Making an HTTP Request

Python’s requests library allows us to make HTTP requests. Here's how you’d make a request to our example website:

import requests

url = 'https://www.bestrecipes.com.au/recipes/best-chocolate-cake-recipe/b02q0fm7'

response = requests.get(url)

# Check if the request was successful

if response.status_code == 200:

    html_content = response.text

else:

    raise Exception("Failed to retrieve the webpage")

Step 3: Parsing the HTML Content

Parsing the HTML content is akin to translating a foreign language. We take the HTML, which is structured for browsers to display content, and translate it into a format that our Python code can understand and navigate. Beautiful Soup is the translator in this scenario.

from bs4 import BeautifulSoup

# Parse the HTML content

soup = BeautifulSoup(html_content, 'html.parser')

Once we have a parse tree, we can navigate it using various Beautiful Soup methods and attributes. Here are some of the most common:

  • find_all(): Searches for all instances of a tag that match the given criteria.
  • find(): Finds the first instance of a tag that matches the given criteria.
  • children and descendants: These attributes can be used to navigate through a tag's children or through all of its descendants.
  • parent and parents: To move up the tree, you can use these attributes.
  • next_sibling and previous_sibling: These attributes let you navigate between tags at the same level of the parse tree.

Each of these navigational tools can be used to zero in on the parts of the HTML document that contain the data you're interested in.

Step 4: Extracting the Data

Extracting data is a pivotal step in the web scraping process, and it’s where the power of Beautiful Soup shines. Once we've navigated through the HTML structure, we need to extract the relevant information. Here's a more detailed breakdown:

# Extracting ingredients

ingredients = []

for item in soup.find_all('div', class_='ingredient-description'):

    ingredients.append(item.get_text().strip())

# Extracting instructions

instructions = []

for step in soup.find_all('div', class_='recipe-method-step-content'):

    instructions.append(step.get_text().strip())

Step 5: Saving the Data

Once we've extracted the data, we need to store it in a format that's useful for us. JSON is a common format because it's both human-readable and machine-readable. However, sometimes you might want to save your data in a CSV file, which can be easily imported into a spreadsheet for analysis. Here’s how you can expand this step:

import json

import csv

recipe = {

    'title': 'Chocolate Cake',

    'ingredients': ingredients,

    'instructions': instructions

}

# Save to a JSON file

with open('chocolate_cake_recipe.json', 'w') as jsonfile:

    json.dump(recipe, jsonfile, indent=4)

# Save to a CSV file

with open('chocolate_cake_recipe.csv', 'w', newline='', encoding='utf-8') as csvfile:

    fieldnames = ['title', 'ingredients', 'instructions']

    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()

    writer.writerow({

        'title': recipe['title'],

        'ingredients': '|'.join(recipe['ingredients']),  # Using '|' as a separator for ingredients

        'instructions': '\n'.join(recipe['instructions'])  # Joining instructions with a newline character

    })

We’ve included options to save the data as both a JSON and a CSV file. The JSON format preserves the structure of our data, while the CSV format is more suitable for tabular data and can be easily opened with spreadsheet software. We've also demonstrated how to handle lists when writing to a CSV file—joining list items into a single string, with a chosen separator.

Further Expansion

If the scraping task grows in complexity or needs to be performed regularly, you might consider setting up a more robust system. This could involve:

  • Error Handling: Implementing try-except blocks to gracefully handle any errors that occur during the scraping process.
  • Logging: Keeping logs of your scraping sessions can help with debugging and tracking the scraping history.
  • Database Storage: For large or regularly updated datasets, storing the scraped data in a database might be more efficient.
  • Automated Running: Using tools like cron jobs (on Linux) or Task Scheduler (on Windows) to run your scraping script at regular intervals.

Ethics and Legality

It’s crucial to note that web scraping can be legally and ethically questionable. Always check the website’s robots.txt file and terms of service to ensure you’re not violating any rules. Moreover, scraping should be done responsibly to avoid overloading the website’s server.

Conclusion

Mastering Python web scraping is akin to unlocking a treasure trove of data. It empowers you to harness information from the vast expanse of the internet, transforming it into actionable insights and valuable datasets. Python, with its plethora of libraries and tools, stands as the quintessential medium for this endeavor.

Our detailed walkthrough from inspecting the webpage to parsing HTML content, extracting data and saving it in a structured format, embodies the core of web scraping. It also serves as a foundation upon which you can build more complex and sophisticated scraping bots.

In conclusion, Python web scraping is not just about writing code; it's about creating a bridge between the wealth of data available online and the endless potential it holds. So, equip yourself with Python, respect the rules of the digital world, and step into the future of data.

Let's do business

The project was co-financed by the European Union from the European Regional Development Fund. The content of the site is the sole responsibility of Serengeti ltd.
cross