Day 3 of AI and Automation

Web Scraping with Python (BeautifulSoup)

Hey Guys!!

Welcome to Day 3 of the AI and Automation challenge!

If you missed the first two posts, they are linked below!!

Day 1

Day 2

With the entire world going digital, and all of the worlds data stored online, we need to be able to access that data if we want to perform various tasks like Market Research etc. So, we use a technique called web scraping to get our desired data.

Web scraping is the technique of automatically collecting data from websites. In this comprehensive tutorial, we'll cover the basics of web scraping and detail how to use the popular Python library Beautiful Soup step-by-step.

Overview of Web Scraping

The main components of a web scraper are:

  • Making requests to download web page content

  • Parsing the HTML content to identify relevant data

  • Extracting the data from the HTML

  • Structuring and storing the scraped data

Python has several libraries to help with these tasks. We'll focus on Beautiful Soup, which simplifies parsing and searching HTML documents.

Let’s build a web scraper with the Beautiful Soup library!!

First, install Beautiful Soup:

pip install beautifulsoup4  

This gives us the power to parse HTML and navigate it with Python.

Let's grab the homepage HTML from Wikipedia:

import requests

page =requests.get("https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(page.content, "html.parser")

Now we can find and extract different elements:

Links

links = soup.find_all("a")

for link in links:
    print(link["href"])

This prints all the link URLs.

Images

images = soup.find_all("img")

for img in images:
    print(img["src"])  

This prints the src of all images. This src will most likely be a link that will lead to an image hosted on the link

Text 

text = soup.get_text() 
print(text)

This prints all visible text on the page.

We can also isolate specific text elements:

headlines = soup.find_all("span", class_="mw-headline")

for headline in headlines:
   print(headline.text)

This prints just the Wikipedia section headings.

As you can see, Beautiful Soup makes it really easy to dive into HTML documents and extract the bits we want!!!

Now there are a few more things to keep in mind before you go off to build your own web scraper.

Properly Storing Scraped Data in CSVs

When scraping large datasets, it's important to properly store the extracted data. One popular format is CSV files. CSVs allow storage in a spreadsheet-like tabular format, which can be easily exported and opened in other programs.

To save scraped data as CSVs in Python, we can use the csv module. For example:

import csv 

with open('scraped_data.csv', 'w') as csv_file:
    fieldnames = ['name', 'url']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    for item in scraped_items:
        writer.writerow({'name': item.name, 'url': item.url})

This allows writing each scraped item to a new row in the CSV, with columns defined by fieldnames. CSV files can be opened in any spreadsheet program for analysis. They also work nicely with databases and other data pipelines.

CSV isn't the only option - JSON, Excel, databases, and other formats may also be appropriate depending on use case. But CSV offers a simple way to export structured scraped data.

Cleaning Scraped Data

Real-world data scraped from websites is often messy. Here are some techniques to clean it up:

  • Strip HTML tags - Convert tags to text with BeautifulSoup or regex

  • Handle missing data - Set default values for blank entries

  • Remove duplicates - Deduplicate since many sites have redundancies

  • Normalize text - Lowercase, trim whitespace, etc for consistency

  • Validate/filter records - Discard or fix entries that don't match expected schema

  • Convert data types - Force strings to ints, floats, dates as needed

Here is an example of how to strip HTML tags from scraped data using BeautifulSoup in Python:

from bs4 import BeautifulSoup

html = """<p>This is a <b>paragraph</b> with <i>some</i> HTML tags</p>"""

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()

print(text)

This would print:

This is a paragraph with some HTML tags

The key is using BeautifulSoup's get_text() method. This will extract just the visible text from the HTML, stripping out any tags.

A few points:

  • get_text() returns a string with newlines and whitespace condensed

  • To preserve structure, iterate over elements and call get_text() on each

  • You can also call stripped_strings to get a generator of text chunks

  • To remove specific tags only, use decompose() on those tag objects

Regular expressions can also help clean up and normalize text:

import re

text = re.sub(r'\s+', ' ', text) # Condense whitespace

Some other regex examples:

text = re.sub(r'<.*?>', '', text) # Strip all tags
text = re.sub(r'[\r\n\t]', '', text) # Remove newlines, tabs

With a bit of data wrangling, we can get scraped content ready for analysis and usage in other applications. Always beware of incorrect assumptions and over-cleaning though!

Ethics of Web Scraping

As scrapers, we should be mindful of ethical data collection. Here are some principles to follow:

  • Obey robots.txt rules - Don't scrape sites that forbid it

  • Don't overload sites - Use throttling/delays to avoid crashing servers

  • Credit sources - Make sure to cite where data came from

  • Don't violate Terms of Service - Stay away from data you shouldn't access

  • Use data responsibly - Don't be evil or careless with what you collect

It's also courteous to identify yourself in user agent strings so sites can contact you if needed. Whenever possible, try reaching out to site owners before large scrapes.

Remember that "just because you can scrape it doesn't mean you should." Be responsible!

Always remember, with great power comes great responsibility!!

Typharius
The Aivelution