- The Aivelution - Newsletter
- Posts
- Day 3 of AI and Automation
Day 3 of AI and Automation
Web Scraping with Python (BeautifulSoup)
Hey Guys!!
Welcome to Day 3 of the AI and Automation challenge!
If you missed the first two posts, they are linked below!!
Day 1
Day 2
With the entire world going digital, and all of the worlds data stored online, we need to be able to access that data if we want to perform various tasks like Market Research etc. So, we use a technique called web scraping to get our desired data.
Web scraping is the technique of automatically collecting data from websites. In this comprehensive tutorial, we'll cover the basics of web scraping and detail how to use the popular Python library Beautiful Soup step-by-step.
Overview of Web Scraping
The main components of a web scraper are:
Making requests to download web page content
Parsing the HTML content to identify relevant data
Extracting the data from the HTML
Structuring and storing the scraped data
Python has several libraries to help with these tasks. We'll focus on Beautiful Soup, which simplifies parsing and searching HTML documents.
Let’s build a web scraper with the Beautiful Soup library!!
First, install Beautiful Soup:
pip install beautifulsoup4 This gives us the power to parse HTML and navigate it with Python.
Let's grab the homepage HTML from Wikipedia:
import requests
page =requests.get("https://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(page.content, "html.parser")Now we can find and extract different elements:
Links
links = soup.find_all("a")
for link in links:
print(link["href"])This prints all the link URLs.
Images
images = soup.find_all("img")
for img in images:
print(img["src"]) This prints the src of all images. This src will most likely be a link that will lead to an image hosted on the link
Text
text = soup.get_text()
print(text)This prints all visible text on the page.
We can also isolate specific text elements:
headlines = soup.find_all("span", class_="mw-headline")
for headline in headlines:
print(headline.text)This prints just the Wikipedia section headings.
As you can see, Beautiful Soup makes it really easy to dive into HTML documents and extract the bits we want!!!
Now there are a few more things to keep in mind before you go off to build your own web scraper.
Properly Storing Scraped Data in CSVs
When scraping large datasets, it's important to properly store the extracted data. One popular format is CSV files. CSVs allow storage in a spreadsheet-like tabular format, which can be easily exported and opened in other programs.
To save scraped data as CSVs in Python, we can use the csv module. For example:
import csv
with open('scraped_data.csv', 'w') as csv_file:
fieldnames = ['name', 'url']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for item in scraped_items:
writer.writerow({'name': item.name, 'url': item.url})This allows writing each scraped item to a new row in the CSV, with columns defined by fieldnames. CSV files can be opened in any spreadsheet program for analysis. They also work nicely with databases and other data pipelines.
CSV isn't the only option - JSON, Excel, databases, and other formats may also be appropriate depending on use case. But CSV offers a simple way to export structured scraped data.
Cleaning Scraped Data
Real-world data scraped from websites is often messy. Here are some techniques to clean it up:
Strip HTML tags - Convert tags to text with BeautifulSoup or regex
Handle missing data - Set default values for blank entries
Remove duplicates - Deduplicate since many sites have redundancies
Normalize text - Lowercase, trim whitespace, etc for consistency
Validate/filter records - Discard or fix entries that don't match expected schema
Convert data types - Force strings to ints, floats, dates as needed
Here is an example of how to strip HTML tags from scraped data using BeautifulSoup in Python:
from bs4 import BeautifulSoup
html = """<p>This is a <b>paragraph</b> with <i>some</i> HTML tags</p>"""
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()
print(text)This would print:
This is a paragraph with some HTML tags The key is using BeautifulSoup's get_text() method. This will extract just the visible text from the HTML, stripping out any tags.
A few points:
get_text()returns a string with newlines and whitespace condensedTo preserve structure, iterate over elements and call
get_text()on eachYou can also call
stripped_stringsto get a generator of text chunksTo remove specific tags only, use
decompose()on those tag objects
Regular expressions can also help clean up and normalize text:
import re
text = re.sub(r'\s+', ' ', text) # Condense whitespaceSome other regex examples:
text = re.sub(r'<.*?>', '', text) # Strip all tags
text = re.sub(r'[\r\n\t]', '', text) # Remove newlines, tabsWith a bit of data wrangling, we can get scraped content ready for analysis and usage in other applications. Always beware of incorrect assumptions and over-cleaning though!
Ethics of Web Scraping
As scrapers, we should be mindful of ethical data collection. Here are some principles to follow:
Obey robots.txt rules - Don't scrape sites that forbid it
Don't overload sites - Use throttling/delays to avoid crashing servers
Credit sources - Make sure to cite where data came from
Don't violate Terms of Service - Stay away from data you shouldn't access
Use data responsibly - Don't be evil or careless with what you collect
It's also courteous to identify yourself in user agent strings so sites can contact you if needed. Whenever possible, try reaching out to site owners before large scrapes.
Remember that "just because you can scrape it doesn't mean you should." Be responsible!
Always remember, with great power comes great responsibility!!
Typharius
The Aivelution