Web scraping is the process of extracting data from websites. It’s a powerful tool for gathering information that isn’t readily available in a structured format. Whether you’re tracking stock prices, collecting news articles, or monitoring product prices, web scraping allows you to automate data collection. However, gathering data is only the first step. Once you have the data, you need to store it in a way that makes it easy to access and analyze. This is where SQLite, a lightweight and easy-to-use database, comes into play.
In this article, we’ll explore how to scrape data from a fictious website and store it in an SQLite database using Python.
Why Use Web Scraping?
Web scraping is valuable when you need to collect data that isn’t available via an API. For example, job listings, product prices, or even weather data can often be found on websites but aren’t always accessible through traditional data sources. By automating the process of data collection, you can save time and ensure that your data is always up to date.
Tools for Web Scraping
To scrape data, we’ll use the requests
library to fetch the web page content and BeautifulSoup
from the bs4
package to parse the HTML. After scraping, we’ll use SQLite to store the data in a structured format. These tools are easy to use and work well together.
Real-Life Example: Scraping Job Listings
Suppose you want to track job openings for data science positions on a job board website. You can scrape the job listings, extract relevant details, and store them in an SQLite database for later use.
1. Set Up the Project
First, create a new Python file named job_scraper.py
. This file will contain all the code needed to scrape the job listings and save them to a database.
touch job_scraper.py
2. Install Required Libraries
Before you start coding, make sure you have the necessary libraries installed. You can install them using pip:
pip install requests beautifulsoup4 sqlite3
3. Write the Scraping Code
Now, open job_scraper.py
in your preferred text editor and write the following code:
import requests
from bs4 import BeautifulSoup
import sqlite3
import time
# URL of the job listings page
URL = "https://example.com/jobs/data-scientist"
# SQLite database file
DB_FILE = "jobs.db"
def create_database():
conn = sqlite3.connect(DB_FILE)
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS job_listings
(id INTEGER PRIMARY KEY, title TEXT, company TEXT, location TEXT, date_posted TEXT, url TEXT)''')
conn.commit()
conn.close()
def scrape_jobs():
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'html.parser')
jobs = []
listings = soup.find_all('div', class_='job-listing')
for listing in listings:
title = listing.find('h2', class_='job-title').text.strip()
company = listing.find('div', class_='company').text.strip()
location = listing.find('div', class_='location').text.strip()
date_posted = listing.find('div', class_='date-posted').text.strip()
job_url = listing.find('a', class_='apply-link')['href']
jobs.append((title, company, location, date_posted, job_url))
return jobs
def save_jobs_to_db(jobs):
conn = sqlite3.connect(DB_FILE)
c = conn.cursor()
c.executemany('INSERT INTO job_listings (title, company, location, date_posted, url) VALUES (?, ?, ?, ?, ?)', jobs)
conn.commit()
conn.close()
if __name__ == "__main__":
create_database()
while True:
jobs = scrape_jobs()
save_jobs_to_db(jobs)
print(f"{len(jobs)} jobs scraped and saved to database.")
time.sleep(3600) # Wait for 1 hour before scraping again
How the Code Works
- URL: Replace
"https://example.com/jobs/data-scientist"
with the actual URL of the job listings page you want to scrape. - create_database() Function: This function sets up a SQLite database and a table for storing job listings. If the database and table don’t exist, it creates them.
- scrape_jobs() Function: This function fetches the job listings page, parses the HTML, and extracts the job title, company, location, date posted, and the URL for applying. It returns a list of tuples containing this information.
- save_jobs_to_db() Function: This function takes the list of job listings and inserts them into the SQLite database.
- Main Execution: The script sets up the database, scrapes job listings, and saves them to the database. It repeats this process every hour, ensuring that the database is always updated with the latest job openings.
Running the Script
To run the script, open your terminal, navigate to the directory where job_scraper.py
is located, and execute it:
python job_scraper.py
The script will now run continuously, scraping the job listings page every hour and storing the data in the SQLite database. This setup ensures that you always have the latest job listings available for analysis.
Thank you reading this article. We hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials/articles, please feel free to join and comment. Your feedback and suggestions are always welcome!
You can find the same tutorial on Medium.com.