Post

edX Institution Course Scraper - Web Scraping with Selenium

edX Institution Course Scraper - Web Scraping with Selenium

Overview

The edX Institution Course Scraper is a Python web scraping tool designed to extract comprehensive information from the edX platform. This script automatically navigates through edX’s “Schools & Partners” page, identifies all listed institutions, and extracts the title of the first course offered by each organization.

🚀 View Source Code - Explore the complete implementation with documentation!

✨ Key Features

The scraper combines the power of Selenium WebDriver with BeautifulSoup to handle dynamic content and extract valuable educational data:

  • Dynamic Content Scraping: Uses Selenium WebDriver to interact with JavaScript-loaded content that traditional HTTP requests cannot access
  • Institution Link Extraction: Automatically discovers and follows profile URLs for all schools and partners listed on edX
  • First Course Identification: Navigates to each institution’s page and extracts the title of their prominently displayed first course
  • Incremental CSV Output: Appends scraped data to a CSV file after processing each organization, providing resilience against interruptions
  • Robust Element Selection: Employs multiple CSS selectors and fallback mechanisms to reliably locate course titles across varying page structures
  • Headless Browser Operation: Runs efficiently in headless mode for automated data collection

🛠️ Technology Stack

  • Python 3.x: Core programming language
  • Selenium WebDriver: For handling dynamic web content and browser automation
  • BeautifulSoup4: For HTML parsing and data extraction
  • Pandas: For CSV data manipulation and output formatting
  • Requests: For HTTP request handling
  • Chrome/ChromeDriver: Browser automation engine

📋 Prerequisites

Before running the scraper, ensure you have the following components installed:

System Requirements

  • Python 3.x with pip package installer
  • Google Chrome browser (or alternative browser with Selenium support)
  • ChromeDriver matching your Chrome version

ChromeDriver Setup Guide

  1. Download ChromeDriver: Visit the ChromeDriver Downloads page
  2. Version Matching: Download ChromeDriver version matching your Chrome browser (check chrome://version)
  3. Installation Options:
    • Recommended: Place chromedriver in a PATH directory (e.g., /usr/local/bin on macOS/Linux)
    • Alternative: Specify the full executable path in the script’s initialize_driver function

Python Dependencies Installation

1
pip install selenium pandas requests beautifulsoup4

🚀 Usage Instructions

Running the Scraper

Execute the script from your terminal:

1
python edx_course_scrapper.py

Script Execution Flow

  1. Initialization: Sets up headless Chrome WebDriver with optimized options
  2. Institution Discovery: Navigates to edX Schools & Partners page and extracts all institution profile links
  3. Course Extraction: Visits each institution’s page and identifies their first offered course
  4. Data Storage: Incrementally saves results to edx_institution_courses.csv
  5. Progress Reporting: Provides real-time console updates on processing status

📊 Output Format

The scraper generates a CSV file (edx_institution_courses.csv) with the following structure:

InstitutionFirst Course Offered
ACCAFinancial Accounting
Harvard UniversityCS50’s Introduction to Computer Science
MITIntroduction to Computer Science and Programming in Python

Sample Output Data

Institution,First Course Offered
ACCA,Financial Accounting
Harvard University,CS50's Introduction to Computer Science
MIT,Introduction to Computer Science and Programming in Python
Stanford University,Machine Learning
University of California Berkeley,Data Science

🔧 Implementation Highlights

Dynamic Content Handling

The scraper uses Selenium WebDriver to handle JavaScript-rendered content that traditional web scraping tools cannot access:

1
2
3
4
5
6
7
8
def initialize_driver():
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=options)
    return driver

Robust Element Selection

Multiple CSS selectors ensure reliable course title extraction across different page layouts:

1
2
3
4
5
6
7
# Primary selectors for course titles
selectors = [
    'h3[data-testid="course-title-popover-trigger"]',
    '.course-title',
    'h3.course-name',
    '[class*="course"][class*="title"]'
]

Error Handling and Resilience

Comprehensive exception handling ensures the scraper continues operation even when individual pages fail:

1
2
3
4
5
6
7
try:
    course_title_element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, selector))
    )
    return course_title_element.text.strip()
except (TimeoutException, NoSuchElementException):
    continue  # Try next selector

⚠️ Important Considerations

Web Scraping Ethics

  • Respect robots.txt: Always check and comply with website scraping policies
  • Rate Limiting: The script includes appropriate delays between requests to avoid overwhelming servers
  • Terms of Service: This tool is designed for educational and research purposes
  • Responsible Usage: Avoid excessive requests that could impact website performance

Maintenance Requirements

  • Website Updates: HTML structure changes may require selector updates
  • ChromeDriver Compatibility: Keep ChromeDriver updated with Chrome browser versions
  • Dependency Management: Regularly update Python packages for security and compatibility

💡 Future Enhancement Opportunities

Advanced Features

  • Enhanced Logging: Implement comprehensive logging system for debugging and monitoring
  • Parallel Processing: Add concurrent processing for faster large-scale scraping
  • Proxy Rotation: Include proxy support for extensive data collection
  • Configuration Management: Externalize selectors and parameters to configuration files
  • Interactive Interface: Add command-line interface for user-specified parameters

Data Enhancement

  • Course Details: Extract additional course metadata (duration, difficulty, enrollment)
  • Institution Analytics: Collect institution statistics and course counts
  • Historical Tracking: Implement periodic scraping for trend analysis
  • Export Formats: Support multiple output formats (JSON, Excel, XML)

🔗 Source Code Access

Interactive Code Viewer

Explore the complete source code with syntax highlighting and documentation:

📄 edx_course_scrapper.py

Complete Python scraper implementation with documentation

Download Options

🎯 Use Cases

Educational Research

  • Course Catalog Analysis: Study course offerings across different institutions
  • Educational Trend Tracking: Monitor changes in course availability over time
  • Institutional Comparison: Compare course portfolios between universities

Data Science Projects

  • Educational Data Mining: Extract patterns from online education platforms
  • Market Research: Analyze online education landscape and trends
  • Academic Analytics: Study relationships between institutions and course offerings

Automation and Monitoring

  • Course Availability Alerts: Monitor new course launches from preferred institutions
  • Educational Content Aggregation: Build comprehensive educational resource databases
  • Competitive Analysis: Track course offerings for educational platform comparison

🚀 Get Started

Ready to explore educational data extraction? View the complete source code and start building your own educational data analysis tools!

This project demonstrates practical web scraping techniques, dynamic content handling, and automated data extraction - perfect for developers interested in educational technology and data science applications.

This post is licensed under CC BY 4.0 by the author.