edX Institution Course Scraper - Web Scraping with Selenium

Posted Jun 14, 2025 Updated Jul 26, 2025

By Usama Sadiq

5 min read

Overview

The edX Institution Course Scraper is a Python web scraping tool designed to extract comprehensive information from the edX platform. This script automatically navigates through edX’s “Schools & Partners” page, identifies all listed institutions, and extracts the title of the first course offered by each organization.

🚀 View Source Code - Explore the complete implementation with documentation!

✨ Key Features

The scraper combines the power of Selenium WebDriver with BeautifulSoup to handle dynamic content and extract valuable educational data:

Dynamic Content Scraping: Uses Selenium WebDriver to interact with JavaScript-loaded content that traditional HTTP requests cannot access
Institution Link Extraction: Automatically discovers and follows profile URLs for all schools and partners listed on edX
First Course Identification: Navigates to each institution’s page and extracts the title of their prominently displayed first course
Incremental CSV Output: Appends scraped data to a CSV file after processing each organization, providing resilience against interruptions
Robust Element Selection: Employs multiple CSS selectors and fallback mechanisms to reliably locate course titles across varying page structures
Headless Browser Operation: Runs efficiently in headless mode for automated data collection

🛠️ Technology Stack

Python 3.x: Core programming language
Selenium WebDriver: For handling dynamic web content and browser automation
BeautifulSoup4: For HTML parsing and data extraction
Pandas: For CSV data manipulation and output formatting
Requests: For HTTP request handling
Chrome/ChromeDriver: Browser automation engine

📋 Prerequisites

Before running the scraper, ensure you have the following components installed:

System Requirements

Python 3.x with pip package installer
Google Chrome browser (or alternative browser with Selenium support)
ChromeDriver matching your Chrome version

ChromeDriver Setup Guide

Download ChromeDriver: Visit the ChromeDriver Downloads page
Version Matching: Download ChromeDriver version matching your Chrome browser (check chrome://version)
Installation Options:
- Recommended: Place chromedriver in a PATH directory (e.g., /usr/local/bin on macOS/Linux)
- Alternative: Specify the full executable path in the script’s initialize_driver function

Python Dependencies Installation

pip install selenium pandas requests beautifulsoup4

🚀 Usage Instructions

Running the Scraper

Execute the script from your terminal:

python edx_course_scrapper.py

Script Execution Flow

Initialization: Sets up headless Chrome WebDriver with optimized options
Institution Discovery: Navigates to edX Schools & Partners page and extracts all institution profile links
Course Extraction: Visits each institution’s page and identifies their first offered course
Data Storage: Incrementally saves results to edx_institution_courses.csv
Progress Reporting: Provides real-time console updates on processing status

📊 Output Format

The scraper generates a CSV file (edx_institution_courses.csv) with the following structure:

Institution	First Course Offered
ACCA	Financial Accounting
Harvard University	CS50’s Introduction to Computer Science
MIT	Introduction to Computer Science and Programming in Python

Sample Output Data

Institution,First Course Offered
ACCA,Financial Accounting
Harvard University,CS50's Introduction to Computer Science
MIT,Introduction to Computer Science and Programming in Python
Stanford University,Machine Learning
University of California Berkeley,Data Science

🔧 Implementation Highlights

Dynamic Content Handling

The scraper uses Selenium WebDriver to handle JavaScript-rendered content that traditional web scraping tools cannot access:

  
def initialize_driver():
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    driver = webdriver.Chrome(options=options)
    return driver

Robust Element Selection

Multiple CSS selectors ensure reliable course title extraction across different page layouts:

  
# Primary selectors for course titles
selectors = [
    'h3[data-testid="course-title-popover-trigger"]',
    '.course-title',
    'h3.course-name',
    '[class*="course"][class*="title"]'
]

Error Handling and Resilience

Comprehensive exception handling ensures the scraper continues operation even when individual pages fail:

  
try:
    course_title_element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, selector))
    )
    return course_title_element.text.strip()
except (TimeoutException, NoSuchElementException):
    continue  # Try next selector

⚠️ Important Considerations

Web Scraping Ethics

Respect robots.txt: Always check and comply with website scraping policies
Rate Limiting: The script includes appropriate delays between requests to avoid overwhelming servers
Terms of Service: This tool is designed for educational and research purposes
Responsible Usage: Avoid excessive requests that could impact website performance

Maintenance Requirements

Website Updates: HTML structure changes may require selector updates
ChromeDriver Compatibility: Keep ChromeDriver updated with Chrome browser versions
Dependency Management: Regularly update Python packages for security and compatibility

💡 Future Enhancement Opportunities

Advanced Features

Enhanced Logging: Implement comprehensive logging system for debugging and monitoring
Parallel Processing: Add concurrent processing for faster large-scale scraping
Proxy Rotation: Include proxy support for extensive data collection
Configuration Management: Externalize selectors and parameters to configuration files
Interactive Interface: Add command-line interface for user-specified parameters

Data Enhancement

Course Details: Extract additional course metadata (duration, difficulty, enrollment)
Institution Analytics: Collect institution statistics and course counts
Historical Tracking: Implement periodic scraping for trend analysis
Export Formats: Support multiple output formats (JSON, Excel, XML)

🔗 Source Code Access

Interactive Code Viewer

Explore the complete source code with syntax highlighting and documentation:

📄 edx_course_scrapper.py

Complete Python scraper implementation with documentation

Download Options

📁 View Project Directory - Browse all project files
⬇️ Download Python Script - Direct script download
📋 Copy to Clipboard - Copy code directly from the viewer above

🎯 Use Cases

Educational Research

Course Catalog Analysis: Study course offerings across different institutions
Educational Trend Tracking: Monitor changes in course availability over time
Institutional Comparison: Compare course portfolios between universities

Data Science Projects

Educational Data Mining: Extract patterns from online education platforms
Market Research: Analyze online education landscape and trends
Academic Analytics: Study relationships between institutions and course offerings

Automation and Monitoring

Course Availability Alerts: Monitor new course launches from preferred institutions
Educational Content Aggregation: Build comprehensive educational resource databases
Competitive Analysis: Track course offerings for educational platform comparison

🚀 Get Started

Ready to explore educational data extraction? View the complete source code and start building your own educational data analysis tools!

This project demonstrates practical web scraping techniques, dynamic content handling, and automated data extraction - perfect for developers interested in educational technology and data science applications.

projects, python

This post is licensed under CC BY 4.0 by the author.