Default

edX Institution Course Scraper

A comprehensive Python web scraping tool designed to extract course information from edX institutions using advanced Selenium WebDriver automation.

🚀 Project Overview

This sophisticated scraper navigates through edX’s “Schools & Partners” page, automatically discovers all listed educational institutions, and intelligently extracts the title of the first course offered by each organization. Built with robust error handling and dynamic content support.

✨ Key Features

  • 🔄 Dynamic Content Scraping: Uses Selenium WebDriver to handle JavaScript-loaded content
  • 🏫 Institution Discovery: Automatically extracts profile URLs for all edX schools and partners
  • 📚 Course Identification: Navigates to each institution’s page and finds their first course
  • 💾 Incremental CSV Output: Saves data progressively to prevent loss during long scraping sessions
  • 🛡️ Robust Element Selection: Multiple CSS selectors and fallback mechanisms for reliability
  • 🚀 Headless Operation: Runs efficiently in background without browser UI

🛠️ Technology Stack

  • Python 3.x - Core programming language
  • Selenium WebDriver - Browser automation and dynamic content handling
  • BeautifulSoup4 - HTML parsing and data extraction
  • Pandas - CSV data manipulation and export
  • ChromeDriver - Automated browser control

📁 Project Files

  • edx_course_scrapper.py - Main scraper implementation with comprehensive error handling
  • README.md - Complete setup guide and usage documentation
  • code-viewer.html - Interactive source code viewer with syntax highlighting
  • Documentation - Detailed technical specifications and examples

� Quick Start

Prerequisites

class="highlight">
1
pip install selenium pandas requests beautifulsoup4

Download ChromeDriver

Visit ChromeDriver Downloads and install the version matching your Chrome browser.

Run the Scraper

class="highlight">
1
python edx_course_scrapper.py

� Sample Output

The script generates edx_institution_courses.csv:

InstitutionFirst Course Offered
Harvard UniversityCS50’s Introduction to Computer Science
MITIntroduction to Computer Science and Programming in Python
Stanford UniversityMachine Learning

💻 Interactive Code Viewer

Explore the complete source code with syntax highlighting and easy copying:

📄 edx_course_scrapper.py

259 lines • Complete Python implementation with documentation

⬇️ Download Options

🎯 Use Cases

  • � Educational Research: Analyze course offerings across institutions
  • 🔍 Market Analysis: Track trends in online education
  • 🏫 Institutional Comparison: Compare course portfolios between universities
  • 📊 Data Science Projects: Build educational datasets for analysis

🚀 Advanced Features

  • Error Recovery: Continues scraping even if individual pages fail
  • Rate Limiting: Respectful delays between requests
  • Multiple Selectors: Handles different page layouts automatically
  • Headless Mode: Efficient background operation
  • Progress Tracking: Real-time status updates during scraping

📖 Need Help?

Check the detailed README for complete setup instructions, troubleshooting tips, and advanced configuration options.