edX Institution Course Scraper - Web Scraping with Selenium
Overview
The edX Institution Course Scraper is a Python web scraping tool designed to extract comprehensive information from the edX platform. This script automatically navigates through edX’s “Schools & Partners” page, identifies all listed institutions, and extracts the title of the first course offered by each organization.
🚀 View Source Code - Explore the complete implementation with documentation!
✨ Key Features
The scraper combines the power of Selenium WebDriver with BeautifulSoup to handle dynamic content and extract valuable educational data:
- Dynamic Content Scraping: Uses Selenium WebDriver to interact with JavaScript-loaded content that traditional HTTP requests cannot access
- Institution Link Extraction: Automatically discovers and follows profile URLs for all schools and partners listed on edX
- First Course Identification: Navigates to each institution’s page and extracts the title of their prominently displayed first course
- Incremental CSV Output: Appends scraped data to a CSV file after processing each organization, providing resilience against interruptions
- Robust Element Selection: Employs multiple CSS selectors and fallback mechanisms to reliably locate course titles across varying page structures
- Headless Browser Operation: Runs efficiently in headless mode for automated data collection
🛠️ Technology Stack
- Python 3.x: Core programming language
- Selenium WebDriver: For handling dynamic web content and browser automation
- BeautifulSoup4: For HTML parsing and data extraction
- Pandas: For CSV data manipulation and output formatting
- Requests: For HTTP request handling
- Chrome/ChromeDriver: Browser automation engine
📋 Prerequisites
Before running the scraper, ensure you have the following components installed:
System Requirements
- Python 3.x with pip package installer
- Google Chrome browser (or alternative browser with Selenium support)
- ChromeDriver matching your Chrome version
ChromeDriver Setup Guide
- Download ChromeDriver: Visit the ChromeDriver Downloads page
- Version Matching: Download ChromeDriver version matching your Chrome browser (check
chrome://version
) - Installation Options:
- Recommended: Place
chromedriver
in a PATH directory (e.g.,/usr/local/bin
on macOS/Linux) - Alternative: Specify the full executable path in the script’s
initialize_driver
function
- Recommended: Place
Python Dependencies Installation
1
pip install selenium pandas requests beautifulsoup4
🚀 Usage Instructions
Running the Scraper
Execute the script from your terminal:
1
python edx_course_scrapper.py
Script Execution Flow
- Initialization: Sets up headless Chrome WebDriver with optimized options
- Institution Discovery: Navigates to edX Schools & Partners page and extracts all institution profile links
- Course Extraction: Visits each institution’s page and identifies their first offered course
- Data Storage: Incrementally saves results to
edx_institution_courses.csv
- Progress Reporting: Provides real-time console updates on processing status
📊 Output Format
The scraper generates a CSV file (edx_institution_courses.csv
) with the following structure:
Institution | First Course Offered |
---|---|
ACCA | Financial Accounting |
Harvard University | CS50’s Introduction to Computer Science |
MIT | Introduction to Computer Science and Programming in Python |
Sample Output Data
Institution,First Course Offered
ACCA,Financial Accounting
Harvard University,CS50's Introduction to Computer Science
MIT,Introduction to Computer Science and Programming in Python
Stanford University,Machine Learning
University of California Berkeley,Data Science
🔧 Implementation Highlights
Dynamic Content Handling
The scraper uses Selenium WebDriver to handle JavaScript-rendered content that traditional web scraping tools cannot access:
1
2
3
4
5
6
7
8
def initialize_driver():
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options)
return driver
Robust Element Selection
Multiple CSS selectors ensure reliable course title extraction across different page layouts:
1
2
3
4
5
6
7
# Primary selectors for course titles
selectors = [
'h3[data-testid="course-title-popover-trigger"]',
'.course-title',
'h3.course-name',
'[class*="course"][class*="title"]'
]
Error Handling and Resilience
Comprehensive exception handling ensures the scraper continues operation even when individual pages fail:
1
2
3
4
5
6
7
try:
course_title_element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, selector))
)
return course_title_element.text.strip()
except (TimeoutException, NoSuchElementException):
continue # Try next selector
⚠️ Important Considerations
Web Scraping Ethics
- Respect robots.txt: Always check and comply with website scraping policies
- Rate Limiting: The script includes appropriate delays between requests to avoid overwhelming servers
- Terms of Service: This tool is designed for educational and research purposes
- Responsible Usage: Avoid excessive requests that could impact website performance
Maintenance Requirements
- Website Updates: HTML structure changes may require selector updates
- ChromeDriver Compatibility: Keep ChromeDriver updated with Chrome browser versions
- Dependency Management: Regularly update Python packages for security and compatibility
💡 Future Enhancement Opportunities
Advanced Features
- Enhanced Logging: Implement comprehensive logging system for debugging and monitoring
- Parallel Processing: Add concurrent processing for faster large-scale scraping
- Proxy Rotation: Include proxy support for extensive data collection
- Configuration Management: Externalize selectors and parameters to configuration files
- Interactive Interface: Add command-line interface for user-specified parameters
Data Enhancement
- Course Details: Extract additional course metadata (duration, difficulty, enrollment)
- Institution Analytics: Collect institution statistics and course counts
- Historical Tracking: Implement periodic scraping for trend analysis
- Export Formats: Support multiple output formats (JSON, Excel, XML)
🔗 Source Code Access
Interactive Code Viewer
Explore the complete source code with syntax highlighting and documentation:
📄 edx_course_scrapper.py
Complete Python scraper implementation with documentation
Download Options
- 📁 View Project Directory - Browse all project files
- ⬇️ Download Python Script - Direct script download
- 📋 Copy to Clipboard - Copy code directly from the viewer above
🎯 Use Cases
Educational Research
- Course Catalog Analysis: Study course offerings across different institutions
- Educational Trend Tracking: Monitor changes in course availability over time
- Institutional Comparison: Compare course portfolios between universities
Data Science Projects
- Educational Data Mining: Extract patterns from online education platforms
- Market Research: Analyze online education landscape and trends
- Academic Analytics: Study relationships between institutions and course offerings
Automation and Monitoring
- Course Availability Alerts: Monitor new course launches from preferred institutions
- Educational Content Aggregation: Build comprehensive educational resource databases
- Competitive Analysis: Track course offerings for educational platform comparison
🚀 Get Started
Ready to explore educational data extraction? View the complete source code and start building your own educational data analysis tools!
This project demonstrates practical web scraping techniques, dynamic content handling, and automated data extraction - perfect for developers interested in educational technology and data science applications.