edX Institution Course Scraper
A comprehensive Python web scraping tool designed to extract course information from edX institutions using advanced Selenium WebDriver automation.
🚀 Project Overview
This sophisticated scraper navigates through edX’s “Schools & Partners” page, automatically discovers all listed educational institutions, and intelligently extracts the title of the first course offered by each organization. Built with robust error handling and dynamic content support.
✨ Key Features
- 🔄 Dynamic Content Scraping: Uses Selenium WebDriver to handle JavaScript-loaded content
- 🏫 Institution Discovery: Automatically extracts profile URLs for all edX schools and partners
- 📚 Course Identification: Navigates to each institution’s page and finds their first course
- 💾 Incremental CSV Output: Saves data progressively to prevent loss during long scraping sessions
- 🛡️ Robust Element Selection: Multiple CSS selectors and fallback mechanisms for reliability
- 🚀 Headless Operation: Runs efficiently in background without browser UI
🛠️ Technology Stack
- Python 3.x - Core programming language
- Selenium WebDriver - Browser automation and dynamic content handling
- BeautifulSoup4 - HTML parsing and data extraction
- Pandas - CSV data manipulation and export
- ChromeDriver - Automated browser control
📁 Project Files
edx_course_scrapper.py
- Main scraper implementation with comprehensive error handlingREADME.md
- Complete setup guide and usage documentationcode-viewer.html
- Interactive source code viewer with syntax highlighting- Documentation - Detailed technical specifications and examples
� Quick Start
Prerequisites
1
pip install selenium pandas requests beautifulsoup4
Download ChromeDriver
Visit ChromeDriver Downloads and install the version matching your Chrome browser.
Run the Scraper
class="highlight">1
python edx_course_scrapper.py
� Sample Output
The script generates edx_institution_courses.csv
:
Institution First Course Offered Harvard University CS50’s Introduction to Computer Science MIT Introduction to Computer Science and Programming in Python Stanford University Machine Learning
💻 Interactive Code Viewer
Explore the complete source code with syntax highlighting and easy copying:
⬇️ Download Options
- � Download Python Script - Direct file download
- 📋 Copy from Viewer - Use the copy button in the code viewer above
- 📁 View on GitHub - Browse project repository
🎯 Use Cases
- � Educational Research: Analyze course offerings across institutions
- 🔍 Market Analysis: Track trends in online education
- 🏫 Institutional Comparison: Compare course portfolios between universities
- 📊 Data Science Projects: Build educational datasets for analysis
🚀 Advanced Features
- Error Recovery: Continues scraping even if individual pages fail
- Rate Limiting: Respectful delays between requests
- Multiple Selectors: Handles different page layouts automatically
- Headless Mode: Efficient background operation
- Progress Tracking: Real-time status updates during scraping
📖 Need Help?
Check the detailed README for complete setup instructions, troubleshooting tips, and advanced configuration options.