edX Institution Course Scraper
This Python script is designed to scrape information from the edX website, specifically focusing on the institutions listed on their "Schools & Partners" page and extracting the title of the first course offered by each. It leverages Selenium to handle dynamically loaded content, ensuring that data populated by JavaScript is correctly captured.
✨ Features
- Dynamic Content Scraping: Uses Selenium WebDriver to interact with the webpage and retrieve data loaded asynchronously via JavaScript.
- Institution Link Extraction: Automatically fetches the profile URLs for all listed schools and partners from the main edX page.
- First Course Identification: Navigates to each institution's page and extracts the title of the first prominently displayed course.
- Incremental CSV Output: Appends scraped data to a CSV file after processing each organization, making the process robust against interruptions for large datasets.
- Robust Selectors: Employs a combination of CSS selectors and fallback mechanisms to reliably locate course titles, even with varying page structures.
⚙️ Prerequisites
Before running this script, ensure you have the following installed:
- Python 3.x
- pip (Python package installer)
- Google Chrome browser (or any other browser supported by Selenium, with its corresponding WebDriver)
- ChromeDriver (or the WebDriver for your chosen browser).
ChromeDriver Setup
- Download ChromeDriver: Visit the ChromeDriver Downloads page.
- Match Chrome Version: Download the ChromeDriver version that matches your installed Google Chrome browser version. You can check your Chrome version by going to
chrome://version
in your browser.
- Place ChromeDriver:
- Recommended: Place the
chromedriver
executable in a directory that is included in your system's PATH
environment variable (e.g., /usr/local/bin
on macOS/Linux, or a directory added to PATH on Windows).
- Alternative: If you don't want to modify your PATH, you can specify the full path to the
chromedriver
executable in the initialize_driver
function within the script (uncomment and modify the Service(executable_path='...')
line).
🚀 Installation
- Clone this repository (or copy the script content) to your local machine.
- Navigate to the project directory in your terminal.
- Install the required Python libraries:
pip install selenium pandas requests beautifulsoup4
🏃♀️ Usage
To run the scraper, simply execute the Python script from your terminal:
python your_script_name.py
(Replace your_script_name.py
with the actual name you save the script as, e.g., edx_scraper.py
)
The script will:
- Print status messages to the console as it fetches links and processes each institution.
- Create (or append to) a CSV file named
edx_institution_courses.csv
in the same directory where the script is run.
📊 Output
The edx_institution_courses.csv
file will contain two columns:
- Institution: The name of the edX school or partner.
- First Course Offered: The title of the first course found on that institution's edX profile page. If no course is found, the cell will be empty.
Example edx_institution_courses.csv
content:
Institution,First Course Offered
ACCA,Financial Accounting
Harvard University,CS50's Introduction to Computer Science
MIT,Introduction to Computer Science and Programming in Python
...
⚠️ Important Notes
- Web Scraping Ethics: Be mindful of
robots.txt
files and the terms of service of the website you are scraping. This script is provided for educational purposes and personal use. Excessive or abusive scraping can lead to your IP being blocked.
- Website Changes: Websites frequently update their HTML structure. If edX changes its page layout or CSS class names, the selectors used in this script (
By.CSS_SELECTOR
) may need to be updated to continue functioning correctly.
- Error Handling: The script includes basic error handling for network issues and missing elements, but complex website behaviors might require further refinement.
💡 Future Enhancements
- More Robust Error Logging: Implement a dedicated logging system to record successes, failures, and specific errors for easier debugging.
- Parallel Processing: For faster scraping of a large number of institutions, consider using
concurrent.futures
to process institution pages in parallel (with appropriate delays to avoid overwhelming the server).
- Proxy Support: Add proxy rotation to avoid IP blocking for extensive scraping.
- Configuration File: Externalize URLs, selectors, and other parameters into a configuration file (e.g., YAML or JSON) for easier management.
- Interactive Input: Allow users to specify the output file name or other parameters at runtime.
⬇️ Download the Code
You can download the Python scraper script from my GitHub repository:
Download edx_scraper.py