
Web Scraping with Puppeteer
A Node.js script that scrapes course data from websites using Puppeteer and saves it to JSON format. Demonstrates headless browser automation for data extraction.
Timeline
1 week
Role
Backend Developer
Team
Solo
Status
CompletedTechnology Stack
Key Challenges
- Puppeteer Browser Automation
- DOM Element Selection
- Handling Dynamic Content
- Data Extraction and Parsing
- JSON File Generation
- Error Handling for Network Issues
Key Learnings
- Puppeteer API and Methods
- Web Scraping Techniques
- Headless Browser Automation
- CSS Selector Querying
- Node.js File System Operations
- Chromium DevTools Protocol
- Async/Await Patterns
Web Scraping with Puppeteer
Overview
A Node.js web scraping script that demonstrates browser automation using Puppeteer. The script launches a headless Chrome browser, navigates to target websites, extracts structured data using CSS selectors, and saves the results to JSON format. This project showcases practical web scraping techniques for data collection and automation.
Key Features
- Headless Browser: Launches Chrome/Chromium in background without GUI
- Automated Navigation: Programmatically visits target websites
- Data Extraction: Scrapes specific data using CSS selectors
- Structured Output: Saves extracted data to
courses.jsonfile - $$eval Method: Uses Puppeteer's powerful query methods for bulk extraction
- Console Output: Displays scraped data in terminal for verification
- Error Handling: Gracefully handles navigation and scraping errors
What This Project Does
Scraping Process
- Browser Launch: Opens a headless Chromium instance in the background
- Page Navigation: Navigates to the target website URL
- Data Extraction: Queries DOM elements using CSS selectors
- Data Parsing: Extracts text content, attributes, and URLs from elements
- JSON Generation: Structures data into JSON format
- File Writing: Saves the JSON data to
courses.json
Extracted Data Structure
{
title: "Course Title", // from .card-body h3
level: "Beginner", // from .card-body .level
url: "https://...", // from .card-footer a
promo: "PROMO123" // from .card-footer .promo-code .promo
}Tech Stack
- Runtime: Node.js
- Browser Automation: Puppeteer
- Browser: Chromium (bundled with Puppeteer)
- File System: Node.js fs module
- Language: JavaScript (ES6+)
What is Puppeteer?
Puppeteer is a Node.js library that provides a high-level API to control Chrome/Chromium over the DevTools Protocol. It runs in headless mode by default but can be configured for full (non-headless) mode.
Common Use Cases
- Generate screenshots and PDFs of pages
- Crawl SPAs and generate pre-rendered content (SSR)
- Automate form submission and UI testing
- Create automated testing environments
- Test Chrome Extensions
- Capture performance traces
- Web scraping and data extraction
Technical Highlights
Puppeteer Integration
Configured Puppeteer to launch headless Chrome browser with proper page navigation, waiting strategies, and graceful shutdown.
DOM Querying
Utilized $$eval() method for bulk element extraction, efficiently querying multiple elements and extracting data in a single operation.
Data Extraction
Implemented CSS selector-based data extraction to target specific HTML elements and extract text content, attributes, and href values.
JSON File Generation
Used Node.js file system module to write structured data to JSON file with proper formatting and error handling.
Async/Await Pattern
Leveraged async/await syntax for clean, readable asynchronous code handling browser operations and file I/O.
How It Works
1. Browser Launch
const browser = await puppeteer.launch();
const page = await browser.newPage();2. Navigation
await page.goto('https://example.com');3. Data Extraction
const data = await page.$$eval('#courses .card', (cards) => {
return cards.map((card) => ({
title: card.querySelector('.card-body h3').textContent,
level: card.querySelector('.card-body .level').textContent,
// ... more fields
}));
});4. Save to JSON
fs.writeFile('courses.json', JSON.stringify(data, null, 2));Installation & Usage
# Install dependencies (includes Chromium)
npm install
# Run the scraper
npm startThe script will output scraped data to console and save it to courses.json.
Challenges & Solutions
Puppeteer Browser Automation
Learned Puppeteer API including browser launching, page navigation, waiting for selectors, and proper resource cleanup with browser.close().
DOM Element Selection
Mastered CSS selector syntax and Puppeteer's query methods ($, $$, $eval, $$eval) for efficient element targeting and data extraction.
Handling Dynamic Content
Implemented proper waiting strategies using waitForSelector() and waitForNavigation() to ensure content is loaded before scraping.
Data Extraction and Parsing
Built robust extraction logic that handles missing elements, null values, and inconsistent data structures with fallback values.
JSON File Generation
Used Node.js fs.writeFile() with JSON.stringify() to create properly formatted, human-readable JSON output with error handling.
Error Handling for Network Issues
Implemented try-catch blocks for network failures, timeout issues, and selector mismatches with informative error messages.
Use Cases
- Data Collection: Gather product information, prices, reviews
- Market Research: Monitor competitor websites and pricing
- Content Aggregation: Collect articles, courses, or listings
- Testing: Automate UI testing and regression tests
- Monitoring: Track website changes and updates
- SEO Analysis: Extract meta tags, headings, and content
Problem Solved
Manually copying data from websites is time-consuming and error-prone. Many websites don't offer APIs or data exports. Web scraping automates this tedious process, extracting structured data from any website into usable JSON format - perfect for data analysis, content aggregation, and research.
What Makes It Unique
Unlike simple fetch-based scrapers, this uses Puppeteer with headless Chrome to render JavaScript-heavy pages accurately. The $$eval method allows efficient bulk extraction in a single DOM operation. The clean JSON output format makes data immediately usable in other applications.
Impact
- Automation Skills: Learned browser automation techniques applicable to testing, monitoring, and data collection across various projects
- Puppeteer Mastery: Gained deep understanding of headless browser APIs, useful for testing frameworks and automation tools
- Data Engineering: Developed skills in extracting, transforming, and structuring data from unstructured web sources
Future Enhancements
- Add support for pagination and infinite scroll
- Implement proxy rotation for large-scale scraping
- Add data validation and cleaning
- Create scheduling for periodic scraping
- Add database storage instead of JSON files
- Implement screenshot capture for visual verification
- Add support for authentication and cookies
- Create configuration file for multiple targets
