Back to Projects
Web Scraping with Puppeteer
CompletedNode.jsPuppeteerJavaScript+2 more

Web Scraping with Puppeteer

A Node.js script that scrapes course data from websites using Puppeteer and saves it to JSON format. Demonstrates headless browser automation for data extraction.

Timeline

1 week

Role

Backend Developer

Team

Solo

Status
Completed

Technology Stack

Node.js
Puppeteer
JavaScript
Chrome DevTools
JSON

Key Challenges

  • Puppeteer Browser Automation
  • DOM Element Selection
  • Handling Dynamic Content
  • Data Extraction and Parsing
  • JSON File Generation
  • Error Handling for Network Issues

Key Learnings

  • Puppeteer API and Methods
  • Web Scraping Techniques
  • Headless Browser Automation
  • CSS Selector Querying
  • Node.js File System Operations
  • Chromium DevTools Protocol
  • Async/Await Patterns

Web Scraping with Puppeteer

Overview

A Node.js web scraping script that demonstrates browser automation using Puppeteer. The script launches a headless Chrome browser, navigates to target websites, extracts structured data using CSS selectors, and saves the results to JSON format. This project showcases practical web scraping techniques for data collection and automation.

Key Features

  • Headless Browser: Launches Chrome/Chromium in background without GUI
  • Automated Navigation: Programmatically visits target websites
  • Data Extraction: Scrapes specific data using CSS selectors
  • Structured Output: Saves extracted data to courses.json file
  • $$eval Method: Uses Puppeteer's powerful query methods for bulk extraction
  • Console Output: Displays scraped data in terminal for verification
  • Error Handling: Gracefully handles navigation and scraping errors

What This Project Does

Scraping Process

  1. Browser Launch: Opens a headless Chromium instance in the background
  2. Page Navigation: Navigates to the target website URL
  3. Data Extraction: Queries DOM elements using CSS selectors
  4. Data Parsing: Extracts text content, attributes, and URLs from elements
  5. JSON Generation: Structures data into JSON format
  6. File Writing: Saves the JSON data to courses.json

Extracted Data Structure

{
  title: "Course Title",       // from .card-body h3
  level: "Beginner",           // from .card-body .level
  url: "https://...",          // from .card-footer a
  promo: "PROMO123"            // from .card-footer .promo-code .promo
}

Tech Stack

  • Runtime: Node.js
  • Browser Automation: Puppeteer
  • Browser: Chromium (bundled with Puppeteer)
  • File System: Node.js fs module
  • Language: JavaScript (ES6+)

What is Puppeteer?

Puppeteer is a Node.js library that provides a high-level API to control Chrome/Chromium over the DevTools Protocol. It runs in headless mode by default but can be configured for full (non-headless) mode.

Common Use Cases

  • Generate screenshots and PDFs of pages
  • Crawl SPAs and generate pre-rendered content (SSR)
  • Automate form submission and UI testing
  • Create automated testing environments
  • Test Chrome Extensions
  • Capture performance traces
  • Web scraping and data extraction

Technical Highlights

Puppeteer Integration

Configured Puppeteer to launch headless Chrome browser with proper page navigation, waiting strategies, and graceful shutdown.

DOM Querying

Utilized $$eval() method for bulk element extraction, efficiently querying multiple elements and extracting data in a single operation.

Data Extraction

Implemented CSS selector-based data extraction to target specific HTML elements and extract text content, attributes, and href values.

JSON File Generation

Used Node.js file system module to write structured data to JSON file with proper formatting and error handling.

Async/Await Pattern

Leveraged async/await syntax for clean, readable asynchronous code handling browser operations and file I/O.

How It Works

1. Browser Launch

const browser = await puppeteer.launch();
const page = await browser.newPage();

2. Navigation

await page.goto('https://example.com');

3. Data Extraction

const data = await page.$$eval('#courses .card', (cards) => {
  return cards.map((card) => ({
    title: card.querySelector('.card-body h3').textContent,
    level: card.querySelector('.card-body .level').textContent,
    // ... more fields
  }));
});

4. Save to JSON

fs.writeFile('courses.json', JSON.stringify(data, null, 2));

Installation & Usage

# Install dependencies (includes Chromium)
npm install

# Run the scraper
npm start

The script will output scraped data to console and save it to courses.json.

Challenges & Solutions

Puppeteer Browser Automation

Learned Puppeteer API including browser launching, page navigation, waiting for selectors, and proper resource cleanup with browser.close().

DOM Element Selection

Mastered CSS selector syntax and Puppeteer's query methods ($, $$, $eval, $$eval) for efficient element targeting and data extraction.

Handling Dynamic Content

Implemented proper waiting strategies using waitForSelector() and waitForNavigation() to ensure content is loaded before scraping.

Data Extraction and Parsing

Built robust extraction logic that handles missing elements, null values, and inconsistent data structures with fallback values.

JSON File Generation

Used Node.js fs.writeFile() with JSON.stringify() to create properly formatted, human-readable JSON output with error handling.

Error Handling for Network Issues

Implemented try-catch blocks for network failures, timeout issues, and selector mismatches with informative error messages.

Use Cases

  • Data Collection: Gather product information, prices, reviews
  • Market Research: Monitor competitor websites and pricing
  • Content Aggregation: Collect articles, courses, or listings
  • Testing: Automate UI testing and regression tests
  • Monitoring: Track website changes and updates
  • SEO Analysis: Extract meta tags, headings, and content

Problem Solved

Manually copying data from websites is time-consuming and error-prone. Many websites don't offer APIs or data exports. Web scraping automates this tedious process, extracting structured data from any website into usable JSON format - perfect for data analysis, content aggregation, and research.

What Makes It Unique

Unlike simple fetch-based scrapers, this uses Puppeteer with headless Chrome to render JavaScript-heavy pages accurately. The $$eval method allows efficient bulk extraction in a single DOM operation. The clean JSON output format makes data immediately usable in other applications.

Impact

  • Automation Skills: Learned browser automation techniques applicable to testing, monitoring, and data collection across various projects
  • Puppeteer Mastery: Gained deep understanding of headless browser APIs, useful for testing frameworks and automation tools
  • Data Engineering: Developed skills in extracting, transforming, and structuring data from unstructured web sources

Future Enhancements

  • Add support for pagination and infinite scroll
  • Implement proxy rotation for large-scale scraping
  • Add data validation and cleaning
  • Create scheduling for periodic scraping
  • Add database storage instead of JSON files
  • Implement screenshot capture for visual verification
  • Add support for authentication and cookies
  • Create configuration file for multiple targets

Design & Developed by Saket Kothari
© 2026. All rights reserved.