CompletedNode.jsPuppeteerJavaScript+2 more

Web Scraping with Puppeteer

A Node.js script that scrapes course data from websites using Puppeteer and saves it to JSON format. Demonstrates headless browser automation for data extraction.

Timeline

1 week

Role

Backend Developer

Team

Solo

Status

Completed

Source Code

Technology Stack

Node.js

Puppeteer

JavaScript

Chrome DevTools

JSON

Key Challenges

Puppeteer Browser Automation
DOM Element Selection
Handling Dynamic Content
Data Extraction and Parsing
JSON File Generation
Error Handling for Network Issues

Key Learnings

Puppeteer API and Methods
Web Scraping Techniques
Headless Browser Automation
CSS Selector Querying
Node.js File System Operations
Chromium DevTools Protocol
Async/Await Patterns

Web Scraping with Puppeteer

Overview

A Node.js web scraping script that demonstrates browser automation using Puppeteer. The script launches a headless Chrome browser, navigates to target websites, extracts structured data using CSS selectors, and saves the results to JSON format. This project showcases practical web scraping techniques for data collection and automation.

Key Features

Headless Browser: Launches Chrome/Chromium in background without GUI
Automated Navigation: Programmatically visits target websites
Data Extraction: Scrapes specific data using CSS selectors
Structured Output: Saves extracted data to courses.json file
$$eval Method: Uses Puppeteer's powerful query methods for bulk extraction
Console Output: Displays scraped data in terminal for verification
Error Handling: Gracefully handles navigation and scraping errors

What This Project Does

Scraping Process

Browser Launch: Opens a headless Chromium instance in the background
Page Navigation: Navigates to the target website URL
Data Extraction: Queries DOM elements using CSS selectors
Data Parsing: Extracts text content, attributes, and URLs from elements
JSON Generation: Structures data into JSON format
File Writing: Saves the JSON data to courses.json

Extracted Data Structure

{
  title: "Course Title",       // from .card-body h3
  level: "Beginner",           // from .card-body .level
  url: "https://...",          // from .card-footer a
  promo: "PROMO123"            // from .card-footer .promo-code .promo
}

Tech Stack

Runtime: Node.js
Browser Automation: Puppeteer
Browser: Chromium (bundled with Puppeteer)
File System: Node.js fs module
Language: JavaScript (ES6+)

What is Puppeteer?

Puppeteer is a Node.js library that provides a high-level API to control Chrome/Chromium over the DevTools Protocol. It runs in headless mode by default but can be configured for full (non-headless) mode.

Common Use Cases

Generate screenshots and PDFs of pages
Crawl SPAs and generate pre-rendered content (SSR)
Automate form submission and UI testing
Create automated testing environments
Test Chrome Extensions
Capture performance traces
Web scraping and data extraction

Technical Highlights

Puppeteer Integration

Configured Puppeteer to launch headless Chrome browser with proper page navigation, waiting strategies, and graceful shutdown.

DOM Querying

Utilized $$eval() method for bulk element extraction, efficiently querying multiple elements and extracting data in a single operation.

Data Extraction

Implemented CSS selector-based data extraction to target specific HTML elements and extract text content, attributes, and href values.

JSON File Generation

Used Node.js file system module to write structured data to JSON file with proper formatting and error handling.

Async/Await Pattern

Leveraged async/await syntax for clean, readable asynchronous code handling browser operations and file I/O.

How It Works

1. Browser Launch

const browser = await puppeteer.launch();
const page = await browser.newPage();

2. Navigation

await page.goto('https://example.com');

3. Data Extraction

const data = await page.$$eval('#courses .card', (cards) => {
  return cards.map((card) => ({
    title: card.querySelector('.card-body h3').textContent,
    level: card.querySelector('.card-body .level').textContent,
    // ... more fields
  }));
});

4. Save to JSON

fs.writeFile('courses.json', JSON.stringify(data, null, 2));

Installation & Usage

# Install dependencies (includes Chromium)
npm install

# Run the scraper
npm start

The script will output scraped data to console and save it to courses.json.

Challenges & Solutions

Puppeteer Browser Automation

Learned Puppeteer API including browser launching, page navigation, waiting for selectors, and proper resource cleanup with browser.close().

DOM Element Selection

Mastered CSS selector syntax and Puppeteer's query methods ($, $$, $eval, $$eval) for efficient element targeting and data extraction.

Handling Dynamic Content

Implemented proper waiting strategies using waitForSelector() and waitForNavigation() to ensure content is loaded before scraping.

Data Extraction and Parsing

Built robust extraction logic that handles missing elements, null values, and inconsistent data structures with fallback values.

JSON File Generation

Used Node.js fs.writeFile() with JSON.stringify() to create properly formatted, human-readable JSON output with error handling.

Error Handling for Network Issues

Implemented try-catch blocks for network failures, timeout issues, and selector mismatches with informative error messages.

Use Cases

Data Collection: Gather product information, prices, reviews
Market Research: Monitor competitor websites and pricing
Content Aggregation: Collect articles, courses, or listings
Testing: Automate UI testing and regression tests
Monitoring: Track website changes and updates
SEO Analysis: Extract meta tags, headings, and content

Problem Solved

Manually copying data from websites is time-consuming and error-prone. Many websites don't offer APIs or data exports. Web scraping automates this tedious process, extracting structured data from any website into usable JSON format - perfect for data analysis, content aggregation, and research.

What Makes It Unique

Unlike simple fetch-based scrapers, this uses Puppeteer with headless Chrome to render JavaScript-heavy pages accurately. The $$eval method allows efficient bulk extraction in a single DOM operation. The clean JSON output format makes data immediately usable in other applications.

Impact

Automation Skills: Learned browser automation techniques applicable to testing, monitoring, and data collection across various projects
Puppeteer Mastery: Gained deep understanding of headless browser APIs, useful for testing frameworks and automation tools
Data Engineering: Developed skills in extracting, transforming, and structuring data from unstructured web sources

Future Enhancements

Add support for pagination and infinite scroll
Implement proxy rotation for large-scale scraping
Add data validation and cleaning
Create scheduling for periodic scraping
Add database storage instead of JSON files
Implement screenshot capture for visual verification
Add support for authentication and cookies
Create configuration file for multiple targets

Previous Project

Website Accessibility Tester

Related Projects

Website Accessibility Tester

Completed

A web application to test any website for accessibility issues using Pa11y. Enter a URL and get a detailed report of accessibility problems based on WCAG standards.

Node.jsExpress.jsPa11y+4

API Proxy Server

In-progress

A robust Node.js and Express-based proxy server designed to enhance API security by hiding sensitive API keys, implementing rate limiting, and optimizing performance through intelligent caching mechanisms.

Node.jsExpress.jsJavaScript+10

View All Projects

Technology Stack

Key Challenges

Key Learnings

Web Scraping with Puppeteer

Overview

Key Features

What This Project Does

Scraping Process

Extracted Data Structure

Tech Stack

What is Puppeteer?

Common Use Cases

Technical Highlights

Puppeteer Integration

DOM Querying

Data Extraction

JSON File Generation

Async/Await Pattern

How It Works

1. Browser Launch

2. Navigation

3. Data Extraction

4. Save to JSON

Installation & Usage

Challenges & Solutions

Puppeteer Browser Automation

DOM Element Selection

Handling Dynamic Content

Data Extraction and Parsing

JSON File Generation

Error Handling for Network Issues

Use Cases

Problem Solved

What Makes It Unique

Impact

Future Enhancements

Related Projects

Website Accessibility Tester

API Proxy Server

Saket Kothari's Portfolio Assistant