Web Scraping with TypeScript and Puppeteer

Web scraping is a powerful tool for extracting information from websites. Whether you need to collect data for a project, monitor prices, or gather competitive intelligence, scraping can save you a lot of time and effort. In this guide, I’ll walk you through how to use TypeScript and Puppeteer to scrape web pages after they have fully loaded and rendered JavaScript content.

Thank me by sharing on Twitter 🙏

Introduction to Web Scraping with Puppeteer

Web scraping involves fetching web pages and extracting useful information from them. However, many modern websites rely heavily on JavaScript for dynamic content loading, which means simple HTTP requests often won’t get you the data you need. This is where Puppeteer comes in. Puppeteer is a Node library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. With Puppeteer, we can simulate user interactions and scrape content after the JavaScript has fully executed.

Setting Up the Environment

Before diving into the code, make sure you have Node.js and npm installed on your machine. With these prerequisites out of the way, we can install the necessary packages for our project.

Installing Dependencies

First, we’ll need to install Puppeteer, TypeScript, and some type definitions. Open your terminal and run the following command:

ShellScript
npm install puppeteer typescript @types/node @types/puppeteer tsx

This command installs Puppeteer for controlling the browser, TypeScript for type checking and transpiling our code, and the necessary type definitions.

Creating the TypeScript Configuration

Next, we’ll set up a TypeScript configuration file. Create a tsconfig.json file in your project directory with the following content:

TypeScript
{
  "compilerOptions": {
    "target": "ES6",
    "module": "commonjs",
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true
  },
  "include": ["src"]
}

This configuration ensures our TypeScript code is compiled to ES6-compatible JavaScript and places the output in the dist directory.

Writing the Scraping Code

With our environment set up, we can now write the code to scrape a webpage. We’ll create a TypeScript file named scrape.ts inside a src directory. This file will contain our scraping logic.

Launching a Headless Browser

The first step is to launch a headless browser using Puppeteer. We’ll create a function that takes a URL, navigates to it, waits for the page to fully load, and then extracts the desired content.

Here’s the code:

TypeScript
import puppeteer from 'puppeteer';

async function scrapeUrl(url: string) {
  // Launch a headless browser
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Navigate to the URL
  await page.goto(url, { waitUntil: 'networkidle2' });

  // Wait for a specific element to be loaded
  // Replace 'body' with the actual selector of an element that indicates the page is fully loaded
  await page.waitForSelector('body');

  // Scrape the content
  const content = await page.evaluate(() => {
    // Replace 'body' with the actual element or content you want to scrape
    return document.querySelector('body')?.innerHTML || '';
  });

  // Close the browser
  await browser.close();

  return content;
}

// Example usage
(async () => {
  const url = 'https://example.com';
  const content = await scrapeUrl(url);
  console.log(content);
})();

Running the TypeScript Code

Instead of compiling our TypeScript code and running it with Node.js, we can use tsx to streamline this process. tsx allows us to run TypeScript files directly. To execute our script, run the following command in your terminal:

ShellScript
npx tsx src/scrape.ts

This command will launch a headless browser, navigate to the specified URL, wait for the page to load, scrape the content, and then print it to the console.

Handling Dynamic Content

Many websites load content dynamically using JavaScript frameworks like React, Angular, or Vue. In such cases, it’s crucial to wait for specific elements to load before attempting to scrape the data. Puppeteer’s waitForSelector method is perfect for this. You can specify any CSS selector that indicates the page has fully loaded.

Example: Scraping Dynamic Content

Let’s consider an example where we want to scrape the latest news headlines from a news website that loads content dynamically. We’ll modify our scrape.ts to handle this scenario:

TypeScript
import puppeteer from 'puppeteer';

async function scrapeUrl(url: string) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle2' });

  // Wait for the headlines to be loaded
  await page.waitForSelector('.headline');

  const headlines = await page.evaluate(() => {
    const headlineElements = document.querySelectorAll('.headline');
    return Array.from(headlineElements).map(el => el.textContent?.trim() || '');
  });

  await browser.close();

  return headlines;
}

(async () => {
  const url = 'https://newswebsite.com';
  const headlines = await scrapeUrl(url);
  console.log(headlines);
})();

In this example, we wait for elements with the class headline to load and then extract their text content.

Conclusion

Web scraping with TypeScript and Puppeteer opens up a world of possibilities for automating data collection from websites. By leveraging Puppeteer’s powerful browser automation capabilities, we can handle complex pages that rely on JavaScript for rendering content. Using TypeScript ensures our code is robust and maintainable, thanks to its strong typing and excellent tooling support.

From setting up the environment to handling dynamic content, we’ve covered the essential steps to get you started with web scraping. Experiment with different websites and selectors to scrape the data you need. The web is your oyster, and with Puppeteer and TypeScript, you have the tools to harvest its pearls efficiently and effectively.

Share this:

Leave a Reply