Web scraping is a powerful tool for extracting information from websites. Whether you need to collect data for a project, monitor prices, or gather competitive intelligence, scraping can save you a lot of time and effort. In this guide, I’ll walk you through how to use TypeScript and Puppeteer to scrape web pages after they have fully loaded and rendered JavaScript content.
Thank me by sharing on Twitter 🙏
Introduction to Web Scraping with Puppeteer
Web scraping involves fetching web pages and extracting useful information from them. However, many modern websites rely heavily on JavaScript for dynamic content loading, which means simple HTTP requests often won’t get you the data you need. This is where Puppeteer comes in. Puppeteer is a Node library that provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. With Puppeteer, we can simulate user interactions and scrape content after the JavaScript has fully executed.
Setting Up the Environment
Before diving into the code, make sure you have Node.js and npm installed on your machine. With these prerequisites out of the way, we can install the necessary packages for our project.
Installing Dependencies
First, we’ll need to install Puppeteer, TypeScript, and some type definitions. Open your terminal and run the following command:
npm install puppeteer typescript @types/node @types/puppeteer tsx
This command installs Puppeteer for controlling the browser, TypeScript for type checking and transpiling our code, and the necessary type definitions.
Unprepared Healer: A Fantasy LitRPG Isekai Adventure (Earthen Contenders Book 2)
$4.99 (as of January 22, 2025 11:32 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)HP 67XL Black High-yield Ink Cartridge | Works with HP DeskJet 1255, 2700, 4100 Series, HP ENVY 6000, 6400 Series | Eligible for Instant Ink | One Size | 3YM57AN
$29.89 (as of January 22, 2025 11:32 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Anker USB C to HDMI Adapter (@60Hz), 310 USB-C (4K HDMI), Aluminum, Portable, for MacBook Pro, Air, iPad pROPixelbook, XPS, Galaxy, and More
$11.99 (as of January 22, 2025 11:32 GMT +00:00 - More infoProduct prices and availability are accurate as of the date/time indicated and are subject to change. Any price and availability information displayed on [relevant Amazon Site(s), as applicable] at the time of purchase will apply to the purchase of this product.)Creating the TypeScript Configuration
Next, we’ll set up a TypeScript configuration file. Create a tsconfig.json
file in your project directory with the following content:
{
"compilerOptions": {
"target": "ES6",
"module": "commonjs",
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true
},
"include": ["src"]
}
This configuration ensures our TypeScript code is compiled to ES6-compatible JavaScript and places the output in the dist
directory.
Writing the Scraping Code
With our environment set up, we can now write the code to scrape a webpage. We’ll create a TypeScript file named scrape.ts
inside a src
directory. This file will contain our scraping logic.
Launching a Headless Browser
The first step is to launch a headless browser using Puppeteer. We’ll create a function that takes a URL, navigates to it, waits for the page to fully load, and then extracts the desired content.
Here’s the code:
import puppeteer from 'puppeteer';
async function scrapeUrl(url: string) {
// Launch a headless browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the URL
await page.goto(url, { waitUntil: 'networkidle2' });
// Wait for a specific element to be loaded
// Replace 'body' with the actual selector of an element that indicates the page is fully loaded
await page.waitForSelector('body');
// Scrape the content
const content = await page.evaluate(() => {
// Replace 'body' with the actual element or content you want to scrape
return document.querySelector('body')?.innerHTML || '';
});
// Close the browser
await browser.close();
return content;
}
// Example usage
(async () => {
const url = 'https://example.com';
const content = await scrapeUrl(url);
console.log(content);
})();
Running the TypeScript Code
Instead of compiling our TypeScript code and running it with Node.js, we can use tsx
to streamline this process. tsx
allows us to run TypeScript files directly. To execute our script, run the following command in your terminal:
npx tsx src/scrape.ts
This command will launch a headless browser, navigate to the specified URL, wait for the page to load, scrape the content, and then print it to the console.
Handling Dynamic Content
Many websites load content dynamically using JavaScript frameworks like React, Angular, or Vue. In such cases, it’s crucial to wait for specific elements to load before attempting to scrape the data. Puppeteer’s waitForSelector
method is perfect for this. You can specify any CSS selector that indicates the page has fully loaded.
Example: Scraping Dynamic Content
Let’s consider an example where we want to scrape the latest news headlines from a news website that loads content dynamically. We’ll modify our scrape.ts
to handle this scenario:
import puppeteer from 'puppeteer';
async function scrapeUrl(url: string) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
// Wait for the headlines to be loaded
await page.waitForSelector('.headline');
const headlines = await page.evaluate(() => {
const headlineElements = document.querySelectorAll('.headline');
return Array.from(headlineElements).map(el => el.textContent?.trim() || '');
});
await browser.close();
return headlines;
}
(async () => {
const url = 'https://newswebsite.com';
const headlines = await scrapeUrl(url);
console.log(headlines);
})();
In this example, we wait for elements with the class headline
to load and then extract their text content.
Conclusion
Web scraping with TypeScript and Puppeteer opens up a world of possibilities for automating data collection from websites. By leveraging Puppeteer’s powerful browser automation capabilities, we can handle complex pages that rely on JavaScript for rendering content. Using TypeScript ensures our code is robust and maintainable, thanks to its strong typing and excellent tooling support.
From setting up the environment to handling dynamic content, we’ve covered the essential steps to get you started with web scraping. Experiment with different websites and selectors to scrape the data you need. The web is your oyster, and with Puppeteer and TypeScript, you have the tools to harvest its pearls efficiently and effectively.