How to Scrape the Web using Node.js and Puppeteer

How to Scrape the Web using Node.js and Puppeteer

In this tutorial, we'll learn how to use Node.js and the Google-built Puppeteer library to control an instance of headless Chrome and perform various tasks that we want it to. More specifically, we will learn how to use it to scrape a webpage for information that we want.

Other use cases for Puppeteer include automating manual testing, printing out screenshots of a page, and test Chrome Extensions.

Prerequisites

  1. Basic JavaScript knowledge is needed. If you need a refresher, check out our class on JavaScript.
  2. Node and NPM installed. If you don't have them installed, follow our how to install Node guide.

Installing Puppeteer

Pretty cool, right?

Installing Puppeteer is as simple as running the npm install command:

BASH
npm install puppeteer

Now, create a new file, index.js, and import it at the top:

JAVASCRIPT
import puppeteer from "puppeteer";

Now we can try launching an instance of Puppeteer using the launch function:

JAVASCRIPT
import puppeteer from "puppeteer"; async function run() { const browser = await puppeteer.launch(); } run();

Puppeteer is now running! Now we need to tell it to do something. We can have it launch a new page now.

JAVASCRIPT
import puppeteer from "puppeteer"; async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); } run();

Now with our new page, we can have it navigate to a site. Let's try this site, why not?

JAVASCRIPT
import puppeteer from "puppeteer"; const url = "https://sabe.io/"; async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); } run();

The Sabe Homepage

Thanks to the goto method, we are now on this site's homepage using a headless instance of Chrome!

Scraping a Webpage

Now that we are at a webpage we want to scrape, we only need to run the evaluate function on the page, and give it a callback function that performs that actions we want to perform on that page.

JAVASCRIPT
import puppeteer from "puppeteer"; const url = "https://sabe.io/"; async function run() { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); const result = await page.evaluate(() => { return document.querySelector("h1").innerText; }); console.log(result); } run();

The h1 tag on the Sabe Homepage

For the sake of example, I've chosen a simple example, simply getting back the h1 tag of the homepage. Because we are working with a document, we have access to the document object just like you would in the console of your actual browser. Because of this, we can use the same selectors we would otherwise.

After we get the text, we can simply return it so that it gets stored in the result variable and then we can log it to our console.

You should see this if all went well:

HTML
Become a better developer

Conclusion

Puppeteer's API is incredibly powerful and that was truly just a small taste at what you can do with it. You can use it to fully fill out forms, perform complex tasks manually, render entire single-page applications, and of course, scape data from websites. Do check out the below resources to learn more about this great tool!

Resources

Recommended Tutorial »
Copyright © 2017 - 2024 Sabe.io. All rights reserved. Made with ❤ in NY.