How to Scrape the Web using Node.js and Puppeteer

How to Scrape the Web using Node.js and Puppeteer

In this tutorial, we'll learn how to use Node.js and the Google-built Puppeteer library to control an instance of headless Chrome and perform various tasks that we want it to. More specifically, we will learn how to use it to scrape a webpage for information that we want.

Other use cases for Puppeteer include automating manual testing, printing out screenshots of a page, and test Chrome Extensions.

Prerequisites

  1. Basic JavaScript knowledge is needed. If you need a refresher, check out our class on JavaScript.

Being familiar with HTML and CSS helps but is not needed as both will be provided with no changes needed.

Installing Node and NPM

To install the npm module we need for our bot, we will first need Node.js, a JavaScript runtime.

  1. Visit the official Node.js website to get the installer.
  2. After it downloads, run the installer until the end.
  3. Restart your computer to ensure the changes can take effect.
The Node.js installer.
The Node.js installer.

The Node.js installer should have also installed NPM for you. To confirm that you have installed both properly, you'll need to open Windows Command Prompt if you're on Windows, or Terminal if you're on Mac or Linux.

To check if you installed node:

	
    node -v
	

To check if you installed NPM:

	
    npm -v
	

If both of these commands return a version number, you're good to go.

Installing Puppeteer

Pretty cool, right?
Pretty cool, right?

Installing Puppeteer is as simple as running the npm install command:

	
    npm install puppeteer
	

Now, create a new file, index.js, and require it at the top:

	
    const puppeteer = require("puppeteer");
	

Now we can try launching an instance of Puppeteer using the launch function:

	
    const puppeteer = require("puppeteer");

    async function run() {
        const browser = await puppeteer.launch();
    }

    run();
	

Puppeteer is now running! Now we need to tell it to do something. We can have it launch a new page now.

	
    const puppeteer = require("puppeteer");

    async function run() {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
    }

    run();
	

Now with our new page, we can have it navigate to a site. Let's try this site, why not?

	
    const puppeteer = require("puppeteer");
    const url = "https://sabe.io/";

    async function run() {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);
    }

    run();
	
The Sabe Homepage
The Sabe Homepage

Thanks to the goto method, we are now on this site's homepage using a headless instance of Chrome!

Scraping a Webpage

Now that we are at a webpage we want to scrape, we only need to run the evaluate function on the page, and give it a callback function that performs that actions we want to perform on that page.

	
    const puppeteer = require("puppeteer");
    const url = "https://sabe.io/";

    async function run() {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);

        const result = await page.evaluate(() => {
            return document.querySelector("h1").innerText;
        });

        console.log(result);
    }

    run();
	
The h1 tag on the Sabe Homepage
The h1 tag on the Sabe Homepage

For the sake of example, I've chosen a simple example, simply getting back the h1 tag of the homepage. Because we are working with a document, we have access to the document object just like you would in the console of your actual browser. Because of this, we can use the same selectors we would otherwise.

After we get the text, we can simply return it so that it gets stored in the result variable and then we can log it to our console.

You should see this if all went well:

	
    Supercharged Web Development
	

Conclusion

Puppeteer's API is incredibly powerful and that was truly just a small taste at what you can do with it. You can use it to fully fill out forms, perform complex tasks manually, render entire single-page applications, and of course, scape data from websites. Do check out the below resources to learn more about this great tool!

Resources

If you learned from this tutorial, follow us on Facebook, Twitter and LinkedIn! 😊 Also, join the conversation over at our official forum!