In this tutorial, we'll learn how to use Node.js and the Google-built Puppeteer library to control an instance of headless Chrome and perform various tasks that we want it to. More specifically, we will learn how to use it to scrape a webpage for information that we want.
Other use cases for Puppeteer include automating manual testing, printing out screenshots of a page, and test Chrome Extensions.
Prerequisites
Basic JavaScript knowledge is needed. If you need a refresher, check out our class on JavaScript.
Thanks to the goto method, we are now on this site's homepage using a headless instance of Chrome!
Scraping a Webpage
Now that we are at a webpage we want to scrape, we only need to run the evaluate function on the page, and give it a callback function that performs that actions we want to perform on that page.
For the sake of example, I've chosen a simple example, simply getting back the h1 tag of the homepage. Because we are working with a document, we have access to the document object just like you would in the console of your actual browser. Because of this, we can use the same selectors we would otherwise.
After we get the text, we can simply return it so that it gets stored in the result variable and then we can log it to our console.
You should see this if all went well:
HTML
Become a better developer
Conclusion
Puppeteer's API is incredibly powerful and that was truly just a small taste at what you can do with it. You can use it to fully fill out forms, perform complex tasks manually, render entire single-page applications, and of course, scape data from websites. Do check out the below resources to learn more about this great tool!