How to Create a Website Scraper with Puppeteer and Firebase Features

0

This tutorial explains how to create a web scraper with Puppeteer and deploy it to the web with Firebase functions.

Let’s create a simple website scraper that downloads content from a webpage and extracts content from the page. For this example, we will use the New York Times website as the source of the content. The scraper will extract the top 10 news headlines from the page and display them on the web page. The scraping is done using the Puppeteer headless browser and the web application is deployed to Firebase functions.

1. Initialize a Firebase function

Assuming you have already created a Firebase project, you can initialize Firebase functions in a local environment by running the following command:

mkdir scraper
cd scraper
npx firebase init functions
cd functions
npm install puppeteer

Follow the prompts to initialize the project. We also install NPM’s Puppeteer package to use the Puppeteer headless browser.

2. Create a Node.js app

Create a new pptr.js file in the functions folder which will contain the application code to scrape the page content. The script will only download the HTML content of the page and block all images, style sheets, videos and fonts to reduce the time it takes to download the page.

We use the XPath expression to select the page titles that are wrapped under the h3 label. You can use Chrome Dev Tools to find the XPath of titles.

const puppeteer = require('puppeteer');

const scrapeWebsite = async () => {
  let stories = [];
  const browser = await puppeteer.launch({
    headless: true,
    timeout: 20000,
    ignoreHTTPSErrors: true,
    slowMo: 0,
    args: [
      '--disable-gpu',
      '--disable-dev-shm-usage',
      '--disable-setuid-sandbox',
      '--no-first-run',
      '--no-sandbox',
      '--no-zygote',
      '--window-size=1280,720',
    ],
  });

  try {
    const page = await browser.newPage();

    await page.setViewport({ width: 1280, height: 720 });

    
    await page.setRequestInterception(true);

    page.on('request', (interceptedRequest) => {
      const blockResources = ['script', 'stylesheet', 'image', 'media', 'font'];
      if (blockResources.includes(interceptedRequest.resourceType())) {
        interceptedRequest.abort();
      } else {
        interceptedRequest.continue();
      }
    });

    
    await page.setUserAgent(
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
    );

    await page.goto('https://www.nytimes.com/', {
      waitUntil: 'domcontentloaded',
    });

    const storySelector = 'section.story-wrapper h3';

    
    stories = await page.$$eval(storySelector, (divs) =>
      divs.slice(0, 10).map((div, index) => `${index + 1}. ${div.innerText}`)
    );
  } catch (error) {
    console.log(error);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
  return stories;
};

module.exports = scrapeWebsite;

3. Write the Firebase function

Inside of index.js file, import the scraper function and export it as a firebase function. We are also writing a scheduled function that will run every day and call the scraper function.

It is important to increase function memory and timeout limits because Chrome with Puppeteer is resource heavy.


const functions = require('firebase-functions');
const scrapeWebsite = require('./pptr');

exports.scrape = functions
  .runWith({
    timeoutSeconds: 120,
    memory: '512MB' || '2GB',
  })
  .region('us-central1')
  .https.onRequest(async (req, res) => {
    const stories = await scrapeWebsite();
    res.type('html').send(stories.join('
'
)); }); exports.scrapingSchedule = functions.pubsub .schedule('09:00') .timeZone('America/New_York') .onRun(async (context) => { const stories = await scrapeWebsite(); console.log('The NYT headlines are scraped every day at 9 AM EST', stories); return null; });

4. Deploy the feature

If you want to test the function locally, you can run the npm run serve command and browse to the function endpoint on localhost. When you are ready to deploy the function in the cloud, the command is npm run deploy.


Firebase Puppeteer Function

5. Test the planned function

If you want to test the scheduled function locally, you can run the command npm run shell to open an interactive shell to manually invoke functions with test data. Type the name of the function here scrapingSchedule() and press Enter to get the function output.


Firebase functions shell

Share.

Comments are closed.