Skip to content

Run headless Chrome/Chromium on AWS Lambda (maybe Azure, & GCP later)

License

Notifications You must be signed in to change notification settings

mario589736/serverless-chrome

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

serverless-chrome

Serverless Chrome contains everything you need to get started running headless Chrome on AWS Lambda (possibly Azure and GCP Functions soon).

The aim of this project is to provide the scaffolding for using Headless Chrome during a serverless function invocation. Serverless Chrome takes care of building and bundling the Chrome binaries and making sure Chrome is running when your serverless function executes. In addition, this project also provides a few "example" handlers for common patterns (e.g. taking a screenshot of a page, printing to PDF, some scraping, etc.)

Why? Because it's neat. It also opens up interesting possibilities for using the Chrome DevTools Protocol in serverless architectures.

Breaking Changes coming up!: Active development is happening in the develop branch. v1.0 introduces a framework-agnostic package for running Chrome on AWS Lambda. Try the pre-release with yarn add @serverless-chrome/lambda. More info is available here. There's also a Serverless-framework plugin here.

Please be sure to raise PRs against the develop branch.

Contents

  1. What is it?
  2. Installation
  3. Setup
  4. Testing
  5. Configuration and Deployment
  6. Known Issues / Limitations
  7. Roadmap
  8. Troubleshooting
  9. Projects & Companies using serverless-chrome
  10. Change log
  11. Prior Art

Installation

Installation can be achieved with the following commands

git clone https://github.com/adieuadieu/serverless-chrome
cd serverless-chrome
yarn install

(It is possible to exchange yarn for npm if yarn is too hipster for you. No problem.)

Or, if you have serverless installed globally:

serverless install -u https://github.com/adieuadieu/serverless-chrome

Setup

Credentials

You must configure your AWS credentials either by defining AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environmental variables, or using an AWS profile. You can read more about this on the Serverless Credentials Guide.

In short, either:

export AWS_PROFILE=<your-profile-name>

or

export AWS_ACCESS_KEY_ID=<your-key-here>
export AWS_SECRET_ACCESS_KEY=<your-secret-key-here>

Testing

Test with yarn test or just yarn ava to skip the linter.

Setup and Deployment

yarn deploy

This package bundles a lambda-execution-environment-ready headless Chrome binary which allows you to deploy from any OS. The current build is:

  • Browser: HeadlessChrome/60.0.3095.0
  • Protocol-Version: 1.2
  • User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/60.0.3095.0 Safari/537.36
  • V8-Version: 6.0.184
  • WebKit-Version: 537.36 (@947514553066c623a85712d05c3a01bd1bcbbffc)

Configuration

You can override default configuration in the /config.js file generated at the root of the project after a yarn install. See the defaults in src/config.js for a full list of configuration options.

Example Handlers

Currently there are only two, very basic "proof of concept" type functions:

captureScreenshot: Capture Screenshot of a given URL

When you the serverless function, it creates a Lambda function which will take a screenshot of a URL it's provided. You can provide this URL to the Lambda function via the AWS API Gateway. After a successful deploy, an API endpoint will be provided. Use this URL to call the Lambda function with a url in the query string. E.g. https://XXXXXXX.execute-api.us-west-2.amazonaws.com/dev/chrome?url=https://google.com/

We're using API Gateway as our method to execute the function, but of course it's possible to use any other available triggers to kick things off be it an event from S3, SNS, DynamoDB, etc. TODO: explain how --^

/config.js

import captureScreenshot from './src/handlers/captureScreenshot'

export default {
  handler: captureScreenshot
}
printToPdf: Print a given URL to PDF

The printToPdf handler will create a PDF from a URL it's provided. You can provide this URL to the Lambda function via the AWS API Gateway. After a successful deploy, an API endpoint will be provided. Use this URL to call the Lambda function with a url in the query string. E.g. https://XXXXXXX.execute-api.us-west-2.amazonaws.com/dev/chrome?url=https://google.com/

We're using API Gateway as our method to execute the function, but of course it's possible to use any other available triggers to kick things off be it an event from S3, SNS, DynamoDB, etc. TODO: explain how --^

This handler also supports configuring the "paper" size, orientation, etc. You can pass any of the DevTools Protocol's Page.printToPdf() method's parameters. For example, for landscape oriented PDF add &landscape=true to the end of the URL. Be sure to remember to escape the value of url if it contains query parameters. E.g. https://XXXXXXX.execute-api.us-west-2.amazonaws.com/dev/chrome?url=https://google.com/&landscape=true

/config.js

import printToPdf from './src/handlers/printToPdf'

export default {
  handler: printToPdf
}

Custom Handlers

You can provide your own handler via the /config.js file created when you initialize the project with yarn install. The config accepts a handler property. Pass it a function which returns a Promise when complete. For example:

/config.js

export default {
  handler: async function(invocationEventData, executionContext) {
    const { queryStringParameters: { url } } = invocationEventData
    const stuff = await doSomethingWith(url)
    return stuff
  }
}

The first parameter, invocationEventData, is the event data with which the Lambda function is invoked. It's the first parameter provided by Lambda. The second, executionContext is the second parameter provided to the Lambda function which contains useful runtime information.

serverless-chrome calls the Lambda handlers callback() for you when your handler function completes. The result of your handler is passed to callback with callback(null, yourHandlerResult). If your handler throws an error, callback is called with callback(yourHandlerError).

For example, to create a handler which returns the version info of the Chrome DevTools Protocol, you could modify /config.js to:

import Cdp from 'chrome-remote-interface'

export default {
  async handler (event) {

    const versionInfo = await Cdp.Version()

    return {
      statusCode: 200,
      body: JSON.stringify({
        versionInfo,
      }),
      headers: {
        'Content-Type': 'application/json',
      },
    }
  },
}

To capture all of the Network Request events made when loading a URL, you could modify /config.js to something like:

import Cdp from 'chrome-remote-interface'
import { sleep } from './src/utils'

const LOAD_TIMEOUT = 1000 * 30

export default {
  async handler (event) {
    const requestsMade = []
    let loaded = false

    const loading = async (startTime = Date.now()) => {
      if (!loaded && Date.now() - startTime < LOAD_TIMEOUT) {
        await sleep(100)
        await loading(startTime)
      }
    }

    const [tab] = await Cdp.List()
    const client = await Cdp({ host: '127.0.0.1', target: tab })

    const { Network, Page } = client

    Network.requestWillBeSent(params => requestsMade.push(params))

    Page.loadEventFired(() => {
      loaded = true
    })

    // https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-enable
    await Network.enable()

    // https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-enable
    await Page.enable()

    // https://chromedevtools.github.io/devtools-protocol/tot/Page/#method-navigate
    await Page.navigate({ url: 'https://www.chromium.org/' })

    // wait until page is done loading, or timeout
    await loading()

    // It's important that we close the websocket connection,
    // or our Lambda function will not exit properly
    await client.close()

    return {
      statusCode: 200,
      body: JSON.stringify({
        requestsMade,
      }),
      headers: {
        'Content-Type': 'application/json',
      },
    }
  },
}

See src/handlers for more examples.

TODO: talk about CDP and chrome-remote-interface

Known Issues / Limitations

  1. hack to chrome code to disable /dev/shm.
  2. /tmp size on Lambda
  3. it might not be the most cost efficient to do this on Lambda vs. EC2

Roadmap

1.0

  1. Don't force the use of Serverless-framework. See Issue #4
    1. Refactor the headless Chrome bundle and Chrome spawning code into an npm package
    2. Create a Serverless plugin, using above npm package
  2. OMG OMG Get unit tests up to snuff!
  3. Example serverless services using headless-chrome
    1. Printing a URL to a PDF
    2. Loading a page and taking a screenshot, with options on viewport size and device settings
    3. DOM manipulation and scraping

Future

  1. Support for Google Cloud Functions
  2. Support for Azure Functions?
  3. Example handler with nightmarejs (if this is even possible?)

Troubleshooting

I keep getting a timeout error when deploying and it's really annoying.

Indeed, that is annoying. I've had the same problem, and so that's why it's now here in this troubleshooting section. This may be an issue in the underlying AWS SDK when using a slower Internet connection. Try changing the AWS_CLIENT_TIMEOUT environment variable to a higher value. For example, in your command prompt enter the following and try deploying again:

export AWS_CLIENT_TIMEOUT=3000000
Aaaaaarggghhhhhh!!!

Uuurrrggghhhhhh! Have you tried filing an Issue?

Projects & Companies using serverless-chrome

Tell us about your project on the Wiki!

Change log

See the CHANGELOG

Prior Art

This project was inspired in various ways by the following projects:

CircleCI Coveralls Codacy grade David David

About

Run headless Chrome/Chromium on AWS Lambda (maybe Azure, & GCP later)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 100.0%