Scraping

Scraping a web page involves converting it into the internal data structure that Alfa can work on. This process requires rendering the page in a browser. In automated tests this is usually done using browser automation and headless browsers.

The precise steps needed for scraping thus on the specific browser automation used. Several packages are provided for integrating with common browser automations.

We previously covered how to scrape with browser automation tools. Now let's go over alternative solutions if you do not have one of the supported frameworks already set up in your project.

Standalone scraper

The @siteimprove/alfa-scraper package provides a standalone solution that uses Puppeteer internally. It might be a better fit for projects that do not already have browser automation and end-to-end tests in place.

Install the @siteimprove/alfa-scraper package:

npm install --save-dev @siteimprove/alfa-scraper

Then, point the scraper at a live page. This may require setting up a local server for it or using Node's url.pathToFileURL to point at a local file:

import { Scraper } from "@siteimprove/alfa-scraper";

const alfaPage = await Scraper.of()
  .scrape("http://localhost:8080")
  .then((result) => result.getUnsafe("Could not scrape page"));
  

Command line scraper

The @siteimprove/alfa-cli package provides a scraper that can be used from the command line, using Puppeteer internally. The resulting scrape must be saved to a local file and then loaded into a page. Therefore, it may not be the best option for automated tests but can be useful for quick iteration and experimentation.

Install the @siteimprove/alfa-cli and @siteimprove/alfa-web packages:

npm install --save-dev @siteimprove/alfa-cli @siteimprove/alfa-web

Scrape and save a page:

npm run alfa scrape -o page.json http://localhost:8080

(use alfa scrape --help for more options)

Then load the page into an Alfa object:

import fs from "node:fs";
import path from "node:path";
import { Page } from "@siteimprove/alfa-web";

const file = "page.json";
const alfaPage = Page.from(
  JSON.parse(fs.readFileSync(path.join(".", file), "utf-8"))
).getUnsafe("Could not parse the page");

Generic integration

The @siteimprove/alfa-dom/native package provides a Native.fromNode function that can convert any document object into an Alfa document (and page). Notably, it can be used inside an actual browser, as part of a script or extension, or injected into a headless browser by whichever means the browser automation tool provides.