Scraping

Scraping a web page is the act of turning a web page into the internal data structure that Alfa can work on. It requires rendering the page in a browser; in automated tests this is usually done using browser automation and headless browsers.

The precise steps needed for scraping thus depend on the actual browser automation used. Several packages are provided for integrating with common browser automations.

We previously saw how to scrape with browser automation tools. Now we go over alternative solutions if you do not have one of the supported frameworks already set up in your project.

Standalone Scraper

The @siteimprove/alfa-scraper package provides a standalone package (using Puppeteer internally). It might be a better fit for projects that do not already have browser automation and end-to-end tests in place.

Install the @siteimprove/alfa-scraper package:

npm install --save-dev @siteimprove/alfa-scraper

Then, point the scraper at a live page (this may require setting up a local server for it, or using Node's url.pathToFileURL to point at a local file):

import { Scraper } from "@siteimprove/alfa-scraper";

const alfaPage = await Scraper.of()
  .scrape("http://localhost:8080")
  .then((result) => result.getUnsafe("Could not scrape page"));
  

Command Line Scraper

The @siteimprove/alfa-cli package provides a scraper usable from the command line (using Puppeteer internally). The resulting scrape must be saved to a local file and then loaded into a page. Thus, it may not be the best option for automated tests but can be useful for quick iteration and experimentation of the process.

Install the @siteimprove/alfa-cli and @siteimprove/alfa-web packages:

npm install --save-dev @siteimprove/alfa-cli @siteimprove/alfa-web

Scrape and save a page:

npm run alfa scrape -o page.json http://localhost:8080

(use alfa scrape --help for more options)

Then load the page into an Alfa object:

import fs from "node:fs";
import path from "node:path";
import { Page } from "@siteimprove/alfa-web";

const file = "page.json";
const alfaPage = Page.from(
  JSON.parse(fs.readFileSync(path.join(".", file), "utf-8"))
).getUnsafe("Could not parse the page");

Generic Integration

The @siteimprove/alfa-dom/native package provides a Native.fromNode function that can be used to convert any document object into an Alfa document (and page). It can, notably, be used inside an actual browser (e.g., as part of a script or extension) or injected into a headless browser by whichever means the browser automation tool provides for this.