Scraping
Scraping a web page is the act of turning a web page into the internal data structure that Alfa can work on. It requires rendering the page in a browser; in automated tests this is usually done using browser automation and headless browsers.
The precise steps needed for scraping thus depend on the actual browser automation used. Several packages are provided for integrating with common browser automations.
We previously saw how to scrape with browser automation tools. Now we go over alternative solutions if you do not have one of the supported frameworks already set up in your project.
Standalone Scraper
The @siteimprove/alfa-scraper
package provides a standalone package (using Puppeteer internally). It might be a better fit for projects that do not already have browser automation and end-to-end tests in place.
Install the @siteimprove/alfa-scraper
package:
- npm
- yarn
- pnpm
- bun
npm install --save-dev @siteimprove/alfa-scraper
Then, point the scraper at a live page (this may require setting up a local server for it, or using Node's url.pathToFileURL
to point at a local file):
import { Scraper } from "@siteimprove/alfa-scraper";
const alfaPage = await Scraper.of()
.scrape("http://localhost:8080")
.then((result) => result.getUnsafe("Could not scrape page"));
Command Line Scraper
The @siteimprove/alfa-cli
package provides a scraper usable from the command line (using Puppeteer internally). The resulting scrape must be saved to a local file and then loaded into a page. Thus, it may not be the best option for automated tests but can be useful for quick iteration and experimentation of the process.
Install the @siteimprove/alfa-cli
and @siteimprove/alfa-web
packages:
- npm
- yarn
- pnpm
- bun
npm install --save-dev @siteimprove/alfa-cli @siteimprove/alfa-web
Scrape and save a page:
- npm
- yarn
- pnpm
- bun
npm run alfa scrape -o page.json http://localhost:8080
(use alfa scrape --help
for more options)
Then load the page into an Alfa object:
import fs from "node:fs";
import path from "node:path";
import { Page } from "@siteimprove/alfa-web";
const file = "page.json";
const alfaPage = Page.from(
JSON.parse(fs.readFileSync(path.join(".", file), "utf-8"))
).getUnsafe("Could not parse the page");
Generic Integration
The @siteimprove/alfa-dom/native
package provides a Native.fromNode
function that can be used to convert any document object into an Alfa document (and page). It can, notably, be used inside an actual browser (e.g., as part of a script or extension) or injected into a headless browser by whichever means the browser automation tool provides for this.