# hyperlink A command-line tool to find broken links in your static site. * **Fast.** [docs.sentry.io](https://github.com/getsentry/sentry-docs) produces 1.1 GB of HTML files. `hyperlink` handles this amount of data in 4 seconds on a MacBook Pro 2018. See [Alternatives](#alternatives) for a performance comparison. * **Pay for what you need.** By default, `hyperlink` checks for hard 404s in internal links only. Anything beyond that is opt-in. See [Options](#options) for a list of features to enable. * **Maps back errors to source files.** If your static site was created from Markdown files, `hyperlink` can try to find the original broken link by fuzzy-matching the content around it. See the [`--sources` option](#options). * Supports traversing file-system paths only, no arbitrary URLs. Hyperlink does not know how to make network calls. However, hyperlink does have tools to [extract external links](#external-links). * Does not honor `robots.txt`. A broken link is still broken for users even if not indexed by Google. * Does not parse CSS files, as broken links in CSS have not been a practical concern for us. We are concerned about broken link in the page content, not the chrome around it. * Only supports UTF-8 encoded HTML files. ## Installation and Usage [Download the latest binary](https://github.com/untitaker/hyperlink/releases) and: ```bash # Check a folder of HTML ./hyperlink public/ # Also validate anchors ./hyperlink public/ --check-anchors # src/ is a folder of Markdown. Show original Markdown file paths in errors ./hyperlink public/ --sources src/ ``` ### GitHub action ```yaml - uses: untitaker/hyperlink@0.1.43 with: args: public/ --sources src/ ``` ### NPM ```bash npm install -g @untitaker/hyperlink hyperlink public/ --sources src/ ``` ### Docker ```bash docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:0.1.43 /check/public/ --sources /check/src/ # specific commit docker run -v $PWD:/check ghcr.io/untitaker/hyperlink:sha-82ca78c /check/public/ --sources /check/src ``` [See all available tags](https://github.com/untitaker/hyperlink/pkgs/container/hyperlink) ### From source ```bash cargo install --locked hyperlink # latest stable release cargo install --locked --git https://github.com/untitaker/hyperlink # latest git SHA ``` ## Options When invoked without options, `hyperlink` only checks for 404s of internal links. However, it can do more. * `-j/--jobs`: How many threads to spawn for parsing HTML. By default `hyperlink` will attempt to saturate your CPU. * `--check-anchors`: Opt-in, check for validity of anchors on pages. Broken anchors are considered warnings, meaning that `hyperlink` will `exit 2` if there are *only* broken anchors but no hard 404s. * `--sources`: A folder of markdown files that were the input for the HTML `hyperlink` has to check. This is used to provide better error messages that point at the actual file to edit. `hyperlink` does very simple content-based matching to figure out which markdown files may have been involved in the creation of a HTML file. Why not just crawl and validate links in Markdown at this point? Answer: * There are countless of proprietary extensions to markdown out there for creating intra-page links that are generally not supported by link checking tools. * The structure of your markdown content does not necessarily match the structure of your HTML (i.e. what the user actually sees). With this setup, `hyperlink` does not have to assume anything about your build pipeline. * `--github-actions`: Emit [GitHub actions errors](https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-commands-for-github-actions#setting-an-error-message), i.e. add error messages in-line to PR diffs. This is only useful with `--sources` set. If you are using `hyperlink` through the GitHub action this option is already set. It is only useful if you are downloading/building and running hyperlink yourself in CI. ## Exit codes * `exit 1`: There have been errors (hard 404s) * `exit 2`: There have been only warnings (broken anchors) ## External links Hyperlink does not know how to check external links, but it gives you some tools to extract them. ``` hyperlink dump-external-links build/ # http://example.com/myurl # ... ``` This allows you to plug in your own logic that fits the requirements for your site (special handling for social networks, custom URI schemes, ...): ``` # filter for HTTP URLs and turn off all link-checking for our social media # handles, as twitter.com is unreliable and we already know those links are correct. hyperlink dump-external-links build/ | \ rg '^https?://' | \ rg -v '^https://twitter.com/untitaker' | \ xargs -P20 -I{} bash -c 'curl -ILf "{}" &> /dev/null || (echo "{}" && exit 1)' ``` ...and allows hyperlink to focus on its main job of traversing and parsing HTML. ## Alternatives *(roughly ranked by performance, determined by some unserious benchmark. this section contains partially dated measurements and is not continuously updated with regards to either performance or featureset)* None of the listed alternatives have an equivalent to `hyperlink`'s `--sources` and `--github-actions` feature. * [lychee](https://github.com/lycheeverse/lychee), like `hyperlink`, is a great choice for obscenely large static sites. Additionally it can check external/outbound links. An invocation of `lychee --offline public/` is more or less equivalent to `hyperlink public/`. * [liche](https://github.com/raviqqe/liche) seems to be fairly fast, but is unmaintained. * [htmltest](https://github.com/wjdp/htmltest) seems to be fairly fast as well, and is more of a general-purpose HTML linting tool. * [muffet](https://github.com/raviqqe/muffet) seems to have similar performance as `htmltest`. We tested `muffet` with [`http-server`](https://www.npmjs.com/package/http-server) and webfsd without noticing a change in timings. * [linkcheck](https://github.com/filiph/linkcheck) is faster than `linkchecker` but still quite slow on large sites. We tried `linkcheck` together with [`http-server`](https://www.npmjs.com/package/http-server) on localhost, although that does not seem to be the bottleneck at all. * [wummel/linkchecker](https://wummel.github.io/linkchecker/) seems to be the fairly feature-rich, but was a non-starter due to performance. This applies to other countless link checkers we tried that are not mentioned here. ## Testimonials > We use Hyperlink to check for dead links on > [Graphviz's static-site user documentation](https://graphviz.org/), because: > > * Hyperlink is *blazingly* fast, checking 700 HTML pages in 220ms (default) and > 850ms (with `--check-anchors`). > * Hyperlink's single-binary release, with no library dependencies, > was trivial to integrate into our [continuous integration tests](https://gitlab.com/graphviz/graphviz.gitlab.io/-/blob/5dcfa637b7df17e3a1b821f3d7e9de8f5f82544b/.gitlab-ci.yml#L27). > * High coverage: Hyperlink immediately spotted over a thousand broken page > links within both `` tags and HTML redirects, and a further 62 broken > anchor-links with `--check-anchors`. > * Hyperlink's design decision to crawl only static files (avoiding HTTP), > avoids test flakiness from network requests, allowing me to confidently > block merging if Hyperlink reports an error. > > In conclusion, Hyperlink fills the "static site continuous testing" niche > really nicely. -- Mark Hansen, Graphviz documentation maintainer ## License Licensed under the MIT, see [`./LICENSE`](./LICENSE).