--- title: Supply Chain Security for Version Control Systems abbrev: Supply Chain Security for VCSs docname: draft-nhw-openpgp-supply-chain-security-vcs-00 date: 2023-06-20 category: info submissiontype: independent ipr: trust200902 area: int workgroup: openpgp keyword: Internet-Draft stand_alone: yes pi: [toc, sortrefs, symrefs] venue: group: "OpenPGP" type: "Working Group" mail: "openpgp@ietf.org" arch: "https://mailarchive.ietf.org/arch/browse/openpgp/" repo: "https://gitlab.com/sequoia-pgp/sequoia-git" latest: "https://sequoia-pgp.gitlab.io/sequoia-git/" author: - ins: N.H. Walfield name: Neal H. Walfield org: Sequoia PGP email: neal@sequoia-pgp.org - ins: J. Winter name: Justus Winter org: Sequoia PGP email: justus@sequoia-pgp.org normative: RFC2119: RFC4880: RFC8174: toml: author: - ins: T. Preston Werner name: Tom Preston-Werner - ins: P. Gedam name: Pradyun Gedam title: TOML v1.0.0 date: 2021-01-12 target: https://toml.io/en/v1.0.0 informative: event-stream: author: - ins: T. Hunter name: Thomas Hunter II title: "Compromised npm Package: event-stream" date: 2018-11-27 target: https://medium.com/intrinsic-blog/compromised-npm-package-event-stream-d47d08605502 dependency-confusion: author: - ins: A. Birsan name: Alex Birsan title: "Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies" date: 2021-02-09 target: https://medium.com/@alex.birsan/dependency-confusion-4a5d60fec610 reflections-on-trusting-trust: DOI.10.1145/358198.358210 guix: author: - ins: L. Courtès name: Ludovic Courtès title: Building a Secure Software Supply Chain with GNU Guix date: 2022-06 doi: 10.48550/arXiv.2206.14606 target: https://arxiv.org/abs/2206.14606 --- abstract In a software supply chain attack, an attacker injects malicious code into some software, which they then leverage to compromise systems that depend on that software. A simple example of a supply chain attack is when SourceForge, a once popular open source software forge, injected advertising into the binaries that they delivered on behalf of the projects that they hosted. Software supply chain attacks are different from normal bugs in that the intent of the perpetrator is different: in the former case, bugs are added with the intent to harm, and in the latter they are added inadvertently, or due to negligence. Software supply chain security starts on a developer's machine. By signing a commit or a tag, a developer can assert that they wrote or approved the change. This allows users of a code base to determine whether a version has been approved, and by whom, and then make a policy decision based on that information. For instance, a packager may require that software releases be signed with a particular certificate. Version control systems such as git have long included support for signed commits and tags. Most developers don't sign their commits, and in the cases where they do, it is usually unclear what the semantics are. This document describes a set of semantics for signed commits and tags, and a framework to work with them in a version control system, in particular, in a git repository. The framework is designed to be self contained. That is, given a repository, it is possible to add changes, or authenticate a version without consulting any third parties; all of the relevant information is stored in the repository itself. By publishing this draft we hope to clarify and enrich the semantics of signing in version control system repositories thereby enabling a new tooling ecosystem, which can strengthen software supply chain security. --- middle # Introduction ## Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 {{RFC2119}} {{RFC8174}} when, and only when, they appear in all capitals, as shown here. ## Terminology - "Maintainer" is a software developer, who is responsible for a software project in the sense that they act as a gatekeeper, and decide with other maintainers what changes are acceptable, and should be added to the software. - "Contributor" is someone who contributes changes to a software project. Unlike a maintainer, a contributor cannot add their changes to a project on their own accord. - "Software supply chain" is the collection of software that something depends on. For instance, a software package depends on libraries, it is built by a compiler, it is distributed by a package registry, etc. - "Software supply chain attack" is an attack in which an attacker compromises a software supply chain. For instance, a maintainer or a contributor may stealthily insert malicious code into a software project in order to compromise the security of a system that depends on that software. - "Version control system" is a database, which contains versions of a software project. Each version includes links to preceding versions. - "git" is a popular version control system. Although "git" is distributed and does not rely on a central authority, it is often used with one to simplify collaboration. Examples of centralized authorities include gitea, GitHub, and Gitlab. - "Commit" is a version that is added to the "version control system". In git, commits are identified by their message digest. - "Branch" is a typically human readable name given to a particular commit. When a commit is superseded, the branch is updated to point to the new commit. Repositories normally have at least one branch called "main" or "master" where most work is done. - "Tag" is a name given to a particular commit. Tags are usually only added for significant versions like releases and are normally not changed once published. - "Change" is a commit or a tag. - "Forge" is a service which hosts software repositories, and often provides additional services like a bug tracker. Examples of forges are codeberg, GitHub, and GitLab. - "Registry" or "Package Registry" is a service that provides an index of software packages. Maintainers register their software there under a well-known name. Build tools like `cargo` fetch dependencies by looking up the software by its name. - "Authentication" is the process of determining whether something should be considered authentic. - "Trust model" is a process for determining what evidence to consider, and how to weigh it when doing authentication. - "OpenPGP certificate" or just "certificate" is the data structure that section 11.2 of {{RFC4880}} defines as a "Transferable Public Key". A certificate is sometimes called a key, but this is confusing, because a certificate contains components that are also called keys. - "Liveness" is a property of a certificate, a signature, etc. An object is considered live with respect to some reference time if, as of the reference time, its creation time is in the past, and it has not expired. # Problem Statement Consider the following scenario. Alice and Bob are developers. They are the primary maintainers of the Xyzzy project, which is a free and open source project. Although they do most of the work on the project, they also have occasional collaborators like Carol, and drive-by contributions from people like Dave. Paul packages their software for an operating system distribution. Ted from Ty Coon Corporation integrates it into his company's software. And, Mallory is an adversary who is trying to subvert the project. When someone updates their local copy of Xyzzy's source code repository, they want to authentic any changes before they use them. That is, they want to know that each change was made or approved by someone whom they consider authorized to make that change. In the Xyzzy project, Alice is willing to rely on Bob to check-in changes he makes, and to approve contributions from third parties without auditing the code herself. But, she doesn't want to rely on anyone else without checking their proposed changes manually. Bob feels the same way about Alice. In version control systems like `git`, the meta-data for a commit or tag includes `author` and `committer` fields. By themselves, these fields cannot be used to reliably determine who a change's author and committer are, because these fields are set by the committer and unauthenticated. That is, Mallory could author a commit, set both of these fields to "Bob," and push the malicious commit. No one would be able to tell that they came from Mallory and not Bob. There are two main ways to authenticate changes. First, changes to a repository or branch can be mediated by a trusted third party, which enforces a policy at the time a change is added to the repository. Second, individual changes can be signed, and a policy can be evaluated at any time. These two approaches can be mixed. ## Repositories Protected by a Trusted Third Party When using a trusted third party, only certain users are allowed to change the repository. This is often realized using access control lists: the trusted third party has a list of users who are allowed to do certain types of modifications. Before the trusted third party allows a user to modify the repository, the user has to authenticate themselves. When they attempt to make a change, the trusted third party checks that they are authorized. If they are, the third party allows the modification. If not, it is rejected. A user of this repository can now conclude that if they can authenticate the trusted third party, then the changes were approved. A drawback of using a trusted third party is that it relies on centralized infrastructure. This means the only way for a user to determine if a version of Xyzzy is authentic is to fetch it from the trusted third party; the repository is not self authenticating. If the third party ever disappears, users will no longer be able to authenticate the project's source code. Another disadvantage is that this approach doesn't expose the project's policy to its users. This means that both first-parties like Alice and third-parties like Paul are not able to audit the trusted third party. This is the case even if the set of users that are currently authorized to make changes are exposed via a separate API end point: because the set of authorized users changes with time, all updates to the ACLs would need to be exposed along with information about what user authorized each change. ## Self-Authenticating Repositories An alternative approach is to have authors and committers sign their changes. Users then check that the changes are signed correctly, and authenticate the signers. For instance, for the Xyzzy project, Paul might decide that Alice or Bob are allowed to make changes. So when Paul fetches changes, he checks whether Alice or Bob signed the new changes, and flags changes made by anyone else. If Alice and Bob later decide that Carol should also be allowed to directly commit her changes, Paul needs to update his policy. If Bob leaves the team, Paul needs to pay enough attention to notice, and then disallow changes made by Bob after a certain date. For projects that sign their commits today, this is more or less the status quo. Most users, however, do not want to maintain their own policy, and aren't even in a good position to do so. Since users are willing to rely on the maintainers to make changes to the project, they can just as well delegate the policy to them. Now, a user like Paul just needs to designate an initial policy. If he knows when the policy changes, and can authenticate changes to the policy based on the existing policy, then he is able to authenticate any subsequent changes to the repository. An easy way to manage the policy is to include it in the repository itself. Then changes to the policy can be authenticated in the same way as normal changes. This also makes the repository self authenticating, because it is self contained. One issue is how users should handle forks to a project. A fork in a project may occur due to a social or technical conflict, or because the project dies, and is later revived by a different party. In both cases, it may not be possible for there to be a clean hand off to the new maintainer. That is, Alice or Bob may not be willing or able to change the policy file to allow Dave to seamlessly continue the development of Xyzzy. Forks are straightforward to handle, but require user intervention: from the system's perspective, Dave is not authorized, so his changes are rejected. And that's good, as Dave may be an attacker; the system can't tell. Users opt in to a fork by changing their trust root to designate a version in which Dave is authorized to make changes. # Threat Model Consider an attacker, Mallory, who is trying to compromise a user, Ursula, by injecting a vulnerability into the software supply chain of a piece of software, Super Frob, that she uses. There are several different ways that Mallory could accomplish this. These include: - Mallory could pose as a contributor, and convince a develop to authorize a malicious change to one of Super Frob's dependencies, such as a library. - Mallory could take over an abandoned package that Super Frob depends on, and publish a new version with malicious code. - Mallory could use typo squatting to opportunistically or through social engineering inject malicious software into Super Frob's supply chain. For instance, Mallory could publish a library called `libevent`, which is a copy of `libevents`, but includes a malicious change, and Super Frob accidentally includes `libevent` as a dependency instead of `libevents`. - Mallory could publish a malicious package that has the same name as a package on another registry in order to confuse Super Frob's build tools. This type of attack is called a dependency confusion attack, {{dependency-confusion}}. It can be launched when an organization uses an internal registry and a public registry to find dependencies. As dependencies are often referenced by name, and that name does not include the registry, an attacker may trick the organization into using their malicious version of the package. - Mallory could sneak a change into one of Super Frob's build dependencies, like the compiler. Whereas software maintainers have a large degree of control over their direct dependencies, they have more limited control over the tools downstream users use to build their software. In the extreme, a software project may include a copy of a dependency in their version control system, or depend on a specific version of a dependency by cryptographic hash, but only specify a standard that the compiler needs, like C99. This attack is most well-known from Ken Thompson's Reflections on Trusting Trust Turning award lecture, {{reflections-on-trusting-trust}}. - Mallory could compromise the tools that a developer uses, e.g., by publishing a useful, but malicious plug-in for an editor, which detects certain code patterns, and quietly modifies them to insert malicious code. - Mallory could compromise the systems that the developers use, and modify their source code repositories. For instance, if Mallory gets access to a developer's machine, he could stealthy modify code before it is signed and committed. Or, he could exfiltrate the developer's signing key, or login credentials and imitate her. Similarly, if a software project uses a forge and Mallory is able to compromise the forge, he could modify the source code. - Mallory could compromise Super Frob or one of its dependencies as it is being downloaded. For instance, if a package registry like `crates.io` depends on a content delivery network (CDN) to distribute packages, a compromised node in the CDN may return a modified version of the software to the user. The setting is as follows. To protect herself from Mallory, Ursula has to make sure that versions of the software she obtains do not contain malicious code. Ursula cannot afford to audit every version of the software, but she is willing to rely on the maintainers of the project to not add malicious code, and to review contributions from third parties. The framework presented in this specification allows Ursula to audit a dependency and its developers once, and then to delegate decisions of what code and dependencies to include to the developers. Assuming the developers are reliable, this can protect Ursula from attacks where Mallory is not explicitly authorized to make a change. For instance, if the developers of an abandoned software package do not authorize a new maintainer, Ursula will be warned when a package has a new maintainer, as she can no longer authenticate it. She can then reaudit it. Similarly, when the software is modified in transit by a machine in the middle, Ursula will not be able to authenticate it. This can also stop dependency confusion attacks, because the software cannot be authenticated. It won't however, stop a downgrade attack, as older versions can still be authenticated. This framework cannot protect Ursula from mistakes that she or a developer of the software that she depends on makes. For instance, if Mallory is able to convince a developer to authorize a malicious change to their software, this framework consider the change to be legitimate. This framework can facilitate forensic analysis in these case by making it easier to identify changes approved by the same person (potentially across different projects) and thereby conduct a targeted audit. # Authentication This framework helps users authenticate three types of artifacts: commits, tags, and tarballs or other archives. ## Policy Every commit has an associated policy. If a commit contains the file `openpgp-policy.toml` in the root directory, then that file describes the commit's policy. If the commit does not contain that file, the void policy is used. The void policy rejects everything. `openpgp-policy.toml` is a TOML v1.0.0 file {{toml}}. Version 0 defines the following three top-level keys: `version`, `authorization`, and `commit_goodlist`. If a parser recognizes the version, but encounters keys that it does not know, then it must ignore the unknown keys. This allows a degree of forwards compatibility. ### version The value of the `version` key is an integer and must be `0`: version = 0 If the value of `version` is not recognized, the implementation SHOULD error out. It MAY instead treat the policy as the void policy. ### authorization `authorization` is a table of authorization entries. Each key in the `authorization` table is a free-form identifier, which is chosen by the user of the system. The identifier SHOULD be a UTF-8 encoded, human-readable string that identifies an entity. Examples of identifiers are `alice`, `Bob `, `Boty McBotface `. The value of each authorization entry is another table. The table has the following entries: - `keyring` - `sign_commit` - `sign_tag` - `sign_archive` - `audit` - `add_user` - `retire_user` #### keyring The value of `keyring` is a string. It contains one or more OpenPGP certificates. The OpenPGP certificates MUST be ASCII-armored. An ASCII-armored block MAY contain more than one OpenPGP certificate. The string MAY contain multiple ASCII-armored blocks. An implementation SHOULD ignore valid OpenPGP certificates that is does not support, and MAY emit a warning that a certificate, or component is not supported. An implementation SHOULD return an error if it encounters something other than an OpenPGP certificate encoded with ASCII armor. When adding a certificate, an implementation SHOULD only add components that are needed to validate the signatures. That is, an implementation SHOULD strip subkeys that are not signing capable, and third-party signatures. For components that are kept, an implementation SHOULD include all known self signatures, and not just the newest self signature. #### sign_commit The value of `sign_commit` is a boolean. If `true`, then the entity is authorized to sign commits. #### sign_tag The value of `sign_tag` is a boolean. If `true`, then the entity is authorized to sign tags. #### sign_archive The value of `sign_archive` is a boolean. If `true`, then the entity is authorized to sign tarballs or other archives. #### audit The value of `audit` is a boolean. If `true`, then the entity is authorized to add commits to the top-level `commit_goodlist` array. #### add_user The value of `add_user` is a boolean. If `true`, then the entity is authorized to add new entities to the authorization table, and to grant them any capabilities that they have. #### retire_user The value of `retire_user` is a boolean. If `true`, then the entity is authorized to retire capabilities from any entity. This includes capabilities that they do not have. #### Example The following is an example of an authorization entry. The user has been granted all the capabilities. The user is identified by two different OpenPGP certificates. The certificates are contained in two concatenated ASCII armored blocks. [authorization."Neal H. Walfield "] sign_commit = true sign_tag = true sign_archive = true add_user = true retire_user = true audit = true keyring = """ -----BEGIN PGP PUBLIC KEY BLOCK----- Comment: F717 3B3C 7C68 5CD9 ECC4 191B 74E4 45BA 0E15 C957 Comment: Neal H. Walfield (Code Signing Key) Comment: Neal H. Walfield Comment: Neal H. Walfield Comment: Neal H. Walfield Comment: Neal H. Walfield xsEhBFUjmukBDqCpmVI7Ve+2xTFSTG+mXMFHml63/Yai2nqxBk9gBfQfRFIjMt74 =MESu -----END PGP PUBLIC KEY BLOCK----- """ ### commit_goodlist The value of `commit_goodlist` is an array of strings where each string contains a commit identifier. The commit identifier MUST be a full hash. The commit identifier MUST NOT be a branch name, a tag name, or a truncated hash. Commits listed in the `commit_goodlist` are commits that have retroactively been marked as valid. This may be useful when a certificate's private key material has been compromised. ## Authenticating Commits Each commit in a `git` repository is part of a directed acyclic graph (DAG) where a node is a commit, and a directed edge shows how two commits are related. Specifically, the head of a directed edge is a commit that is derived from the tail. Except for the root commits, each commit has one or more parents. A commit that has multiple parents is derived from multiple commits. Conceptually, it merges multiple paths, and as such is called a merge commit. A commit is consider authenticated if at least one of its parent commits considers the commit to be authenticated. This rule is different from Guix's *authorization invariant* as described in {{guix}}, which states that all parent commits must consider the commit to be authenticated. The semantics described here allow a developer to add commits from unauthorized third-parties as-is using a merge commit. Using Guix's authorization invariant, the third party's commit would have to be resigned, which loses the third-party's signature, and consequently complicates forensic analysis. A commit's parent authenticates it as follows. First, the implementation looks up the signer's certificate in the parent commit's policy file. The implementation SHOULD then canonicalize the certificate so that the active self signatures are those that were active when the signature was made. A self signature is valid, if it is not revoked, and not expired. A self signature is active, if it is the most recent, valid self signature prior to a reference time. That is, if a new commit was made on June 9, 2023, then each component's most recent signature as of June 9, 2023, which is also not revoked, and not expired, is considered that component's active self signature. If the canonicalized certificate is valid as of the signature's time, not expired as of that time, not soft revoked as of that time, not hard revoked at any time, and the signature is correct, then the signature is considered verified. The implementation MAY consider certificate updates from other sources. If it does, it SHOULD only consider hard revocations. The implementation MUST then check that the type of change is authorized by the policy. The following capabilities allow the specified types of changes: - `sign_commit`: Needed for any change. - `add_user`: Needed to delegate a capability to another user. Updating `keyring` does not require this capability if a certificate is only updated, and not added. - `retire_user`: Needed to rescind a capability from another user. - `audit`: Needed to modify the `version` field, and the `commit_goodlist` list. If the signature is considered verified, and the signer is authorized to make the type of change that was made, then the commit is considered authenticated. If the commit is not considered authenticated, because the signer's certificate has been hard revoked, but the commit is included in a later commit's `commit_goodlist`, then the commit is considered to be authenticated. A commit is considered to occur later if when authenticating a range of commits, a commit is a direct descendant of the commit in question, and it is in the commit range. Consider the three commits `a`, `b`, and `c` where `a` is `b`'s parent, `b` is `c`'s parent, the certificate used to sign `b` has been hard revoked, and `c` includes `b` in its `commit_goodlist`. In this case, the hard revocation for the certificate to use `b` is ignored. All other criteria including the fact that the signature on `b` is valid are still checked. ## Authenticating Tags A tag is a special type of commit in `git`, which has no content, but assigns a name to a specific commit. A tag is usually used to mark release points. A tag is authenticated in the same way as a commit, as described in the previous section, with the following exceptions. First, the tagged commit is considered a parent commit, and the tag is considered its child commit. The entity that signed the tag needs the `sign_tag` capability, and only the `sign_tag` capability. ## Authenticating Archives Archives like tarballs are often generated as part of a software's release process. These may be signed. To authenticate an archive with respect to a signature, and a trust root, the trust root's policy is used to authenticate the tarball's signature. The entity that signed the tarball must have the `sign_archive` capability. Unlike a commit, an archive does not have a pointer to the commit that it was derived from. Thus, if an archive is derived from commit `c`, it may be possible to authenticate commit `c`, as well as tags referring to commit `c` using a given trust root, but to not authenticate an archive derived from commit `c` using the same trust root, because the policy changed in the meantime. If the signature includes the notation `commit@notations.sequoia-pgp.org`, then the value of the notation is interpreted as the commit that the archive is derived from. The value of the notation is a hexadecimal value corresponding to the commit's full hash. Truncated hashes MUST be considered erroneous. The commit identifier MUST NOT be a branch name, a tag name, or a truncated hash. Since archives are often verified outside of a repository, one or more repositories may be specified using the `repository@notations.sequoia-pgp.org` notation. In that case, each notation indicates a git repository. For example, the main repository of the reference implementation, `sq-git`, is `https://gitlab.com/sequoia-pgp/sequoia-git.git`. So, archives SHOULD include the `repository@notations.sequoia-pgp.org` notation with `https://gitlab.com/sequoia-pgp/sequoia-git.git` as the value. When `commit@notations.sequoia-pgp.org` is present in the signature, the implementation MUST use that commit's policy to authenticate the archive, and then authenticate that commit by chaining back to the trust root, as described above; in this case, it MUST NOT use the trust root's policy directly unless the specified commit is also the trust root. # Reference implementation A Rust implementation of this specification is part of Sequoia. See https://gitlab.com/sequoia-pgp/sequoia-git for the source code. # Security Concerns ## Malicious vs. Buggy Changes The scheme presented here can help mitigate malicious attacks on a code base, but it does nothing to prevent design flaws or code errors. That is, this scheme does not and cannot provide any protections from normal bugs. ## Trusted Developers The protections outlined in this document are mainly designed to stop third-parties from adding malicious code to a project. This system provides no protection from a developer who is authorized to make changes and turns out to be malicious. That said, because commits are signed, when malicious code is discovered, an audit is required to restore trust in the code base. Using this system, it is easier to identify other code added by the same person, and focus an audit on that code. ## Judging Code vs. Judging Humans The approach described in this document relies on transitive trust. The basic idea is that if a user is willing to run a developer's code, then they can reasonably rely on that developer to modify the code, and to delegate that capability to a third party. Yet, writing and reviewing code is fundamentally different from evaluating another person's intents. This is demonstrated quite well by the events surrounding the popular `event-stream` npm package, {{event-stream}}. In 2018, a new developer gained the trust of the package's maintainer by contributing a number of high-quality changes. The original developer eventually made the new developer the maintainer, and the new maintainer introduced malicious code to steal user's credentials. ## Operational Security Signing commits relies on each developer having a long-term identity key, which they keep safe. If the key is compromised, the attacker is able to impersonate the developer. It is possible to limit the damage by revoking the compromised key, or having another authorized user retire the developer's access. In this regard, sigstore appears to be better as it relies on ephemeral signing keys, which are issued by a central authority. However, in order to obtain a signing key, the user needs to log in. If they use a password, then if an attacker gets access to the password, an attacker can impersonate the developer. If the developer uses a second factor like a hardware token, then they are again using private key cryptography, and may as well put their private keys on a hardware token, and forego the centralized infrastructure. ## Dependencies This specification has concentrated on enabling a user of a software project to authenticate new versions. But most software has its own dependencies, and those also need to be authenticated. A user could identify all software that they are willing to rely on, but this is more work than most users are willing and able to do. But, just as developers are usually in a better position to evaluate who should be allowed to contribute to their project, they are also in a better position to designate a trust root for their dependencies. Enabling this functionality requires ecosystem-specific tooling. The developer needs to be able to specifying a trust root for each dependency, and the build infrastructure needs to authenticate the dependencies. For instance, the Rust ecosystem uses Cargo for building and dependency management. Currently, to add `sequoia-openpgp` as a dependency to a project, a developer would modify their `Cargo.toml` file as follows: [dependencies] sequoia-openpgp = { version = "1" } Instead, they would also specify a trust root, which they've presumably audited: [dependencies] sequoia-openpgp = { version = "1", trust-root = "HASH" } When downloading the dependency, `cargo` would make sure that the dependency can be authenticated from the specified trust root, and if not throw an error. ## Document History This is a first draft that has not been published. # Acknowledgments My thanks go---in particular, but not only---to the Sequoia PGP team for many fruitful discussions. Funding for this project was provided by the Sovereign Tech Fund.