Cryptography in MLA
MLA uses cryptographic primitives essentially for the purpose of the Encrytion and Signature layers.
This document introduces the primitives used, arguments for the choice made and some security considerations.
Keys used for encryption and signature are generated and used separately.
Signature
As described in FORMAT.md
an archive can be signed. Implementation must ensure users explicitely choose if signature is made and verified.
A PQ/T key consists of a pair of a post-quantum key and a traditional key. An archive is considered correctly signed for a PQ/T key if and only if it is correctly signed for its post-quantum part AND its traditional part.
Two signature methods are available and must be used together. Signature method input is called m
. The SHA-512 hash h
of m
may be computed in a first step.
For method MLAEd25519SigMethod
, signature_data
is the Ed25519ph (as described in RFC 8032 1) signature of m
(not h
even though it can be used for computing the result). The context given as parameter to Ed25519ph is the ASCII MLAEd25519SigMethod
. Signature verification and key generation are done as described in RFC 8032. Key storage is described in KEY_FORMAT.md
.
For method MLAMLDSA87SigMethod
, signature_data
is the ML-DSA-87 signature (as described in FIPS 204 2, not HashML-DSA) of h
(not m
this time) with the ASCII MLAMLDSA87SigMethod
as context. Signature verification and key generation are done as described in FIPS 204. Key storage is described in KEY_FORMAT.md
.
An archive can be signed with multiple signing keys. If a user provides a set of PQ/T keys for signature verification, implementations should give a way for the user to know if archive is correctly signed for at least one key. Implementations may give a way for users to know if archive is correctly signed for all keys. Users must explicitely know if they are validating against at least one or all keys. Implementations may also give a way for users to know which PQ/T keys correspond to valid signatures or their number.
Encryption high-level overview
Objectives
The purpose of the Encryption layer is to provide confidentiality and data integrity of the inner layer.
These objectives are obtained using:
- Authenticated encryption
- Asymmetric cryptography, for several recipients
This layer does not provide signature.
General design guidelines
- The size and the initial computation time used for the encryption needs are not a big issue, if kept reasonable. Indeed, in the author understanding, MLA archives are usually several MB long and the computation time is primarily spent in compression/decompression and encryption/decryption of the data
As a result, some optimization have not been performed -- which help keeping an hopefully auditable and conservative design.
-
Only one encryption method and key type is available, to avoid confusion and potential corner cases errors
-
When possible, use audited code and test vectors
Main bricks: Encryption
The data is encrypted using AES-256-GCM, an AEAD algorithm. To offer a seekable layer, data is encrypted using chunks of 128KB each, except for the last one. These encrypted chunks are all present with their associated tag. Tags are checked during decryption before returning data to the upper layer.
To prevent truncation attacks, another chunk is added at the end corresponding to the encryption of the ASCII string "FINALBLOCK" with "FINALAAD" as additional authenticated data. Any usage of the archive must check correct decryption (including tag verification) of this last block.
The key, the base nonce and the nonce derivation for each data chunk are computed following HPKE (RFC 9180) 3. HPKE is parameterized with:
- Mode: "Base" (no PSK, no sender authentication)
- KDF: HKDF-SHA512
- AEAD: AES-256-GCM
- KEM: Multi-Recipient Hybrid KEM, a custom KEM described later in this document
Thus, only one cryptography suite is available for now. If this setting ends up broken by cryptanalysis, we will move users onward to the next MLA version, using appropriate cryptography. Therefore, MLA lacks cryptography agility which is an encouraged property regarding post-quantum cryptography by ANSSI 4. Still, HPKE improves this aspect of MLA 3.
Full details are available below.
Additionally, "key commitment" is included using a method described in 5 and detailed in 6.
Main bricks: Asymmetric encryption
Since the format v2
, the Encrypt
layer is using post-quantum cryptography (PQC) through an hybrid approach, to avoid "Harvest now, decrypt later" attacks.
The algorithms used are:
- X25519 for pre-quantum cryptography, using DHKEM (RFC 9180) 3
- FIPS 2037 (CRYSTALS Kyber) MLKEM-1024 for post-quantum cryptography
The two keys are mixed together (see below) in a manner keeping the IND-CCA2 properties of the two algorithms.
Sending to multiple recipients is achieved using a two-step process:
- For each recipient, a per-recipient Hybrid KEM is done, leading to a per-recipient shared secret
- These per-recipient shared secret are derived through HPKE to obtain a key and a nonce
- These per-recipient key and nonce are used to decrypt a secret shared by all recipients
This final secret is the one later used as an input to the encryption layer. The whole process can be viewed as a KEM encapsulation for multiple recipients.
Encryption Details
The following sections describe the whole process for data encryption and seed derivation. They are meant to ease the understanding of the code and MLA format re-implementation.
The interested reader could also look at the Rust implementation in this repository for more details. The implementation also includes tests (including some test vectors) and comments.
Asymmetric encryption - Per-recipient KEM
Notations
- , , and : respectively the X25519 public key and secret key, and the MLKEM-1024 (FIPS 203 7) encapsulating key and decapsulating key
- and : key encapsulation methods with X25519, as defined in RFC 9180, section 4 3
- and : key encapsulation methods on MLKEM-1024, as defined in FIPS 203 7
- : a 32-bytes secret, produced by a cryptographic RNG. Informally, this is the secret shared among recipients, encapsulated separately for each recipient
- :
KeySchedule
function from RFC 9180 3, instanciated with:- Mode: "Base"
- KDF: HKDF-SHA-512
- AEAD: AES-256-GCM
- KEM: a custom KEM ID, numbered 0x1120
- : AES-256-GCM encryption, returning the encrypted data concatened with the associated tag
- : AES-256-GCM decryption, returning the decrypted data after verifying the tag
- and : respectively produce a byte string encoding the data in argument, and produce the data from the byte string in argument
Process
To encrypt to a target recipient , knowing and :
- Compute shared secrets and ciphertexts for both KEM:
- Combine the shared secrets (implemented in
mla::crypto::hybrid::combine
):
def combine(ss1, ss2, ct1, ct2):
uniformly_random_ss1 = HKDF-SHA512-Extract(
salt=0,
ikm=ss1
)
key = HKDF(
salt=uniformly_random_ss1,
ikm=ss2,
info=ct1 . ct2
)
return key
- Wrap the recipients' shared secret:
Informally, this process can be viewed as a per-recipient KEM taking a shared secret , the recipient public key (made of the elliptic curve and the PQC public keys) and returning a ciphertext .
To obtain the shared secret from for a recipient knowing and :
- Compute the recipient's shared secret:
- Try to decrypt the secret shared among recipients:
If the decryption is a success, returns . Otherwise, returns an error.
Arguments
- Using HPKE (RFC 9180 3) for both elliptic curve encryption (DHKEM) and post-quantum encryption (MLKEM) offers several benefits8:
- Easier re-implementation of the format MLA, thanks to the availability of HPKE in cryptographic libraries
- An existing formal analysis 9
- Easier code and security auditing, thanks to the use of known bricks
- Availability of test vectors in the RFC, making the implementation more reliable
- If signature is added to MLA in a future version, it could also be integrated using HPKE
- To the knowledge of the author, no HPKE algorithm has been standardized for quantum hybridation, hence the custom algorithm
- FIPS 203 is used as, at the time of writing:
- The MLKEM-1024 mode is used for stronger security, and to limit consequence of future advances 11 12. This is also the choice of other industry standards 13 14
- The shared secret from the two-KEM is produced using a "Nested Dual-PRF Combiner", proved in 15 (3.3):
- The use of concatenation scheme including ciphertexts keeps IND-CCA2 if one of the two underlying scheme is IND-CCA2, as proved in 16 and explained in 17
- TLS 18 uses a similar scheme, and IKE 19 also uses a concatenation scheme
- This kind of scheme follows ANSSI recommendations 4
- HKDF can be considered as a Dual-PRF if both inputs are uniformly random 20. In MLA, the
combine
method is called with a shared secret from ML-KEM, and the resulting ECC key derivation -- both are uniformly random - To avoid potential mistake in the future, or a mis-reuse of this method, the "Nested Dual-PRF Combiner" is used instead of the "Dual-PRF Combiner" (also from 15). Indeed, this combiner force the "salt" part of HKDF to be uniformly random using an additional PRF use, ensuring the following HKDF is indeed a Dual-PRF
Asymmetric encryption - Multi-Recipient Hybrid KEM
Intuition
KEM, such as the one described above, returns a fresh and distinct secret for each recipient.
To obtain a "meta-KEM", working for multi-recipient, the strategy is the use of per-recipient KEM to encrypt a common secret.
This whole process can then be viewed as a KEM for multi-recipient, taking in input a list of public keys and returning a shared secret and a ciphertext made of the concatenation of each per-recipient ciphertext.
To avoid marking which per-recipient ciphertext correspond to which recipient public key, the decapsulation process "brute-force" each ciphertext for a given decapsulation key. If the decryption works (with the associated tag), the shared secret is returned.
Key commitment, to avoid rather unlikely mismatch, is further ensured inside the Encrypt
layer (see below).
Process
The "Per-recipient KEM" process described above is noted:
- , taking a couple of public key ( and ), a shared secret and returning a recipient ciphertext
- , taking a couple of private key ( and ), a ciphertext and returning either a shared secret if the recipient is a legitimate recipient (if the AEAD decryption works), or an error otherwise
is a cryptographically secured RNG producing a n-bytes secret.
To encapsulate to a list of recipient :
To decapsulate from a ciphertext , knowing a recipient private key :
Arguments
- The shared secret is cryptographically generated, so it can later be used as a shared secret in HPKE encryption
- This secret is unique per archive, as it is generated on archive creation. Even "converting" or "repairing" an archive in
mlar
CLI will force a newly fresh secret. It is a new secret as there is no edit feature implemented, even if it is doable. Hence, a new random symetric key is used to encrypt its content while "converting" or "repairing" an archive. - Even if the AEAD decryption worked for an non legitimate recipient, for instance following an intentional manipulation, the shared secret obtained will later be checked using Key commitment before decrypting actual data (see below)
- Optimization would have been possible here, such as sharing a common ephemeral key for the DHKEM. But the size gain is not worth enough regarding the ciphertext size of MLKEM and would move the implementation away from the DHKEM in RFC 9180
Encryption
Notation
The "Multi-Recipient Hybrid KEM" process described above is noted:
- , taking a list of public keys and returing a shared secret and a ciphertext
- , taking a couple of private keys ( and ), a ciphertext and returning either a shared secret if the recipient is a legitimate recipient (if the AEAD decryption works), or an error otherwise
KeyCommitmentChain
is defined as the array of 64-bytes: -KEY COMMITMENT--KEY COMMITMENT--KEY COMMITMENT--KEY COMMITMENT-
.
: KeySchedule
function from RFC 9180 3, instanciated with:
- Mode: "Base"
- KDF: HKDF-SHA-512
- AEAD: AES-256-GCM
- KEM: a custom KEM ID, numbered 0x1020
: function from RFC 9180 3.
Process
To encrypt n-bytes data
to a list of public keys :
- Compute a shared secret and the corresponding ciphertext:
- Derive the key and base nonce using HPKE
- Ensure key-commitment
- For each 128KB of data:
Note: starts at 0. is used because the sequence numbered 0 has already been used by the Key commitment.
-
When the layer is finalized, the last chunk of data (with a length lower than or equals to 128KB) is encrypted the same way
-
Finally, a final chunk with sequence number (where is the number of data chunks) and special content and additional authenticated data is appended:
The resulting layer is composed of:
- header:
- data:
Special care must be taken not to reuse a sequence number in implementations as this would be catastrophic given GCM properties. For chunks of data:
- sequence 0: key commitment
- sequence 1 to : data
- sequence : with only the 10 bytes "FINALBLOCK" as content
To decrypt the data at position :
- Once for the whole session, get the cryptographic materials
- Once for the whole session, check the key commitment
- Retrieve the encrypted chunk of data
Where is the Euclidian division.
Then:
Arguments
- Key commitment is always checked before returning clear-text data to the caller
- AEAD tag of a chunk is always checked before returning the corresponding clear-text data to the caller
- Arguments for HPKE use are very similar to the ones mentioned above. In particular, this is a standardized approach with existing analysis
- As there is two kind of custom KEM used ("Per-recipient KEM" and "Hybrid KEM"), two distinct KEM ID are used. In addition, two distinct MLA specific
info
are used to bind this derivation to MLA - As described in 5 and 21, AES in GCM mode does not ensure "key commitment". This property is added in the layer using the "padding fix" scheme from 5 with the recommended 512-bits size for a 256-bits security
- Key commitment is mainly used to ensure that two recipients will decrypt to the same plaintext if given the same ciphertext, i.e. an attacker modifying the header of an archive cannot provide two distinct plaintext to two distinct recipient
- AES-GCM is used as an industry standard AEAD
- the base nonce, and therefore each nonce used, are unique per archive because they are generated from the archive-specific shared secret, limiting the nonce-reuse risk to standard acceptability 3
- no more than chunks will be produced, as the sequence's type used in MLA implementation is a
u64
checked for overflow. As this is a widely accepted limit of AES-GCM, this value is also within the range provided by 3 - the tag size is 128-bits (standard one), avoiding attacks described in 22
- 128KiB is lower than the maximum plaintext length for a single message in AES-GCM (64 GiB)22
Seed derivation
The asymmetric encryption in MLA, particularly the KEMs, provides deterministic API.
These API are usually fed with cryptographically generated data, except for the regression test and the "seed derivation" feature in mlar
CLI.
This feature is meant to provide a way for client to implement:
- A derivation tree
- Keep the root secret in a safe place, and be able to find back the derived secrets
The derivation scheme is based on the same ideas than mla::crypto::hybrid::combine
:
- A dual-PRF (HKDF-Extract with a uniform random salt 20) to extract entropy from the private key
- HKDF-Expand to derive along the given path component
From a private key ( and ), the secret is derived from the path component through:
To derive a key using a seed
, a ChaCha20Rng
is used.
If a seed
is provided, the ChaCha20Rng
is seeded with the first 32-bytes of . Otherwise, the seed comes from OS Cryptographic RNG sources.
A ChaCha20Rng is the ChaCha2023 stream cipher feeded with a seed as key and 8 null bytes as nonce.
The CSRNG is then provided to MLA deterministic APIs.
Implementation specificities
External dependencies
Some of the external cryptographic libraries have been reviewed:
- RustCrypto AES-GCM, reviewed by NCC Group 24
- Dalek cryptography library, reviewed by Quarkslab 25
rust-hpke
library, reviewed in version 0.8 by CloudFlare 26
In addition to the review, rust-hpke
is mainly based on RustCrypto
, avoiding the need for additional newer dependencies.
The MLKEM implementation used is the one of RustCrypto
, as MLA already depends on this project and the code quality and auditability are, in the author understanding, rather good.
The generation uses OsRng
from crate rand
, that uses getrandom()
from crate getrandom
. getrandom
provides implementations for many systems, listed here.
On Linux it uses the getrandom()
syscall and falls back on /dev/urandom
.
On Windows it uses the RtlGenRandom
API (available since Windows XP/Windows Server 2003).
In order to be "better safe than sorry", a ChaCha20Rng
is seeded from the bytes generated by OsRng
in order to build a CSPRNG(Cryptographically Secure PseudoRandom Number Generator). This ChaCha20Rng
provides the actual bytes used in keys and nonces generations.
The authors decided to use elliptic curve over RSA, because:
- No ready-for-production Rust-based libraries have been found at the date of writing
- A security-audited Rust library already exists for Curve25519
- Curve25519 is widely used and respects several criteria
- Common arguments, such as the ones of Trail of bits
AES-GCM is used because it is one of the most commonly used AEAD algorithms and using one avoids a whole class of attacks. In addition, it lets us rely on hardware acceleration (like AES-NI) to keep reasonable performance.
AES-GCM re-implementation
While the AES and GHash bricks come from RustCrypto, the GCM mode for AES-256 has been re-implemented in MLA.
Indeed, the repair mode must be able to only partially decrypt a data chunk, and decide whether the associated tag must be verified or not. This API is not provided by the RustCrypto project, for very understandable reasons.
To ensure the implementation follows the standard, it is tested against AES-256-GCM test vectors in MLA regression tests.
HPKE Key Schedule re-implementation
For several reasons described in the code, but mainly due to the availability of API, the possibility to add custom KEM ID and the relative few lines needed for re-implementation, the method has been re-implemented in MLA.
It still use some bricks from rust-hpke
, as the KDF, and . It is tested against RFC 9180 3 test vectors in MLA regression tests.
MLKEM implementation without a review
Thanks to the hybrid approach, a flawed implementation of MLKEM would have limited consequences. It satisfies ANSSI guidelines for the transition first phase to PQC hybridization 4. For this reason, MLA is eligible for a security visa evaluation.
For now, it is therefore accepted by the author (as a trade-off) to use a MLKEM implementation without existing review to bring as soon as possible a reasonable protection against "Harvest now, decrypt later" attacks.
If a reviewed implementation with acceptable dependency emerges in the future, it can be easily swapped in MLA. Thus, MLA would also satisfy the requirements to get a security visa evaluation in the second and third phases of these guidelines by including its PQC implementation.
Security considerations
Absence of signature
As there is no signature for now in MLA, an attacker knowing the recipient public key can always create a custom archive with arbitrary data.
For this reason, several known attacks are considered acceptable, such as:
- The bit indicating if the
Encrypt
layer is present is not protected in integrity
An attacker can remove it, making the reader treating the archive as if encryption was absent. The reader is responsible of checking for encryption bit if it was expected in the first place.
For instance, the mlar
CLI will refuse to open an archive without the Encrypt
bit unless --accept-unencrypted
is provided on the command line.
- An attacker with the ability to modify a real archive in transit can replace what the reader will be able to read with arbitrary data
To perform this attack, the attacker will have to either remove the Encrypt
bit or modify the key used for decryption with one she has.
The remaining encrypted data will then act as random values.
Still, the attacker could expect to gain enough privilege, like arbitrary code execution in the process, during the archive read. One can then try to reuse the provided key to decrypt, then act on the real data.
Limiting this attack is beyond the scope of this document. It mainly involves the security features of Rust, reviewed implementation, testing & fuzzing, zeroizing secrets when possible 27, etc.
- An attacker can truncate an archive and hope for repair
This attack is based on a trade-off: should the SafeReader
try to get as many bytes as possible, or should it return only data that have been authenticated?
The choice has been made to report the decision to the user of the library28.
Other properties
- Plaintext length
The Encrypt
layer does not hide the plaintext length.
Usually, this layer is used with the Compress
layer. If an attacker knows the original file size, he might learn information about the original data entropy.
- Hidden recipient list
Only the owner of a recipient's private key can determine that they are a recipient of the archive. In other words, while the recipient list remains private, the total number of recipients is still visible.
This is an intentional privacy feature.