Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Build & test Cargo MLA Documentation MLA Cargo MLAR

Multi Layer Archive (MLA)

MLA is an archive file format with the following features:

  • Support for traditional and post-quantum encryption hybridation with asymmetric keys (HPKE with AES256-GCM and a KEM based on an hybridation of X25519 and post-quantum ML-KEM 1024)
  • Support for traditional and post-quantum signing hybridation
  • Support for compression (based on rust-brotli)
  • Streamable archive creation:
    • An archive can be built even over a data-diode
    • An entry can be added through chunks of data, without initially knowing the final size
    • Entry chunks can be interleaved (one can add the beginning of an entry, start a second one, and then continue adding the first entry's parts)
  • Architecture agnostic and portable to some extent (written entirely in Rust)
  • Archive reading is seekable, even if compressed or encrypted. An entry can be accessed in the middle of the archive without reading from the beginning
  • If truncated, archives can be repaired to some extent. Two modes are available:
    • Authenticated repair (default): only authenticated (as in AEAD, there is no signature verification) encrypted chunks of data are retrieved
    • Unauthenticated repair: authenticated and unauthenticated encrypted chunks of data are retrieved. Use at your own risk.
  • Arguably less prone to bugs, especially while parsing an untrusted archive (Rust safety)

Repository

This repository contains:

  • mla: the Rust library implementing MLA reader and writer
  • mlar: a Rust cli utility wrapping mla for common actions (create, list, extract...)
  • doc : advanced documentation related to MLA (e.g. format specification)
  • bindings : bindings for other languages
  • samples : test assets
  • mla-fuzz-afl : a Rust utility to fuzz mla
  • .github: Continuous Integration needs

Quick command-line usage

Here are some commands to use mlar in order to work with archives in MLA format.

# Generate MLA key pairs.
mlar keygen sender
mlar keygen receiver

# Create an archive with some files.
mlar create -k sender.mlapriv -p receiver.mlapub -o my_archive.mla /etc/./os-release /etc/security/../issue ../file.txt

# List the content of the archive.
# Note that order may vary, root dir are stripped,
# paths are normalized and listing is encoded as described in
# `doc/ESCAPING.md`.
# This outputs:
# ``
# etc/issue
# etc/os%2drelease
# file.txt
# ``
mlar list -k receiver.mlapriv -p sender.mlapub -i my_archive.mla

# Extract the content of the archive into a new directory.
# In this example, this creates two files:
# extracted_content/etc/issue and extracted_content/etc/os-release
mlar extract -k receiver.mlapriv -p sender.mlapub -i my_archive.mla -o extracted_content

# Display the content of a file in the archive
mlar cat -k receiver.mlapriv -p sender.mlapub -i my_archive.mla etc/os-release

# Convert the archive to a long-term one, removing encryption and using the best
# and slower compression level
mlar convert -k receiver.mlapriv -p sender.mlapub -i my_archive.mla -o longterm.mla -l compress -q 11

# Create an archive with multiple recipients and without signature nor compression
mlar create -l encrypt -p archive.mlapub -p client1.mlapub -o my_archive.mla ...

# List an archive containing an entry with a name that cannot be interpreted as path.
# This outputs:
# `c%3a%2f%00%3b%e2%80%ae%0ac%0dd%1b%5b1%3b31ma%3cscript%3eevil%5c..%2f%d8%01%c2%85%e2%88%95`
# corresponding to an entry name containing: ASCII chars, c:, /, .., \,
# NUL, RTLO, newline, terminal escape sequence, carriage return,
# HTML, surrogate code unit, U+0085 weird newline, fake unicode slash.
# Please note that some of these characters may appear in valid a path.
mlar list -k test_mlakey_archive_v2_receiver.mlapriv -p test_mlakey_archive_v2_sender.mlapub -i archive_weird.mla --raw-escaped-names

# Get its content.
# This displays:
# `' OR 1=1`
mlar cat -k test_mlakey_archive_v2_receiver.mlapriv -p test_mlakey_archive_v2_sender.mlapub -i archive_weird.mla --raw-escaped-names c%3a%2f%00%3b%e2%80%ae%0ac%0dd%1b%5b1%3b31ma%3cscript%3eevil%5c..%2f%d8%01%c2%85%e2%88%95

# Create an archive of a web file and utf-8 string, without encryption and without signature
(curl https://raw.githubusercontent.com/ANSSI-FR/MLA/refs/heads/master/README.md; echo "SEP"; echo "All Hail MLA!") | mlar create -l -o my_archive.mla --separator "SEP" --filenames great_readme.md -

mlar can be obtained:

  • through Cargo: cargo install mlar
  • using the latest release for supported operating systems
    • The released binaries are built with opt-level = 3, enabling great performance

For even higher performance, you can build a native-optimized binary (not portable), for example on a Linux machine:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release --target x86_64-unknown-linux-musl

Note: Native builds are optimized for your machine's CPU and are not portable. Use them only when running on the same machine you build on.

API usage

See https://docs.rs/mla

Using MLA with others languages

Bindings are available for:

Security

  • Please keep in mind, it is generally not safe to extract in a place where at least one ancestor is writable by others (symbolic link attacks).
  • Even if encrypted with an authenticated cipher, if you receive an unsigned archive , it may have been crafted by anyone having your public key and thus can contain arbitrary data.
  • Read API documentation and mlar help before using their functionnalities. They sometimes provide important security warnings. doc/ENTRY_NAME.md is also of particular interest.
  • mlar escapes entry names on output to avoid security issues.
  • Except for symbolic link attacks, mlar will not extract outside given output directory.

FAQ

Is MLAArchiveWriter Send?

By default, MLAArchiveWriter is not Send. If the inner writable type is also Send, one can enable the feature send for mla in Cargo.toml, such as:

[dependencies]
mla = { version = "...", default-features = false, features = ["send"]}

Was a new format really required?

As existing archive formats are numerous, probably not.

But to the best of the authors' knowledge, none of them support the aforementioned features (but, of course, are better suitable for others purposes).

For instance (from the understanding of the author):

  • tar format needs to know the size of files before adding them, and is not seekable
  • zip format could lose information about files if the footer is removed
  • 7zip format requires to rebuild the entire archive while adding files to it (not streamable). It is also quite complex, and so harder to audit / trust when unpacking unknown archive
  • journald format is not streamable. Also, one writter / multiple reader is not needed here, thus releasing some constraints journald format have
  • any archive + age: age does not, as of MLA 2.0 release, support post quantum encryption nor signatures.
  • Backup formats are generally written to avoid things such as duplication, hence their need to keep bigger structures in memory, or not being streamable

Tweaking these formats would likely have resulted in similar properties. The choice has been made to keep a better control over what the format is capable of, and to (try to) KISS.

Performance

One can evaluate the performance through embedded benchmark, based on Criterion.

Several scenarios are already embedded, such as:

  • File addition, with different size and layer configurations
  • File addition, varying the compression quality
  • File reading, with different size and layer configurations
  • Random file read, with different size and layer configurations
  • Linear archive extraction, with different size and layer configurations

On an "Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz":

$ cd mla/
$ cargo bench
...
multiple_layers_multiple_block_size/Layers ENCRYPT | COMPRESS | DEFAULT/1048576                                                                           
                        time:   [28.091 ms 28.259 ms 28.434 ms]
                        thrpt:  [35.170 MiB/s 35.388 MiB/s 35.598 MiB/s]
...
chunk_size_decompress_mutilfiles_random/Layers ENCRYPT | COMPRESS | DEFAULT/4194304                                                                          
                        time:   [126.46 ms 129.54 ms 133.42 ms]
                        thrpt:  [29.980 MiB/s 30.878 MiB/s 31.630 MiB/s]
...
linear_vs_normal_extract/LINEAR / Layers DEBUG | EMPTY/2097152                        
                        time:   [145.19 us 150.13 us 153.69 us]
                        thrpt:  [12.708 GiB/s 13.010 GiB/s 13.453 GiB/s]
...

Criterion.rs documentation explains how to get back HTML reports, compare results, etc.

The AES-NI extension is enabled in the compilation toolchain for the supported architectures, leading to massive performance gain for the encryption layer, especially in reading operations. Because the crate aesni statically enables it, it might lead to errors if the user's architecture does not support it. It could be disabled at the compilation time, or by commenting the associated section in .cargo/config.

Contributing

We appreciate your help! To contribute, please read our contributing instructions.

MLA FORMAT

Relation between the MLA library version and the file format version:

MLA VersionSupported file format
2.X2
1.X1

This document introduces the MLA file format in its current version, 2. For a more comprehensive introduction of the ideas behind it, please refer to README.md.

Types and their serialization format

  • Integers are unsigned and serialized as bytes in little endian. They are called u64 for 64 bits integers, u32 for 32 bits ones, u16 for 16 bits ones and u8 for 8 bits ones. Serialization length in bytes are: 8 for u64, 4 for u32, 2 for u16 and 1 for u8.
  • Vec<T> is a sequence of elements of type T. It is serialized with its length in number of elements (not necessarily bytes) as a u64 and the sequence of serialized elements of type T.
  • Opts represents MLA options. It is serialized with a tag of value 0 as a u8 if no option is present. Otherwise it is serialized with a tag of value 1 as u8 followed by a yet unspecified Vec. Multiple fields in this file format are of type Opts for future proofing reasons, but no option is defined at the moment. For future proofing, implementers of this file format version must still handle the tag of value 1 and read the Vec<u8> even if not using this values. Thus, if an option is specified in the future, pre-dating implementations will be able to work with new archives containing the optional value.
  • Tail<T> is a T followed by its tail_length length in bytes as a u64. This enables extracting the T when reading from the end. Note that a Tail<Vec<T>> contains two lengths which may differ in units and always differ in values as Tail's length includes Vec's serialization of its own length. For example, serialization of a Tail<Vec<u16>> containing 0 and 1, leads to 02 00 00 00 00 00 00 00 00 00 01 00 0c 00 00 00 00 00 00 00. As a second example, serialization of Tail<Vec<u8>> containing 0 and 1, leads to 02 00 00 00 00 00 00 00 00 01 0a 00 00 00 00 00 00 00.

MLA Header

  • An MLA file begins with the mla_header_magic ASCII magic: "MLAFAAAA".
  • mla_header_magic is followed by the format_version format version number as a u32.
  • format_version is followed by an header_options field of type Opts.
  • header_options is followed by archive content archive_content, described below.
  • archive_content is followed by an footer_options field of type Tail<Opts> to enable determining archive_content's end when reading from the end of the MLA file.
  • footer_options is followed by the footer_magic ASCII magic "EMLAAAAA", terminating the archive.

archive_content consists of a serialized MLA entries layer documented below, transformed with zero or more layers, documented below too. A layer consists of a u64 layer magic followed by its data. Layer order plays an important security role, so the signature layer has to be above the encryption layer which has to be above the compression layer. This must be enforced by writers and readers. Readers should ensure users explicitly choose if they allow an archive without signature or without encryption.

Signature layer

  • The layer signature_layer_magic ASCII magic is "SIGMLAAA".
  • signature_layer_magic is followed by an signature_header_options of type Opts.
  • signature_header_options is followed by sig_inner_layer, consisting of the inner layer bytes.
  • sig_inner_layer is followed by signature_footer_options of type Tail<Opts>.
  • signature_footer_options is followed by signature_data serialized as a Tail<Vec<u8>> which content is described below.

signature_data is a Vec<u8> whose content bytes consist of a sequence of SignatureDataWHdr. A SignatureDataWHdr is a signature_method_id u16, followed by a signature_data sequence of bytes depending of the tag value. For the moment, there are two valid signature_method_id: 0 and 1. 0 maps to MLAEd25519SigMethod, 1 maps to MLAMLDSA87SigMethod. These methods are described in doc/CRYPTO.md. Their input starts and includes mla_header_magic, up to and including sig_inner_layer. In the current version, for the signature layer to be considered verified, the reader must verify that at least one SignatureDataWHdr of each signature_method_id is verified.

Encryption layer

The layer encryption_layer_magic ASCII magic is "ENCMLAAA". encryption_layer_magic is followed by encryption_header_options of type Opts. encryption_header_options is followed by a encryption_method_id u16, described below. encryption_method_id is followed by encryption_metadata, a sequence of bytes described below. encryption_metadata is followed by encrypted_inner_layer, a sequence of bytes described below. encrypted_inner_layer is followed by one end_of_encrypted_inner_layer_magic ASCII magic "ENCMLAAB" end_of_encrypted_inner_layer_magic is followed by one encryption_footer_options of type Tail<Opts>.

The only encryption_method_id valid encryption_method_id for the moment is 0. It is the encryption method described in CRYPTO.md.

encryption_metadata depends on the previous encryption_method_id value. For encryption_method_id 0, encryption_metadata is a Vec<PerRecipientEncapsulatedKey> followed by a KeyCommitmentAndTag.

A PerRecipientEncapsulatedKey is an mlkem1024_encapsulated_s field followed by an ed25519_encapsulated_s field, followed by an m0_encrypted_ss field and a prkem_tag field. As described in more detail in CRYPTO.md, m0_encrypted_ss is the AES-256-GCM encrypted global_secret. The AES key used to encrypt this global_secret is recovered from mlkem1024_encapsulated_s and ed25519_encapsulated_s. prkem_tag is the GCM tag associated with m0_encrypted_ss.

mlkem1024_encapsulated_s is a sequence of 1568 bytes corresponding to the ciphertext output of ML-KEM.Encaps as described in FIPS 203. ed25519_encapsulated_s is a sequence of 32 bytes corresponding to the output of the X25519 as described in RFC 7748. m0_encrypted_ss is a 32 bytes sequence. prkem_tag is a 16 bytes sequence.

KeyCommitmentAndTag is the key commitment described in CRYPTO.md. It is a 64-bytes ciphertext followed by a 16-bytes tag.

encrypted_inner_layer is the AES-256-GCM encrypted inner layer with the global_secret key. encrypted_inner_layer is a sequence of M0EncryptedChunk followed by one M0FinalEncryptedChunk. Each M0EncryptedChunk has an ASCII magic "M0ENCCNK" followed by a u64 chunk_number, followed by an encrypted_content (128*1024)-bytes field (last M0EncryptedChunk may be smaller) followed by a tag 16-bytes field. encrypted_content is the inner_layer encrypted chunk, and tag its GCM tag. M0FinalEncryptedChunk has an ASCII magic "M0FNLBLK" followed by a 10-bytes encrypted_content field followed by a 16-bytes tag. chunk_number is the number of the M0EncryptedChunk in the stream, starting at 1.

To protect from a truncation attack, before using an archive, it must be checked that the tag of the M0FinalEncryptedChunk is correct and that its decrypted encrypted_content is the ASCII FINALBLOCK.

Compression layer

The layer compression_layer_magic ASCII magic is "COMLAAAA". compression_layer_magic if followed by compression_header_options of type Opts. compression_header_options is followed by compressed_data, a sequence of bytes explained below. compressed_data is followed by compression_footer_options of type Tail<Opts>. compressed_footer_options is followed by sizes_info of type Tail<SizesInfo>, where SizesInfo is explained below.

The inner layer, is split in 4 * 1024 * 1024-bytes chunks, except for the last chunk which may be smaller. Each chunk is compressed with brotli. The resulting size of each compressed chunk is recorded in sizes_info. SizesInfo has a first field compressed_sizes, which is a Vec<u32> corresponding to an ordered list of compressed chunk sizes and a second field last_block_uncompresed_size as a u32 indicating the uncompressed size of last inner layer chunk.

compressed_data is the concatenation of each compressed chunk.

The compression layer footer information can be retrieved by first reading the value of sizes_info.tail_length at the end of the layer, then reading the preceding sizes_info.tail_length-bytes.

MLA entries layer

The layer entries_layer_magic ASCII magic is "MLAENAAA". entries_layer_magic is followed by entries_header_options of type Opts. entries_header_options is followed by entries_data, a sequence of bytes described below. entries_data is followed by entries_footer of type Tail<EntriesFooter>, where EntriesFooter is described below. entries_footer is followed by entries_footer_options of type Tail<Opts>.

entries_data is a succession of ArchiveEntryBlock of different type. An ArchiveEntryBlock begins with an ASCII magic "MAEB" followed by an ArchiveEntryBlockType u8 determining the type of ArchiveEntryBlock:

  • 0x00 means EntryStart
  • 0x01 means EntryContentChunk
  • 0xFE means EndOfArchiveData
  • 0xFF means EndOfEntry

If the ArchiveEntryBlockType is EntryStart, it is followed by an ArchiveEntryId u64, an EntryName and an entry_start_options of type Opts. An EntryName is a Vec<u8> described in doc/ENTRY_NAME.md.

If the ArchiveEntryBlockType is EntryContentChunk, it is followed by an ArchiveEntryId, a content_options of type Opts and a Vec<u8> entry_content_data.

If the ArchiveEntryBlockType is EndOfEntry, it is followed by an ArchiveEntryId, a end_options of type Opts and a hash serialized as 32 u8.

If the ArchiveEntryBlockType is EndOfArchiveData, it is followed by nothing.

EntriesFooter is a Vec<EntryNameInfoMapElt>. An EntryNameInfoMapElt is an EntryName followed by an entry_blocks_info which is a Vec<EntryBlockInfo> explained after. For reproducibility, the EntriesFooter Vec is sorted by entry name (lexicographically by bytes values) before being serialized.

EntryBlockInfo has two fields: block_offset and block_size. The block_offset field is a u64 indicating at which offset from the begining of the MLA entries layer an ArchiveEntryBlock can be found for the given EntryName. The block_size field is a u64 indicating the size in bytes of the block content (0 except for EntryContentChunk). If it is an EntryContentChunk with entry_content_data containing 1 byte, block_size is 1. All EntryBlockInfos for each entry are recorded in entry_blocks_info and they are so in ascending order of offset.

Explanations

An archive entry entry_i in the archive always starts with an EntryStart, giving its name and unique ID i.

entry_i content is the concatenation of all EntryContentChunks entry_content_data fields with ArchiveEntryId value i.

Once the EndOfEntry for entry_i is reached, the entry is completely read. Its content SHA-256 hash can be verified with the EndOfEntry.hash.

Between the last EndOfEntry block and entries_footer, there is the only EndOfArchiveData block. It is used when trying to read a truncated archive, to correctly separate the actual archive data from the footer.

As blocks from different entries can be interleaved, the entry_block_info offsets for an entry are the offsets in entries_data of its blocks.

For instance, if the blocks are:

Off0: [EntryStart ID 1]
Off1: [EntryStart ID 2]
Off2: [EntryContentChunk ID 1]
Off3: [EntryContentChunk ID 1]
Off4: [EntryContentChunk ID 2]
Off5: [EndOfEntry ID 1]
...

The offsets for the entry with ID 1 will be Off0, Off2, Off3 and Off5.

Entry name documentation

An archive can store entries associated with a name. These entries may or may not represent OS filesystem files. And their name may or may not represent an OS file system path.

An entry name is a nonempty sequence of bytes (maximum length of 65536).

Please keep in mind that names, interpreted as paths or not, may contain arbitrary bytes like slash, backslash, .., C:\\{}...], newline, spaces, carriage return, terminal escape sequences, Unicode chars like U+0085 or RTLO, HTML, SQL, semicolons, homoglyphs, etc.

Interpretation of an entry name as an OS filesystem file path

If it is to be interpreted as a file path, the underlying bytes must consist of ASCII slash separated components and not begin with a slash. The rules for each component are:

  • must not be empty
  • must not contain any ASCII NUL byte
  • must not be ASCII dot
  • must not be two ASCII dots

If it is to be interpreted as a Windows file path, in addition to previous rules:

  • No byte should be an ASCII backslash (separators are represented by an ASCII slash).
  • The eventual second byte of the whole path should not be an ASCII colon (:).
  • Every component must be encoded as UTF-8.

These rules are checked by the accompanying Rust implementation (EntryName::to_pathbuf).

Even if respecting these rules, the OS may see the resulting path as invalid.

Please keep in mind that two different names, may map to same path on OS (e.g. Windows case insensitivity).

When given a path as input, before being converted to an entry name by EntryName::from_path and mlar the path is normalized by keeping only Normal std::path::Components and popping an eventual previous component when a .. is encountered.

String representation of entry names

To prevent some security risks, proposed string representations of entry names are given with EntryName::to_pathbuf_escaped_string and EntryName::raw_content_to_escaped_string and are used by mlar.

Other representations may be preferred depending on their usage context.

The idea of this representation is that unwanted bytes are replaced with a percent and their hexadecimal representation. Details follow.

For an entry name interpreted as raw bytes, below generic escaping is applied with ASCII alphanumeric, dot, dash and underscore as preserved bytes. This is used by mlar list --raw-escaped-names.

For an entry name interpreted as a path, below generic escaping is applied with ASCII alphanumeric chars, ASCII dot and ASCII slash as preserved bytes. This is used by default by mlar list.

Generic escaping, implemented by helpers::mla_percent_escape

A bytes_to_preserve parameter tells which bytes are not escaped. For every input byte:

  • If listed in bytes_to_preserve then it will be output without transformation.
  • Else, it will be replaced by %xx where xx is their hexadecimal representation.

Generic unescaping, implemented by helpers::mla_percent_unescape

A bytes_to_allow parameter tells which bytes are not escaped. Unescaping fails if feeded with anything else than bytes listed in bytes_to_allow and %xx where xx is the hexadecimal representation of a byte not listed in bytes_to_allow. Otherwise it reverses the process described in Generic escaping.

Examples

For each following entry name found serialized in an archive, here is how they are represented as strings when interpreted as path:

  • empty bytes -> invalid (even interpreted as arbitrary bytes)
  • /a -> invalid path (root directory)
  • a/b/../d -> invalid path (path traversal)
  • a/b/.. -> invalid path (path traversal)
  • a//b -> invalid path (not normalized)
  • a/./b -> invalid path (not normalized)
  • ./b -> invalid path (not normalized)
  • a/. -> invalid path (not normalized)
  • aNULb (where NUL here represent an ASCII NUL byte) -> invalid path
  • m:abcd -> invalid path on Windows (: as second byte), m%3aabcd on UNIX-like
  • a\b (where \ represents an ASCII backslash, not an escaped b) -> invalid path on Windows (contains backslash), a%5cb on UNIX-like
  • a/b.txt -> a/b.txt
  • a/b!c -> a/b%21c

Cryptography in MLA

MLA uses cryptographic primitives essentially for the purpose of the Encrytion and Signature layers.

This document introduces the primitives used, arguments for the choice made and some security considerations.

Keys used for encryption and signature are generated and used separately.

Signature

As described in FORMAT.md an archive can be signed. Implementation must ensure users explicitely choose if signature is made and verified.

A PQ/T key consists of a pair of a post-quantum key and a traditional key. An archive is considered correctly signed for a PQ/T key if and only if it is correctly signed for its post-quantum part AND its traditional part.

Two signature methods are available and must be used together. Signature method input is called m. The SHA-512 hash h of m may be computed in a first step.

For method MLAEd25519SigMethod, signature_data is the Ed25519ph (as described in RFC 8032 1) signature of m (not h even though it can be used for computing the result). The context given as parameter to Ed25519ph is the ASCII MLAEd25519SigMethod. Signature verification and key generation are done as described in RFC 8032. Key storage is described in KEY_FORMAT.md.

For method MLAMLDSA87SigMethod, signature_data is the ML-DSA-87 signature (as described in FIPS 204 2, not HashML-DSA) of h (not m this time) with the ASCII MLAMLDSA87SigMethod as context. Signature verification and key generation are done as described in FIPS 204. Key storage is described in KEY_FORMAT.md.

An archive can be signed with multiple signing keys. If a user provides a set of PQ/T keys for signature verification, implementations should give a way for the user to know if archive is correctly signed for at least one key. Implementations may give a way for users to know if archive is correctly signed for all keys. Users must explicitely know if they are validating against at least one or all keys. Implementations may also give a way for users to know which PQ/T keys correspond to valid signatures or their number.

Encryption high-level overview

Objectives

The purpose of the Encryption layer is to provide confidentiality and data integrity of the inner layer.

These objectives are obtained using:

  • Authenticated encryption
  • Asymmetric cryptography, for several recipients

This layer does not provide signature.

General design guidelines

  1. The size and the initial computation time used for the encryption needs are not a big issue, if kept reasonable. Indeed, in the author understanding, MLA archives are usually several MB long and the computation time is primarily spent in compression/decompression and encryption/decryption of the data

As a result, some optimization have not been performed -- which help keeping an hopefully auditable and conservative design.

  1. Only one encryption method and key type is available, to avoid confusion and potential corner cases errors

  2. When possible, use audited code and test vectors

Main bricks: Encryption

The data is encrypted using AES-256-GCM, an AEAD algorithm. To offer a seekable layer, data is encrypted using chunks of 128KB each, except for the last one. These encrypted chunks are all present with their associated tag. Tags are checked during decryption before returning data to the upper layer.

To prevent truncation attacks, another chunk is added at the end corresponding to the encryption of the ASCII string "FINALBLOCK" with "FINALAAD" as additional authenticated data. Any usage of the archive must check correct decryption (including tag verification) of this last block.

The key, the base nonce and the nonce derivation for each data chunk are computed following HPKE (RFC 9180) 3. HPKE is parameterized with:

  • Mode: "Base" (no PSK, no sender authentication)
  • KDF: HKDF-SHA512
  • AEAD: AES-256-GCM
  • KEM: Multi-Recipient Hybrid KEM, a custom KEM described later in this document

Thus, only one cryptography suite is available for now. If this setting ends up broken by cryptanalysis, we will move users onward to the next MLA version, using appropriate cryptography. Therefore, MLA lacks cryptography agility which is an encouraged property regarding post-quantum cryptography by ANSSI 4. Still, HPKE improves this aspect of MLA 3.

Full details are available below.

Additionally, "key commitment" is included using a method described in 5 and detailed in 6.

Main bricks: Asymmetric encryption

Since the format v2, the Encrypt layer is using post-quantum cryptography (PQC) through an hybrid approach, to avoid "Harvest now, decrypt later" attacks.

The algorithms used are:

  • X25519 for pre-quantum cryptography, using DHKEM (RFC 9180) 3
  • FIPS 2037 (CRYSTALS Kyber) MLKEM-1024 for post-quantum cryptography

The two keys are mixed together (see below) in a manner keeping the IND-CCA2 properties of the two algorithms.

Sending to multiple recipients is achieved using a two-step process:

  1. For each recipient, a per-recipient Hybrid KEM is done, leading to a per-recipient shared secret
  2. These per-recipient shared secret are derived through HPKE to obtain a key and a nonce
  3. These per-recipient key and nonce are used to decrypt a secret shared by all recipients

This final secret is the one later used as an input to the encryption layer. The whole process can be viewed as a KEM encapsulation for multiple recipients.

Encryption Details

The following sections describe the whole process for data encryption and seed derivation. They are meant to ease the understanding of the code and MLA format re-implementation.

The interested reader could also look at the Rust implementation in this repository for more details. The implementation also includes tests (including some test vectors) and comments.

Asymmetric encryption - Per-recipient KEM

Notations
  • , , and : respectively the X25519 public key and secret key, and the MLKEM-1024 (FIPS 203 7) encapsulating key and decapsulating key
  • and : key encapsulation methods with X25519, as defined in RFC 9180, section 4 3
  • and : key encapsulation methods on MLKEM-1024, as defined in FIPS 203 7
  • : a 32-bytes secret, produced by a cryptographic RNG. Informally, this is the secret shared among recipients, encapsulated separately for each recipient
  • : KeySchedule function from RFC 9180 3, instanciated with:
    • Mode: "Base"
    • KDF: HKDF-SHA-512
    • AEAD: AES-256-GCM
    • KEM: a custom KEM ID, numbered 0x1120
  • : AES-256-GCM encryption, returning the encrypted data concatened with the associated tag
  • : AES-256-GCM decryption, returning the decrypted data after verifying the tag
  • and : respectively produce a byte string encoding the data in argument, and produce the data from the byte string in argument
Process

To encrypt to a target recipient , knowing and :

  1. Compute shared secrets and ciphertexts for both KEM:

  1. Combine the shared secrets (implemented in mla::crypto::hybrid::combine):
def combine(ss1, ss2, ct1, ct2):
    uniformly_random_ss1 = HKDF-SHA512-Extract(
        salt=0,
        ikm=ss1
    )
    key = HKDF(
        salt=uniformly_random_ss1,
        ikm=ss2,
        info=ct1 . ct2
    )
    return key

  1. Wrap the recipients' shared secret:

Informally, this process can be viewed as a per-recipient KEM taking a shared secret , the recipient public key (made of the elliptic curve and the PQC public keys) and returning a ciphertext .


To obtain the shared secret from for a recipient knowing and :

  1. Compute the recipient's shared secret:

  1. Try to decrypt the secret shared among recipients:

If the decryption is a success, returns . Otherwise, returns an error.

Arguments
  • Using HPKE (RFC 9180 3) for both elliptic curve encryption (DHKEM) and post-quantum encryption (MLKEM) offers several benefits8:
    • Easier re-implementation of the format MLA, thanks to the availability of HPKE in cryptographic libraries
    • An existing formal analysis 9
    • Easier code and security auditing, thanks to the use of known bricks
    • Availability of test vectors in the RFC, making the implementation more reliable
    • If signature is added to MLA in a future version, it could also be integrated using HPKE
  • To the knowledge of the author, no HPKE algorithm has been standardized for quantum hybridation, hence the custom algorithm
  • FIPS 203 is used as, at the time of writing:
    • It is the only KEM algorithm standardized by the NIST 10
    • It is in line with the French suggestions 4 for PQ cryptography
  • The MLKEM-1024 mode is used for stronger security, and to limit consequence of future advances 11 12. This is also the choice of other industry standards 13 14
  • The shared secret from the two-KEM is produced using a "Nested Dual-PRF Combiner", proved in 15 (3.3):
    • The use of concatenation scheme including ciphertexts keeps IND-CCA2 if one of the two underlying scheme is IND-CCA2, as proved in 16 and explained in 17
    • TLS 18 uses a similar scheme, and IKE 19 also uses a concatenation scheme
    • This kind of scheme follows ANSSI recommendations 4
    • HKDF can be considered as a Dual-PRF if both inputs are uniformly random 20. In MLA, the combine method is called with a shared secret from ML-KEM, and the resulting ECC key derivation -- both are uniformly random
    • To avoid potential mistake in the future, or a mis-reuse of this method, the "Nested Dual-PRF Combiner" is used instead of the "Dual-PRF Combiner" (also from 15). Indeed, this combiner force the "salt" part of HKDF to be uniformly random using an additional PRF use, ensuring the following HKDF is indeed a Dual-PRF

Asymmetric encryption - Multi-Recipient Hybrid KEM

Intuition

KEM, such as the one described above, returns a fresh and distinct secret for each recipient.

To obtain a "meta-KEM", working for multi-recipient, the strategy is the use of per-recipient KEM to encrypt a common secret.

This whole process can then be viewed as a KEM for multi-recipient, taking in input a list of public keys and returning a shared secret and a ciphertext made of the concatenation of each per-recipient ciphertext.

To avoid marking which per-recipient ciphertext correspond to which recipient public key, the decapsulation process "brute-force" each ciphertext for a given decapsulation key. If the decryption works (with the associated tag), the shared secret is returned.

Key commitment, to avoid rather unlikely mismatch, is further ensured inside the Encrypt layer (see below).

Process

The "Per-recipient KEM" process described above is noted:

  • , taking a couple of public key ( and ), a shared secret and returning a recipient ciphertext
  • , taking a couple of private key ( and ), a ciphertext and returning either a shared secret if the recipient is a legitimate recipient (if the AEAD decryption works), or an error otherwise

is a cryptographically secured RNG producing a n-bytes secret.

To encapsulate to a list of recipient :


To decapsulate from a ciphertext , knowing a recipient private key :









Arguments
  • The shared secret is cryptographically generated, so it can later be used as a shared secret in HPKE encryption
  • This secret is unique per archive, as it is generated on archive creation. Even "converting" or "repairing" an archive in mlar CLI will force a newly fresh secret. It is a new secret as there is no edit feature implemented, even if it is doable. Hence, a new random symetric key is used to encrypt its content while "converting" or "repairing" an archive.
  • Even if the AEAD decryption worked for an non legitimate recipient, for instance following an intentional manipulation, the shared secret obtained will later be checked using Key commitment before decrypting actual data (see below)
  • Optimization would have been possible here, such as sharing a common ephemeral key for the DHKEM. But the size gain is not worth enough regarding the ciphertext size of MLKEM and would move the implementation away from the DHKEM in RFC 9180

Encryption

Notation

The "Multi-Recipient Hybrid KEM" process described above is noted:

  • , taking a list of public keys and returing a shared secret and a ciphertext
  • , taking a couple of private keys ( and ), a ciphertext and returning either a shared secret if the recipient is a legitimate recipient (if the AEAD decryption works), or an error otherwise

KeyCommitmentChain is defined as the array of 64-bytes: -KEY COMMITMENT--KEY COMMITMENT--KEY COMMITMENT--KEY COMMITMENT-.

: KeySchedule function from RFC 9180 3, instanciated with:

  • Mode: "Base"
  • KDF: HKDF-SHA-512
  • AEAD: AES-256-GCM
  • KEM: a custom KEM ID, numbered 0x1020

: function from RFC 9180 3.

Process

To encrypt n-bytes data to a list of public keys :

  1. Compute a shared secret and the corresponding ciphertext:

  1. Derive the key and base nonce using HPKE

  1. Ensure key-commitment

  1. For each 128KB of data:

Note: starts at 0. is used because the sequence numbered 0 has already been used by the Key commitment.

  1. When the layer is finalized, the last chunk of data (with a length lower than or equals to 128KB) is encrypted the same way

  2. Finally, a final chunk with sequence number (where is the number of data chunks) and special content and additional authenticated data is appended:

The resulting layer is composed of:

  • header:
  • data:

Special care must be taken not to reuse a sequence number in implementations as this would be catastrophic given GCM properties. For chunks of data:

  • sequence 0: key commitment
  • sequence 1 to : data
  • sequence : with only the 10 bytes "FINALBLOCK" as content

To decrypt the data at position :

  1. Once for the whole session, get the cryptographic materials

  1. Once for the whole session, check the key commitment

  1. Retrieve the encrypted chunk of data

Where is the Euclidian division.

Then:

Arguments
  • Key commitment is always checked before returning clear-text data to the caller
  • AEAD tag of a chunk is always checked before returning the corresponding clear-text data to the caller
  • Arguments for HPKE use are very similar to the ones mentioned above. In particular, this is a standardized approach with existing analysis
  • As there is two kind of custom KEM used ("Per-recipient KEM" and "Hybrid KEM"), two distinct KEM ID are used. In addition, two distinct MLA specific info are used to bind this derivation to MLA
  • As described in 5 and 21, AES in GCM mode does not ensure "key commitment". This property is added in the layer using the "padding fix" scheme from 5 with the recommended 512-bits size for a 256-bits security
  • Key commitment is mainly used to ensure that two recipients will decrypt to the same plaintext if given the same ciphertext, i.e. an attacker modifying the header of an archive cannot provide two distinct plaintext to two distinct recipient
  • AES-GCM is used as an industry standard AEAD
    • the base nonce, and therefore each nonce used, are unique per archive because they are generated from the archive-specific shared secret, limiting the nonce-reuse risk to standard acceptability 3
    • no more than chunks will be produced, as the sequence's type used in MLA implementation is a u64 checked for overflow. As this is a widely accepted limit of AES-GCM, this value is also within the range provided by 3
    • the tag size is 128-bits (standard one), avoiding attacks described in 22
    • 128KiB is lower than the maximum plaintext length for a single message in AES-GCM (64 GiB)22

Seed derivation

The asymmetric encryption in MLA, particularly the KEMs, provides deterministic API.

These API are usually fed with cryptographically generated data, except for the regression test and the "seed derivation" feature in mlar CLI.

This feature is meant to provide a way for client to implement:

  • A derivation tree
  • Keep the root secret in a safe place, and be able to find back the derived secrets

The derivation scheme is based on the same ideas than mla::crypto::hybrid::combine:

  1. A dual-PRF (HKDF-Extract with a uniform random salt 20) to extract entropy from the private key
  2. HKDF-Expand to derive along the given path component

From a private key ( and ), the secret is derived from the path component through:

To derive a key using a seed, a ChaCha20Rng is used. If a seed is provided, the ChaCha20Rng is seeded with the first 32-bytes of . Otherwise, the seed comes from OS Cryptographic RNG sources.

A ChaCha20Rng is the ChaCha2023 stream cipher feeded with a seed as key and 8 null bytes as nonce.

The CSRNG is then provided to MLA deterministic APIs.

Implementation specificities

External dependencies

Some of the external cryptographic libraries have been reviewed:

  • RustCrypto AES-GCM, reviewed by NCC Group 24
  • Dalek cryptography library, reviewed by Quarkslab 25
  • rust-hpke library, reviewed in version 0.8 by CloudFlare 26

In addition to the review, rust-hpke is mainly based on RustCrypto, avoiding the need for additional newer dependencies.

The MLKEM implementation used is the one of RustCrypto, as MLA already depends on this project and the code quality and auditability are, in the author understanding, rather good.

The generation uses OsRng from crate rand, that uses getrandom() from crate getrandom. getrandom provides implementations for many systems, listed here. On Linux it uses the getrandom() syscall and falls back on /dev/urandom. On Windows it uses the RtlGenRandom API (available since Windows XP/Windows Server 2003).

In order to be "better safe than sorry", a ChaCha20Rng is seeded from the bytes generated by OsRng in order to build a CSPRNG(Cryptographically Secure PseudoRandom Number Generator). This ChaCha20Rng provides the actual bytes used in keys and nonces generations.

The authors decided to use elliptic curve over RSA, because:

  • No ready-for-production Rust-based libraries have been found at the date of writing
  • A security-audited Rust library already exists for Curve25519
  • Curve25519 is widely used and respects several criteria
  • Common arguments, such as the ones of Trail of bits

AES-GCM is used because it is one of the most commonly used AEAD algorithms and using one avoids a whole class of attacks. In addition, it lets us rely on hardware acceleration (like AES-NI) to keep reasonable performance.

AES-GCM re-implementation

While the AES and GHash bricks come from RustCrypto, the GCM mode for AES-256 has been re-implemented in MLA.

Indeed, the repair mode must be able to only partially decrypt a data chunk, and decide whether the associated tag must be verified or not. This API is not provided by the RustCrypto project, for very understandable reasons.

To ensure the implementation follows the standard, it is tested against AES-256-GCM test vectors in MLA regression tests.

HPKE Key Schedule re-implementation

For several reasons described in the code, but mainly due to the availability of API, the possibility to add custom KEM ID and the relative few lines needed for re-implementation, the method has been re-implemented in MLA.

It still use some bricks from rust-hpke, as the KDF, and . It is tested against RFC 9180 3 test vectors in MLA regression tests.

MLKEM implementation without a review

Thanks to the hybrid approach, a flawed implementation of MLKEM would have limited consequences. It satisfies ANSSI guidelines for the transition first phase to PQC hybridization 4. For this reason, MLA is eligible for a security visa evaluation.

For now, it is therefore accepted by the author (as a trade-off) to use a MLKEM implementation without existing review to bring as soon as possible a reasonable protection against "Harvest now, decrypt later" attacks.

If a reviewed implementation with acceptable dependency emerges in the future, it can be easily swapped in MLA. Thus, MLA would also satisfy the requirements to get a security visa evaluation in the second and third phases of these guidelines by including its PQC implementation.

Security considerations

Absence of signature

As there is no signature for now in MLA, an attacker knowing the recipient public key can always create a custom archive with arbitrary data.

For this reason, several known attacks are considered acceptable, such as:

  • The bit indicating if the Encrypt layer is present is not protected in integrity

An attacker can remove it, making the reader treating the archive as if encryption was absent. The reader is responsible of checking for encryption bit if it was expected in the first place.

For instance, the mlar CLI will refuse to open an archive without the Encrypt bit unless --accept-unencrypted is provided on the command line.

  • An attacker with the ability to modify a real archive in transit can replace what the reader will be able to read with arbitrary data

To perform this attack, the attacker will have to either remove the Encrypt bit or modify the key used for decryption with one she has. The remaining encrypted data will then act as random values.

Still, the attacker could expect to gain enough privilege, like arbitrary code execution in the process, during the archive read. One can then try to reuse the provided key to decrypt, then act on the real data.

Limiting this attack is beyond the scope of this document. It mainly involves the security features of Rust, reviewed implementation, testing & fuzzing, zeroizing secrets when possible 27, etc.

  • An attacker can truncate an archive and hope for repair

This attack is based on a trade-off: should the SafeReader try to get as many bytes as possible, or should it return only data that have been authenticated?

The choice has been made to report the decision to the user of the library28.

Other properties

  • Plaintext length

The Encrypt layer does not hide the plaintext length.

Usually, this layer is used with the Compress layer. If an attacker knows the original file size, he might learn information about the original data entropy.

  • Hidden recipient list

Only the owner of a recipient's private key can determine that they are a recipient of the archive. In other words, while the recipient list remains private, the total number of recipients is still visible.

This is an intentional privacy feature.


  1. RFC 8032 - Ed25519 Signature Algorithm

  2. NIST FIPS 204

  3. Hybrid Public Key Encryption, RFC 9180 ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11

  4. ANSSI Position Paper on Post-Quantum Cryptography ↩2 ↩3 ↩4

  5. How to Abuse and Fix Authenticated Encryption Without Key Commitment, Usenix'22 ↩2 ↩3

  6. MLA GitHub Issue #206

  7. FIPS 203 - MLKEM Standard ↩2 ↩3

  8. MLA GitHub Issue #211

  9. A Formal Analysis of HPKE

  10. NIST PQC Standardization News

  11. Counting Correctly in MLKEM

  12. KyberSlash

  13. Signal PQXDH Specification

  14. Apple iMessage PQ3 Security Blog

  15. Dual-PRF Combiners ↩2

  16. Hybrid Key Exchange Security

  17. Hybrid Key Exchange Security (2024)

  18. TLS Hybrid Design Draft

  19. RFC 9370 - IKEv2 Post-quantum Hybrid Key Exchange

  20. On the Security of Dual-PRF Combiners ↩2

  21. Key Commitment in AEAD

  22. Authentication weaknesses in GCM ↩2

  23. RFC 8439 - ChaCha20 and Poly1305 for IETF Protocols

  24. NCC Group Review of RustCrypto AES-GCM

  25. Quarkslab Security Audit of Dalek Libraries

  26. Cloudflare on HPKE

  27. MLA GitHub Issue #46

  28. MLA GitHub Issue #167

Key derivation

This feature can help setup a hierarchical key infrastructure.

mlar provides a subcommand keyderive to deterministically derive sub-keys from a given key along a derivation path (a bit like BIP-32, except children public keys can't be derived from the parent one).

For instance, if one wants to derive the following scheme:

root_key
    ├──["App X"]── key_app_x
    │   └──["v1.2.3"]── key_app_x_v1.2.3
    └──["App Y"]── key_app_y

One can use the following commands:

# Create the root key (--seed can be used if this key must be created deterministically)
mlar keygen root_key
# Create App keys
mlar keyderive root_key key_app_x --path-component "App X"
mlar keyderive root_key key_app_y --path-component "App Y"
# Create the v1.2.3 key of App X
mlar keyderive key_app_x key_app_x_v1.2.3 --path-component "v1.2.3"

At this point, let's consider an outage happened and keys have been lost.

One can recover all the keys from the root_key private key. For instance, to recover the key_app_v1.2.3:

mlar keyderive root_key recovered_key --path-component "App X" --path-component "v1.2.3"

As such, if the App X owner only knows key_app_x, he can recover all of its subkeys, including key_app_v1.2.3 but excluding key_app_y.

WARNING: This scheme does not provide any revocation mechanism. If a parent key is compromised, all of the key in its sub-tree must be considered compromised (ie. all past and futures key that can be obtained from it). The opposite is not true: a parent key remains safe if any of its children key is compromised.