Multi Layer Archive (MLA)

MLA is an archive file format with the following features:

Support for traditional and post-quantum encryption hybridation with asymmetric keys (HPKE with AES256-GCM and a KEM based on an hybridation of X25519 and post-quantum ML-KEM 1024)
Support for traditional and post-quantum signing hybridation
Support for compression (based on rust-brotli)
Streamable archive creation:
- An archive can be built even over a data-diode
- An entry can be added through chunks of data, without initially knowing the final size
- Entry chunks can be interleaved (one can add the beginning of an entry, start a second one, and then continue adding the first entry's parts)
Architecture agnostic and portable to some extent (written entirely in Rust)
Archive reading is seekable, even if compressed or encrypted. An entry can be accessed in the middle of the archive without reading from the beginning
If truncated, archives can be recovered to some extent. Two modes are available:
- Authenticated recover (default): only authenticated (as in AEAD, there is no signature verification) encrypted chunks of data are retrieved
- Unauthenticated recover: authenticated and unauthenticated encrypted chunks of data are retrieved. Use at your own risk.
Arguably less prone to bugs, especially while parsing an untrusted archive (Rust safety)

Repository

This repository contains:

mla: the Rust library implementing MLA reader and writer
mlar: a Rust cli utility wrapping mla for common actions (create, list, extract...)
doc : documentation related to MLA (e.g. format specification, cryptography)
- MLA book
bindings : bindings for other languages
samples : test assets
mla-fuzz-afl : a Rust utility to fuzz mla
.github: Continuous Integration needs

Quick command-line usage

Here are some commands to use mlar in order to work with archives in MLA format.

# Generate MLA key pairs.
mlar keygen sender
mlar keygen receiver

# Create an archive with some files.
mlar create -k sender.mlapriv -p receiver.mlapub -o my_archive.mla /boot/./grub/locale/en@quot.mo /etc/security/../issue ../file.txt

# List the content of the archive.
# Note that order may vary, root dir are stripped,
# paths are normalized and listing is encoded as described in
# `doc/src/ENTRY_NAME.md` (hence the percent in output).
# This outputs:
# ``
# etc/issue
# boot/grub/locale/en%40quot.mo
# file.txt
# ``
mlar list -k receiver.mlapriv -p sender.mlapub -i my_archive.mla

# Extract the content of the archive into a new directory.
# In this example, this creates two files:
# extracted_content/etc/issue and extracted_content/etc/os-release
mlar extract -k receiver.mlapriv -p sender.mlapub -i my_archive.mla -o extracted_content

# Display the content of a file in the archive
mlar cat -k receiver.mlapriv -p sender.mlapub -i my_archive.mla etc/os-release

# Convert the archive into a long-term format, primarily for archival purposes.
# Below operation also removes encryption and applies
# the highest (but slowest) compression level.
mlar convert -k receiver.mlapriv -p sender.mlapub -i my_archive.mla -o longterm.mla -l compress -q 11

# Create an archive with multiple recipients and without signature nor compression
mlar create -l encrypt -p archive.mlapub -p client1.mlapub -o my_archive.mla ...

# List an archive containing an entry with a name that cannot be interpreted as path.
# This outputs:
# `c%3a%2f%00%3b%e2%80%ae%0ac%0dd%1b%5b1%3b31ma%3cscript%3eevil%5c..%2f%d8%01%c2%85%e2%88%95`
# corresponding to an entry name containing: ASCII chars, c:, /, .., \,
# NUL, RTLO, newline, terminal escape sequence, carriage return,
# HTML, surrogate code unit, U+0085 weird newline, fake unicode slash.
# Please note that some of these characters may appear in a valid path.
mlar list -k samples/test_mlakey_archive_v2_receiver.mlapriv -p samples/test_mlakey_archive_v2_sender.mlapub -i samples/archive_weird.mla --raw-escaped-names

# Get its content.
# This displays:
# `' OR 1=1`
mlar cat -k samples/test_mlakey_archive_v2_receiver.mlapriv -p samples/test_mlakey_archive_v2_sender.mlapub -i samples/archive_weird.mla --raw-escaped-names c%3a%2f%00%3b%e2%80%ae%0ac%0dd%1b%5b1%3b31ma%3cscript%3eevil%5c..%2f%d8%01%c2%85%e2%88%95

# Create an archive of a web file, without compression, without encryption and without signature
curl https://raw.githubusercontent.com/ANSSI-FR/MLA/refs/heads/main/LICENSE.md | mlar create -l -o my_archive.mla --stdin-data

# Create an archive of a web file and arbitrary byte string, without compression, without encryption and without signature (chosen separator should not be present in the two entries)
(curl https://raw.githubusercontent.com/ANSSI-FR/MLA/refs/heads/main/LICENSE.md; echo "SEPARATOR"; echo -n "All Hail MLA") | mlar create -l -o my_archive.mla --stdin-data --stdin-data-separator "SEPARATOR" --stdin-data-entry-names great_license.md,hello.txt

# Create an archive passing the file list on stdin (not data)
echo -n -e "/etc/issue\n/etc/os-release" | mlar create -l -o my_archive.mla --stdin-file-list

mlar can be obtained:

through Cargo: cargo install mlar
using the latest release for supported operating systems
- The released binaries are built with opt-level = 3, enabling great performance

For even higher performance, you can build a native-optimized binary (not portable), for example on a Linux machine:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release --target x86_64-unknown-linux-musl

Note: Native builds are optimized for your machine's CPU and are not portable. Use them only when running on the same machine you build on.

API usage

See https://docs.rs/mla

Using MLA with others languages

Bindings are available for:

C/C++
Python

Security

Please keep in mind, it is generally not safe to extract in a place where at least one ancestor is writable by others (symbolic link attacks).
Even if encrypted with an authenticated cipher, if you receive an unsigned archive, it may have been crafted by anyone having your public key and thus can contain arbitrary data.
Read API documentation and mlar help before using their functionalities. They sometimes provide important security warnings. doc/src/ENTRY_NAME.md is also of particular interest.
mlar escapes entry names on output to avoid security issues.
Except for symbolic link attacks, mlar will not extract outside given output directory.

FAQ

Is MLAArchiveWriter Send?

By default, MLAArchiveWriter is not Send. If the inner writable type is also Send, one can enable the feature send for mla in Cargo.toml, such as:

[dependencies]
mla = { version = "...", default-features = false, features = ["send"]}

Was a new format really required?

As existing archive formats are numerous, probably not.

But to the best of the authors' knowledge, none of them support the aforementioned features (but, of course, are better suitable for others purposes).

For instance (from the understanding of the author):

tar format needs to know the size of files before adding them, and is not seekable
zip format could lose information about files if the footer is removed
7zip format requires to rebuild the entire archive while adding files to it (not streamable). It is also quite complex, and so harder to audit / trust when unpacking unknown archive
journald format is not streamable. Also, one writer / multiple reader is not needed here, thus releasing some constraints journald format has
any archive + age: age does not, as of MLA 2.0 release, support post quantum encryption nor signatures.
Backup formats are generally written to avoid things such as duplication, hence their need to keep bigger structures in memory, or not being streamable

Tweaking these formats would likely have resulted in similar properties. The choice has been made to keep a better control over what the format is capable of, and to (try to) KISS.

Performance

One can evaluate the performance through embedded benchmark, based on Criterion.

Several scenarios are already embedded, such as:

File addition, with different size and layer configurations
File addition, varying the compression quality
File reading, with different size and layer configurations
Random file read, with different size and layer configurations
Linear archive extraction, with different size and layer configurations

On an "Intel(R) Core(TM) i7-1255U CPU @ 2.60GHz":

$ cargo bench
...
multiple_layers_multiple_block_size/compression: true, encryption: true, signature: true/1048576
                        time:   [7.0850 ms 7.1179 ms 7.1586 ms]
                        thrpt:  [139.69 MiB/s 140.49 MiB/s 141.14 MiB/s]
...
chunk_size_decompress_multifiles_random/compression: true, encryption: true, signature: true/1048576
                        time:   [11.285 ms 11.494 ms 11.663 ms]
                        thrpt:  [85.745 MiB/s 87.005 MiB/s 88.616 MiB/s]
...
reader_multiple_layers_multiple_block_size_multifiles_linear/compression: true, encryption: true, signature: true/1048576
                        time:   [4.6197 ms 4.6383 ms 4.6604 ms]
                        thrpt:  [214.58 MiB/s 215.60 MiB/s 216.47 MiB/s]
...

Criterion.rs documentation explains how to get back HTML reports, compare results, etc.

As described in the aes crate documentation, this crate uses runtime detection on i686 and x86_64 targets to check if AES-NI is available. If AES-NI is not detected, it automatically falls back to a constant-time software implementation.

Contributing

We appreciate your help! To contribute, please read our contributing instructions.

MLA FORMAT

Relation between the MLA library version and the file format version:

MLA Version	Supported file format
2.X	2
1.X	1

This document introduces the MLA file format in its current version, 2. For a more comprehensive introduction of the ideas behind it, please refer to README.md.

Types and their serialization format

Integers are unsigned and serialized as bytes in little endian. They are called u64 for 64 bits integers, u32 for 32 bits ones, u16 for 16 bits ones and u8 for 8 bits ones. Serialization length in bytes are: 8 for u64, 4 for u32, 2 for u16 and 1 for u8.
Vec<T> is a sequence of elements of type T. It is serialized with its length in number of elements (not necessarily bytes) as a u64 and the sequence of serialized elements of type T.
Opts represents MLA options. It is serialized with a tag of value 0 as a u8 if no option is present. Otherwise it is serialized with a tag of value 1 as u8 followed by a yet unspecified Vec. Multiple fields in this file format are of type Opts for future proofing reasons, but no option is defined at the moment. For future proofing, implementers of this file format version must still handle the tag of value 1 and read the Vec<u8> even if not using this values. Thus, if an option is specified in the future, pre-dating implementations will be able to work with new archives containing the optional value.
Tail<T> is a T followed by its tail_length length in bytes as a u64. This enables extracting the T when reading from the end. Note that a Tail<Vec<T>> contains two lengths which may differ in units and always differ in values as Tail's length includes Vec's serialization of its own length. For example, serialization of a Tail<Vec<u16>> containing 0 and 1, leads to 02 00 00 00 00 00 00 00 00 00 01 00 0c 00 00 00 00 00 00 00. As a second example, serialization of Tail<Vec<u8>> containing 0 and 1, leads to 02 00 00 00 00 00 00 00 00 01 0a 00 00 00 00 00 00 00.

MLA Header

An MLA file begins with the mla_header_magic ASCII magic: "MLAFAAAA".
mla_header_magic is followed by the format_version format version number as a u32.
format_version is followed by an header_options field of type Opts.
header_options is followed by archive content archive_content, described below.
archive_content is followed by an footer_options field of type Tail<Opts> to enable determining archive_content's end when reading from the end of the MLA file.
footer_options is followed by the footer_magic ASCII magic "EMLAAAAA", terminating the archive.

archive_content consists of a serialized MLA entries layer documented below, transformed with zero or more layers, documented below too. A layer consists of a u64 layer magic followed by its data. Layer order plays an important security role, so the signature layer has to be above the encryption layer which has to be above the compression layer. This must be enforced by writers and readers. Readers should ensure users explicitly choose if they allow an archive without signature or without encryption.

Signature layer

The layer signature_layer_magic ASCII magic is "SIGMLAAA".
signature_layer_magic is followed by an signature_header_options of type Opts.
signature_header_options is followed by sig_inner_layer, consisting of the inner layer bytes.
sig_inner_layer is followed by signature_footer_options of type Tail<Opts>.
signature_footer_options is followed by signature_data serialized as a Tail<Vec<u8>> which content is described below.

signature_data is a Vec<u8> whose content bytes consist of a sequence of SignatureDataWHdr. A SignatureDataWHdr is a signature_method_id u16, followed by a signature_data sequence of bytes depending of the tag value. For the moment, there are two valid signature_method_id: 0 and 1. 0 maps to MLAEd25519SigMethod, 1 maps to MLAMLDSA87SigMethod. These methods are described in doc/CRYPTO.md. Their input starts and includes mla_header_magic, up to and including sig_inner_layer. In the current version, for the signature layer to be considered verified, the reader must verify that at least one SignatureDataWHdr of each signature_method_id is verified.

Encryption layer

The layer encryption_layer_magic ASCII magic is "ENCMLAAA". encryption_layer_magic is followed by encryption_header_options of type Opts. encryption_header_options is followed by a encryption_method_id u16, described below. encryption_method_id is followed by encryption_metadata, a sequence of bytes described below. encryption_metadata is followed by encrypted_inner_layer, a sequence of bytes described below. encrypted_inner_layer is followed by one end_of_encrypted_inner_layer_magic ASCII magic "ENCMLAAB" end_of_encrypted_inner_layer_magic is followed by one encryption_footer_options of type Tail<Opts>.

The only encryption_method_id valid encryption_method_id for the moment is 0. It is the encryption method described in CRYPTO.md.

encryption_metadata depends on the previous encryption_method_id value. For encryption_method_id 0, encryption_metadata is a Vec<PerRecipientEncapsulatedKey> followed by a KeyCommitmentAndTag.

A PerRecipientEncapsulatedKey is an mlkem1024_encapsulated_s field followed by an ed25519_encapsulated_s field, followed by an m0_encrypted_ss field and a prkem_tag field. As described in more detail in CRYPTO.md, m0_encrypted_ss is the AES-256-GCM encrypted global_secret. The AES key used to encrypt this global_secret is recovered from mlkem1024_encapsulated_s and ed25519_encapsulated_s. prkem_tag is the GCM tag associated with m0_encrypted_ss.

mlkem1024_encapsulated_s is a sequence of 1568 bytes corresponding to the ciphertext output of ML-KEM.Encaps as described in FIPS 203. ed25519_encapsulated_s is a sequence of 32 bytes corresponding to the output of the X25519 as described in RFC 7748. m0_encrypted_ss is a 32 bytes sequence. prkem_tag is a 16 bytes sequence.

KeyCommitmentAndTag is the key commitment described in CRYPTO.md. It is a 64-bytes ciphertext followed by a 16-bytes tag.

encrypted_inner_layer is the AES-256-GCM encrypted inner layer with the global_secret key. encrypted_inner_layer is a sequence of M0EncryptedChunk followed by one M0FinalEncryptedChunk. Each M0EncryptedChunk has an ASCII magic "M0ENCCNK" followed by a u64 chunk_number, followed by an encrypted_content (128*1024)-bytes field (last M0EncryptedChunk may be smaller) followed by a tag 16-bytes field. encrypted_content is the inner_layer encrypted chunk, and tag its GCM tag. M0FinalEncryptedChunk has an ASCII magic "M0FNLBLK" followed by a 10-bytes encrypted_content field followed by a 16-bytes tag. chunk_number is the number of the M0EncryptedChunk in the stream, starting at 1.

To protect from a truncation attack, before using an archive, it must be checked that the tag of the M0FinalEncryptedChunk is correct and that its decrypted encrypted_content is the ASCII FINALBLOCK.

Compression layer

The layer compression_layer_magic ASCII magic is "COMLAAAA". compression_layer_magic if followed by compression_header_options of type Opts. compression_header_options is followed by compressed_data, a sequence of bytes explained below. compressed_data is followed by compression_footer_options of type Tail<Opts>. compressed_footer_options is followed by sizes_info of type Tail<SizesInfo>, where SizesInfo is explained below.

The inner layer, is split in 4 * 1024 * 1024-bytes chunks, except for the last chunk which may be smaller. Each chunk is compressed with brotli. The resulting size of each compressed chunk is recorded in sizes_info. SizesInfo has a first field compressed_sizes, which is a Vec<u32> corresponding to an ordered list of compressed chunk sizes and a second field last_block_uncompresed_size as a u32 indicating the uncompressed size of last inner layer chunk.

compressed_data is the concatenation of each compressed chunk.

The compression layer footer information can be retrieved by first reading the value of sizes_info.tail_length at the end of the layer, then reading the preceding sizes_info.tail_length-bytes.

MLA entries layer

The layer entries_layer_magic ASCII magic is "MLAENAAA". entries_layer_magic is followed by entries_header_options of type Opts. entries_header_options is followed by entries_data, a sequence of bytes described below. entries_data is followed by entries_footer of type Tail<EntriesIndex>, where EntriesIndex is described below. entries_footer is followed by entries_footer_options of type Tail<Opts>.

entries_data is a succession of ArchiveEntryBlock of different type. An ArchiveEntryBlock begins with an ASCII magic "MAEB" followed by an ArchiveEntryBlockType u8 determining the type of ArchiveEntryBlock:

0x00 means EntryStart
0x01 means EntryContentChunk
0xFE means EndOfArchiveData
0xFF means EndOfEntry

If the ArchiveEntryBlockType is EntryStart, it is followed by an ArchiveEntryId u64, an EntryName and an entry_start_options of type Opts. An EntryName is a Vec<u8> described in doc/ENTRY_NAME.md.

If the ArchiveEntryBlockType is EntryContentChunk, it is followed by an ArchiveEntryId, a content_options of type Opts and a Vec<u8> entry_content_data.

If the ArchiveEntryBlockType is EndOfEntry, it is followed by an ArchiveEntryId, a end_options of type Opts and a hash serialized as 32 u8.

If the ArchiveEntryBlockType is EndOfArchiveData, it is followed by nothing.

EntriesIndex is 0x00 byte in case of an archive not storing any index. Otherwise, it is a 0x01 byte followed by a Vec<EntryNameInfoMapElt>. An EntryNameInfoMapElt is an EntryName followed by an entry_blocks_info which is a Vec<EntryBlockInfo> explained after. For reproducibility, the EntriesIndex Vec is sorted by entry name (lexicographically by bytes values) before being serialized.

EntryBlockInfo has two fields: block_offset and block_size. The block_offset field is a u64 indicating at which offset from the beginning of the MLA entries layer an ArchiveEntryBlock can be found for the given EntryName. The block_size field is a u64 indicating the size in bytes of the block content (0 except for EntryContentChunk). If it is an EntryContentChunk with entry_content_data containing 1 byte, block_size is 1. All EntryBlockInfos for each entry are recorded in entry_blocks_info and they are so in ascending order of offset.

Explanations

An archive entry entry_i in the archive always starts with an EntryStart, giving its name and unique ID i.

entry_i content is the concatenation of all EntryContentChunks entry_content_data fields with ArchiveEntryId value i.

Once the EndOfEntry for entry_i is reached, the entry is completely read. Its content SHA-256 hash can be verified with the EndOfEntry.hash.

Between the last EndOfEntry block and entries_footer, there is the only EndOfArchiveData block. It is used when trying to read a truncated archive, to correctly separate the actual archive data from the footer.

As blocks from different entries can be interleaved, the entry_block_info offsets for an entry are the offsets in entries_data of its blocks.

For instance, if the blocks are:

Off0: [EntryStart ID 1]
Off1: [EntryStart ID 2]
Off2: [EntryContentChunk ID 1]
Off3: [EntryContentChunk ID 1]
Off4: [EntryContentChunk ID 2]
Off5: [EndOfEntry ID 1]
...

The offsets for the entry with ID 1 will be Off0, Off2, Off3 and Off5.

Entry name documentation

An archive can store entries associated with a name. These entries may or may not represent OS filesystem files. And their name may or may not represent an OS file system path.

An entry name is a nonempty sequence of bytes (maximum length of 65536).

Please keep in mind that names, interpreted as paths or not, may contain arbitrary bytes such as slashes, backslashes, .., C:\\{}...], newlines, spaces, carriage returns, terminal escape sequences, Unicode chars such as U+0085 or RTLO, HTML, SQL, semicolons, homoglyphs, etc.

Interpretation of an entry name as an OS filesystem file path

If it is to be interpreted as a file path, the underlying bytes must consist of ASCII slash separated components and not begin with a slash. The rules for each component are:

must not be empty
must not contain any ASCII NUL byte
must not be ASCII dot
must not be two ASCII dots

If it is to be interpreted as a Windows file path, in addition to previous rules:

No byte should be an ASCII backslash (separators are represented by an ASCII slash).
Byte values strictly below 32 (non-printable control characters) are forbidden. Additionally, the following ASCII values are forbidden: 34 ("), 42 (*), 58 (:), 60 (<), 62 (>), 63 (?), and 124 (|).
Every component must be encoded as UTF-8.

These rules are checked by the accompanying Rust implementation (EntryName::to_pathbuf).

Even if respecting these rules, the OS may see the resulting path as invalid.

Please keep in mind that two different names, may map to same path on OS (e.g. Windows case insensitivity).

In provided rust implementation, when given a path as input, before being converted to an entry name by EntryName::from_path and mlar the path is normalized by keeping only Normal std::path::Components and popping an eventual previous component when a .. is encountered.

String representation of entry names

To prevent some security risks, proposed string representations of entry names are given with EntryName::to_pathbuf_escaped_string and EntryName::raw_content_to_escaped_string and are used by mlar.

Other representations may be preferred depending on their usage context.

The idea of this representation is that unwanted bytes are replaced with a percent and their hexadecimal representation. Details follow.

For an entry name interpreted as raw bytes, below generic escaping is applied with ASCII alphanumeric, dot, dash and underscore as preserved bytes. This is used by mlar list --raw-escaped-names.

For an entry name interpreted as a path, below generic escaping is applied with ASCII alphanumeric chars, dot, dash, underscore and slash as preserved bytes. This is used by default by mlar list.

Generic escaping, implemented by `helpers::mla_percent_escape`

A bytes_to_preserve parameter tells which bytes are not escaped. For every input byte:

If listed in bytes_to_preserve then it will be output without transformation.
Else, it will be replaced by %xx where xx is their hexadecimal representation.

Generic unescaping, implemented by `helpers::mla_percent_unescape`

A bytes_to_allow parameter tells which bytes are not escaped. Unescaping fails if fed with anything else than bytes listed in bytes_to_allow and %xx where xx is the hexadecimal representation of a byte not listed in bytes_to_allow. Otherwise it reverses the process described in Generic escaping.

Examples

For each following entry name found serialized in an archive, here is how they are represented as strings when interpreted as path:

empty bytes -> invalid (even interpreted as arbitrary bytes)
/a -> invalid path (root directory)
a/b/../d -> invalid path (path traversal)
a/b/.. -> invalid path (path traversal)
a//b -> invalid path (not normalized)
a/./b -> invalid path (not normalized)
./b -> invalid path (not normalized)
a/. -> invalid path (not normalized)
aNULb (where NUL here represent an ASCII NUL byte) -> invalid path
m:abcd -> invalid path on Windows (: as second byte), m%3aabcd on UNIX-like
a\b (where \ represents an ASCII backslash, not an escaped b) -> invalid path on Windows (contains backslash), a%5cb on UNIX-like
a/b.txt -> a/b.txt
a/b!c -> a/b%21c

MLA key file format

MLA can use cryptography for signature and/or encryption. Thus it needs to operate with keys. An implementation can get access to these keys from a serialized format described here.

The string || denotes concatenation.

Private key file format

A private key file is an ASCII file, which may use mlapriv as file extension. The file (or whatever serialization medium) content is PrivFormatHeader||<CR><LF>PrivEncHdr||B64Priv4Enc||<CR><LF>||PrivSigHdr||B64Priv4Sig||<CR><LF>||B64PrivOpts||<CR><LF>||PrivFormatFooter||<CR><LF> where <CR> is ASCII carriage return, <LF> is ASCII line feed, and PrivFormatHeader, PrivEncHdr, B64Priv4Enc, PrivSigHdr, B64Priv4Sig, B64PrivOpts and PrivFormatFooter are described below.

PrivFormatHeader is the ASCII string DO NOT SEND THIS TO ANYONE - MLA PRIVATE KEY FILE V1.
PrivEncHdr is the ASCII string MLA PRIVATE DECRYPTION KEY (note the trailing space).
PrivSigHdr is the ASCII string MLA PRIVATE SIGNING KEY (note the trailing space).
B64Priv4Enc is the base64 encoding (RFC 4648) of EncMethodId||PrivEncOpts||X25519PrivKey||MLKEM1024PrivKey where EncMethodId, PrivEncOpts, X25519PrivKey and MLKEM1024PrivKey are described below.
B64Priv4Sig is the base64 encoding of SigMethodId||PrivSigOpts||Ed25519PrivKey||MLDSA87PrivKey where MethodId, PrivEncOpts, Ed25519PrivKey and MLDSA87PrivKey are described below.
PrivFormatFooter is the ASCII string END OF MLA PRIVATE KEY FILE
The only valid EncMethodId for the moment is the ASCII mla-kem-private-x25519-mlkem1024.
The only valid SigMethodId for the moment is the ASCII mla-signature-private-ed25519-mldsa87.
X25519PrivKey is a X25519 private key as specified in RFC 7748.
MLKEM1024PrivKey is an ML-KEM-1024 private key seed (d,z) as specified in FIPS 203 algorithm 16. d and z are concatenated in this order.
Ed25519PrivKey is a Ed25519 private key as specified in RFC 8032.
MLDSA87PrivKey is an ML-DSA-87 private key seed xi as specified in FIPS 204 algorithm 6.

For PrivEncOpts and PrivSigOpts, refer to below generic explanation for KeyOpts.

B64PrivOpts is a base64 encoded KeyOpts.

Public key file format

A public key file is an ASCII file, which may use mlapub as file extension. The file (or whatever serialization medium) content is PubFormatHeader||<CR><LF>||PubEncHdr||B64Pub4Enc||<CR><LF>||PubSigHdr||B64Pub4Sig||<CR><LF>||B64PubOpts||<CR><LF>||PubFormatFooter||<CR><LF> where <CR> is ASCII carriage return, <LF> is ASCII line feed, and PubFormatHeader, PubEncHdr, B64Pub4Enc, PubSigHdr, B64Pub4Sig, B64PubOpts and PubFormatFooter are described below.

PubFormatHeader is the ASCII string MLA PUBLIC KEY FILE V1.
PubEncHdr is the ASCII string `MLA PUBLIC ENCRYPTION KEY " (note the trailing space).
PubSigHdr is the ASCII string `MLA PUBLIC SIGNATURE VERIFICATION KEY " (note the trailing space).
B64Pub4Enc is the base64 encoding (RFC 4648) of EncMethodId||PubEncOpts||X25519PubKey||MLKEM1024PubKey where EncMethodId, PubEncOpts, X25519PubKey and MLKEM1024PubKey are described below.
B64Pub4Sig is the base64 encoding of SigMethodId||PubSigOpts||Ed25519PubKey||MLDSA87PubKey where MethodId, PubEncOpts, Ed25519PubKey and MLDSA87PubKey are described below.
PubFormatFooter is the ASCII string END OF MLA PUBLIC KEY FILE
The only valid EncMethodId for the moment is the ASCII mla-kem-public-x25519-mlkem1024.
The only valid SigMethodId for the moment is the ASCII mla-signature-verification-public-ed25519-mldsa87.
X25519PubKey is a X25519 public key as specified in RFC 7748.
MLKEM1024PubKey is an ML-KEM-1024 public key as specified in FIPS 203.
Ed25519PubKey is a Ed25519 public key as specified in RFC 8032.
MLDSA87PubKey is an ML-DSA-87 public key as specified in FIPS 204.

For PubEncOpts and PubSigOpts, refer to below generic explanation for KeyOpts.

B64PubOpts is a base64 encoded KeyOpts.

Options

KeyOpts fields are options fields for future-proofing the format in case of later non-breaking optional additions to the key file format. It is a length-value field where length is the length in bytes of value, serialized as a 4 bytes little-endian integer. Possible values are left unspecified for the moment, but implementations, particularly for public keys, should read length bytes correctly in case some options are specified later.

Cryptography in MLA

MLA uses cryptographic primitives essentially for the purpose of the Encryption and Signature layers.

This document introduces the primitives used, arguments for the choice made and some security considerations.

Keys used for encryption and signature are generated and used separately.

Signature

As described in FORMAT.md an archive can be signed. Implementation must ensure users explicitly choose if signature is made and verified.

A PQ/T key consists of a pair of a post-quantum key and a traditional key. An archive is considered correctly signed for a PQ/T key if and only if it is correctly signed for its post-quantum part AND its traditional part.

Two signature methods are available and must be used together. Signature method input is called m. The SHA-512 hash h of m may be computed in a first step.

For method MLAEd25519SigMethod, signature_data is the Ed25519ph (as described in RFC 8032 ¹) signature of m (not h even though it can be used for computing the result). The context given as parameter to Ed25519ph is the ASCII MLAEd25519SigMethod. Signature verification and key generation are done as described in RFC 8032. Key storage is described in KEY_FORMAT.md.

For method MLAMLDSA87SigMethod, signature_data is the ML-DSA-87 signature (as described in FIPS 204 ², not HashML-DSA) of h (not m this time) with the ASCII MLAMLDSA87SigMethod as context. Signature verification and key generation are done as described in FIPS 204. Key storage is described in KEY_FORMAT.md.

An archive can be signed with multiple signing keys. If a user provides a set of PQ/T keys for signature verification, implementations should give a way for the user to know if archive is correctly signed for at least one key. Implementations may give a way for users to know if archive is correctly signed for all keys. Users must explicitly know if they are validating against at least one or all keys. Implementations may also give a way for users to know which PQ/T keys correspond to valid signatures or their number.

Encryption high-level overview

Objectives

The purpose of the Encryption layer is to provide confidentiality and data integrity of the inner layer.

These objectives are obtained using:

Authenticated encryption
Asymmetric cryptography, for several recipients

This layer does not provide signature.

General design guidelines

The size and the initial computation time used for the encryption needs are not a big issue, if kept reasonable. Indeed, in the author understanding, MLA archives are usually several MB long and the computation time is primarily spent in compression/decompression and encryption/decryption of the data

As a result, some optimization have not been performed -- which help keeping an hopefully auditable and conservative design.

Only one encryption method and key type is available, to avoid confusion and potential corner cases errors
When possible, use audited code and test vectors

Main bricks: Encryption

The data is encrypted using AES-256-GCM, an AEAD algorithm. To offer a seekable layer, data is encrypted using chunks of 128KB each, except for the last one. These encrypted chunks are all present with their associated tag. Tags are checked during decryption before returning data to the upper layer.

To prevent truncation attacks, another chunk is added at the end corresponding to the encryption of the ASCII string "FINALBLOCK" with "FINALAAD" as additional authenticated data. Any usage of the archive must check correct decryption (including tag verification) of this last block.

The key, the base nonce and the nonce derivation for each data chunk are computed following HPKE (RFC 9180) ³. HPKE is parameterized with:

Mode: "Base" (no PSK, no sender authentication)
KDF: HKDF-SHA512
AEAD: AES-256-GCM
KEM: Multi-Recipient Hybrid KEM, a custom KEM described later in this document

Thus, only one cryptography suite is available for now. If this setting ends up broken by cryptanalysis, we will move users onward to the next MLA version, using appropriate cryptography. Therefore, MLA lacks cryptography agility which is an encouraged property regarding post-quantum cryptography by ANSSI ⁴. Still, HPKE improves this aspect of MLA ³.

Full details are available below.

Additionally, "key commitment" is included using a method described in ⁵ and detailed in ⁶.

Main bricks: Asymmetric encryption

Since the format v2, the Encrypt layer is using post-quantum cryptography (PQC) through an hybrid approach, to avoid "Harvest now, decrypt later" attacks.

The algorithms used are:

X25519 for pre-quantum cryptography, using DHKEM (RFC 9180) ³
FIPS 203⁷ (CRYSTALS Kyber) MLKEM-1024 for post-quantum cryptography

The two keys are mixed together (see below) in a manner keeping the IND-CCA2 properties of the two algorithms.

Sending to multiple recipients is achieved using a two-step process:

For each recipient, a per-recipient Hybrid KEM is done, leading to a per-recipient shared secret
These per-recipient shared secret are derived through HPKE to obtain a key and a nonce
These per-recipient key and nonce are used to decrypt a secret shared by all recipients

This final secret is the one later used as an input to the encryption layer. The whole process can be viewed as a KEM encapsulation for multiple recipients.

Encryption Details

The following sections describe the whole process for data encryption and seed derivation. They are meant to ease the understanding of the code and MLA format re-implementation.

The interested reader could also look at the Rust implementation in this repository for more details. The implementation also includes tests (including some test vectors) and comments.

Asymmetric encryption - Per-recipient KEM

Notations

$p k_{ecc}^{i}$ , $s k_{ecc}^{i}$ , $p k_{m l k e m}^{i}$ and $s k_{m l k e m}^{i}$ : respectively the X25519 public key and secret key, and the MLKEM-1024 (FIPS 203 ⁷) encapsulating key and decapsulating key
$DHKEM.Encapsulate$ and $DHKEM.Decapsulate$ : key encapsulation methods with X25519, as defined in RFC 9180, section 4 ³
$MLKEM.Encapsulate$ and $MLKEM.Decapsulate$ : key encapsulation methods on MLKEM-1024, as defined in FIPS 203 ⁷
$s s_{rec i p i e n t s}$ : a 32-bytes secret, produced by a cryptographic RNG. Informally, this is the secret shared among recipients, encapsulated separately for each recipient
$KeySchedule_{rec i p i e n t}$ : KeySchedule function from RFC 9180 ³, instantiated with:
- Mode: "Base"
- KDF: HKDF-SHA-512
- AEAD: AES-256-GCM
- KEM: a custom KEM ID, numbered 0x1120
$Encrypt_{A ES 256 GCM}$ : AES-256-GCM encryption, returning the encrypted data concatenated with the associated tag
$Decrypt_{A ES 256 GCM}$ : AES-256-GCM decryption, returning the decrypted data after verifying the tag
$Serialize$ and $Deserialize$ : respectively produce a byte string encoding the data in argument, and produce the data from the byte string in argument

Process

To encrypt to a target recipient $i$ , knowing $p k_{ecc}^{i}$ and $p k_{m l k e m}^{i}$ :

Compute shared secrets and ciphertexts for both KEM:

$(s s_{ecc}^{i}, c t_{ecc}^{i}) (s s_{m l k e m}^{i}, c t_{m l k e m}^{i}) = DHKEM.Encapsulate (p k_{ecc}^{i}) = MLKEM.Encapsulate (p k_{m l k e m}^{i})$

Combine the shared secrets (implemented in mla::crypto::hybrid::combine):

def combine(ss1, ss2, ct1, ct2):
    uniformly_random_ss1 = HKDF-SHA512-Extract(
        salt=0,
        ikm=ss1
    )
    key = HKDF(
        salt=uniformly_random_ss1,
        ikm=ss2,
        info=ct1 . ct2
    )
    return key

$s s_{rec i p i e n t}^{i} = combine (s s_{ecc}^{i}, s s_{m l k e m}^{i}, c t_{ecc}^{i}, c t_{m l k e m}^{i})$

Wrap the recipients' shared secret:

$(k e y^{i}, n o n c e^{i}) c t_{w r a p}^{i} c t_{rec i p i e n t}^{i} = KeySchedule_{rec i p i e n t} (s ha re d_secre t = s s_{rec i p i e n t}^{i}, info = "MLA Recipient") = Encrypt_{A ES 256 GCM} (key = k e y^{i}, nonce = n o n c e^{i}, data = s s_{rec i p i e n t s}) = Serialize (c t_{w r a p}^{i}, c t_{ecc}^{i}, c t_{m l k e m}^{i})$

Informally, this process can be viewed as a per-recipient KEM taking a shared secret $s s_{rec i p i e n t s}$ , the recipient public key (made of the elliptic curve and the PQC public keys) and returning a ciphertext $c t_{rec i p i e n t}^{i}$ .

To obtain the shared secret from $c t_{rec i p i e n t}^{i}$ for a recipient $i$ knowing $s k_{ecc}^{i}$ and $s k_{m l k e m}^{i}$ :

Compute the recipient's shared secret:

$(c t_{w r a p}^{i}, c t_{ecc}^{i}, c t_{m l k e m}^{i}) s s_{ecc}^{i} s s_{m l k e m}^{i} s s_{rec i p i e n t}^{i} = Deserialize (c t_{rec i p i e n t}^{i}) = DHKEM.Decapsulate (s k_{ecc}^{i}, c t_{ecc}^{i}) = MLKEM.Decapsulate (s k_{m l k e m}^{i}, c t_{m l k e m}^{i}) = combine (s s_{ecc}^{i}, s s_{m l k e m}^{i}, c t_{ecc}^{i}, c t_{m l k e m}^{i})$

Try to decrypt the secret shared among recipients:

$(k e y^{i}, n o n c e^{i}) s s_{rec i p i e n t s} = KeySchedule_{rec i p i e n t} (s ha re d_secre t = s s_{rec i p i e n t}^{i}, info = "MLA Recipient") = Decrypt_{A ES 256 GCM} (key = k e y^{i}, nonce = n o n c e^{i}, data = c t_{w r a p}^{i})$

If the decryption is a success, returns $s s_{rec i p i e n t s}$ . Otherwise, returns an error.

Arguments

Using HPKE (RFC 9180 ³) for both elliptic curve encryption (DHKEM) and post-quantum encryption (MLKEM) offers several benefits⁸:
- Easier re-implementation of the format MLA, thanks to the availability of HPKE in cryptographic libraries
- An existing formal analysis ⁹
- Easier code and security auditing, thanks to the use of known bricks
- Availability of test vectors in the RFC, making the implementation more reliable
To the knowledge of the author, no HPKE algorithm has been standardized for quantum hybridation, hence the custom algorithm
FIPS 203 is used as, at the time of writing:
- It is the only KEM algorithm standardized by the NIST ¹⁰
- It is in line with the French suggestions ⁴ for PQ cryptography
The MLKEM-1024 mode is used for stronger security, and to limit consequence of future advances ¹¹ ¹². This is also the choice of other industry standards ¹³ ¹⁴
The shared secret from the two-KEM is produced using a "Nested Dual-PRF Combiner", proved in ¹⁵ (3.3):
- The use of concatenation scheme including ciphertexts keeps IND-CCA2 if one of the two underlying scheme is IND-CCA2, as proved in ¹⁶ and explained in ¹⁷
- TLS ¹⁸ uses a similar scheme, and IKE ¹⁹ also uses a concatenation scheme
- This kind of scheme follows ANSSI recommendations ⁴
- HKDF can be considered as a Dual-PRF if both inputs are uniformly random ²⁰. In MLA, the combine method is called with a shared secret from ML-KEM, and the resulting ECC key derivation -- both are uniformly random
- To avoid potential mistake in the future, or a mis-reuse of this method, the "Nested Dual-PRF Combiner" is used instead of the "Dual-PRF Combiner" (also from ¹⁵). Indeed, this combiner force the "salt" part of HKDF to be uniformly random using an additional PRF use, ensuring the following HKDF is indeed a Dual-PRF

Asymmetric encryption - Multi-Recipient Hybrid KEM

Intuition

KEM, such as the one described above, returns a fresh and distinct secret for each recipient.

To obtain a "meta-KEM", working for multi-recipient, the strategy is the use of per-recipient KEM to encrypt a common secret.

This whole process can then be viewed as a KEM for multi-recipient, taking in input a list of public keys and returning a shared secret and a ciphertext made of the concatenation of each per-recipient ciphertext.

To avoid marking which per-recipient ciphertext correspond to which recipient public key, the decapsulation process "brute-force" each ciphertext for a given decapsulation key. If the decryption works (with the associated tag), the shared secret is returned.

Key commitment, to avoid rather unlikely mismatch, is further ensured inside the Encrypt layer (see below).

Process

The "Per-recipient KEM" process described above is noted:

$PerRecipientKEM.Encapsulate$ , taking a couple of public key ( $p k_{ecc}^{i}$ and $p k_{m l k e m}^{i}$ ), a shared secret $s s_{rec i p i e n t s}$ and returning a recipient ciphertext $c t_{rec i p i e n t}^{i}$
$PerRecipientKEM.Decapsulate$ , taking a couple of private key ( $s k_{ecc}^{i}$ and $s k_{m l k e m}^{i}$ ), a ciphertext $c t_{rec i p i e n t s}$ and returning either a shared secret $s s_{rec i p i e n t s}$ if the recipient $i$ is a legitimate recipient (if the AEAD decryption works), or an error otherwise

$CSPRNG (n)$ is a cryptographically secured RNG producing a n-bytes secret.

To encapsulate to a list of recipient $[(p k_{ecc}^{0}, p k_{m l k e m}^{0}), ..., (p k_{ecc}^{n - 1}, p k_{m l k e m}^{n - 1})]$ :

$def HybridKEM.Encapsulate ([(p k_{ecc}^{0}, p k_{m l k e m}^{0}), ..., (p k_{ecc}^{n - 1}, p k_{m l k e m}^{n - 1})]) s s_{rec i p i e n t s} = CSPRNG (32) c t_{rec i p i e n t}^{0} = PerRecipientKEM ((p k_{ecc}^{0}, p k_{m l k e m}^{0}), s s_{rec i p i e n t s}) \dots c t_{rec i p i e n t}^{n - 1} = PerRecipientKEM ((p k_{ecc}^{n - 1}, p k_{m l k e m}^{n - 1}), s s_{rec i p i e n t s}) c t_{rec i p i e n t s} = Serialize (c t_{rec i p i e n t}^{0}, \dots, c t_{rec i p i e n t}^{n - 1}) return s s_{rec i p i e n t s}, c t_{rec i p i e n t s}$

To decapsulate from a ciphertext $c t_{rec i p i e n t s}$ , knowing a recipient private key $(s k_{ecc}^{i}, s k_{m l k e m}^{i})$ :

$def HybridKEM.Decapsulate ((s k_{ecc}^{i}, s k_{m l k e m}^{i}), c t_{rec i p i e n t s})$
$foreach c t_{k} in Deserialize (c t_{rec i p i e n t s})$
$try :$
$s s_{rec i p i e n t s} = PerRecipientKEM.Decapsulate ((s k_{ecc}^{i}, s k_{m l k e m}^{i}), c t_{k})$
$success :$
$return s s_{rec i p i e n t s}$
$error :$
$continue$
$throw KeyNotFoundError$

Arguments

The shared secret is cryptographically generated, so it can later be used as a shared secret in HPKE encryption
This secret is unique per archive, as it is generated on archive creation. Even converting (convert) or cleaning a truncated archive (clean-truncate) an archive in mlar CLI will force a newly fresh secret. It is a new secret as there is no edit feature implemented, even if it is doable. Hence, a new random symmetric key is used to encrypt its content while "converting" or "recovering" an archive.
Even if the AEAD decryption worked for an non legitimate recipient, for instance following an intentional manipulation, the shared secret obtained will later be checked using Key commitment before decrypting actual data (see below)
Optimization would have been possible here, such as sharing a common ephemeral key for the DHKEM. But the size gain is not worth enough regarding the ciphertext size of MLKEM and would move the implementation away from the DHKEM in RFC 9180

Encryption

Notation

The "Multi-Recipient Hybrid KEM" process described above is noted:

$MultiRecipientHybridKEM.Encapsulate$ , taking a list of public keys $[(p k_{ecc}^{0}, p k_{m l k e m}^{0}), ..., (p k_{ecc}^{n - 1}, p k_{m l k e m}^{n - 1})]$ and returing a shared secret $s s_{rec i p i e n t s}$ and a ciphertext $c t_{rec i p i e n t s}$
$MultiRecipientHybridKEM.Decapsulate$ , taking a couple of private keys ( $s k_{ecc}^{i}$ and $s k_{m l k e m}^{i}$ ), a ciphertext $c t_{rec i p i e n t s}$ and returning either a shared secret $s s_{rec i p i e n t s}$ if the recipient $i$ is a legitimate recipient (if the AEAD decryption works), or an error otherwise

KeyCommitmentChain is defined as the array of 64-bytes: -KEY COMMITMENT--KEY COMMITMENT--KEY COMMITMENT--KEY COMMITMENT-.

$KeySchedule_{h y b r i d}$ : KeySchedule function from RFC 9180 ³, instantiated with:

Mode: "Base"
KDF: HKDF-SHA-512
AEAD: AES-256-GCM
KEM: a custom KEM ID, numbered 0x1020

$ComputeNonce$ : function from RFC 9180 ³.

Process

To encrypt n-bytes data to a list of public keys $[(p k_{ecc}^{0}, p k_{m l k e m}^{0}), ..., (p k_{ecc}^{n - 1}, p k_{m l k e m}^{n - 1})]$ :

Compute a shared secret and the corresponding ciphertext:

$s s_{rec i p i e n t s}, c t_{rec i p i e n t s} = MultiRecipientHybridKEM.Encapsulate ([(p k_{ecc}^{0}, p k_{m l k e m}^{0}), ..., (p k_{ecc}^{n - 1}, p k_{m l k e m}^{n - 1})])$

Derive the key and base nonce using HPKE

$(k ey, ba se_n o n ce) = KeySchedule_{h y b r i d} (shared_secret = s s_{recipients}, info = "MLA Encrypt Layer")$

Ensure key-commitment

$k eyco mmi t) = Encrypt_{A ES 256 GCM} (key = k ey, nonce = ComputeNonce (ba se_n o n ce, 0), data = KeyCommitmentChain$

For each 128KB $c h u n k_{j}$ of data:

$e n c_{j}) = Encrypt_{A ES 256 GCM} (key = k ey, nonce = ComputeNonce (ba se_n o n ce, j + 1), data = c h u n k_{j}$

Note: $j$ starts at 0. $j + 1$ is used because the sequence numbered 0 has already been used by the Key commitment.

When the layer is finalized, the last chunk of data (with a length lower than or equals to 128KB) is encrypted the same way
Finally, a final chunk with sequence number $n + 1$ (where $n$ is the number of data chunks) and special content and additional authenticated data is appended:

$f ina l_c h u nk) = Encrypt_{A ES 256 GCM} (key = k ey, nonce = ComputeNonce (base_nonce, n + 1), data = " F I N A L B L OC K " aad = " F I N A L AA D "$

The resulting layer is composed of:

header: $c t_{rec i p i e n t s}$
data: $k eyco mmi t . e n c_{0} . \dots e n c_{n} .$ $f ina l_c h u nk$

Special care must be taken not to reuse a sequence number in implementations as this would be catastrophic given GCM properties. For $n$ chunks of data:

sequence 0: key commitment
sequence 1 to $n$ : data
sequence $n + 1$ : $f ina l_c h u nk$ with only the 10 bytes "FINALBLOCK" as content

To decrypt the data at position $p os$ :

Once for the whole session, get the cryptographic materials

$s s_{rec i p i e n t s} (k ey, base_nonce) = MultiRecipientHybridKEM.Decapsulate ((s k_{ecc}^{i}, s k_{m l k e m}^{i}), c t_{rec i p i e n t s}) = KeySchedule_{h y b r i d} (s ha re d_secre t = s s_{rec i p i e n t s}, info = "MLA Encrypt Layer")$

Once for the whole session, check the key commitment

$co mmi t) = Decrypt_{A ES 256 GCM} (key = k ey, nonce = ComputeNonce (ba se_n o n ce, 0), data = k eyco mmi t$

$assert co mmi t = KeyCommitmentChain$

Retrieve the encrypted chunk of data

$s t a r t j = p os - sizeof (k eyco mmi t) = p os \div 128 K i B$

Where $\div$ is the Euclidean division.

Then: $c h u n k_{j}) = Decrypt_{A ES 256 GCM} (key = k ey, nonce = ComputeNonce (ba se_n o n ce, j + 1), data = e n c_{j}$

Arguments

Key commitment is always checked before returning clear-text data to the caller
AEAD tag of a chunk is always checked before returning the corresponding clear-text data to the caller
Arguments for HPKE use are very similar to the ones mentioned above. In particular, this is a standardized approach with existing analysis
As there is two kind of custom KEM used ("Per-recipient KEM" and "Hybrid KEM"), two distinct KEM ID are used. In addition, two distinct MLA specific info are used to bind this derivation to MLA
As described in ⁵ and ²¹, AES in GCM mode does not ensure "key commitment". This property is added in the layer using the "padding fix" scheme from ⁵ with the recommended 512-bits size for a 256-bits security
Key commitment is mainly used to ensure that two recipients will decrypt to the same plaintext if given the same ciphertext, i.e. an attacker modifying the header of an archive cannot provide two distinct plaintext to two distinct recipient
AES-GCM is used as an industry standard AEAD
- the base nonce, and therefore each nonce used, are unique per archive because they are generated from the archive-specific shared secret, limiting the nonce-reuse risk to standard acceptability ³
- no more than $2^{64}$ chunks will be produced, as the sequence's type used in MLA implementation is a u64 checked for overflow. As this is a widely accepted limit of AES-GCM, this value is also within the range provided by ³
- the tag size is 128-bits (standard one), avoiding attacks described in ²²
- 128KiB is lower than the maximum plaintext length for a single message in AES-GCM (64 GiB)²²

Seed derivation

The asymmetric encryption in MLA, particularly the KEMs, provides deterministic API.

These API are usually fed with cryptographically generated data, except for the regression test and the "seed derivation" feature in mlar CLI.

This feature is meant to provide a way for client to implement:

A derivation tree
Keep the root secret in a safe place, and be able to find back the derived secrets

The derivation scheme is based on the same ideas than mla::crypto::hybrid::combine:

A dual-PRF (HKDF-Extract with a uniform random salt ²⁰) to extract entropy from the private key
HKDF-Expand to derive along the given path component

From a private key ( $s k_{ecc}^{i}$ and $s k_{m l k e m}^{i}$ ), the secret is derived from the path component $p c$ through:

$ecc_r n d see d = HKDF.Extrac t_{SHA512} (salt = 0, ikm = s k_{ecc}^{i}) = HKD F_{SHA512} (salt = ecc_rnd, ikm = s k_{m l k e m}^{i}, info = "PATH DERIVATION" . p c)$

To derive a key using a seed, a ChaCha20Rng is used. If a seed is provided, the ChaCha20Rng is seeded with the first 32-bytes of $SHA512 (see d)$ . Otherwise, the seed comes from OS Cryptographic RNG sources.

A ChaCha20Rng is the ChaCha20²³ stream cipher fed with a seed as key and 8 null bytes as nonce.

The CSRNG is then provided to MLA deterministic APIs.

Implementation specificities

External dependencies

MLA relies on several external cryptographic libraries for its primitives. Below is a summary of each dependency, its role, review status, and documentation coverage:

Symmetric Encryption

RustCrypto AES-GCM (aes-gcm crate)
- Role: AES-256-GCM authenticated encryption for archive data.
- Review ²⁴.
- Documentation: Well documented and widely used in the Rust ecosystem.
- Note: GCM mode is re-implemented in MLA for chunked/partial decryption.

Traditional Signature

ed25519-dalek
- Role: Ed25519ph signatures (RFC 8032) for the traditional part of hybrid signatures.
- Review ²⁵.
- Documentation: Well documented, widely used, and considered production-grade.

Post-Quantum Signature

ml-dsa (ml-dsa crate)
- Role: ML-DSA-87 signatures (FIPS 204) for the post-quantum part of hybrid signatures.
- Review: No formal third-party audit known at time of writing; code is open-source and used in research and reference implementations.
- Documentation: Documented in FIPS 204 and crate docs, but less mature than ed25519-dalek.
- Note: MLA uses ML-DSA-87, not HashML-DSA.

Hybrid Signature Logic

Custom implementation in MLA (hybrid_signature.rs, mlakey.rs)
- Role: Combines Ed25519ph and ML-DSA-87 for hybrid PQ/T signatures.
- Review: No third-party audit. Covered by internal tests and verified against official test vectors.

Asymmetric Encryption (Hybrid KEM)

RustCrypto MLKEM (ml-kem crate)
- Role: MLKEM-1024 (Kyber) for post-quantum key encapsulation.
- Review: No formal third-party audit; code quality and auditability considered good by the author.
- Documentation: FIPS 203 and crate docs; less mature than AES-GCM or Dalek.
curve25519-dalek / x25519-dalek
- Role: X25519 for pre-quantum key encapsulation (DHKEM).
- Review: curve25519-dalek is widely used and reviewed; respects SafeCurves criteria.
- Documentation: Well documented.
rust-hpke
- Role: HPKE primitives (KDF, LabeledExtract, LabeledExpand) and test vectors.
- Review ²⁶ (version 0.8).
- Documentation: Good, but custom KEM IDs and key schedule logic are re-implemented in MLA.

Random Number Generation

rand / getrandom / OsRng
- Role: Cryptographically secure random number generation for keys and nonces.
- Review: Well documented and widely used; uses OS sources (getrandom() syscall, /dev/urandom, RtlGenRandom).
- Documentation: getrandom crate docs.
- Note: On Linux, uses the getrandom() syscall and falls back on /dev/urandom. On Windows, uses the RtlGenRandom API.
rand_chacha / ChaCha20Rng
- Role: CSPRNG for deterministic seed derivation and key generation.
- Review: Well documented and widely used.
- Note: A ChaCha20Rng is seeded from bytes generated by OsRng to build a CSPRNG. This provides the actual bytes used in key and nonce generation.

Other

hkdf
- Role: HKDF-SHA512 for key derivation and hybrid KEM combiners.
- Review: Part of RustCrypto; well documented.
zeroize
- Role: Securely zeroes sensitive data in memory.
- Review: Well documented and widely used.

Design Choices and Rationale

Elliptic curve cryptography (Curve25519/X25519) is used over RSA due to the lack of ready-for-production Rust-based RSA libraries, the availability of audited Curve25519 libraries, and its widespread use and security properties (SafeCurves, Trail of Bits arguments).
AES-GCM is chosen for authenticated encryption due to its common use, hardware acceleration support (e.g., AES-NI), and avoidance of a class of attacks.
MLA’s hybrid approach (combining traditional and post-quantum primitives) limits the impact of potential flaws in less mature or unaudited libraries (e.g., MLKEM, ML-DSA).
MLA includes regression tests and test vectors for all critical cryptographic operations.
Custom cryptographic logic (e.g., hybrid KEM, hybrid signature, chunked AES-GCM) is described in this document and tested against standards.

Summary Table

Dependency	Purpose	Review Status	Documentation
aes-gcm	AES-256-GCM encryption	NCC Group	Excellent
ed25519-dalek	Ed25519ph signature	Quarkslab	Excellent
ml-dsa	ML-DSA-87 signature	None (open-source, FIPS)	Good
ml-kem	MLKEM-1024 (Kyber) KEM	None (open-source, FIPS)	Good
curve25519-dalek	X25519 KEM	Community, SafeCurves	Excellent
rust-hpke	HPKE primitives	Cloudflare	Good
rand/getrandom	OS CSPRNG	Community	Excellent
rand_chacha	ChaCha20Rng CSPRNG	Community	Excellent
hkdf	Key derivation	Community	Excellent
zeroize	Memory zeroization	Community	Excellent

AES-GCM re-implementation

While the AES and GHash bricks come from RustCrypto, the GCM mode for AES-256 has been re-implemented in MLA.

Indeed, the recover mode must be able to only partially decrypt a data chunk, and decide whether the associated tag must be verified or not. This API is not provided by the RustCrypto project, for very understandable reasons.

To ensure the implementation follows the standard, it is tested against AES-256-GCM test vectors in MLA regression tests.

HPKE Key Schedule re-implementation

For several reasons described in the code, but mainly due to the availability of API, the possibility to add custom KEM ID and the relative few lines needed for re-implementation, the $KeySchedule$ method has been re-implemented in MLA.

It still use some bricks from rust-hpke, as the KDF, $LabeledExtract$ and $LabeledExpand$ . It is tested against RFC 9180 ³ test vectors in MLA regression tests.

MLKEM implementation without a review

Thanks to the hybrid approach, a flawed implementation of MLKEM would have limited consequences. It satisfies ANSSI guidelines for the transition first phase to PQC hybridization ⁴. For this reason, MLA is eligible for a security visa evaluation.

For now, it is therefore accepted by the author (as a trade-off) to use a MLKEM implementation without existing review to bring as soon as possible a reasonable protection against "Harvest now, decrypt later" attacks.

If a reviewed implementation with acceptable dependency emerges in the future, it can be easily swapped in MLA. Thus, MLA would also satisfy the requirements to get a security visa evaluation in the second and third phases of these guidelines by including its PQC implementation.

Security considerations

Plaintext length

The Encrypt layer does not hide the plaintext length.

Usually, this layer is used with the Compress layer. If an attacker knows the original file size, he might learn information about the original data entropy.

Hidden recipient list

Only the owner of a recipient's private key can determine that they are a recipient of the archive. In other words, while the recipient list remains private, the total number of recipients is still visible.

This is an intentional privacy feature.

Key derivation

This feature can help setup a hierarchical key infrastructure.

mlar provides a subcommand keyderive to deterministically derive sub-keys from a given key along a derivation path (a bit like BIP-32, except children public keys can't be derived from the parent one).

For instance, if one wants to derive the following scheme:

root_key
    ├──["App X"]── key_app_x
    │   └──["v1.2.3"]── key_app_x_v1.2.3
    └──["App Y"]── key_app_y

One can use the following commands:

# Create the root key (--seed can be used if this key must be created deterministically)
mlar keygen root_key
# Create App keys
mlar keyderive root_key key_app_x --path-component "App X"
mlar keyderive root_key key_app_y --path-component "App Y"
# Create the v1.2.3 key of App X
mlar keyderive key_app_x key_app_x_v1.2.3 --path-component "v1.2.3"

At this point, let's consider an outage happened and keys have been lost.

One can recover all the keys from the root_key private key. For instance, to recover the key_app_v1.2.3:

mlar keyderive root_key recovered_key --path-component "App X" --path-component "v1.2.3"

As such, if the App X owner only knows key_app_x, he can recover all of its subkeys, including key_app_v1.2.3 but excluding key_app_y.

WARNING: This scheme does not provide any revocation mechanism. If a parent key is compromised, all of the key in its sub-tree must be considered compromised (ie. all past and futures key that can be obtained from it). The opposite is not true: a parent key remains safe if any of its children key is compromised.

Fuzzing

A fuzzing scenario made with afl.rs is available in mla-fuzz-afl. The scenario is capable of:

Creating archives with interleaved files, and different layers enabled
Reading them to check their content
Repairing the archive without truncation, and verifying it
Altering the archive raw data, and ensuring reading it does not panic (but only fail)
Repairing the altered archive, and ensuring the recovery doesn't fail (only reports detected errors)

To launch it:

produce initial samples by uncommenting produce_samples() in mla-fuzz-afl/src/main.rs

cd mla-fuzz-afl
# ... uncomment `produces_samples()` ...
mkdir in
mkdir out
cargo run

build and launch AFL

cargo afl build
cargo afl fuzz -i in -o out ../target/debug/mla-fuzz-afl

If you have found crashes, try to replay them with either:

Peruvian rabbit mode of AFL: cargo afl run -i - -o out -C ../target/debug/mla-fuzz-afl
Direct replay: ../target/debug/mla-fuzz-afl < out/crashes/crash_id
Debugging: uncomment the "Replay sample" part of mla-fuzz-afl/src/main.rs, and add dbg!() when it's needed

:warning: The stability is quite low, likely due to the process used for the scenario (deserialization from the data provided by AFL) and variability of inner algorithms, such as brotli. Crashes, if any, might not be reproducible or due to the mla-fuzz-afl inner working, which is a bit complex (and therefore likely buggy). One can comment irrelevant parts in mla-fuzz-afl/src/main.rs to ensure a better experience.

Keyboard shortcuts

Multi Layer Archive