This section is intended to give you an overview of the architecture of Krill, which is important to keep in mind when deploying the application in your infrastructure. It will give you an understanding how and where data is stored, how to make your setup redundant and how to save and restore backups.
Krill does NOT support clustering at this time. You can achieve high availability by doing a fail-over to a standby inactive installation using the same data and configuration. However, you cannot have multiple active instances. This feature is on our long term roadmap.
Used Disk Space¶
Krill stores all of its data under the
DATA_DIR. For users who will operate
a CA under an RIR / NIR parent the following sub-directories are relevant:
The HTTPS key and certificate used by Krill
The history of your CA(s) in raw JSON format
If used, the history of your Publication Server
Note that old versions of Krill also used the directories
data_dir/rfc6492 for storing all
protocol messages exchanged between your CAs and their parent
and repository. If they are still present on your system, you
can safely remove them and save space - potentially quite a bit
Krill offers the option to archive old, less relevant, historical information
related to publication. You can enable this by setting the option
archive_threshold_days in your configuration file. If set Krill will move
all publication events older than the specified number of days to a subdirectory
archived under the relevant data directory, i.e.
data_dir/pubd/0/archived if you are using the Krill Publication Server and
data_dir/cas/<your-ca-name>/archived for each of your CAs.
You can set up a cronjob to delete these events once and for all, but we recommend that you save them in long term storage if you can. The reason is that if (and only if) you have this data, you will be able to rebuild the complete Krill state based on its audit log of events, and irrevocably prove that no changes were made to Krill other than the changes recorded in the audit trail. We have no tooling for this yet, but we have an issue on our backlog.
Saving State Changes¶
You can skip this section if you’re not interested in the gory details. However, understanding this section will help to explain how backup and restore works in Krill, and why a standby fail-over node can be used, but Krill’s locking and storage mechanism needs to be changed in order to make multiple active nodes work.
State changes in Krill are tracked using events. Krill CA(s) and Publication Servers are versioned. They can only be changed by applying an event for a specific version. An event just contains the data that needs to be changed. Crucially, they cannot cause any side effects. As such, the overall state can always be reconstituted by applying all past events. This concept is called event-sourcing, and in this context the CAs and Publication Servers are so-called aggregates.
Events are not applied directly. Rather, users of Krill and background jobs will send their intent to make a change through the API, which then translates this into a so-called command. Krill will then lock the target aggregate and send the command to it. This locking mechanism is not aware of any clustering, and it’s a primary reason why Krill cannot run as an active-active cluster yet.
Upon receiving a command the aggregate (your CA etc.) will do some work. In some cases a command can have a side-effect. For example it may instruct your CA to create a new key pair, after receiving entitlements from its parent. The key pair is random — applying a command again would result in a new random key pair. Remember that commands are not re-applied to aggregates, only their resulting events are. Thus in this example there would be an event caused that contains the resulting key pair.
After receiving the command, the aggregate will return one of the following:
- An error
Usually this means that the command is not applicable to the aggregate state. For example, you may have tried to remove a ROA which does not exist.
When Krill encounters such an error, it will store the command with some meta-information like the time the command was issued, and a summary of the error, so that it can be seen in the history. It will then unlock the aggregate, so that the next command can be sent to it.
- No error, zero events
In this case the command turned out to be a no-op, and Krill just unlocks the aggregate. The command sequence counter is not updated, and the command is not saved. This is used as a feature whenever the ‘republish’ background job kicks in. A ‘republish’ command is sent, but it will only have an actual effect if there was a need to republish — e.g. a manifest would need to be re-issued before it would expire.
- One or more events
In this case there is a desired state change in a Krill aggregate. Krill will now apply and persist the changes in the following order:
Each event is stored. If an event already exists for a version, then then the update is aborted. Because Krill cannot run as a cluster, and it uses locking to ensure that updates are done in sequence, this will only fail on the first event if a user tried to issue concurrent updates to the same CA.
On every fifth event a snapshot of the state is saved to a new file. If this is successful then the old snapshot (if there is one) is renamed and kept as a backup snapshot. The new snapshot is then renamed to the ‘current’ snapshot.
When all events are saved, the command is saved enumerating all resulting events, and including meta-information such as the time that the time that the command was executed. And when multiple users will be supported, this will also include who made a change.
Finally the version information file for the aggregate is updated to indicate its current version, and command sequence counter.
Krill will crash, by design, if there is any failure in saving any of the above files to disk. If Krill cannot persist its state it should not try to carry on. It could lead to disjoints between in-memory and on-disk state that are impossible to fix. Therefore, crashing and forcing an operator to look at the system is the only sensible thing Krill can now do. Fortunately, this should not happen unless there is a serious system failure.
Loading State at Startup¶
Krill will rebuild its internal state whenever it starts. If it finds that there
are surplus events or commands compared to the latest information state for any
of the aggregates, then it will assume that they are present because, either
Krill stopped in the middle of writing a transaction of changes to disk, or your
backup was taken in the middle of a transaction. Such surplus files are backed
up to a subdirectory called
surplus under the relevant data directory, i.e.
data_dir/pubd/0/surplus if you are using the Krill Publication Server and
data_dir/cas/<your-ca-name>/surplus for each of your CAs.
Recover State at Startup¶
When Krill starts, it will try to go back to the last possible recoverable state if:
it cannot rebuild its state at startup due to data corruption
the environment variable:
the configuration file contains
always_recover_data = true
Under normal circumstances, i.e. when there is no data corruption, performing this recovery will not be necessary. It can also take significant time due to all the checks performed. So, we do not recommend forcing this.
Krill will try the following checks and recovery attempts:
Verify each recorded command and its effects (events) in their historical order.
If any command or event file is corrupt it will be moved to a subdirectory called
corruptunder the relevant data directory, and all subsequent commands and events will be moved to a subdirectory called
surplusunder the relevant data directory.
Verify that each snapshot file can be parsed. If it can’t then this file is moved to the relevant
If a snapshot file could not be parsed, try to parse the backup snapshot. If this file can’t be parsed, move it to the relevant
Try to rebuild the state to the last recoverable state, i.e. the last known good event. Note that if this pre-dates the available snapshots, or, if no snapshots are available this means that Krill will try to rebuild state by replaying all events. If you had enabled archiving of events, it will not be able rebuild state.
If rebuilding state failed, Krill will now exit with an error.
Note that in case of data corruption Krill may be able to fall back to an earlier recoverable state, but this state may be far in the past. You should always verify your ROAs and/or delegations to child CAs in such cases.
Of course, it’s best to avoid data corruption in the first place. Please monitor available disk space, and make regular backups.
Backup / Restore¶
Backing up Krill is as simple as backing up its data directory. There is no need to stop Krill during the backup. To restore put back your data directory and make sure that you refer to it in the configuration file that you use for your Krill instance. As described above, if Krill finds that the backup contain an incomplete transaction, it will just fall back to the state prior to it.
You may want to encrypt your backup, because the
data_dir/ssl directory contains your private keys in clear
text. Encrypting your backup will help protect these, but of course
also implies that you can only restore if you have the ability to
All Krill versions 0.4.1 and upwards can be automatically upgraded to the current version. Any required data migrations will be performed automatically. To do so we recommend that you:
backup your krill data directories
install the new version of Krill
stop the running Krill instance
start Krill again, using the new binary, and the same configuration
If you want to test if data migrations will work correctly for your data, you can do the following:
copy your data directory to another system
set the env variable
create a configuration file, and set
start up Krill
Krill will then perform the data migrations, rebuild its state, and then exit before doing anything else.
Downgrading Krill data is not supported. So, downgrading can only be achieved by installing a previous version of Krill and restoring a backup from before your upgrade.
Proxy and HTTPS¶
Krill uses HTTPS and refuses to do plain HTTP. By default Krill will generate a
2048 bit RSA key and self-signed certificate in
/ssl in the data
directory when it is first started. Replacing the self-signed certificate with a
TLS certificate issued by a CA works, but has not been tested extensively. By
default Krill will only be available under
If you need to access the Krill UI or API (also used by the CLI) from another machine you can use use a proxy server such as NGINX or Apache to proxy requests to Krill. This proxy can then also use a proper HTTPS certificate and production grade TLS support.
Proxy Krill UI¶
The Krill UI and assets are hosted directly under the base path
/. So, in
order to proxy to the Krill UI you should proxy ALL requests under
/ to the
Note that although the UI and API are protected by a token, you should consider further restrictions in your proxy setup, such as restrictions on source IP or adding your own authentication.
Proxy Krill as Parent¶
If you delegated resources to child CAs then you will need to ensure that these
children can reach your Krill. Child requests for resource certificates are
directed to the
/rfc6492 directory under the
service_uri that you
defined in your configuration file.
Note that contrary to the UI you should not add any additional authentication mechanisms to this location. RFC 6492 uses cryptographically signed messages sent over HTTP and is secure. However, verifying messages and signing responses can be computationally heavy, so if you know the source IP addresses of your child CAs, you may wish to restrict access based on this.
Proxy Krill as Publication Server¶
If you are running Krill as a Publication Server, then you should read here how to do the Publication Server specific set up.
We recommend that you do not make Krill available to the public internet unless you really need remote access to the UI or API, or you are serving as parent CA or Publication Server for other CAs.