Go to file
Wojciech Kozlowski e276a05a3f Latest updates for backup testing 2024-04-20 20:01:21 +02:00
inventory Latest updates for backup testing 2024-04-20 20:01:21 +02:00
playbooks Latest updates for backup testing 2024-04-20 20:01:21 +02:00
roles@7d1b975af6 Roles update 2024-02-04 10:16:49 +01:00
scripts Update README 2023-11-04 22:44:07 +01:00
.ansible-lint Fix lints 2023-11-04 21:19:09 +01:00
.gitignore Add script to manage instance for backup testing 2023-01-02 23:39:04 +01:00
.gitmodules Fix README 2023-11-04 22:56:33 +01:00
.yamllint Add gitea runner 2024-01-06 12:21:34 +01:00
README.md Latest updates for backup testing 2024-04-20 20:01:21 +02:00
ansible.cfg Move roles to shared repo 2022-12-20 19:56:45 +01:00
main.yml Add music service 2023-02-21 00:06:29 +01:00
makefile Fix lints 2023-11-04 21:19:09 +01:00
requirements.txt Fix lints 2023-11-04 21:19:09 +01:00
vault-keyring-client.py Move to using virtualenv 2023-02-11 10:30:32 +01:00

README.md

The Ansible Edda

Ansible playbooks for provisioning The Nine Worlds.

Running the playbooks

The main entrypoint for The Nine Worlds is main.yml.

Keyring integration

Keyring integration requires python3-keyring to be installed.

To set the keyring password run:

./vault-keyring-client.py --set [--vault-id <vault-id>]

If --vault-id is not specified, the password will be stored under ansible.

To use the password from the keyring invoke playbooks with:

ansible-playbook --vault-id @vault-keyring-client.py ...

Production and testing

The inventory files are split into inventory/production and inventory/testing.

To run the main.yml playbook on production hosts:

ansible-playbook -i inventory/production main.yml

To run the main.yml playbook on testing hosts:

ansible-playbook -i inventory/testing main.yml

Playbooks

The Ansible Edda playbook is composed of smaller playbooks. To run a single playbook, invoke the relevant playbook directly from the playbook directory. For example, to run the playbooks/system.yml playbook, run:

ansible-playbook playbooks/system.yml

Alternatively you can use its tag as well:

ansible-playbook main.yml --tags "system"

Roles

Playbooks are composed of roles defined in the roles submodule and playbooks/roles.

To play a specific role, e.g., system/base/sshd in the playbook system, run:

ansible-playbook playbooks/system.yml --tags "system:base:sshd"

To play all roles from a specific group, e.g., system/base in the playbook system, run:

ansible-playbook playbooks/system.yml --tags "system:base"

Some roles, e.g., services/setup/user, have sub-tasks which can also be invoked individually. To find the relevant tag, see the role's tasks/main.yml.

In all cases, the roles can be also invoked from the main playbook:

ansible-playbook main.yml --tags "system:base:sshd"
ansible-playbook main.yml --tags "system:base"

Testing virtual machines

The scripts for starting, stopping, and reverting the testing virtual machines is located in scripts/testing/vmgr.py.

Managing backup buckets

The scripts/restic/restic.py script provides a wrapper around restic to manage the backup buckets. The script collects the credentials from the OS keyring and constructs the restic command with the correct endpoint. It allows the user to focus on the actual command to be executed rather than authentication and bucket URLs.

The scripts/restic/restic.py requires the following entries in the keyring:

  • scaleway: access_key (Scaleway project ID),
  • scaleway: secret_key (Scaleway secret key),
  • restic: password.

The easiest way to set these values is with Python's keyring.set_password.

Testing backups

Setting up baldur on yggdrasil

  1. Create the zvol rpool/var/lib/libvirt/images/baldur for the testing OS.
  2. Create the zvol hpool/baldur for the backup data under test. It should have a capacity that's larger than what yggdrasil estimates for rpool/var/lib/the-nine-worlds/data (excluding datasets that are not backed up to the cloud).
  3. Set refreserv=0 on the zvols to make snapshots take less space.
    • zfs set refreserv=0 rpool/var/lib/libvirt/images/baldur
    • zfs set refreserv=0 hpool/baldur
  4. Install the same OS that is running on yggdrasil, but with a DE, on rpool/var/lib/libvirt/images/baldur with hpool/baldur mounted within at /var/lib/the-nine-worlds/data.
  5. Create non-root user wojtek with sudo privileges.
  6. Configure SSH from the workstation to use yggdrasil as a jump server.
  7. Use ZFS for snapshots/rollback of the zvols.
    • zfs snapshot rpool/var/lib/libvirt/images/baldur@start
    • zfs snapshot hpool/baldur@start

Provision baldur

  1. Provision baldur by running
    ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/baldur.yml
    
  2. Update /etc/the-nine-worlds/resolv.conf to point at a public DNS resolver, e.g., 1.1.1.1. Name resolution failures can cause containers to fail.
  3. Restore all the backups by ssh'ing into baldur and running (as root):
    /usr/local/sbin/restic-batch --config-dir /etc/the-nine-worlds/restic-batch.d restore
    
  4. Once restore has completed, chown -R <user>:<user> all the restored directories in /var/lib/the-nine-worlds/data. Restic restores the UID information of the host from which the backup was performed which may not match that of the new target machine. Note that permissions and ownership are restored as a second step once all the content is restored. Therefore, the files will list root as owner during the restoration.
  5. Start all the pod services with:
    ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_start.yml
    
    Give them some time to download all the images and start.
  6. Once the CPU returns to idling check the state of all the pod services and their veth interfaces. If necessary restart the affected pod, some containers fail to start up if the database takes too long to come online.

Testing the backups

  1. Stop all services on yggdrasil to prevent accidental connections to the live services which defeats the point of testing backups.
  2. Log into the baldur. Testing from a VM (as opposed to a regular workstation) is important to prevent live applications from accidentally connecting to baldur.
  3. Modify /etc/hosts in the VM to point at rproxy (e.g., 10.66.3.8) for all relevant domains.
  4. Test each service manually one by one. Use the Flagfox add-on to verify that you are indeed connecting to baldur.

Cleaning up

  1. Stop all the pod services with:
    ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_stop.yml
    
  2. Delete the VM and the two zvols:
    • rpool/var/lib/libvirt/images/baldur,
    • hpool/baldur.