ansible-edda/README.md

167 lines
6.2 KiB
Markdown
Raw Normal View History

2022-08-18 10:48:41 +02:00
# The Ansible Edda
2023-11-04 22:44:07 +01:00
Ansible playbooks for provisioning **The Nine Worlds**.
2022-08-18 10:48:41 +02:00
2023-11-04 22:44:07 +01:00
## Running the playbooks
2022-08-18 10:48:41 +02:00
2023-11-04 22:44:07 +01:00
The main entrypoint for **The Nine Worlds** is [`main.yml`](main.yml).
2022-12-28 14:21:33 +01:00
### Keyring integration
Keyring integration requires `python3-keyring` to be installed.
To set the keyring password run:
``` sh
./vault-keyring-client.py --set [--vault-id <vault-id>]
```
If `--vault-id` is not specified, the password will be stored under `ansible`.
To use the password from the keyring invoke playbooks with:
``` sh
ansible-playbook --vault-id @vault-keyring-client.py ...
```
### Production and testing
2023-11-04 22:50:08 +01:00
The inventory files are split into [`inventory/production`](inventory/production) and
[`inventory/testing`](inventory/testing).
2022-12-07 21:36:08 +01:00
To run the `main.yml` playbook on production hosts:
``` sh
2023-11-04 22:50:08 +01:00
ansible-playbook -i inventory/production main.yml
```
2023-11-04 22:44:07 +01:00
To run the `main.yml` playbook on testing hosts:
``` sh
2023-11-04 22:50:08 +01:00
ansible-playbook -i inventory/testing main.yml
```
2022-12-07 21:36:08 +01:00
### Playbooks
2022-12-18 21:14:04 +01:00
The Ansible Edda playbook is composed of smaller [`playbooks`](playbooks). To run a single playbook,
invoke the relevant playbook directly from the playbook directory. For example, to run the
2023-11-04 22:56:33 +01:00
[`playbooks/system.yml`](playbooks/system.yml) playbook, run:
2022-12-07 21:36:08 +01:00
``` sh
2022-12-18 21:14:04 +01:00
ansible-playbook playbooks/system.yml
```
Alternatively you can use its tag as well:
``` sh
ansible-playbook main.yml --tags "system"
2022-12-07 21:36:08 +01:00
```
### Roles
2023-11-04 22:57:15 +01:00
Playbooks are composed of roles defined in the
[`roles`](http://git.thenineworlds.net/the-nine-worlds/ansible-roles) submodule and
2023-11-04 22:50:08 +01:00
[`playbooks/roles`](playbooks/roles).
2022-12-18 21:14:04 +01:00
2023-11-04 22:44:07 +01:00
To play a specific role, e.g., `system/base/sshd` in the playbook `system`, run:
``` sh
ansible-playbook playbooks/system.yml --tags "system:base:sshd"
```
2022-12-18 21:14:04 +01:00
2023-11-04 22:44:07 +01:00
To play all roles from a specific group, e.g., `system/base` in the playbook `system`, run:
2022-12-18 21:14:04 +01:00
``` sh
ansible-playbook playbooks/system.yml --tags "system:base"
```
2023-11-04 22:44:07 +01:00
Some roles, e.g., `services/setup/user`, have sub-tasks which can also be invoked individually. To
2023-11-04 22:50:08 +01:00
find the relevant tag, see the role's `tasks/main.yml`.
2023-11-04 22:44:07 +01:00
In all cases, the roles can be also invoked from the main playbook:
``` sh
2023-11-04 22:44:07 +01:00
ansible-playbook main.yml --tags "system:base:sshd"
2022-12-18 21:14:04 +01:00
ansible-playbook main.yml --tags "system:base"
```
2023-11-04 22:44:07 +01:00
## Testing virtual machines
2023-11-04 22:44:07 +01:00
The scripts for starting, stopping, and reverting the testing virtual machines is located in
2023-11-04 22:56:33 +01:00
[`scripts/testing/vmgr.py`](scripts/testing/vmgr.py).
2023-11-04 22:44:07 +01:00
## Managing backup buckets
2023-11-04 22:56:33 +01:00
The [`scripts/restic/restic.py`](scripts/restic/restic.py) script provides a wrapper around restic
to manage the backup buckets. The script collects the credentials from the OS keyring and constructs
the restic command with the correct endpoint. It allows the user to focus on the actual command to
be executed rather than authentication and bucket URLs.
2022-12-18 21:14:04 +01:00
2023-11-04 22:44:07 +01:00
The `scripts/restic/restic.py` requires the following entries in the keyring:
- `scaleway`: `access_key` (Scaleway project ID),
- `scaleway`: `secret_key` (Scaleway secret key),
- `restic`: `password`.
2022-12-18 21:14:04 +01:00
2023-11-04 22:44:07 +01:00
The easiest way to set these values is with Python's `keyring.set_password`.
2023-02-13 21:59:24 +01:00
## Testing backups
2023-11-04 22:44:07 +01:00
### Setting up baldur on yggdrasil
2023-11-04 22:44:07 +01:00
1. Create the zvol `rpool/var/lib/libvirt/images/baldur` for the testing OS.
2. Create the zvol `hpool/baldur` for the backup data under test. It should have a capacity that's
larger than what `yggdrasil` estimates for `rpool/var/lib/the-nine-worlds/data` (excluding
datasets that are not backed up to the cloud).
3. Set `refreserv=0` on the zvols to make snapshots take less space.
2023-11-04 22:44:07 +01:00
- `zfs set refreserv=0 rpool/var/lib/libvirt/images/baldur`
- `zfs set refreserv=0 hpool/baldur`
4. Install the same OS that is running on `yggdrasil`, but with a DE, on
`rpool/var/lib/libvirt/images/baldur` with `hpool/baldur` mounted within at
`/var/lib/the-nine-worlds/data`.
5. Create non-root user `wojtek` with `sudo` privileges.
6. Configure SSH from the workstation to use `yggdrasil` as a jump server.
7. Use ZFS for snapshots/rollback of the zvols.
- `zfs snapshot rpool/var/lib/libvirt/images/baldur@start`
- `zfs snapshot hpool/baldur@start`
### Provision baldur
1. Provision `baldur` by running
2023-02-13 21:59:24 +01:00
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/baldur.yml
```
2023-11-04 22:44:07 +01:00
2. Update `/etc/the-nine-worlds/resolv.conf` to point at a public DNS resolver, e.g., `1.1.1.1`.
Name resolution failures can cause containers to fail.
3. Restore all the backups by ssh'ing into `baldur` and running (as root):
2023-02-13 21:59:24 +01:00
```sh
2023-07-23 00:37:19 +02:00
/usr/local/sbin/restic-batch --config-dir /etc/the-nine-worlds/restic-batch.d restore
2023-02-13 21:59:24 +01:00
```
2023-11-04 22:44:07 +01:00
4. Once restore has completed, `chown -R <user>:<user>` all the restored directories in
2023-07-22 23:51:34 +02:00
`/var/lib/the-nine-worlds/data`. Restic restores the UID information of the host from which the
backup was performed which may not match that of the new target machine. Note that permissions
and ownership are restored as a second step once all the content is restored. Therefore, the
files will list `root` as owner during the restoration.
2023-11-04 22:44:07 +01:00
5. Start all the pod services with:
2023-02-13 21:59:24 +01:00
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_start.yml
```
Give them some time to download all the images and start.
2023-11-04 22:44:07 +01:00
6. Once the CPU returns to idling check the state of all the pod services and their `veth`
interfaces. If necessary restart the affected pod, some containers fail to start up if the
database takes too long to come online.
2023-02-19 23:46:17 +01:00
2023-11-04 22:44:07 +01:00
### Testing the backups
2023-02-19 23:46:17 +01:00
2024-04-20 20:01:21 +02:00
1. Stop all services on `yggdrasil` to prevent accidental connections to the live services which
defeats the point of testing backups.
2. Log into the `baldur`. Testing from a VM (as opposed to a regular workstation) is important to
2023-11-04 22:44:07 +01:00
prevent live applications from accidentally connecting to `baldur`.
2024-04-20 20:01:21 +02:00
3. Modify `/etc/hosts` in the VM to point at `rproxy` (e.g., `10.66.3.8`) for all relevant domains.
4. Test each service manually one by one. Use the Flagfox add-on to verify that you are indeed
2023-11-04 22:44:07 +01:00
connecting to `baldur`.
2023-03-01 20:14:12 +01:00
2023-11-04 22:44:07 +01:00
### Cleaning up
2023-03-01 20:14:12 +01:00
2023-11-04 22:44:07 +01:00
1. Stop all the pod services with:
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_stop.yml
```
2. Delete the VM and the two zvols:
- `rpool/var/lib/libvirt/images/baldur`,
- `hpool/baldur`.