Remove backup testing how-to (moved to notes)

This commit is contained in:
Wojciech Kozlowski 2024-12-30 18:56:09 +01:00
parent c782f0f74e
commit 758128a436

View File

@ -98,69 +98,3 @@ The `scripts/restic/restic.py` requires the following entries in the keyring:
- `restic`: `password`.
The easiest way to set these values is with Python's `keyring.set_password`.
## Testing backups
### Setting up baldur on yggdrasil
1. Create the zvol `rpool/var/lib/libvirt/images/baldur` for the testing OS.
2. Create the zvol `hpool/baldur` for the backup data under test. It should have a capacity that's
larger than what `yggdrasil` estimates for `rpool/var/lib/the-nine-worlds/data` (excluding
datasets that are not backed up to the cloud).
3. Set `refreserv=0` on the zvols to make snapshots take less space.
- `zfs set refreserv=0 rpool/var/lib/libvirt/images/baldur`
- `zfs set refreserv=0 hpool/baldur`
4. Install the same OS that is running on `yggdrasil`, but with a DE, on
`rpool/var/lib/libvirt/images/baldur` with `hpool/baldur` mounted within at
`/var/lib/the-nine-worlds/data`.
5. Create non-root user `wojtek` with `sudo` privileges.
6. Configure SSH from the workstation to use `yggdrasil` as a jump server.
7. Use ZFS for snapshots/rollback of the zvols.
- `zfs snapshot rpool/var/lib/libvirt/images/baldur@start`
- `zfs snapshot hpool/baldur@start`
### Provision baldur
1. Provision `baldur` by running
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/baldur.yml
```
2. Update `/etc/the-nine-worlds/resolv.conf` to point at a public DNS resolver, e.g., `1.1.1.1`.
Name resolution failures can cause containers to fail.
3. Restore all the backups by ssh'ing into `baldur` and running (as root):
```sh
/usr/local/sbin/restic-batch --config-dir /etc/the-nine-worlds/restic-batch.d restore
```
4. Once restore has completed, `chown -R <user>:<user>` all the restored directories in
`/var/lib/the-nine-worlds/data`. Restic restores the UID information of the host from which the
backup was performed which may not match that of the new target machine. Note that permissions
and ownership are restored as a second step once all the content is restored. Therefore, the
files will list `root` as owner during the restoration.
5. Start all the pod services with:
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_start.yml
```
Give them some time to download all the images and start.
6. Once the CPU returns to idling check the state of all the pod services and their `veth`
interfaces. If necessary restart the affected pod, some containers fail to start up if the
database takes too long to come online.
### Testing the backups
1. Stop all services on `yggdrasil` to prevent accidental connections to the live services which
defeats the point of testing backups.
2. Log into the `baldur`. Testing from a VM (as opposed to a regular workstation) is important to
prevent live applications from accidentally connecting to `baldur`.
3. Modify `/etc/hosts` in the VM to point at `rproxy` (e.g., `10.66.3.8`) for all relevant domains.
4. Test each service manually one by one. Use the Flagfox add-on to verify that you are indeed
connecting to `baldur`.
### Cleaning up
1. Stop all the pod services with:
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_stop.yml
```
2. Delete the VM and the two zvols:
- `rpool/var/lib/libvirt/images/baldur`,
- `hpool/baldur`.