Add notes on testing backups to README

This commit is contained in:
Wojciech Kozlowski 2023-02-13 21:59:24 +01:00
parent d87423c244
commit faa68b0585

View File

@ -48,8 +48,8 @@ ansible-playbook main.yml -i testing
### Testing virtual machines
Scripts for starting, stopping, and reverting the testing virtual machines are located in
`scripts/testing`.
The scripts for starting, stopping, and reverting the testing virtual machines is located in
`scripts/testing/vmgr.py`.
### Playbooks
@ -101,3 +101,46 @@ Or from the main playbook:
``` sh
ansible-playbook main.yml --tags "system:base:sshd"
```
## Testing backups
Before testing the backups, you may want to shut `yggdrasil` down for extra confidence that it is
not being accessed/modified during this process. It is easy to access `yggdrasil` by accident if
`/etc/hosts` is not modified in the test VM, something that is easy to forget.
1. Create `baldur` by running:
```sh
python scripts/scaleway/baldur.py create --volume-size <size-in-GB>
```
Pick a volume size that's larger than what `yggdrasil` estimates for
`rpool/var/lib/yggdrasil/data`.
2. Provision `baldur` by running
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/baldur.yml
```
3. Restore all the backups by ssh'ing into `baldur` and running (as root):
```sh
/usr/local/sbin/restic-batch --config-dir /etc/restic-batch.d restore
```
4. Start all the pod services with:
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_start.yml
```
Give them some time to download all the images and start.
5. Once the CPU returns to idling check the state of all the pod services and their `veth`
interfaces. If necessary restart the affected pod. Sometimes they fail to start (presumably due
to issues related to limited CPU and RAM).
6. Boot into a test VM. Ideally, one installed onto a virtual disk since the live system might not
have enough space. A VM is used to make sure that none of the services on the host workstation
connect to `baldur` by accident.
7. Modify `/etc/hosts` in the VM to point at `baldur` for all relevant domains.
8. Test each service manually one by one. Use the Flagfox add-on to verify that you are indeed
connecting to `baldur`.
9. Stop all the pod services with:
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_stop.yml
```
10. Destroy `baldur` by running:
```sh
python scripts/scaleway/baldur.py delete
```