Update README

This commit is contained in:
Wojciech Kozlowski 2023-11-04 22:44:07 +01:00
parent 6954490bf4
commit 97ea02c904
2 changed files with 75 additions and 283 deletions

212
README.md
View File

@ -1,18 +1,10 @@
# The Ansible Edda
Ansible playbooks for provisioning The Nine Worlds.
Ansible playbooks for provisioning **The Nine Worlds**.
## Secrets vault
## Running the playbooks
- Encrypt with: ```ansible-vault encrypt vault.yml```
- Decrypt with: ```ansible-vault decrypt secrets.yml```
- Encrypt all `vault.yml` in a directory with: ```ansible-vault encrypt directory/**/vault.yml```
- Decrypt all `vault.yml` in a directory with: ```ansible-vault decrypt directory/**/vault.yml```
- Run a playbook with ```ansible-playbook --vault-id @prompt playbook.yml```
## The Nine Worlds
The main entrypoint for The Nine Worlds is [`main.yml`](main.yml).
The main entrypoint for **The Nine Worlds** is [`main.yml`](main.yml).
### Keyring integration
@ -38,19 +30,14 @@ The inventory files are split into [`production`](production) and [`testing`](te
To run the `main.yml` playbook on production hosts:
``` sh
ansible-playbook main.yml -i production
ansible-playbook main.yml -i inventory/production
```
To run the `main.yml` playbook on production hosts:
To run the `main.yml` playbook on testing hosts:
``` sh
ansible-playbook main.yml -i testing
ansible-playbook main.yml -i inventory/testing
```
### Testing virtual machines
The scripts for starting, stopping, and reverting the testing virtual machines is located in
`scripts/testing/vmgr.py`.
### Playbooks
The Ansible Edda playbook is composed of smaller [`playbooks`](playbooks). To run a single playbook,
@ -69,156 +56,107 @@ ansible-playbook main.yml --tags "system"
### Roles
Playbooks are composed of roles defined in the `roles` directory,
[`playbooks/roles`](playbooks/roles).
To play only a specific role, e.g. `system/base` in the playbook `system`, run:
``` sh
ansible-playbook playbooks/system.yml --tags "system:base"
```
Or from the main playbook:
``` sh
ansible-playbook main.yml --tags "system:base"
```
### Role sub-tasks
Some roles are split into smaller groups of tasks. This can be checked by looking at the
`tasks/main.yml` file of a role, e.g.
[`playbooks/roles/system/base/tasks/main.yml`](playbooks/roles/system/base/tasks/main.yml).
To play only a particular group within a role, e.g. `sshd` in `base` of `system`, run:
Playbooks are composed of roles defined in the `roles` submodule, [`roles`](roles), and the
`playbooks/roles` directory, [`playbooks/roles`](playbooks/roles).
To play a specific role, e.g., `system/base/sshd` in the playbook `system`, run:
``` sh
ansible-playbook playbooks/system.yml --tags "system:base:sshd"
```
Or from the main playbook:
To play all roles from a specific group, e.g., `system/base` in the playbook `system`, run:
``` sh
ansible-playbook playbooks/system.yml --tags "system:base"
```
Some roles, e.g., `services/setup/user`, have sub-tasks which can also be invoked individually. To
find the relevant tag, see the role's `main.yml`.
In all cases, the roles can be also invoked from the main playbook:
``` sh
ansible-playbook main.yml --tags "system:base:sshd"
ansible-playbook main.yml --tags "system:base"
```
## Testing virtual machines
The scripts for starting, stopping, and reverting the testing virtual machines is located in
`scripts/testing/vmgr.py`.
## Managing backup buckets
The `scripts/restic/restic.py` script provides a wrapper around restic to manage the backup buckets.
The script collects the credentials from the OS keyring and constructs the restic command with the
correct endpoint. It allows the user to focus on the actual command to be executed rather than
authentication and bucket URLs.
The `scripts/restic/restic.py` requires the following entries in the keyring:
- `scaleway`: `access_key` (Scaleway project ID),
- `scaleway`: `secret_key` (Scaleway secret key),
- `restic`: `password`.
The easiest way to set these values is with Python's `keyring.set_password`.
## Testing backups
Before testing the backups, you may want to shut `yggdrasil` down for extra confidence that it is
not being accessed/modified during this process. It is easy to access `yggdrasil` by accident if
`/etc/hosts` is not modified in the test VM, something that is easy to forget.
### Setting up baldur on yggdrasil
### Baldur on Scaleway
1. Create `baldur` by running:
```sh
python scripts/scaleway/baldur.py create --volume-size <size-in-GB>
```
Pick a volume size that's larger than what `yggdrasil` estimates for
`rpool/var/lib/yggdrasil/data`.
2. When done destroy `baldur` by running:
```sh
python scripts/scaleway/baldur.py delete
```
### Baldur on Yggdrasil
1. Create a VM on `yggdrasil` and install the same OS that is running on `yggdrasil`.
- Install the OS on a zvol on `rpool`.
- If the same VM is to be used for testing, a GUI is helpful.
- Prepare a zvol on `hpool` of size that's larger than what `yggdrasil` estimates for
`rpool/var/lib/the-nine-worlds/data` and mount at `/var/lib/the-nine-worlds/data`.
- Create non-root user `wojtek` with `sudo` privileges.
2. Configure SSH to use `yggdrasil` as a jump server.
1. Create the zvol `rpool/var/lib/libvirt/images/baldur` for the testing OS.
2. Create the zvol `hpool/baldur` for the backup data under test. It should have a capacity that's
larger than what `yggdrasil` estimates for `rpool/var/lib/the-nine-worlds/data` (excluding
datasets that are not backed up to the cloud).
3. Set `refreserv=0` on the zvols to make snapshots take less space.
- `zfs set refreserv=0 tank/home/ahrens`
4. Use ZFS for snapshots/roolback of the zvols.
- `zfs snapshot tank/home/ahrens@friday`
- `zfs rollback tank/home/ahrens@friday`
5. Service testing can then be done directly from the VM. To achieve that `/etc/hosts` needs to be
set to directly point at the right proxy server, e.g., `10.66.3.8`, not `localhost`.
- `zfs set refreserv=0 rpool/var/lib/libvirt/images/baldur`
- `zfs set refreserv=0 hpool/baldur`
4. Install the same OS that is running on `yggdrasil`, but with a DE, on
`rpool/var/lib/libvirt/images/baldur` with `hpool/baldur` mounted within at
`/var/lib/the-nine-worlds/data`.
5. Create non-root user `wojtek` with `sudo` privileges.
6. Configure SSH from the workstation to use `yggdrasil` as a jump server.
7. Use ZFS for snapshots/rollback of the zvols.
- `zfs snapshot rpool/var/lib/libvirt/images/baldur@start`
- `zfs snapshot hpool/baldur@start`
### Test
### Provision baldur
1. Provision `baldur` by running
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/baldur.yml
```
2. Restore all the backups by ssh'ing into `baldur` and running (as root):
2. Update `/etc/the-nine-worlds/resolv.conf` to point at a public DNS resolver, e.g., `1.1.1.1`.
Name resolution failures can cause containers to fail.
3. Restore all the backups by ssh'ing into `baldur` and running (as root):
```sh
/usr/local/sbin/restic-batch --config-dir /etc/the-nine-worlds/restic-batch.d restore
```
3. Once restore has completed, `chown -R <user>:<user>` all the restored directories in
4. Once restore has completed, `chown -R <user>:<user>` all the restored directories in
`/var/lib/the-nine-worlds/data`. Restic restores the UID information of the host from which the
backup was performed which may not match that of the new target machine. Note that permissions
and ownership are restored as a second step once all the content is restored. Therefore, the
files will list `root` as owner during the restoration.
4. Start all the pod services with:
5. Start all the pod services with:
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_start.yml
```
Give them some time to download all the images and start.
5. Once the CPU returns to idling check the state of all the pod services and their `veth`
interfaces. If necessary restart the affected pod. Sometimes they fail to start (presumably due
to issues related to limited CPU and RAM).
6. Boot into a test VM. Ideally, one installed onto a virtual disk since the live system might not
have enough space. A VM is used to make sure that none of the services on the host workstation
connect to `baldur` by accident.
7. Modify `/etc/hosts` in the VM to point at `baldur` for all relevant domains.
8. Test each service manually one by one. Use the Flagfox add-on to verify that you are indeed
6. Once the CPU returns to idling check the state of all the pod services and their `veth`
interfaces. If necessary restart the affected pod, some containers fail to start up if the
database takes too long to come online.
### Testing the backups
1. Log into the `baldur`. Testing from a VM (as opposed to a regular workstation) is important to
prevent live applications from accidentally connecting to `baldur`.
2. Modify `/etc/hosts` in the VM to point at `rproxy` (e.g., `10.66.3.8`) for all relevant domains.
3. Test each service manually one by one. Use the Flagfox add-on to verify that you are indeed
connecting to `baldur`.
- Some containers fail to start up if the database takes too long to come online. In that case
restart the container.
- Some containers fail to start up if they cannot make DNS queries. Note that `192.168.0.0/16` is
blocked by firewall rules. If `/etc/the-nine-worlds/resolv.conf` points at a DNS resolved at
such an address all DNS queries will fail. Simply update `resolv.conf` to e.g. `1.1.1.1`.
9. Stop all the pod services with:
### Cleaning up
1. Stop all the pod services with:
```sh
ansible-playbook --vault-id @vault-keyring-client.py -i inventory/baldur_production playbooks/services_stop.yml
```
## Music organisation
The `playbooks/music.yml` playbook sets up tools and configuration for organising music. The process
is manual though. The steps for adding a new CD.
All steps below are to be executed as the `music` user.
### Note on tagging
* For live albums add "YYYY-MM-DD at Venue, City, Country" in the "Subtitle" tag.
* For remasters use original release tags and add "YYYY Remaster" in the "Subtitle" tag.
### Ripping a CD
1. Use a CD ripper and rip the CD to `/var/lib/yggdrasil/home/music/rip` using flac encoding.
2. Samba has been set up to give Windows access to the above directory. Therefore, CD rippers
available only for Windows can also be used, e.g. dBpoweramp.
### Import new music
1. Run `beet import /var/lib/yggdrasil/home/music/rip`. This will move the music files to
`/var/lib/yggdrasil/data/music/collection`.
2. Run `beet convert -a <match>`, where `<match>` is used to narrow down to new music only. This
will convert the flac files into mp3 files for sharing via Nextcloud.
3. Run `nextcloud-upload /var/tmp/music/mp3/<artist>` for every artist to upload to Nextcloud.
4. Remove the `/var/tmp/music/mp3/<artist>` directory.
#### Collections
Every track has a `compilation` tag at track-level as well as at album-level (at least in Beets). To
label the album as a compilation for sorting purposes, run `beet modify -a <album> comp=True`.
### Archive music
#### From rip
1. Run `beet --config .config/beets/archive.yaml import --move /var/lib/yggdrasil/home/music/rip`.
This will move the music files to `/var/lib/yggdrasil/data/music/archive`.
#### From collection
1. Run `beet --config .config/beets/archive.yaml import
/var/lib/yggdrasil/data/music/collection/<artist>/<album>`. This will copy the music files to
`/var/lib/yggdrasil/data/music/archive`.
2. Run `beet remove -d -a "album:<album>"`. This will remove the music files from the collection.
2. Delete the VM and the two zvols:
- `rpool/var/lib/libvirt/images/baldur`,
- `hpool/baldur`.

View File

@ -1,146 +0,0 @@
import argparse
import keyring
import requests
class Scaleway:
API_ENDPOINT_BASE = "https://api.scaleway.com/instance/v1/zones"
ZONES = [
"fr-par-1", "fr-par-2", "fr-par-3",
"nl-ams-1", "nl-ams-2",
"pl-waw-1", "pl-waw-2",
]
def __init__(self, project_id, secret_key):
self.__zone = None
self.__project_id = project_id
self.__headers = {"X-Auth-Token": secret_key}
@property
def zone(self):
return self.__zone
@zone.setter
def zone(self, zone):
if zone not in Scaleway.ZONES:
raise KeyError(f"{zone} is not a valid zone - must be one of {Scaleway.ZONES}")
self.__zone = zone
@property
def project_id(self):
return self.__project_id
def __url(self, item, id):
if self.__zone is None:
raise RuntimeError("zone must be set before making any API requests")
url = f"{Scaleway.API_ENDPOINT_BASE}/{self.__zone}"
if id == "products":
return f"{url}/products/{item}"
url = f"{url}/{item}"
if id is not None:
url = f"{url}/{id}"
return url
@staticmethod
def __check_status(type, url, rsp):
if (rsp.status_code // 100) != 2:
raise RuntimeError(
f"{type} {url} returned with status code {rsp.status_code}: {rsp.json()}")
def get(self, item, id=None):
url = self.__url(item, id)
r = requests.get(url, headers=self.__headers)
self.__check_status("GET", url, r)
return r.json()[item]
def get_by_name(self, item, name):
items = self.get(item)
return next((it for it in items if it["name"] == name), None)
def __post(self, url, data):
r = requests.post(url, headers=self.__headers, json=data)
self.__check_status("POST", url, r)
return r.json()
def post(self, item, data):
return self.__post(self.__url(item, None), data)
def post_action(self, item, id, action, data):
return self.__post(f"{self.__url(item, id)}/{action}", data)
def delete(self, item, id):
url = self.__url(item, id)
r = requests.delete(url, headers=self.__headers)
self.__check_status("DELETE", url, r)
def create_baldur(scaleway, args):
volume_size = args.volume_size
security_group = scaleway.get_by_name("security_groups", "baldur-security-group")
image = scaleway.get_by_name("images", "Debian Bullseye")
server_type = "PLAY2-PICO"
if server_type not in scaleway.get("servers", id="products"):
raise RuntimeError(f"{server_type} is not available in {scaleway.zone}")
response = scaleway.post("ips", data={"project": scaleway.project_id})
public_ip = response["ip"]
baldur = {
"name": "baldur",
"dynamic_ip_required": False,
"commercial_type": server_type,
"image": image["id"],
"volumes": {"0": {"size": int(volume_size * 1_000_000_000)}},
"enable_ipv6": False,
"public_ip": public_ip["id"],
"project": scaleway.project_id,
"security_group": security_group["id"],
}
response = scaleway.post("servers", data=baldur)
server = response["server"]
scaleway.post_action("servers", server["id"], "action", data={"action": "poweron"})
print("Baldur instance created:")
print(f" block volume size: {server['volumes']['0']['size']//1_000_000_000} GiB")
print(f" public ip address: {server['public_ip']['address']}")
def delete_baldur(scaleway, _):
server = scaleway.get_by_name("servers", "baldur")
if server is None:
raise RuntimeError(f"Baldur instance was not found in {scaleway.zone}")
ip = server["public_ip"]
scaleway.post_action("servers", server["id"], "action", data={"action": "terminate"})
scaleway.delete("ips", ip["id"])
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Create or delete the Baldur instance")
subparsers = parser.add_subparsers()
create_parser = subparsers.add_parser("create")
create_parser.add_argument("--volume-size", type=int, required=True,
help="Block volume size (in GiB) to create")
create_parser.set_defaults(func=create_baldur)
delete_parser = subparsers.add_parser("delete")
delete_parser.set_defaults(func=delete_baldur)
args = parser.parse_args()
scw_project_id = keyring.get_password("scaleway", "project_id")
scw_secret_key = keyring.get_password("scaleway", "secret_key")
scaleway = Scaleway(scw_project_id, scw_secret_key)
scaleway.zone = "fr-par-2"
args.func(scaleway, args)