fault tolerance of NixOS configurations switching #52

Open
opened 2023-12-04 11:56:44 +02:00 by alexoundos · 0 comments
Collaborator

test suites

general stock configuration test

  1. [https] test basic connectivity (DNS and API responsiveness)
  2. test all enabled services
    1. [https] check systemd services "active" status for each
    2. test functionality
  3. test basic penetration protection
    1. no excess open ports
    2. TBD

rollback status check after a rollback

  1. [ssh] check that current generation is 1 less than before the rollback
  2. [ssh] check that all available services versions correspond to the current generation

nixos-rebuild fault tolerance tests

Below activation timeouts are expected to be expressed as systemd unit JobTimeoutAction/JobRunningTimeoutSec for nixos-rebuild services.

runSystemRebuild => Nix build error/timeout => report to a User

  1. [https] use (nonexistent path / infinite computation) in, adding trojan SP module
  2. [https] runSystemRebuild => expect nothing, but a failure report
  3. [https] check report from API
  4. general test
  5. [https] rebootSystem
  6. [ssh] compare contents of /etc/nixos and /etc/nixos/selfprivacy/nixos-config-source

runSystemRebuild => bootloader update error => report to a User

  1. [ssh] prevent /boot/grub/grub.cfg from updating (how?)
  2. [https] runSystemRebuild => expect nothing, but a failure report
  3. [https] check report from API

new generation activation failure caused by a random service when API is not updated => rollback

  1. [https] break random systemd service activation, adding a trojan SP module
  2. [https] runSystemRebuild => expect automatic rollback
  3. [https] check rollback status
  4. [https] check report from API
  5. general test

new generation activation timeout caused by a random service when API is not updated => rollback

  1. [https] halt random systemd service activation, adding a trojan SP module
  2. [https] runSystemRebuild => expect automatic rollback
  3. [https] check rollback status
  4. [https] check report from API
  5. general test

new generation activation failure caused by a random service when API is updated => rollback

  1. [https] break random systemd service activation, adding a trojan SP module
  2. [https] runSystemUpgrade with upgrade URL with new API => expect automatic rollback
  3. [https] check rollback status (including checking that API version actually downgraded)
  4. [https] check report from API
  5. general test

new generation activation timeout caused by a random service when API is updated => rollback

  1. [https] halt random systemd service activation using a trojan SP module
  2. [https] runSystemUpgrade with upgrade URL with new API => expect automatic rollback
  3. [https] check rollback status (including checking that API version actually downgraded)
  4. [https] check report from API
  5. general test

new generation activation failure cased by API service failure => rollback

  1. [https] runSystemUpgrade with upgrade URL with new API systemd service broken => expect automatic rollback
  2. [https] check rollback status
  3. [https] check report from API
  4. general test

new generation activation timeout cased by API service => rollback

  1. [https] runSystemUpgrade with upgrade URL with new API systemd service halted => expect automatic rollback
  2. [https] check rollback status
  3. [https] check report from API
  4. general test

new generation activation failure cased by API selftest failure => rollback

  1. [https] runSystemUpgrade with upgrade URL with new API not listening => expect automatic rollback
  2. [https] check rollback status (including checking that API version actually downgraded)
  3. [https] check report from API
  4. general test

test runSystemRollback

  1. [https] runSystemRollback
  2. check rollback status
  3. [https] rebootSystem
  4. check rollback status

API fault tolerance test: random crash => reload

  1. [https] runSystemUpgrade with upgrade URL with new API crashing every 10th minute => expect automatic reload
  2. [https] check report from API
  3. general test

test runSystemUpgrade to a major NixOS release with new API

  1. [https] runSystemUpgrade with specific upgrade URL
  2. [https] check that API version has raised
  3. general test
  4. [https] rebootSystem
  5. [ssh] check that current generation is the new one
  6. general test

services longevity test

  1. general test
  2. sleep 10 minutes
  3. [https] runSystemUpgrade to a configuration with new API
  4. general test

test automatic updates?

TBD

top-level test suite

  1. nixos-infect with default configuration
  2. general test
  3. test runSystemUpgrade to a major NixOS release with new API
  4. test runSystemRollback
  5. nixos-rebuild fault tolerance tests
  6. [https] enable all services via API GraphQL
  7. services longevity test
# test suites ## general stock configuration test 1. [https] test basic connectivity (DNS and API responsiveness) 2. test all enabled services 1. [https] check systemd services "active" status for each 2. test functionality 3. test basic penetration protection 1. no excess open ports 2. TBD ## rollback status check after a rollback 1. [ssh] check that current generation is 1 less than before the rollback 2. [ssh] check that all available services versions correspond to the current generation ## nixos-rebuild fault tolerance tests Below activation timeouts are expected to be expressed as systemd unit `JobTimeoutAction`/`JobRunningTimeoutSec` for nixos-rebuild services. ### `runSystemRebuild` => Nix build error/timeout => report to a User 1. [https] use (nonexistent path / infinite computation) in, adding trojan SP module 2. [https] `runSystemRebuild` => expect nothing, but a failure report 3. [https] check report from API 4. general test 5. [https] `rebootSystem` 6. [ssh] compare contents of `/etc/nixos` and `/etc/nixos/selfprivacy/nixos-config-source` ### `runSystemRebuild` => bootloader update error => report to a User 1. [ssh] prevent `/boot/grub/grub.cfg` from updating (how?) 2. [https] `runSystemRebuild` => expect nothing, but a failure report 3. [https] check report from API ### new generation activation failure caused by a random service when API is not updated => rollback 1. [https] break random systemd service activation, adding a trojan SP module 2. [https] `runSystemRebuild` => expect automatic rollback 3. [https] check rollback status 4. [https] check report from API 5. general test ### new generation activation timeout caused by a random service when API is not updated => rollback 1. [https] halt random systemd service activation, adding a trojan SP module 2. [https] `runSystemRebuild` => expect automatic rollback 3. [https] check rollback status 4. [https] check report from API 5. general test ### new generation activation failure caused by a random service when API is updated => rollback 1. [https] break random systemd service activation, adding a trojan SP module 2. [https] `runSystemUpgrade` with upgrade URL with new API => expect automatic rollback 3. [https] check rollback status (including checking that API version actually downgraded) 4. [https] check report from API 5. general test ### new generation activation timeout caused by a random service when API is updated => rollback 1. [https] halt random systemd service activation using a trojan SP module 2. [https] `runSystemUpgrade` with upgrade URL with new API => expect automatic rollback 3. [https] check rollback status (including checking that API version actually downgraded) 4. [https] check report from API 5. general test ### new generation activation failure cased by API service failure => rollback 1. [https] `runSystemUpgrade` with upgrade URL with new API systemd service broken => expect automatic rollback 2. [https] check rollback status 3. [https] check report from API 4. general test ### new generation activation timeout cased by API service => rollback 1. [https] `runSystemUpgrade` with upgrade URL with new API systemd service halted => expect automatic rollback 2. [https] check rollback status 3. [https] check report from API 4. general test ### new generation activation failure cased by API selftest failure => rollback 1. [https] `runSystemUpgrade` with upgrade URL with new API not listening => expect automatic rollback 2. [https] check rollback status (including checking that API version actually downgraded) 3. [https] check report from API 4. general test ## test `runSystemRollback` 1. [https] `runSystemRollback` 2. check rollback status 3. [https] `rebootSystem` 4. check rollback status ## API fault tolerance test: random crash => reload 1. [https] `runSystemUpgrade` with upgrade URL with new API crashing every 10th minute => expect automatic reload 2. [https] check report from API 3. general test ## test `runSystemUpgrade` to a major NixOS release with new API 1. [https] `runSystemUpgrade` with specific upgrade URL 2. [https] check that API version has raised 3. general test 4. [https] `rebootSystem` 5. [ssh] check that current generation is the new one 6. general test ## services longevity test 1. general test 2. sleep 10 minutes 3. [https] `runSystemUpgrade` to a configuration with new API 4. general test ## test automatic updates? TBD ## top-level test suite 1. nixos-infect with default configuration 2. general test 3. test `runSystemUpgrade` to a major NixOS release with new API 4. test `runSystemRollback` 5. nixos-rebuild fault tolerance tests 6. [https] enable all services via API GraphQL 7. services longevity test
alexoundos self-assigned this 2023-12-04 11:56:44 +02:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: SelfPrivacy/selfprivacy-nixos-config#52
There is no content yet.