Skip to content

Commit

Permalink
Volume Update Merge (#576)
Browse files Browse the repository at this point in the history
* fixed rule setting for security groups

* fixed multiple network is now list causing error bugs.

* trying to figure out why route applying only works once.

* Added more echo's for better debugging.

* updated most tests

* fixed validate_configuration.py tests.

* Updated tests for startup.py

* fixed bug in terminate that caused assume_yes to work as assume_no

* updated terminate_cluster tests.

* fixed formatting improved pylint

* adapted tests

* updated return threading test

* updated provider_handler

* tests not finished yet

* Fixed server regex issue

* test list clusters updated

* fixed too open cluster_id regex

* added missing "to"

* fixed id_generation tests

* renamed configuration handler to please linter

* removed unnecessary tests and updated remaining

* fixed remaining "subnet list gets handled as a single subnet" bug and finalized multiple routes handling.

* updated tests not finished yet

* improved code style

* fixed tests further. One to fix left.

* fixed additional tests

* fixed all tests for ansible configurator

* fixed comment

* fixed multiple tests

* fixed a few tests

* Fixed create

* fixed some issues regarding

* fixing test_provider.py

* removed infrastructure_cloud.yml

* minor fixes

* fixed all tests

* removed print

* changed prints to log

* removed log

* fixed None bug where [] is expected when no sshPublicKeyFile is given.

* removed master from compute if use master as compute is false

* reconstructured role additional in order to make it easier to include. Added quotes for consistency.

* Updated all tests (#448)

* updated most tests

* fixed validate_configuration.py tests.

* Updated tests for startup.py

* fixed bug in terminate that caused assume_yes to work as assume_no

* updated terminate_cluster tests.

* fixed formatting improved pylint

* adapted tests

* updated return threading test

* updated provider_handler

* tests not finished yet

* Fixed server regex issue

* test list clusters updated

* fixed too open cluster_id regex

* added missing "to"

* fixed id_generation tests

* renamed configuration handler to please linter

* removed unnecessary tests and updated remaining

* updated tests not finished yet

* improved code style

* fixed tests further. One to fix left.

* fixed additional tests

* fixed all tests for ansible configurator

* fixed comment

* fixed multiple tests

* fixed a few tests

* Fixed create

* fixed some issues regarding

* fixing test_provider.py

* removed infrastructure_cloud.yml

* minor fixes

* fixed all tests

* removed print

* changed prints to log

* removed log

* Introduced yaml lock (#464)

* removed unnecessary close

* simplified update_hosts

* updated logging to separate folder and file based on creation date

* many small changes and introducing locks

* restructured log files again. Removed outdated key warnings from bibigrid.yml

* added a few logs

* further improved logging hierarchy

* Added specific folder places for temporary job storage. This might solve the "SlurmSpoolDir full" bug.

* Improved logging

* Tried to fix temps and tried update to 23.11 but has errors so commented that part out

* added initial space

* added existing worker deletion on worker startup if worker already exists as no worker would've been started if Slurm would've known about the existing worker. This is not the best solution. (#468)

* made waitForServices a cloud specific key (#465)

* Improved log messages in validate_configuration.py to make fixing your configuration easier when using a hybrid-/multi-cloud setup (#466)

* removed unnecessary line in provider.py and added cloud information to every log in validate_configuration.py for easier fixing.

* track resources for providers separately to make quota checking precise

* switched from low level cinder to high level block_storage.get_limits()

* added keyword for ssh_timeout and improved argument passing for ssh.

* Update issue templates

* fixed a missing LOG

* removed overwritten variable instantiation

* Update bug_report.md

* removed trailing whitespaces

* added comment about sshTimeout key

* Create dependabot.yml (#479)

* Code cleanup and minor improvement (#482)

* fixed :param and :return to @param and @return

* many spelling mistakes fixed

* added bibigrid_version to common configuration

* added timeout to common_configuration

* removed debug verbosity and improved log message wording

* fixed is_active structure

* fixed pip dependabot.yml

* added documentation. Changed timeout to 2**(2+attempts) to decrease number of unlikely to work attempts

* 474 allow non on demandpermanent workers (#487)

* added worker server start without anything else

* added host entry for permanent workers

* added state unknown for permanent nodes

* added on_demand key for groups and instances for ansible templating

* fixed wording

* temporary solution for custom execute list

* added documentation for onDemand

* added ansible.cfg replacement

* fixed path. Added ansible.cfg to the gitignore

* updated default creation and gitignore. Fixed non-vital bug that didn't reset hosts for new cluster start.

* Code cleanup (#490)

* fixed :param and :return to @param and @return

* many spelling mistakes fixed

* added bibigrid_version to common configuration

* attempted zabbix linting fix. Needs testing.

* fixed double import

* Slurm upgrade fixes (#473)

* removed slurm errors

* added bibilog to show output log of most recent worker start. Tried fixing the slurm23.11 bug.

* fixed a few vpnwkr -> vpngtw remnants. Excluded vpngtw from slurm setup

* improved comments regarding changes and versions

* removed cgroupautomount as it is defunct

* Moved explicit slurm start to avoid errors caused by resume and suspend programs not being copied to their final location yet

* added word for clarification

* Fixed non-fatal bug that lead to non 0 exits on runs without any error.

* changed slurm apt package to slurm-bibigrid

* set version to 23.11.*

* added a few more checks to make sure everything is set up before installing packages

* Added configuration pinning

* changed ignore_error to failed_when false

* fixed or ignored lint fatals

* Update tests (#493)

* updated tests

* removed print

* updated tests

* updated tests

* fixed too loose condition

* updated tests

* added cloudScheduling and userRoles in bibigrid.yml

* added userRoles in documentation

* added varsFiles and comments

* added folder path in documentation

* fixed naming

* added that vars are optional

* polished userRoles documentation

* 439 additional ansible roles (#495)

* added roles structure

* updated roles_path

* fixed upper lower case

* improved customRole implementation

* minor fixes regarding role_paths

* improved variable naming of user_roles

* added documentation for other configurations

* added new feature keys

* fixed template files not being j2

* added helpful comments and removed no longer used roles/additional/

* userRoles crashes if no role set

* fixed ansible.cfg path '"'

* implemented partition system

* added keys customAnsibleCfg and customSlurmConf as keys that stop the automatic copying

* improved spacing

* added logging

* updated documentation

* updated tests. Improved formatting

* fix for service being too fast for startup

* fixed remote src

* changed RESUME to POWER_DOWN and removed delete call which is now handled via Slurm that calls terminate.sh (#503)

* Update check (#499)

* updated validate_configuration.py in order to provide schema validation. Moved cloud_identifier setting even closer to program start in order to be able to log better when performing other actions than create.

* small log change and fix of schema key vpnInstance

* updated tests

* removed no longer relevant test

* added schema validation tests

* fixed ftype. Errors with multiple volumes.

* made automount bound to defined mountPoints and therefore customizable

* added empty line and updated bibigrid.yml

* fixed nfsshare regex error and updated check to fit to the new name mountpoint pattern

* hotfix: folder creation now before accessing hosts.yml

* fixed tests

* moved dnsmasq installation infront of /etc/resolv removal

* fixed tests

* fixed nfs exports by removing unnecessary "/" at the beginning

* fixed master running slurmd but not being listed in slurm.conf. Now set to drained.

* improved logging

* increased timeout. Corrected comment in slurm.j2

* updated info regarding timeouts (changed from 4 to 5).

* added SuspendTimeout as optional to elastic_scheduling

* updated documentation

* permission fix

* fixes #394

* fixes #394 (also for hybrid cluster)

* increased ResumeTimeout by 5 minutes. yml to yaml

* changed all yml to yaml (as preferred by yaml)

* updated timeouts. updated tests

* fixes #394 - remove host from zabbix when terminated

* zabbix api no longer used when not set in configuration

* pleased linting by using false instead of no

* added logging of traceroute even if debug flag is not set when error is not known. Added a few other logs

* Update action 515 (#516)

* configuration update possible 515

* added experimental

* fixed indentation

* fixed missing newline at EOF. Summarized restarts.

* added check for running workers

* fixed multiple workers due to faulty update

* updated tests and removed done todos

* updated documentation

* removed print

* Added apt-reactivate-auto-update to reactivate updates at the end of the playbook run (#518)

* changed theia to 900. Added apt-reactivate-auto-update as new 999.

* added new line at end of file

* changed list representation

* added multiple configuration keys for boot volume handling

* updated documentation

* updated documentation for new volumes and for usually ignored keys

* updated and added tests

* Pleasing Dependabot

* Linting now uses python 3.10

* added early termination when configuration file not found

* added dontUploadCredentials documentation

* fixed broken links

* added dontUploadCredentials to schema valiation

* fixed dontUploadCredential ansible start bug

* prevented BiBiGrid from looking for other keys if created key doesn't work to spot key issues earlier

* prevented BiBiGrid from looking for other keys if created key doesn't work to spot key issues earlier

* updated requirements.txt

* restricted clouds.yaml access

* moved openstack credentials permission change to server only

* added '' to 3.10

* converted implicit to explicit octet notation

* added "" and fixed a few more implicit octets

* added ""

* added missing "

* added allow_agent=False to further prevent BiBiGrid from looking for keys

* removed hardcoded /vol/

* updated versions

* removed unnecessary comments and commented out Workflow execution

* 545 allow attached volumes (#562)

* renamed volumeSize to bootVolumeSize to avoid name issues

* added implementation for adding volumes to permanent workers (they are not deleted)

* implemented creating and terminating volumes without filesystem for permanent workers

* fully working for permanent workers. masterMount is broken now, but also replaced. Will be fixed.

* Added volume creation to create_server. Not yet working.

* hostvar for each host

* Fixed information handling and naming issues

* fixed host yaml creation

* removed unnecessary prints

* improved readability fixed minor bugs

* added volume deletion and set volume_ap_version explicitly

* removed prints from test_provider.py

* improved readability greatly. Fixed overwriting host vars bug

* snapshot and existing volumes can now be attached to master and workers on startup

* snapshot and existing volumes can now be attached to master and workers on startup

* removed mountPoint from a log message in case no mount point is specified

* fixed lsblk not finding item.device due to race condition

* improved comments and naming

* removed server automount. This is now handled by a single automount task for both master and workers

* allows nor to start new permanent volumes if a name is given. One could consider adding tmp to not named volumes for additional clarity

* fixed wrong function call

* renamed nfs_mount to nfs_shares

* added semipermanent as an option

* fixed wrong method of default values for Ansible

* started reworking

* added volumes and changed bootVolumes

* updated bibigrid.yaml and aligned naming of bootVolume and volume

* added newline at end of file

* removed superfluous provider paramter

* pleased linting

* removed argument from function call

* moved host vars creation, vars deletion, added comments

* largely reworked how volumes are attached to servers to be more explicit

* small naming fixes

* updated priority order of permanent and semiPermanent. Updated documentation to new explicit bool setup. Added type as a key.

* fixed bug regarding dontUploadCredentials

* updated schema validation

* Update linting.yml

* Update linting.yml

* Update linting.yml

* Update linting.yml

* added __init__.py where appropriate

* update bibigrid.yaml for more explicit volumes documentation

* volumes are now validated and fixed old state of masterInstance in validate_schema.py

* Update linting.yml

* Update linting.yml

* fixed longtime naming bug for unknown openstack exceptions

* saves more info in .mem file

* moved structure of tests and added a basic integration_test file that needs to be expanded and improved

* moved tests

* added "not ready yet"

* updated bootVolume documentation

* moved tests added __init__.py files for better discovery. minor fixes

* updated tests and comments

* updated tests and comments

* updated tests, code and comments for ansible_configuration

* updated tests for ansible_configurator

* fixed test_ansible_configurator.py

* fixed test_configuration_handler.py

* improved exception messages

* pleased ansible linter

* fixed terminate return values test

* improved naming

* added tests to make sure that server regex only deletes bibigrid servers with fitting cluster id and same for volumes

* pleased pylint

* fixed validation issue when using exists in master

* removed forgotten print

* fixed description bug

* final bugfixes

* pleased linter

* fixed too many positional arguments

---------

Co-authored-by: Jan Krueger <[email protected]>
  • Loading branch information
XaverStiensmeier and jkrue authored Dec 2, 2024
1 parent 06d08e2 commit a6d6db7
Show file tree
Hide file tree
Showing 53 changed files with 1,507 additions and 712 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/linting.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python 3.10
- name: Set up Python 3.12.3
uses: actions/setup-python@v4
with:
python-version: '3.10'
python-version: '3.12.3'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
Expand All @@ -17,4 +17,4 @@ jobs:
- name: ansible_lint
run: ansible-lint resources/playbook/roles/bibigrid/tasks/main.yaml
- name: pylint_lint
run: pylint bibigrid
run: pylint bibigrid
4 changes: 2 additions & 2 deletions .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -562,8 +562,8 @@ min-public-methods=2
[EXCEPTIONS]

# Exceptions that will emit a warning when caught.
overgeneral-exceptions=BaseException,
Exception
overgeneral-exceptions=builtins.BaseException,
builtins.Exception


[STRING]
Expand Down
159 changes: 90 additions & 69 deletions bibigrid.yaml
Original file line number Diff line number Diff line change
@@ -1,105 +1,126 @@
# See https://cloud.denbi.de/wiki/Tutorials/BiBiGrid/ (after update)
# See https://github.com/BiBiServ/bibigrid/blob/master/documentation/markdown/features/configuration.md
# First configuration also holds general cluster information and must include the master.
# All other configurations mustn't include another master, but exactly one vpngtw instead (keys like master).
# For an easy introduction see https://github.com/deNBI/bibigrid_clum
# For more detailed information see https://github.com/BiBiServ/bibigrid/blob/master/documentation/markdown/features/configuration.md

- infrastructure: openstack # former mode. Describes what cloud provider is used (others are not implemented yet)
cloud: openstack # name of clouds.yaml cloud-specification key (which is value to top level key clouds)
- # -- BEGIN: GENERAL CLUSTER INFORMATION --
# The following options configure cluster wide keys
# Modify these according to your requirements

# -- BEGIN: GENERAL CLUSTER INFORMATION --
# sshTimeout: 5 # number of attempts to connect to instances during startup with delay in between
# cloudScheduling:
# sshTimeout: 5 # like sshTimeout but during the on demand scheduling on the running cluster

## sshPublicKeyFiles listed here will be added to access the cluster. A temporary key is created by bibigrid itself.
#sshPublicKeyFiles:
# - [public key one]
## sshPublicKeyFiles listed here will be added to the master's authorized_keys. A temporary key is stored at ~/.config/bibigrid/keys
# sshPublicKeyFiles:
# - [public key one]

## Volumes and snapshots that will be mounted to master
#masterMounts: (optional) # WARNING: will overwrite unidentified filesystems
# - name: [volume name]
# mountPoint: [where to mount to] # (optional)
# masterMounts: DEPRECATED -- see `volumes` key for each instance instead

#nfsShares: /vol/spool/ is automatically created as a nfs
# - [nfsShare one]
# nfsShares: # list of nfs shares. /vol/spool/ is automatically created as an nfs if nfs is true
# - [nfsShare one]

# userRoles: # see ansible_hosts for all options
## Ansible Related
# userRoles: # see ansible_hosts for all 'hosts' options
# - hosts:
# - "master"
# roles: # roles placed in resources/playbook/roles_user
# - name: "resistance_nextflow"
# varsFiles: # (optional)
# - [...]

## Uncomment if you don't want assign a public ip to the master; for internal cluster (Tuebingen).
## If you use a gateway or start a cluster from the cloud, your master does not need a public ip.
# useMasterWithPublicIp: False # defaults True if False no public-ip (floating-ip) will be allocated
# gateway: # if you want to use a gateway for create.
# ip: # IP of gateway to use
# portFunction: 30000 + oct4 # variables are called: oct1.oct2.oct3.oct4

# deleteTmpKeypairAfter: False
# dontUploadCredentials: False
## Only relevant for specific projects (e.g. SimpleVM)
# deleteTmpKeypairAfter: False # warning: if you don't pass a key via sshPublicKeyFiles you lose access!
# dontUploadCredentials: False # warning: enabling this prevents you from scheduling on demand!

## Additional Software
# zabbix: False
# nfs: False
# ide: False # installs a web ide on the master node. A nice way to view your cluster (like Visual Studio Code)

### Slurm Related
# elastic_scheduling: # for large or slow clusters increasing these timeouts might be necessary to avoid failures
# SuspendTimeout: 60 # after SuspendTimeout seconds, slurm allows to power up the node again
# ResumeTimeout: 1200 # if a node doesn't start in ResumeTimeout seconds, the start is considered failed.

# Other keys - these are default False
# Usually Ignored
##localFS: True
##localDNSlookup: True
# cloudScheduling:
# sshTimeout: 5 # like sshTimeout but during the on demand scheduling on the running cluster

#zabbix: True
#nfs: True
#ide: True # A nice way to view your cluster as if you were using Visual Studio Code
# useMasterAsCompute: True

useMasterAsCompute: True
# -- END: GENERAL CLUSTER INFORMATION --

# bootFromVolume: False
# terminateBootVolume: True
# volumeSize: 50
# waitForServices: # existing service name that runs after an instance is launched. BiBiGrid's playbook will wait until service is "stopped" to avoid issues
# -- BEGIN: MASTER CLOUD INFORMATION --
infrastructure: openstack # former mode. Describes what cloud provider is used (others are not implemented yet)
cloud: openstack # name of clouds.yaml cloud-specification key (which is value to top level key clouds)

# waitForServices: # list of existing service names that affect apt. BiBiGrid's playbook will wait until service is "stopped" to avoid issues
# - de.NBI_Bielefeld_environment.service # uncomment for cloud site Bielefeld

# master configuration
## master configuration
masterInstance:
type: # existing type/flavor on your cloud. See launch instance>flavor for options
image: # existing active image on your cloud. Consider using regex to prevent image updates from breaking your running cluster
type: # existing type/flavor from your cloud. See launch instance>flavor for options
image: # existing active image from your cloud. Consider using regex to prevent image updates from breaking your running cluster
# features: # list
# - feature1
# partitions: # list
# bootVolume: None
# bootFromVolume: True
# terminateBootVolume: True
# volumeSize: 50

# -- END: GENERAL CLUSTER INFORMATION --
# - partition1
# bootVolume: # optional
# name: # optional; if you want to boot from a specific volume
# terminate: True # whether the volume is terminated on server termination
# size: 50
# volumes: # optional
# - name: volumeName # empty for temporary volumes
# snapshot: snapshotName # optional; to create volume from a snapshot
# mountPoint: /vol/mountPath
# size: 50
# fstype: ext4 # must support chown
# type: # storage type; available values depend on your location; for Bielefeld CEPH_HDD, CEPH_NVME
## Select up to one of the following options; otherwise temporary is picked
# exists: False # if True looks for existing volume with exact name. count must be 1. Volume is never deleted.
# permanent: False # if True volume is never deleted; overwrites semiPermanent if both are given
# semiPermanent: False # if True volume is only deleted during cluster termination

# fallbackOnOtherImage: False # if True, most similar image by name will be picked. A regex can also be given instead.

# worker configuration
## worker configuration
# workerInstances:
# - type: # existing type/flavor on your cloud. See launch instance>flavor for options
# - type: # existing type/flavor from your cloud. See launch instance>flavor for options
# image: # same as master. Consider using regex to prevent image updates from breaking your running cluster
# count: # any number of workers you would like to create with set type, image combination
# count: 1 # number of workers you would like to create with set type, image combination
# # features: # list
# # partitions: # list
# # bootVolume: None
# # bootFromVolume: True
# # terminateBootVolume: True
# # volumeSize: 50

# Depends on cloud image
sshUser: # for example ubuntu

# Depends on cloud site and project
subnet: # existing subnet on your cloud. See https://openstack.cebitec.uni-bielefeld.de/project/networks/
# or network:

# Uncomment if no full DNS service for started instances is available.
# Currently, the case in Berlin, DKFZ, Heidelberg and Tuebingen.
#localDNSLookup: True

#features: # list

# elastic_scheduling: # for large or slow clusters increasing these timeouts might be necessary to avoid failures
# SuspendTimeout: 60 # after SuspendTimeout seconds, slurm allows to power up the node again
# ResumeTimeout: 1200 # if a node doesn't start in ResumeTimeout seconds, the start is considered failed.
# # partitions: # list of slurm features that all nodes of this group have
# # bootVolume: # optional
# # name: # optional; if you want to boot from a specific volume
# # terminate: True # whether the volume is terminated on server termination
# # size: 50
# # volumes: # optional
# # - name: volumeName # optional
# # snapshot: snapshotName # optional; to create volume from a snapshot
# # mountPoint: /vol/mountPath # optional; not mounted if no path is given
# # size: 50
# # fstype: ext4 # must support chown
# # type: # storage type; available values depend on your location; for Bielefeld CEPH_HDD, CEPH_NVME
# ## Select up to one of the following options; otherwise temporary is picked
# # exists: False # if True looks for existing volume with exact name. count must be 1. Volume is never deleted.
# # permanent: False # if True volume is never deleted; overwrites semiPermanent if both are given
# # semiPermanent: False # if True volume is only deleted during cluster termination

# Depends on image
sshUser: # for example 'ubuntu'

# Depends on project
subnet: # existing subnet from your cloud. See https://openstack.cebitec.uni-bielefeld.de/project/networks/
# network: # only if no subnet is given

# features: # list of slurm features that all nodes of this cloud have
# - feature1

# bootVolume: # optional (cloud wide)
# name: # optional; if you want to boot from a specific volume
# terminate: True # whether the volume is terminated on server termination
# size: 50

#- [next configurations]
Empty file added bibigrid/__init__.py
Empty file.
Empty file added bibigrid/core/__init__.py
Empty file.
Empty file.
Loading

0 comments on commit a6d6db7

Please sign in to comment.