Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation check for out-of-range nodeport values #706

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

skirui-source
Copy link
Collaborator

Fixes #536

@skirui-source skirui-source marked this pull request as ready for review May 3, 2023 00:44
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @skirui-source. I don't think we need the outer try/catch as it just reraises the same exception. We can probably just let that bubble up to the user.

Could you please add a test to verify this functionality?

dask_kubernetes/operator/controller/controller.py Outdated Show resolved Hide resolved
Comment on lines 92 to 96
for port in spec.get("ports", []):
if port.get("nodePort", None) and (
port["nodePort"] < 30000 or port["nodePort"] > 32767
):
raise ValueError("NodePort out of range")
Copy link
Member

@jacobtomlinson jacobtomlinson May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more deeply I don't think we want to check this at this point. The resource has already been created and we are in an unschedulable state because the resource cannot be created.

I think we want to add a validation hook to check this at resource creation time.

https://kopf.readthedocs.io/en/stable/admission/#validation-handlers

# Something like
@kopf.on. validate("daskcluster.kubernetes.dask.org")
async def daskcluster_validate_nodeport(spec, warnings, **kwargs):
    for port in spec.get("ports", []):
        if port.get("nodePort", None) and (
            port["nodePort"] < 30000 or port["nodePort"] > 32767
        ):
            raise kopf.AdmissionError(f"nodePort must be between 30000 and 32767.")

@skirui-source
Copy link
Collaborator Author

skirui-source commented May 4, 2023

Not sure why the CI tests are getting canceled.. but seeing this error (locally). unsure how to resolve:

================================================================================== short test summary info ==================================================================================
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_operator_runs - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_operator_plugins - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_scalesimplecluster - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_simplecluster - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_job - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_failed_job - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_nodeport_valid - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_nodeport_out_of_range - Exception: Admission handlers exist, but no admission server/tunnel is configured in `settings.admission.server`. More: https://kopf.readthedocs.io/en/stable/admission/

@skirui-source
Copy link
Collaborator Author

skirui-source commented May 4, 2023

Not sure why the CI tests are getting canceled.. but seeing this error (locally). unsure how to resolve:

In response to this, I followed these instructions to configure the admission tunnel/server by adding this line here:

  • settings.admission.server = kopf.WebhookServer() # defaults to localhost::port 9443

Then i re ran the tests, but still failing as such:

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================================== short test summary info ==================================================================================
FAILED dask_kubernetes/operator/controller/tests/test_controller.py::test_nodeport_valid - asyncio.exceptions.TimeoutError: Dask Cluster resource not actioned after 60 seconds, is the Dask Operator running?

even though I can confirm that dask operator is running:

(kind-k8) skirui@dt08:~/dev/dask-kubernetes$ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
dask-kubernetes-operator-1683188683-7d754f579d-7ztxs   1/1     Running   0          3h57m

@jacobtomlinson
Copy link
Member

Tests get cancelled if they hang for more than 45 mins.

It looks like you're on the right track with configuring the tunnel/server. I must admit this is more complicated than I expected. When the tests run the controller it outside of the Kubernetes cluster, so the Kubernetes API will not be able to communicate with the webhook. It looks like they have docs on using minikube or k3s but not kind. It might be best to leave this one with me for now and switch to another issue, I'll have a think about this.

Also, you shouldn't have a controller pod running like that, the tests will start/stop controllers so if you have a "production" one running too it's likely you will get conflicts. Do you have the operator help chart installed? If so you need to uninstall that for development.

@skirui-source
Copy link
Collaborator Author

Tests get cancelled if they hang for more than 45 mins.

It looks like you're on the right track with configuring the tunnel/server. I must admit this is more complicated than I expected. When the tests run the controller it outside of the Kubernetes cluster, so the Kubernetes API will not be able to communicate with the webhook. It looks like they have docs on using minikube or k3s but not kind. It might be best to leave this one with me for now and switch to another issue, I'll have a think about this.

Also, you shouldn't have a controller pod running like that, the tests will start/stop controllers so if you have a "production" one running too it's likely you will get conflicts. Do you have the operator help chart installed? If so you need to uninstall that for development.

Still confused about the difference between operator, controller and Kubernetes API server?
Also if the tests automatically start/stop "controller", how can i check the status of said controller at any given time?

@jacobtomlinson
Copy link
Member

jacobtomlinson commented May 10, 2023

Still confused about the difference between operator, controller and Kubernetes API server?

Operator is a design pattern where you extend the Kubernetes API with Custom Resource Definitions and a custom controller. You effectively treat Kubernetes like a database and have a controller reconsile the state of the world with what that database describes.

Custom Resource Definition is a description of a new object type for Kubernetes to track. It basically creates a new table in the Kubernetes database and describes the data structure for your custom resource type.

Controller is the process that runs on the Kubernetes cluster and listens for API events and takes action (e.g on DaskCluster create event set up the cluster resources).

Kubernetes API server is the API that kubectl, kubernetes_asyncio, kubelet, etc interacts with to GET/PUT/PATCH/DELETE the custom resources. Is backed by a database (usually etcd).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator fails if nodeport out of allowed range
2 participants