Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended instantiation with ActiveStatus leads to issues #22

Open
knkski opened this issue Oct 28, 2021 · 5 comments
Open

Recommended instantiation with ActiveStatus leads to issues #22

knkski opened this issue Oct 28, 2021 · 5 comments

Comments

@knkski
Copy link
Contributor

knkski commented Oct 28, 2021

We recommend using this library like this:

class Operator(CharmBase):
    def __init__(self, *args):
        super().__init__(*args)

        try:
            self.interfaces = get_interfaces(self)
        except NoVersionsListed as err:
            self.model.unit.status = WaitingStatus(str(err))
            return
        except NoCompatibleVersions as err:
            self.model.unit.status = BlockedStatus(str(err))
            return
        else:
            self.model.unit.status = ActiveStatus()

That leads to an issue where ActiveStatus can overwrite a BlockedStatus. This can be reproduced with:

juju deploy ch:istio-gateway istio-ingressgateway-operator --config=kind=ingress --trust

Which resolves to revision 6 as of this writing. It should be blocking on needing a relation to istio-pilot, but that status gets overwritten. juju show-status-log shows this:

$ juju show-status-log istio-ingressgateway-operator/0
Time                   Type       Status     Message
28 Oct 2021 11:26:22Z  juju-unit  executing  running config-changed hook
28 Oct 2021 11:26:23Z  workload   active     
28 Oct 2021 11:26:23Z  workload   blocked    Waiting for istio-pilot relation
28 Oct 2021 11:26:23Z  juju-unit  executing  running start hook
28 Oct 2021 11:26:24Z  juju-unit  idle       
28 Oct 2021 13:55:32Z  workload   active     
28 Oct 2021 13:56:15Z  juju-unit  executing  running istio-pilot-relation-created hook
28 Oct 2021 13:56:16Z  juju-unit  idle       
28 Oct 2021 13:56:51Z  juju-unit  executing  running istio-pilot-relation-joined hook for istio-pilot/0
28 Oct 2021 13:56:51Z  workload   waiting    List of istio-pilot versions not found for apps: istio-pilot
28 Oct 2021 13:56:52Z  juju-unit  executing  running istio-pilot-relation-changed hook for istio-pilot/0
28 Oct 2021 13:56:52Z  workload   active     
28 Oct 2021 13:56:52Z  workload   waiting    Waiting for istio-pilot relation data
28 Oct 2021 13:56:52Z  juju-unit  executing  running istio-pilot-relation-changed hook
28 Oct 2021 13:56:53Z  workload   active     
28 Oct 2021 13:56:53Z  workload   waiting    Waiting for istio-pilot relation data
28 Oct 2021 13:56:53Z  juju-unit  idle       
28 Oct 2021 13:56:56Z  juju-unit  executing  running istio-pilot-relation-changed hook
28 Oct 2021 14:01:59Z  juju-unit  idle       
28 Oct 2021 14:20:48Z  workload   active

Deploying istio-gateway manually without the else: self.model.unit.status = ActiveStatus() clause worked initially, but broke after repeated relation joins and removals:

$ juju status
Model         Controller  Cloud/Region        Version  SLA          Timestamp
istio-system  uk8s        microk8s/localhost  2.9.17   unsupported  12:10:54-05:00

App                            Version  Status   Scale  Charm          Store  Channel  Rev  OS          Address         Message
istio-ingressgateway-operator           waiting      1  istio-gateway  local             0  kubernetes  10.152.183.164  installing agent

Unit                              Workload  Agent  Address      Ports  Message
istio-ingressgateway-operator/0*  waiting   idle   10.1.151.73         List of istio-pilot versions not found for apps: istio-pilot
$ juju show-status-log istio-ingressgateway-operator/0
Time                        Type       Status     Message
28 Oct 2021 09:52:07-05:00  juju-unit  executing  running istio-pilot-relation-joined hook for istio-pilot/1
28 Oct 2021 09:52:08-05:00  juju-unit  executing  running istio-pilot-relation-changed hook for istio-pilot/1
28 Oct 2021 09:52:08-05:00  workload   waiting    List of istio-pilot versions not found for apps: istio-pilot
28 Oct 2021 09:52:08-05:00  juju-unit  idle
28 Oct 2021 09:52:09-05:00  juju-unit  executing  running istio-pilot-relation-changed hook
28 Oct 2021 09:52:09-05:00  juju-unit  idle
28 Oct 2021 09:52:15-05:00  juju-unit  executing  running istio-pilot-relation-changed hook for istio-pilot/1
28 Oct 2021 09:52:15-05:00  workload   waiting    Waiting for istio-pilot relation data
28 Oct 2021 09:52:15-05:00  juju-unit  idle
28 Oct 2021 09:52:22-05:00  juju-unit  executing  running istio-pilot-relation-changed hook
28 Oct 2021 09:52:22-05:00  workload   active
28 Oct 2021 09:52:23-05:00  juju-unit  idle
28 Oct 2021 09:52:29-05:00  juju-unit  executing  running istio-pilot-relation-departed hook for istio-pilot/1
28 Oct 2021 09:52:29-05:00  juju-unit  executing  running istio-pilot-relation-broken hook
28 Oct 2021 09:52:30-05:00  juju-unit  idle
28 Oct 2021 09:54:04-05:00  juju-unit  executing  running istio-pilot-relation-created hook
28 Oct 2021 09:54:05-05:00  juju-unit  idle
28 Oct 2021 09:58:10-05:00  juju-unit  executing  running istio-pilot-relation-broken hook
28 Oct 2021 09:58:10-05:00  workload   waiting    List of istio-pilot versions not found for apps: istio-pilot
28 Oct 2021 09:58:11-05:00  juju-unit  idle
@knkski
Copy link
Contributor Author

knkski commented Oct 28, 2021

To spitball a fix, here's what we could do in the istio-gateway charm to fix both issues, and guarantee that a status won't be accidentally written over:

import logging
import subprocess

from jinja2 import Environment, FileSystemLoader
from ops.charm import CharmBase
from ops.framework import StoredState
from ops.main import main
from ops.model import ActiveStatus, BlockedStatus
from serialized_data_interface import NoCompatibleVersions, NoVersionsListed, get_interfaces


class Operator(CharmBase):
    storage = StoredState()

    def __init__(self, *args):
        super().__init__(*args)

        self.storage.set_default(blockers=set())

        if not self.unit.is_leader():
            self.storage.blockers.add("Waiting for leadership")
            self.update_status()
            return
        else:
            if "Waiting for leadership" in self.storage.blockers:
                self.storage.blockers.remove("Waiting for leadership")

        try:
            self.interfaces = get_interfaces(self)
        except NoVersionsListed:
            self.storage.blockers.add('NoVersionsListed')
            self.update_status()
            return
        except NoCompatibleVersions:
            self.storage.blockers.add('NoCompatibleVersions')
            self.update_status()
            return
        else:
            if 'NoVersionsListed' in self.storage.blockers:
                self.storage.blockers.remove('NoVersionsListed')
            if 'NoCompatibleVersions' in self.storage.blockers:
                self.storage.blockers.remove('NoCompatibleVersions')

        self.update_status()

        self.log = logging.getLogger(__name__)

        self.framework.observe(self.on.install, self.install)
        self.framework.observe(self.on["istio-pilot"].relation_changed, self.install)
        self.framework.observe(self.on.config_changed, self.install)

    def update_status(self):
        if self.storage.blockers:
            self.model.unit.status = BlockedStatus('; '.join(self.storage.blockers))
        else:
            self.model.unit.status = ActiveStatus()

    def install(self, event):
        """Install charm."""

        if self.model.config['kind'] not in ('ingress', 'egress'):
            self.storage.blockers.add('Config item `kind` must be set')
            self.update_status()
            return
        else:
            if 'Config item `kind` must be set' in self.storage.blockers:
                self.storage.blockers.remove('Config item `kind` must be set')

        if not self.model.relations['istio-pilot']:
            self.storage.blockers.add("Waiting for istio-pilot relation")
            self.update_status()
            return
        else:
            if "Waiting for istio-pilot relation" in self.storage.blockers:
                self.storage.blockers.remove("Waiting for istio-pilot relation")

        if not ((pilot := self.interfaces["istio-pilot"]) and pilot.get_data()):
            self.storage.blockers.add("Waiting for istio-pilot relation data")
            self.update_status()
            return
        else:
            if "Waiting for istio-pilot relation data" in self.storage.blockers:
                self.storage.blockers.remove("Waiting for istio-pilot relation data")

        pilot = list(pilot.get_data().values())[0]

        env = Environment(loader=FileSystemLoader('src'))
        template = env.get_template('manifest.yaml')
        rendered = template.render(
            kind=self.model.config['kind'],
            namespace=self.model.name,
            pilot_host=pilot['service-name'],
            pilot_port=pilot['service-port'],
        )

        subprocess.run(["./kubectl", "apply", "-f-"], input=rendered.encode('utf-8'), check=True)

        self.update_status()


if __name__ == "__main__":
    main(Operator)

@DomFleischmann
Copy link
Contributor

DomFleischmann commented Oct 29, 2021

Reading about the relation_changed hook in the docs makes me think if we might be setting juju statuses when we shouldn't. I'm specifically referring to this section about the relation_changed hook:

Callback methods bound to this event should be the only ones that rely on remote relation settings. They should not error if the settings are incomplete, since it can be guaranteed that when the remote unit or application changes its settings, the event will fire again.

Our main issue with CI, when removing the else: ActiveStatus() part is that we pass to WaitingStatus if the relation is not yet providing the relation data and never set the ActiveStatus once it does. What I'm wondering is if we should even bother with setting this WaitingStatus, as said in the docs, if the relation data changes the hook will fire again.

Now, of course, if the relation data is unavailable for a critical part of the charm (e.g. creating the resources) we should be aware of that and notify it. Therefore I think that the Waiting status should be set in the install method and we should consider defering that event so that once we get the updated relation data we also re run that part of the code which would also be in charge of setting the Active state once it has done what it needs to do.

@DomFleischmann
Copy link
Contributor

DomFleischmann commented Oct 29, 2021

To transform my wall of text into something understandable here is how the code would look like more or less. @knkski @ca-scribner Please let me know if I'm missing something.

class Operator(CharmBase):
    storage = StoredState()

    def __init__(self, *args):
        super().__init__(*args)

        self.log = logging.getLogger(__name__)

        if not self.unit.is_leader():
            self.storage.blockers.add("Waiting for leadership")
            return

        try:
            self.interfaces = get_interfaces(self)
        except NoVersionsListed:
            # Log that the data is not available yet, but don't set any status
            # In this part of the code it is not our problem yet.
            self.log.info("log something about it")
        except NoCompatibleVersions:
            # I understand this will only occur if something is actually wrong.
            self.model.unit.status = BlockedStatus('NoCompatible Versions')
            return


        self.framework.observe(self.on.install, self.install)
        self.framework.observe(self.on["istio-pilot"].relation_changed, self.install)
        self.framework.observe(self.on.config_changed, self.install)

    def install(self, event):
        """Install charm."""

        if self.model.config['kind'] not in ('ingress', 'egress'):
            self.model.unit.status = BlockedStatus('Config item `kind` must be set')
            # A event.defer() shouldn't be needed here as config_changed will also call this method.
            return

        if not ((pilot := self.interfaces["istio-pilot"]) and pilot.get_data()):
            # Actually set a Waiting Status as we can't proceed without it
            self.model.unit.status = WaitingStatus('Istio-pilot data not ready')
            # Revisit the install event once the data is available.
            event.defer()
            return

        pilot = list(pilot.get_data().values())[0]

        env = Environment(loader=FileSystemLoader('src'))
        template = env.get_template('manifest.yaml')
        rendered = template.render(
            kind=self.model.config['kind'],
            namespace=self.model.name,
            pilot_host=pilot['service-name'],
            pilot_port=pilot['service-port'],
        )

        subprocess.run(["./kubectl", "apply", "-f-"], input=rendered.encode('utf-8'), check=True)

        # Make sure we set the ActiveStatus once it actually is active and not anywhere else
        self.model.unit.status = ActiveStatus()


if __name__ == "__main__":
    main(Operator)

@ca-scribner
Copy link
Contributor

ca-scribner commented Oct 29, 2021

A minor edit to @DomFleischmann's suggestion, using constants for status to avoid easy typos. We could probably template things like this too if there's a need (eg: if there's a "relation X missing value 'somethingImportant'", we could have template_relation_value_missing="relation {name} missing ..." or something).

WAITERS = {
    "istio_pilot": "Istio-pilot data not ready"
}

BLOCKERS = {
    "no_interface_version": "No compatible versions",
    "no_kind": "Config item `kind` must be set",
    "not_leader": "Waiting for leadership",
}

ERRORS = {
    "something_could": "be here too",
}


class Operator(CharmBase):
    storage = StoredState()

    def __init__(self, *args):
        super().__init__(*args)

        self.storage.set_default(blockers=set())

        self.log = logging.getLogger(__name__)

        if not self.unit.is_leader():
            self.storage.blockers.add(BLOCKERS["not_leader"])
            return

        try:
            self.interfaces = get_interfaces(self)
        except NoVersionsListed:
            # Log that the data is not available yet, but don't set any status
            # In this part of the code it is not our problem yet.
            self.log.info("log something about it")
        except NoCompatibleVersions:
            # I understand this will only occur if something is actually wrong.
            self.model.unit.status = BlockedStatus(BLOCKERS["no_interface_version"])
            return


        self.framework.observe(self.on.install, self.install)
        self.framework.observe(self.on["istio-pilot"].relation_changed, self.install)
        self.framework.observe(self.on.config_changed, self.install)

    def install(self, event):
        """Install charm."""

        if self.model.config['kind'] not in ('ingress', 'egress'):
            self.model.unit.status = BlockedStatus(BLOCKERS["no_kind"])
            # A event.defer() shouldn't be needed here as config_changed will also call this method.
            return

        if not ((pilot := self.interfaces["istio-pilot"]) and pilot.get_data()):
            # Actually set a Waiting Status as we can't proceed without it
            self.model.unit.status = WaitingStatus(WAITERS["istio_pilot"])
            # Revisit the install event once the data is available.
            event.defer()
            return

        pilot = list(pilot.get_data().values())[0]

        env = Environment(loader=FileSystemLoader('src'))
        template = env.get_template('manifest.yaml')
        rendered = template.render(
            kind=self.model.config['kind'],
            namespace=self.model.name,
            pilot_host=pilot['service-name'],
            pilot_port=pilot['service-port'],
        )

        subprocess.run(["./kubectl", "apply", "-f-"], input=rendered.encode('utf-8'), check=True)

        # Make sure we set the ActiveStatus once it actually is active and not anywhere else
        self.model.unit.status = ActiveStatus()


if __name__ == "__main__":
    main(Operator)

@ca-scribner
Copy link
Contributor

I'm wondering if the install hook is the right hook for us to install on. It sounds odd, but at install we will never (or rarely) have the data we actually need to install. I think our install event should actually be on the relation_changed event only. But I'm still thinking that through and don't have a code suggestion yet :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants