Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation of constants for Python executable phenotypes #129

Open
LSivo opened this issue Jun 1, 2021 · 6 comments
Open

Optimisation of constants for Python executable phenotypes #129

LSivo opened this issue Jun 1, 2021 · 6 comments

Comments

@LSivo
Copy link

LSivo commented Jun 1, 2021

Hello everyone!
I'm using PonyGE2 with pybnf grammars, in which I have a number of constants. I stumbled across "optimize_constants" here, and it looks interesting, but I'm afraid it works just for "evaluable" phenotypes, not for "executable" ones.
Am I wrong? If I am, does anybody have more detailed information to share so I can make it work? Instead, if I'm not, is there any short-term plan to make optimisation of constants work with executable phenotypes too?
Thanks in advance!

@jmmcd
Copy link
Collaborator

jmmcd commented Jun 1, 2021

Hmm, it seems like it should be easy.

All of our supervised learning fitness functions assume that the phenotype is a single expression, and use eval. It's not just an issue with optimize_constants.

So, I guess you already have some custom code which uses exec, right? If you post it, I can suggest how to add the optimize_constants.

@LSivo
Copy link
Author

LSivo commented Jun 1, 2021

Thanks for your answer, @jmmcd. As you guess, I have some grammars that are meant to produce Python executable code, and I'm using a custom fitness based on F1-score. Here it follows:

from fitness.base_ff_classes.base_ff import base_ff
from utilities.fitness.get_data import get_data
from utilities.representation.python_filter import python_filter
from algorithm.parameters import params
import numpy as np


class handpd(base_ff):
    maximise = True

    def __init__(self):
        # Initialise base fitness function class.
        super().__init__()

        # Get training and test data
        self.training_in, self.training_exp, self.test_in, self.test_exp = \
            get_data(params['DATASET_TRAIN'], params['DATASET_TEST'])

        # Find number of variables.
        self.n_vars = np.shape(self.training_in)[0]

        # Regression/classification-style problems use training and test data.
        if params['DATASET_TEST']:
            self.training_test = True

    def evaluate(self, ind, **kwargs):
        dist = kwargs.get('dist', 'training')

        if dist == "training":
            # Set training datasets.
            x = self.training_in
            y = self.training_exp

        elif dist == "test":
            # Set test datasets.
            x = self.test_in
            y = self.test_exp

        else:
            raise ValueError("Unknown dist: " + dist)

        tp = 0
        fp = 0
        fn = 0

        p, d = ind.phenotype, {'is_within': is_within}

        for i in range(x.shape[1]):
            d['x'] = x[:, i]
            exec(p, d)
            y_p = d['result']
            assert np.isrealobj(y_p)
            if y_p == 1 and y[i] == 1:
                tp += 1
            elif y_p == 1 and y[i] == 0:
                fp += 1
            elif y_p == 0 and y[i] == 1:
                fn += 1

        f1 = (2 * tp) / ((2 * tp) + fp + fn)
        # print(tp, fp, fn, f1)
        return f1


def is_within(val, a, b):
    return min(a, b) <= val <= max(a, b)

Sorry, I didn't manage to render the code properly, so there are parts of it out of the code section.
However, the problem is that c is unknown, maybe I shoul pass it within the dictionary d, but I don't know how. I've seen that in the supervised learning template there's some code to manage the constant optimization option, but still I don't understand how to manage the execution of the code... I think I'm messing up.

@jmmcd
Copy link
Collaborator

jmmcd commented Jun 2, 2021

The trick with formatting a large block of code is to use triple backticks before and after. I've fixed your comment. You can re-edit it to see the backtick syntax.

I was thinking of adding exec support to optimize_constants in a generic way, but I don't think I can do it in a way that supports your code. All of your custom code with d['x'] and tp, fp, fn would have to be replicated in the optimize_constants file, so it wouldn't be generic.

So, let's see if we can make this part more generic first.

for i in range(x.shape[1]):
    d['x'] = x[:, i]
    exec(p, d)
    y_p = d['result']

Here I think your idea is to pass one training instance in to p at a time, right? I think it's the wrong way around -- normally in the Scikit-Learn convention, each row of the dataset is a single training instance. So should we have this instead?

for i in range(x.shape[0]):
    d['x'] = x[i, :]
    exec(p, d)
    y_p = d['result']

Second, can you look at your grammar and phenotypes and see whether they can run in a vectorised way? Our supervised learning code assumes that everything is vectorised. So, the result of this would be:

d['x'] = x
exec(p, d)
y_p = d['result']

One common stumbling block in vectorisation is when you need to run if-statements. The vectorised analogue of if is numpy.where: https://numpy.org/doc/stable/reference/generated/numpy.where.html. This is sometimes enough to turn an exec situation back to an eval situation.

Next, I see you want to get the tp, fp, and eventually f1 scores. But if we have vectors y and y_p, then we can calculate all of these in a vectorised way, rather than one at a time. Scikit-Learn gives code for f1.

And indeed, f1 is already available as a PonyGE error metric: https://github.com/PonyGE/PonyGE2/blob/master/src/utilities/fitness/error_metric.py.

@LSivo
Copy link
Author

LSivo commented Jun 2, 2021

Thank you, @jmmcd.
About the x.shape[0] comment, I agree with you: I usually treat each row of a dataset as a single training instance, but when I started to use PonyGE2 I found that x = self.training_in, for instance, returns somehow a transposed version of the dataset, and this is why I ended up using x.shape[1] and d['x'] = x[:, i].
About your second comment, I'll need some time to check if I can change my grammars in a reasonable time so that I can use a vectorised style. Thank you anyway for your help!

@jmmcd
Copy link
Collaborator

jmmcd commented Jun 2, 2021

About the x.shape[0] comment, I agree with you: I usually treat each row of a dataset as a single training instance, but when I started to use PonyGE2 I found that x = self.training_in, for instance, returns somehow a transposed version of the dataset, and this is why I ended up using x.shape[1] and d['x'] = x[:, i].

Yes, you're right. I've made a new issue #130 for that discussion. For this issue, let's continue as-is.

@jmmcd
Copy link
Collaborator

jmmcd commented Oct 17, 2021

I've made the change and closed #130. Happy to continue discussion here about vectorisation, or how to get things done using exec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants