Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn-style interface for regression/classification #47

Open
jmmcd opened this issue Feb 9, 2017 · 6 comments
Open

sklearn-style interface for regression/classification #47

jmmcd opened this issue Feb 9, 2017 · 6 comments

Comments

@jmmcd
Copy link
Collaborator

jmmcd commented Feb 9, 2017

We should be able to provide a wrapper to allow this sklearn-style usage of our regression/classification:

import ponyge
reg = ponyge.ScikitLearnGERegressor(optimise_constants=True, generations=100)
reg.fit(X, y)
print(reg._formula) # prints out the individual
yhat_test = reg.predict(X_test)
print(y_test, yhat_test)
@jmmcd
Copy link
Collaborator Author

jmmcd commented Oct 26, 2021

The SKLearn-style interface is just one example of a bigger picture. We can only really run via command-line.

A better approach might be to provide a GE class, or a GERegressor subclass, allowing to manage runs from a notebook (eg #101), a bit like this:

for p in [0.001, 0.01]:
    reg = ponyge.GERegressor(pmut=p)
    reg.fit(X, y)
    print(reg.score(X, y))

I think we can't loop over hyperparameters like this, at the moment, because the hyperparameter handling is mixed with the command-line parsing. (Right?)

(If we could create a class as above, then the command-line interface would just be a script that handles CLI arguments and makes an appropriate call to this class.)

The biggest obstacle to this is our parameter handling. Parameters are held in the parameters module (not a class). It was designed this way, I think, so that we don't have to pass zillions of parameters into every function. We just have access to the parameters module via import. (That code also feels a bit spaghetti-like as we import the module and then overwrite values in it.)

This branch has a different approach:

https://github.com/aadeshnpn/PonyGE2

It is unifying parameters, stats and tracker into a single class (if I understand right). Every function that needs it now takes an extra argument parameter.

@aadeshnpn, are you still working with PonyGE2? If so, could you please show us how to start a run in your approach?

@aadeshnpn
Copy link
Contributor

aadeshnpn commented Oct 27, 2021

I totally agree @jmmcd . We need to have wrappers to allow sklearn-style usage of our whole library that way it can be used by many other researchers with ease. I am still using the modified version of PonyGE2 in my research. If PonyGE2 maintainers want, I can send a merge request with the changes I have done to unify the scripts. Then we can go through the changes and decide which changes should be keep.
To answer your question on how to use the modified PonyGE2, here is a simple example:

  1. Clone the repo https://github.com/aadeshnpn/PonyGE2
  2. Install the repo using pip install .
  3. Initialize the parameters
from ponyge.operators.initialisation import initialisation
from ponyge.fitness.evaluation import evaluate_fitness
from ponyge.operators.crossover import crossover
from ponyge.operators.mutation import mutation
from ponyge.operators.replacement import replacement
from ponyge.operators.selection import selection
from ponyge.algorithm.parameters import Parameters
parameter = Parameters()
parameter_list = ['--parameters', '..,regression.txt'] # path,filename (comma-separated)
parameter.params['RANDOM_SEED'] = 123
parameter.params['POPULATION_SIZE'] = 100
parameter.set_params(parameter_list)
individual = initialisation(parameter, 1)
individual = evaluate_fitness(individual, parameter)
  1. Now you can use a loop to warp the genetic step.
for i in range(generations):
    parents = selection(parameter, individuals)
    cross_pop = crossover(parameter, parents)
    new_pop = mutation(parameter, cross_pop)
    new_pop = evaluate_fitness(new_pop, parameter)
    individuals = replacement(parameter, new_pop, individuals)
    individuals.sort(reverse=True)

(Edited by jmmcd to fix the code a little.)

@jmmcd
Copy link
Collaborator Author

jmmcd commented Oct 28, 2021

Excellent, thanks.

This overall approach makes sense to me.

The code doesn't quite work as-is. I edited above to get started. Then I see some old bugs (eg sklearn.classification.metrics) and places where the new parameter argument needs to be added (eg the super() call in regression.py). After hacking these I see another error which I'm not sure about:

    336             # Set GENOME_OPERATIONS automatically for faster linear operations.
--> 337             if self.params['CROSSOVER'].representation == "linear" and \
    338                     self.params['MUTATION'].representation == "linear":
    339                 self.params['GENOME_OPERATIONS'] = True

AttributeError: 'function' object has no attribute 'representation'

But maybe the most efficient use of time wouldn't be to track all these down and try to make a clean PR from a fork which has diverged. Instead, let's discuss the design we want and then if we decide to go ahead, implement it in a branch of the main repo (PonyGE/PonyGE2).

One thing to discuss is: does it make sense to have Stats and Trackers as members of the Parameters class? The naming is confusing. Instead, maybe we should have a State class, containing everything stateful, ie Stats, Trackers and Parameters?

class State:
    def __init__(self):
        self.params = {} # etc
        self.trackers = Trackers()
        self.stats = Stats()

Then every function would be like:

def crossover(state, parents):
        while len(cross_pop) < state.params['GENERATION_SIZE']:

(and every function outside State is allowed to read/write State but should be pure otherwise).

The point of it all is that every run has a State object. (It could even be called Run instead of State.) So we can then have a GE class and instantiate multiple instances of it, each creates a State at startup, without them interfering with each other.

(By the way, there is a fork at https://github.com/p-pereira/evoltree which is worth a look. It's not a fork via GitHub, just via copy-paste. It has a nice .fit(X, y) interface: https://github.com/p-pereira/evoltree/blob/main/evoltree/evoltree.py.)

I can see quite a few things that could go wrong so I'll continue this brain-dump...

In a previous issue #83, there was discussion of how difficult it would be to have a GE constructor with all the arguments. Counterpoint: gplearn does ok, though it doesn't have as many: https://github.com/trevorstephens/gplearn. I think the best approach would probably be to choose default parameters that give a "vanilla" GE experience (no derivation trees or unusual operators), but NO DEFAULT FITNESS FUNCTION. The user has to specify it. But then in a GERegressor subclass, for example, we might have rmse as the default objective, but still have .fit(X, y) as the method of specifying the data.

We've put some work into creating these nice parameters files for different example problems. We don't have to throw that away. We can easily have a method State.create_from_parameter_file(filename). And then the command-line interface ponyge --parameters filename just calls that method.

There are several parts of the code I know nothing about, especially scripts, the parsing stuff, progsys. The best I'll be able to do here is to run ponyge --parameters filename for each example problem and check that it doesn't crash.

The import code that currently runs at startup is quite intricate and I would be worried about breaking things there. It was aimed at making it easy for users to add their own fitness functions, operators, etc, by putting them directly into the PonyGE src/ tree. With the new approach, seeing PonyGE as a library, instead ordinary users would write their new fitness function elsewhere (eg in a notebook) and pass it in: ge = GE(fitness=my_fitness_fn, population=100) and would not touch the src/ tree. But I think there could be users who (a) want to hack PonyGE itself and (b) want to use the command-line so we want to keep the convenience here. But it would be nice if, at least, all that complexity becomes part of the command-line interface, not part of the library itself.

@aadeshnpn
Copy link
Contributor

Excellent points @jmmcd . I like the idea of using a divergent branch that has some of the issues fixed and then create a PR request from there. It would be really useful to discuss about overall framework design and design patterns that we think might be useful for the new PR request.

I think I have nice understanding of scripts and parsing stuff for the library and if we combine our effort we can definitely have a nice object oriented interface with backward compatibility with the command-line interface. I think its efficient for us to schedule a zoom meeting, have a proper agenda about the changes we want. That should give us a good starting point.

@jmmcd
Copy link
Collaborator Author

jmmcd commented Nov 2, 2021

Thanks. Yes, a meeting would be helpful, especially if there are any other interested parties?

I'm not quite ready to schedule it as things are busy here, but hopefully within a week or two.

I am still re-learning some parts of the system. Just now I found the state.py module, which is currently used for saving a run and loading it again later. It includes stats and params. Maybe this should become the state class I mentioned earlier. https://github.com/PonyGE/PonyGE2/blob/master/src/utilities/algorithm/state.py

@aadeshnpn
Copy link
Contributor

Thanks. Week or two should work for me as I am following a paper deadline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants