Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transposed data for supervised learning #130

Closed
jmmcd opened this issue Jun 2, 2021 · 4 comments
Closed

Transposed data for supervised learning #130

jmmcd opened this issue Jun 2, 2021 · 4 comments

Comments

@jmmcd
Copy link
Collaborator

jmmcd commented Jun 2, 2021

In #129 we are discussing the X dataset being transposed by PonyGE (relative to the Scikit-Learn convention).

I see that we do indeed transpose the data here

train_X = train_Xy[:, :-1].transpose() # all columns but last
.

I think the motivation here is that we can write a grammar which will work correctly whether processing a single row or a dataset. Eg in Vladislavleva4 we have x[0]|x[1]|x[2]|x[3]|x[4]

. With transposed data, this works.

But it is different from the convention used by Scikit-Learn, Tensorflow, etc. Should we consider a change here?

@dvpfagan
Copy link
Collaborator

dvpfagan commented Jun 4, 2021 via email

@jmmcd
Copy link
Collaborator Author

jmmcd commented Jun 4, 2021

What would be involved in changing to a scikit-learn style dataset, that
would allow for support of the Vlad4 style grammars?

We are just using Numpy, not Pandas, so no loc. I think we would be removing the transpose and changing the grammars to say x[:, 0] etc. And if someone wanted to run the function on a single row of data x, they'd have to reshape it with x.reshape((1, len(x)) or similar.

would this proposed change gain us over what we currently have

Nothing! Well, just it would stick to the convention, so possibly easier for users writing custom code as in #129.

@dvpfagan
Copy link
Collaborator

dvpfagan commented Jun 4, 2021 via email

@jmmcd
Copy link
Collaborator Author

jmmcd commented Oct 17, 2021

I think we should go ahead with this. I think the sklearn standard would be good to align with, more generally (also eventually inheriting from RegressorMixin etc). I'm planning to use PonyGE for some symbolic regression problems in the next few weeks so I have some time to make the changes and mop up any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants