Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing , bad results even on training sample after convergence #28

Open
Alexjap opened this issue Feb 15, 2017 · 15 comments
Open

Testing , bad results even on training sample after convergence #28

Alexjap opened this issue Feb 15, 2017 · 15 comments

Comments

@Alexjap
Copy link

Alexjap commented Feb 15, 2017

Right now by following the instructions on the Readme:

Training procedure seems to converge (perplexity around 1 on toy example),
but when we test on the same data ( the toy example itself) the results are quite bad , Is anyone experiencing this behavior as well? I tried to look into the bucketing part of the code , i'm not sure why the bucketing in evaluation and in training differ but that doesn't seem to be the cause anyway ( tried with same bucketing and still bad results)
The version of keras and tensorflow are the reccomended ones ( Keras 1.1.1 and tf 0.11.0)

@Alexjap Alexjap changed the title Testing on training sample , bad results Testing , bad results even on training sample after convergence Feb 15, 2017
@ddaue
Copy link

ddaue commented Feb 22, 2017

I solved it with the following modification in model.py, line 352, without any explanations of why it should be like that... I searched a lot...

#if not forward_only:
if True:
input_feed[K.learning_phase()] = 1
else:
input_feed[K.learning_phase()] = 0

@Alexjap
Copy link
Author

Alexjap commented Feb 23, 2017

Yeah it would work on the training data but i don't think it can be considered a fix, by setting learning phase to 1 always means that we are in training mode, so any layer that has a different behavior in train/test will be set to train even if we are testing .

@da03
Copy link
Owner

da03 commented Feb 24, 2017

Yes. If you are setting that flag to 1 during test phase, it basicaly means when you receive a test batch, you are doing the same thing as in training: subtracting some mean over the test set. While that's not that inconsitent between training and testing, doing that is kind of unfair, since presumably we should only use a test point's own information to classify that, without looking at some statistics over a batch of test examples. Sorry that I'm busy for a ddl, will look into the code later.

@seed93
Copy link

seed93 commented Feb 27, 2017

because of the difference of BN between training and testing

@NourozR
Copy link

NourozR commented Mar 3, 2017

I trained the model with step perplexity = 1.006652, error = 0.0082. Then tried to test results using svt and iiit5k dataset. But for both dataset i got 100% incorrect results, which is totally unexpected. So, i used the trained model given, but still got same results.

i use keras 1.1.1 and Tf 0.12.1. I used distance as well and tried other datasets as well. Any help? This was an important project for me, please help.

@shrazo
Copy link

shrazo commented Mar 4, 2017

remove tf.gfile.Exists(ckpt.model_checkpoint_path) from model.py.

@raoweijin
Copy link

raoweijin commented Mar 9, 2017

I meet the same issue with Alexjap. Could anyone find the root cause?
keras version: 1.1.1
tensorflow version: 0.12.1
Windows 10
I just created 3 pictures for 'a','b','c' and trained them. The picture's size is 31*31. I tested on these same 3 pictures. The result are bad,too.
If I modify the code according the below, the test result is ok.
#if not forward_only:
if True:
input_feed[K.learning_phase()] = 1
else:
input_feed[K.learning_phase()] = 0

Train result:
2017-03-08 16:35:58,463 root INFO step_time: 1.881249, step_loss: 0.001654, step perplexity: 1.001656
2017-03-08 16:35:58,469 root INFO current_step: 198
2017-03-08 16:36:00,323 root INFO step_time: 1.854232, step_loss: 0.001635, step perplexity: 1.001637
2017-03-08 16:36:00,329 root INFO current_step: 199
2017-03-08 16:36:02,229 root INFO step_time: 1.900263, step_loss: 0.001617, step perplexity: 1.001618
2017-03-08 16:36:02,679 root INFO global step 200 step-time 1.91 loss 0.156341 perplexity 1.17
2017-03-08 16:36:02,679 root INFO Saving model, current_step: 200

Test result:
2017-03-08 16:37:50,221 root INFO Reading model parameters from ./results/model\translate.ckpt-200
2017-03-08 16:38:00,177 root INFO model is established and start to launch model
2017-03-08 16:38:00,178 root INFO start to test
2017-03-08 16:38:00,178 root INFO Compare word based on edit distance.
2017-03-08 16:38:00,844 root INFO step_time: 0.598397, loss: 1.272859, step perplexity: 3.571049
2017-03-08 16:38:00,847 root INFO 0.000000 out of 1 correct
2017-03-08 16:38:01,183 root INFO step_time: 0.335222, loss: 2.004660, step perplexity: 7.423572
2017-03-08 16:38:01,185 root INFO 0.000000 out of 2 correct
2017-03-08 16:38:01,494 root INFO step_time: 0.308204, loss: 1.537001, step perplexity: 4.650624
2017-03-08 16:38:01,496 root INFO 0.000000 out of 3 correct

@Alexjap
Copy link
Author

Alexjap commented Mar 9, 2017

I think what seed93 said might make sense, maybe it is related to the Batch normalisation behavior but i didn't have time to test without it to see if things change.
In the CNN part of the model(keras code) we should try to remove the Batch normalisation layers and try training again to see if things change , i currently don't have access to a proper machine to try this out and a bit busy with stuff.
the test would be to try comment out all the model.add(layers.BatchNormalization(axis=1)) in the cnn.py file, retrain and see if the testing on training data is consistent.
By removing the batch normalisation we should expect a slower convergence during training but it would be fine to check if it's actually the BN that breaks the model in testing

@seed93
Copy link

seed93 commented Mar 9, 2017

@Alexjap I use the pull requests code and found this bug. Change cnn_model = CNN(self.img_data, True) #(not self.forward_only)) to cnn_model = CNN(self.img_data, not self.forward_only)

@Alexjap
Copy link
Author

Alexjap commented Mar 9, 2017

@seed93 i quickly checked the code you mentioned, if we change the code like that we set the CNN model in testing(freeze weights) when we are training and vice versa, it looks a bit strange to me

@NourozR
Copy link

NourozR commented Mar 14, 2017

i looked into the code and found 'false' argument in "model.py" (line 204-211) while debugging. The problem was actually the system was unable to load the trained model. So, i edited the code little bit and found that now the model trained by me is loaded and it's working. But still the accuracy is low (12-15% for both svt & iiit5k test dataset). The problem is in this argument : "batchnormalization_3_running_mean:0 NOT trainable" , batchnormalization_3_running_std:0 NOT trainable" . This happened because: new tf & keras version can't calculate mean & standard deviation from these two arguments. So does for pre-trained model. And since models are binary files, there is no room to change them.

Also, in test phase, the system is giving accurate results for first input of a mini-batch but not for rest of data. This was strange to me.

@NourozR
Copy link

NourozR commented Mar 14, 2017

@raoweijin , i faced same problem and somehow solved it with this: remove tf.gfile.Exists(ckpt.model_checkpoint_path) from model.py .. @shraju024 is right.

@jvpoulos
Copy link

jvpoulos commented Apr 7, 2017

Solved this problem with SivanKe#1
as suggested by seed93

@zj463261929
Copy link

@NourozR Now I face same problen,“remove tf.gfile.Exists(ckpt.model_checkpoint_path)”can solve the problem?This method just loads the model。

@balajiwix
Copy link

Hi Guys,

Please help me. While training the code with test data. I am getting generating first batch. It is not going showing the step train and step loss :(. I gave all the parameters mentioned in the training steps.

Epoch ........ 0
2018-05-20 08:21:01,333 root INFO Generating first batch)
Epoch ........ 1
2018-05-20 08:21:04,836 root INFO Generating first batch)
Epoch ........ 2
2018-05-20 08:21:08,310 root INFO Generating first batch)
Epoch ........ 3
2018-05-20 08:21:11,780 root INFO Generating first batch)
Epoch ........ 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants