Testing , bad results even on training sample after convergence #28

Alexjap · 2017-02-15T04:58:58Z

Right now by following the instructions on the Readme:

Training procedure seems to converge (perplexity around 1 on toy example),
but when we test on the same data ( the toy example itself) the results are quite bad , Is anyone experiencing this behavior as well? I tried to look into the bucketing part of the code , i'm not sure why the bucketing in evaluation and in training differ but that doesn't seem to be the cause anyway ( tried with same bucketing and still bad results)
The version of keras and tensorflow are the reccomended ones ( Keras 1.1.1 and tf 0.11.0)

ddaue · 2017-02-22T18:08:52Z

I solved it with the following modification in model.py, line 352, without any explanations of why it should be like that... I searched a lot...

#if not forward_only:
if True:
input_feed[K.learning_phase()] = 1
else:
input_feed[K.learning_phase()] = 0

Alexjap · 2017-02-23T06:17:10Z

Yeah it would work on the training data but i don't think it can be considered a fix, by setting learning phase to 1 always means that we are in training mode, so any layer that has a different behavior in train/test will be set to train even if we are testing .

da03 · 2017-02-24T03:08:53Z

Yes. If you are setting that flag to 1 during test phase, it basicaly means when you receive a test batch, you are doing the same thing as in training: subtracting some mean over the test set. While that's not that inconsitent between training and testing, doing that is kind of unfair, since presumably we should only use a test point's own information to classify that, without looking at some statistics over a batch of test examples. Sorry that I'm busy for a ddl, will look into the code later.

seed93 · 2017-02-27T11:18:03Z

because of the difference of BN between training and testing

NourozR · 2017-03-03T10:13:16Z

I trained the model with step perplexity = 1.006652, error = 0.0082. Then tried to test results using svt and iiit5k dataset. But for both dataset i got 100% incorrect results, which is totally unexpected. So, i used the trained model given, but still got same results.

i use keras 1.1.1 and Tf 0.12.1. I used distance as well and tried other datasets as well. Any help? This was an important project for me, please help.

shrazo · 2017-03-04T02:41:23Z

remove tf.gfile.Exists(ckpt.model_checkpoint_path) from model.py.

raoweijin · 2017-03-09T06:42:46Z

I meet the same issue with Alexjap. Could anyone find the root cause?
keras version: 1.1.1
tensorflow version: 0.12.1
Windows 10
I just created 3 pictures for 'a','b','c' and trained them. The picture's size is 31*31. I tested on these same 3 pictures. The result are bad,too.
If I modify the code according the below, the test result is ok.
#if not forward_only:
if True:
input_feed[K.learning_phase()] = 1
else:
input_feed[K.learning_phase()] = 0

Train result:
2017-03-08 16:35:58,463 root INFO step_time: 1.881249, step_loss: 0.001654, step perplexity: 1.001656
2017-03-08 16:35:58,469 root INFO current_step: 198
2017-03-08 16:36:00,323 root INFO step_time: 1.854232, step_loss: 0.001635, step perplexity: 1.001637
2017-03-08 16:36:00,329 root INFO current_step: 199
2017-03-08 16:36:02,229 root INFO step_time: 1.900263, step_loss: 0.001617, step perplexity: 1.001618
2017-03-08 16:36:02,679 root INFO global step 200 step-time 1.91 loss 0.156341 perplexity 1.17
2017-03-08 16:36:02,679 root INFO Saving model, current_step: 200

Test result:
2017-03-08 16:37:50,221 root INFO Reading model parameters from ./results/model\translate.ckpt-200
2017-03-08 16:38:00,177 root INFO model is established and start to launch model
2017-03-08 16:38:00,178 root INFO start to test
2017-03-08 16:38:00,178 root INFO Compare word based on edit distance.
2017-03-08 16:38:00,844 root INFO step_time: 0.598397, loss: 1.272859, step perplexity: 3.571049
2017-03-08 16:38:00,847 root INFO 0.000000 out of 1 correct
2017-03-08 16:38:01,183 root INFO step_time: 0.335222, loss: 2.004660, step perplexity: 7.423572
2017-03-08 16:38:01,185 root INFO 0.000000 out of 2 correct
2017-03-08 16:38:01,494 root INFO step_time: 0.308204, loss: 1.537001, step perplexity: 4.650624
2017-03-08 16:38:01,496 root INFO 0.000000 out of 3 correct

Alexjap · 2017-03-09T07:07:22Z

I think what seed93 said might make sense, maybe it is related to the Batch normalisation behavior but i didn't have time to test without it to see if things change.
In the CNN part of the model(keras code) we should try to remove the Batch normalisation layers and try training again to see if things change , i currently don't have access to a proper machine to try this out and a bit busy with stuff.
the test would be to try comment out all the model.add(layers.BatchNormalization(axis=1)) in the cnn.py file, retrain and see if the testing on training data is consistent.
By removing the batch normalisation we should expect a slower convergence during training but it would be fine to check if it's actually the BN that breaks the model in testing

seed93 · 2017-03-09T07:57:43Z

@Alexjap I use the pull requests code and found this bug. Change cnn_model = CNN(self.img_data, True) #(not self.forward_only)) to cnn_model = CNN(self.img_data, not self.forward_only)

Alexjap · 2017-03-09T10:24:39Z

@seed93 i quickly checked the code you mentioned, if we change the code like that we set the CNN model in testing(freeze weights) when we are training and vice versa, it looks a bit strange to me

NourozR · 2017-03-14T18:18:59Z

i looked into the code and found 'false' argument in "model.py" (line 204-211) while debugging. The problem was actually the system was unable to load the trained model. So, i edited the code little bit and found that now the model trained by me is loaded and it's working. But still the accuracy is low (12-15% for both svt & iiit5k test dataset). The problem is in this argument : "batchnormalization_3_running_mean:0 NOT trainable" , batchnormalization_3_running_std:0 NOT trainable" . This happened because: new tf & keras version can't calculate mean & standard deviation from these two arguments. So does for pre-trained model. And since models are binary files, there is no room to change them.

Also, in test phase, the system is giving accurate results for first input of a mini-batch but not for rest of data. This was strange to me.

NourozR · 2017-03-14T18:42:44Z

@raoweijin , i faced same problem and somehow solved it with this: remove tf.gfile.Exists(ckpt.model_checkpoint_path) from model.py .. @shraju024 is right.

jvpoulos · 2017-04-07T15:37:55Z

Solved this problem with SivanKe#1
as suggested by seed93

zj463261929 · 2017-05-04T05:21:20Z

@NourozR Now I face same problen,“remove tf.gfile.Exists(ckpt.model_checkpoint_path)”can solve the problem？This method just loads the model。

balajiwix · 2018-05-20T08:38:01Z

Hi Guys,

Please help me. While training the code with test data. I am getting generating first batch. It is not going showing the step train and step loss :(. I gave all the parameters mentioned in the training steps.

Epoch ........ 0
2018-05-20 08:21:01,333 root INFO Generating first batch)
Epoch ........ 1
2018-05-20 08:21:04,836 root INFO Generating first batch)
Epoch ........ 2
2018-05-20 08:21:08,310 root INFO Generating first batch)
Epoch ........ 3
2018-05-20 08:21:11,780 root INFO Generating first batch)
Epoch ........ 4

Alexjap changed the title ~~Testing on training sample , bad results~~ Testing , bad results even on training sample after convergence Feb 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing , bad results even on training sample after convergence #28

Testing , bad results even on training sample after convergence #28

Alexjap commented Feb 15, 2017

ddaue commented Feb 22, 2017

Alexjap commented Feb 23, 2017

da03 commented Feb 24, 2017

seed93 commented Feb 27, 2017

NourozR commented Mar 3, 2017

shrazo commented Mar 4, 2017

raoweijin commented Mar 9, 2017 •

edited

Loading

Alexjap commented Mar 9, 2017

seed93 commented Mar 9, 2017 •

edited

Loading

Alexjap commented Mar 9, 2017

NourozR commented Mar 14, 2017

NourozR commented Mar 14, 2017

jvpoulos commented Apr 7, 2017

zj463261929 commented May 4, 2017

balajiwix commented May 20, 2018

Testing , bad results even on training sample after convergence #28

Testing , bad results even on training sample after convergence #28

Comments

Alexjap commented Feb 15, 2017

ddaue commented Feb 22, 2017

Alexjap commented Feb 23, 2017

da03 commented Feb 24, 2017

seed93 commented Feb 27, 2017

NourozR commented Mar 3, 2017

shrazo commented Mar 4, 2017

raoweijin commented Mar 9, 2017 • edited Loading

Alexjap commented Mar 9, 2017

seed93 commented Mar 9, 2017 • edited Loading

Alexjap commented Mar 9, 2017

NourozR commented Mar 14, 2017

NourozR commented Mar 14, 2017

jvpoulos commented Apr 7, 2017

zj463261929 commented May 4, 2017

balajiwix commented May 20, 2018

raoweijin commented Mar 9, 2017 •

edited

Loading

seed93 commented Mar 9, 2017 •

edited

Loading