-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN gradient may be due to weight initialization #22
Comments
Hi Victor, Thanks for taking interest in my work. As for your first question: No, I haven’t tried other initialization strategy. But I think your approach makes sense. Maybe care to contribute to the repo? For the second question: IIRC (it was a long time ago I wrote this code) ivec and jvec are constructed from the preprocessed patient records so there is no concept of “patient” in the minibatch. There is just a bunch of random visits from the EHR. Best |
Hi Ed! Thanks for the reply! Very appreciate it! I am transforming your code into TF2 and testing it. I will see if I can contribute to the repo. I am also comparing the results if I implement the code exactly as described in your paper. My data is larger (~2M patients, ~77k medical codes) and it seems to take 2.5 days to train 1 epoch on single CPU... |
Sounds interesting. Feel free to share any result from your experiments, so that others might gain new knowledge! |
I got my 10 epochs of training done. And I found that 80% of the codes are all 0s embeddings (I am taking Also I found transferring the code loss into TF 2 would have some issue when calculating the exponential terms. Taking the exponential of vector product would require the vector to be sparse. Otherwise the value would be very large:
So I switch to the below tensorflow function which will prevent
And it's 3 times slower... |
Hi Ed,
I saw in your code, the weights are initialized with truncated normal distribution. When I ran it, it seemed in the medical-code-loss part, this produced large values feeding to
exp
and resulted ininf
in the loss andNaN
gradients. Also because of such initial weights, the loss in general is pretty high around several hundreds, especially L2 loss is around tens of thousands. Then I changed the weight initialization to be uniform with a small interval[-0.1, 0.1]
. That seems to produce reasonable magnitude of loss (under 10). I wonder if you still remember whether you have tried other weight initializations and how they impact the results.Another question I have is that in the paper, the loss is averaged over
T
. Is thisT
visits in the batch or visits per patient? In your code, it seems, yourivec
andjvec
are generated for the batch. So in the medical-code-loss calculation, it is averaging over all visits in a batch, instead of averaging per patient and then averaging over all patients in a batch?Thanks!
The text was updated successfully, but these errors were encountered: