You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The base BERT model in https://arxiv.org/pdf/1810.04805.pdf uses 768 hidden features, 12 layers, 12 heads (which are also the defaults in bert.py), while the default configuration in the argparser of __main__.py uses 256/8/8. Would it make sense to align the example script with the paper? I spent quite a while puzzling over my low GPU utilization with the default configuration. Thanks!
The text was updated successfully, but these errors were encountered:
The base BERT model in https://arxiv.org/pdf/1810.04805.pdf uses 768 hidden features, 12 layers, 12 heads (which are also the defaults in
bert.py
), while the default configuration in the argparser of__main__.py
uses 256/8/8. Would it make sense to align the example script with the paper? I spent quite a while puzzling over my low GPU utilization with the default configuration. Thanks!The text was updated successfully, but these errors were encountered: