-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calamari 2.1.* slower than 2.0.*, slower than 1.0.5 #290
Comments
I managed to run isolated benchmark. Summary:Calamari v 1.0.5 is the fastest one. No matter which tensorflow version was used for 1.0.5 - the performance was similar. HardwareModel Name: MacBook Pro BenchmarksPython 3.7.11 calamari v1.0.5, tensorflow 2.7.0
calamari v1.0.5, tensorflow 2.4.4
calamari v1.0.5, tensorflow 2.3.4
calamari v2.0.0, tensorflow 2.3.4, tfaip 1.0.1
calamari v2.0.1, tensorflow 2.3.4, tfaip 1.0.1
calamari v2.0.2, tensorflow 2.3.4, tfaip 1.0.1
calamari v2.1.0, tensorflow 2.4.4, tfaip 1.1.1It took ages with plenty of: Skipping calamari versions 2.1.1, 2.1.2, 2.1.3 due to the same problem.
calamari v2.1.4, tensorflow 2.4.4, tfaip 1.2.5For n = 1 it prints each time: WARNING:tensorflow:11 out of the last 11 calls to <function Model.make_predict_function..predict_function at 0x7ffbfbc8f560> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
|
Hi @wosiu, thanks for the benchmarks, that's really interesting! To avoid messing up the tensorflow imports, I'd suggest putting them in a subprocess: import argparse
import tempfile
import time
import numpy as np
from prettytable import PrettyTable
import multiprocessing
def benchmark_prediction(queue, model, batch_size, processes, n_examples, runs=10):
from calamari_ocr.ocr.predict.predictor import Predictor, PredictorParams
params = PredictorParams(silent=True)
predictor = Predictor.from_checkpoint(params, model)
predictor.data.params.pre_proc.run_parallel = False
predictor.data.params.post_proc.run_parallel = False
data = (np.random.random((400, 48)) * 255).astype(np.uint8)
print("Running with bs={}, proc={}, n={}".format(batch_size, processes, n_examples))
start = time.time()
for i in range(runs):
list(predictor.predict_raw([data] * n_examples))
end = time.time()
queue.put((end-start)/runs)
#return (end - start) / runs
def benchmark_subprocess(*args, **kwargs):
""" Run in subprocess to avoid tf imports messing up performance """
c = multiprocessing.get_context("spawn")
q = c.Queue()
p = c.Process(target=benchmark_prediction, args=(q, *args), kwargs=kwargs)
p.start()
res = q.get()
p.join()
return res
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--working_dir", default=tempfile.gettempdir(), type=str)
parser.add_argument("--model", required=True)
parser.add_argument("--processes", default=multiprocessing.cpu_count(), type=int)
args = parser.parse_args()
batch_sizes = [1, 5, 10, 20, 50]
tab = PrettyTable(["n"] + list(map(str, batch_sizes)))
for n_examples in [1, 10, 100, 1000]:
results = [benchmark_subprocess(args.model, bs, args.processes, n_examples) for bs in batch_sizes]
tab.add_row([n_examples] + results)
print(tab)
if __name__ == "__main__":
main() Here are my results on a Quad Core Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz: py 3.9, tf 2.7, calamari 1.0.5(newer python version, newer tensorflow, no subprocess for the benchmark)
py 3.8, tf 2.4.4, calamari 1.0.5
py 3.8, tf 2.4.4, calamari 2.1.4, pre/post_proc.run_parallel=True
py 3.8, tf 2.4.4, calamari 2.1.4, pre/post_proc.run_parallel=False
Conclusion?The positive effects of increasing batch size seems to be missing in C2. Parallel pre- and post-processing is slowing down prediction if we are running on a limited number of CPU cores and without GPU. When running with the same python and tensorflow version, C2 seems to be faster when run on more than just half a page of line images, at least on my machine (could be different for larger amount of cores?). @ChWick: tf 2.7 seems to be quite a bit faster than 2.4.4. Something prevents us from upgrading, what was that again? The retracing warnings can be ignored, I suppose. Tensorflow insists on counting the occurrences of calling a prediction function and complaining if that happens too often. Edit: py 3.9, tf 2.7, calamari 2.1.4, pre/post_proc.run_parallel=False
|
Some more benchmarks, this time with GPUQuadcore Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, GeForce GTX 1050 Ti py 3.9, tf 2.4.1, calamari 1.0.5
py 3.9, tf 2.4.1, calamari 2.1.4, pre/post_proc.run_parallel=True
py 3.9, tf 2.4.1, calamari 2.1.4, pre/post_proc.run_parallel=False
|
All in all it seems there is some big overhead introduced after 1.0.5, which is the most impactful for small |
That's true. It is optimized for the use case of large collections (books or groups of books) and GPU. Also, the main focus so far was on training time, much less on prediction. |
After bumping to:
Calamari 2.1.4
TF 2.4.3
and setting
it's indeed faster compared to the previous 2.1.*, however the newest calamari is still 4-5 times slower compared to 1.0.5.
I did a quick benchmark:
1.0.5: 588ms
2.1.4: 2849ms
The text was updated successfully, but these errors were encountered: