-
Notifications
You must be signed in to change notification settings - Fork 355
How does caffeonspark exchange and synchronize the each executor's parameters? #262
Comments
Assuming multiple-gpu per node and multiple nodes, there are two levels of exchange: All those need to be done synchronously. No gpu is allowed to run next batch unless everybody gets the updated weights. |
@junshi15 thanks,but I have a question.why not use parameter server?which asynchronous technique can make training faster? |
sync version is simple to implement and verify. We do not have need for async training at this moment. In addition, we are limited by our resource. Your contribution is welcome. |
@junshi15 @guyang88 Excuse me, I've been paying attention to this problem recently.In the source code, caffe-distri/src/main/cpp/util/socket_sync_cpu.cpp, and, rdma_sync.cpp.It seems to pass data from the parameter server, slicing rather than full weights or gradients. Is that so?I'm a little confused now. Can you help me? Thank you! |
@jacklonghui Regarding slicing, it is an efficient implementation of all-reduce. If all the clients send its gradients to one node, then that node will be a bottleneck. What's implemented in CaffeOnSpark is a ring algorithm, where each node sends and receives portion of the entire gradients. |
@junshi15 ok,thank you! I got. |
@junshi15 |
|
@junshi15 ok, thank you! in this lines. Does the "master node" here exist for everybody? If not, then there is a "master node" that collects and processes the gradients that everybody sends, |
The line you quote is conceptional true. What's implemented here is different. |
@junshi15 ok, thank you! |
@anfeng @junshi15 How does caffeonspark exchange and synchronize the each executor's parameters?
The text was updated successfully, but these errors were encountered: