Sheng Su 2015-10-19
来自cslt Wiki
Last week:
- I have find out the reason why two GPU work well, and four GPU don't. This base on two facts:
- - 1. Mini-batch training : sum the gradient of all the frames in the batch
- - 2. Mini-batch size : the baseline will not converge if we set mini-batch size over 1024
- Reason:
- - Mini-batch size is M, after N mini-batch we sum all the gradient from 4 GPU and update the net once.(during the N mini-batch, we don’t update the net).This is equal to the baseline with mini-batch size of M*N*4, much larger than baseline. However if we update the net during the N mini-batch, it seems like, to some extent, reduce the mini-batch size. That’s why two GPUs work well, and four GPUs don’t.
This week:
- 1. Try net average.
- 2. Learning NG-SGD.