[theano-users] platoon-launcher in lstm example with multiple gpus

Diana Arellano

2017-09-27 16:58:12 UTC

Hello everyone,

I have a piece of code running with Theano in one GPU and I would like to
parallelize it and distribute it to four GPUs. After the advice from Fred
Bastien, I decided for Platoon, as I am currently using just one node (with
four GPUs).

To get familiar with Platoon, I started running the example LSTM in:
<platoon>/example/lstm and found different behaviors that I hope you help
me understand.

1) When running: platoon-launcher lstm -D cuda0
the behavior is the expected and the training ends without problems.

2) When running platoon-launcher with two or three gpus (e.g.
platoon-launcher lstm -D cuda0 cuda1 cuda2)
the training takes an extremely long time to end. The error gets very
small, but I noticed that the stop condition seems to be when the self.uidx

self.max_mb.

I didn't let it run until the end, but this is what the controller outputs
for uidx: 41090
self.uidx: 41090
self.max_mb: 999000
self.bad_counter: 0
self.patience: 10
controller self.csocket.send_json:
harr: [ 0.4 0.44761905 0.44761905 0.44761905 0.44761905
0.44761905
0.4 0.44761905 0.44761905 0.4 0.4 0.4
0.44761905 0.44761905 0.43809524 0.21904762 0.22857143 0.16190476
0.3047619 0.07619048 0.0952381 0.1047619 0.06666667 0.08571429
0.02857143 0.02857143 0.01904762 0.01904762 0.01904762 0.01904762
0.01904762 0.08571429 0.00952381 0.01904762 0. 0.
0.00952381 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. ]
harr.min(): 0.0
len(self.history_errs): 111
Best error valid: 0.0 test: 0.248

3) When running platoon-launcher with all four gpus, then one of the
workers stops after approx.12 minutes with the following data:
Train cost: 10.3616333008
Syncing with global params
valid
('Valid ', 0.39047619047619042, 'Test ', 0.498)
stop

then the controller has this as last output:
self.uidx: 8550
self.max_mb: 999000
self.bad_counter: 10
self.patience: 10
harr: [ 0.39047619 0.43809524 0.39047619 0.66666667 0.64761905
0.35238095
0.72380952 0.27619048 0.4 0.6 0.34285714 0.66666667
0.6 0.34285714 0.6952381 0.59047619 0.60952381 0.44761905
0.55238095 0.42857143 0.42857143 0.57142857 0.39047619]
len(self.history_errs): 23

but the remaining workers are not killed or informed that the training is
over, so the processes stay in what I assume a "waiting" state, or in some
state that I don't quite understand.

Can you help me understand why the different behaviors between 2) and 3)

Thanks a lot!

--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.