Adam Stooke
2018-01-19 20:42:16 UTC
Hi,
I am holding an array on the GPU (in a shared variable), and I'm sampling
random minibatches from it, but it seems there is a call to HostFromGpu at
every index, which causes significant delay. Is there a way to avoid this?
Here is a minimal code example, plus the debug and profiling printouts.
The same thing happens if I use theano.map. The problem is much worse in
my actual code, which uses multiple levels of indexing--despite also using
much larger data arrays, the time in the many calls to HostFromGpu
dominates.
Code example:
import theano
import theano.tensor as T
import numpy as np
H = W = 3
N = 10
B = 3
src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
idxs = T.ivector('idxs')
selections = [src[idxs[i]] for i in range(B)]
new_dest = T.stack(selections)
updates = [(dest, new_dest)]
f = theano.function(inputs=[idxs], updates=updates)
np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
print(dest.get_value())
f(np_idxs)
print(dest.get_value())
theano.printing.debugprint(f)
for _ in range(10):
f(np_idxs)
Debugprint (notice the HostFromGpu listed with unique ID leading up to each
ScalarFromTensor):
GpuJoin [id A] '' 16
|TensorConstant{0} [id B]
|InplaceGpuDimShuffle{x,0,1} [id C] '' 15
| |GpuSubtensor{int32} [id D] '' 14
| |src [id E]
| |ScalarFromTensor [id F] '' 13
| |HostFromGpu(gpuarray) [id G] '' 12
| |GpuSubtensor{int64} [id H] '' 11
| |GpuFromHost<None> [id I] '' 0
| | |idxs [id J]
| |Constant{0} [id K]
|InplaceGpuDimShuffle{x,0,1} [id L] '' 10
| |GpuSubtensor{int32} [id M] '' 9
| |src [id E]
| |ScalarFromTensor [id N] '' 8
| |HostFromGpu(gpuarray) [id O] '' 7
| |GpuSubtensor{int64} [id P] '' 6
| |GpuFromHost<None> [id I] '' 0
| |Constant{1} [id Q]
|InplaceGpuDimShuffle{x,0,1} [id R] '' 5
|GpuSubtensor{int32} [id S] '' 4
|src [id E]
|ScalarFromTensor [id T] '' 3
|HostFromGpu(gpuarray) [id U] '' 2
|GpuSubtensor{int64} [id V] '' 1
|GpuFromHost<None> [id I] '' 0
|Constant{2} [id W]
Theano profile (in 10 calls to the function--notice 10 calls to GpuFromHost
but 30 calls to HostFromGPU):
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.9% 38.9% 0.001s 5.27e-05s C 10 1
theano.gpuarray.basic_ops.GpuJoin
31.5% 70.4% 0.000s 1.42e-05s C 30 3
theano.gpuarray.basic_ops.HostFromGpu
15.0% 85.4% 0.000s 2.03e-05s C 10 1
theano.gpuarray.basic_ops.GpuFromHost
7.4% 92.8% 0.000s 1.67e-06s C 60 6
theano.gpuarray.subtensor.GpuSubtensor
6.0% 98.8% 0.000s 2.69e-06s C 30 3
theano.gpuarray.elemwise.GpuDimShuffle
1.2% 100.0% 0.000s 5.56e-07s C 30 3
theano.tensor.basic.ScalarFromTensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Appreciate any tips! Thanks!
Adam
I am holding an array on the GPU (in a shared variable), and I'm sampling
random minibatches from it, but it seems there is a call to HostFromGpu at
every index, which causes significant delay. Is there a way to avoid this?
Here is a minimal code example, plus the debug and profiling printouts.
The same thing happens if I use theano.map. The problem is much worse in
my actual code, which uses multiple levels of indexing--despite also using
much larger data arrays, the time in the many calls to HostFromGpu
dominates.
Code example:
import theano
import theano.tensor as T
import numpy as np
H = W = 3
N = 10
B = 3
src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
idxs = T.ivector('idxs')
selections = [src[idxs[i]] for i in range(B)]
new_dest = T.stack(selections)
updates = [(dest, new_dest)]
f = theano.function(inputs=[idxs], updates=updates)
np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
print(dest.get_value())
f(np_idxs)
print(dest.get_value())
theano.printing.debugprint(f)
for _ in range(10):
f(np_idxs)
Debugprint (notice the HostFromGpu listed with unique ID leading up to each
ScalarFromTensor):
GpuJoin [id A] '' 16
|TensorConstant{0} [id B]
|InplaceGpuDimShuffle{x,0,1} [id C] '' 15
| |GpuSubtensor{int32} [id D] '' 14
| |src [id E]
| |ScalarFromTensor [id F] '' 13
| |HostFromGpu(gpuarray) [id G] '' 12
| |GpuSubtensor{int64} [id H] '' 11
| |GpuFromHost<None> [id I] '' 0
| | |idxs [id J]
| |Constant{0} [id K]
|InplaceGpuDimShuffle{x,0,1} [id L] '' 10
| |GpuSubtensor{int32} [id M] '' 9
| |src [id E]
| |ScalarFromTensor [id N] '' 8
| |HostFromGpu(gpuarray) [id O] '' 7
| |GpuSubtensor{int64} [id P] '' 6
| |GpuFromHost<None> [id I] '' 0
| |Constant{1} [id Q]
|InplaceGpuDimShuffle{x,0,1} [id R] '' 5
|GpuSubtensor{int32} [id S] '' 4
|src [id E]
|ScalarFromTensor [id T] '' 3
|HostFromGpu(gpuarray) [id U] '' 2
|GpuSubtensor{int64} [id V] '' 1
|GpuFromHost<None> [id I] '' 0
|Constant{2} [id W]
Theano profile (in 10 calls to the function--notice 10 calls to GpuFromHost
but 30 calls to HostFromGPU):
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.9% 38.9% 0.001s 5.27e-05s C 10 1
theano.gpuarray.basic_ops.GpuJoin
31.5% 70.4% 0.000s 1.42e-05s C 30 3
theano.gpuarray.basic_ops.HostFromGpu
15.0% 85.4% 0.000s 2.03e-05s C 10 1
theano.gpuarray.basic_ops.GpuFromHost
7.4% 92.8% 0.000s 1.67e-06s C 60 6
theano.gpuarray.subtensor.GpuSubtensor
6.0% 98.8% 0.000s 2.69e-06s C 30 3
theano.gpuarray.elemwise.GpuDimShuffle
1.2% 100.0% 0.000s 5.56e-07s C 30 3
theano.tensor.basic.ScalarFromTensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Appreciate any tips! Thanks!
Adam
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.