[theano-users] Avoiding HostFromGPU at every Index into Shared Variable?

Discussion:

Adam Stooke

2018-01-19 20:42:16 UTC

Hi,

I am holding an array on the GPU (in a shared variable), and I'm sampling
random minibatches from it, but it seems there is a call to HostFromGpu at
every index, which causes significant delay. Is there a way to avoid this?

Here is a minimal code example, plus the debug and profiling printouts.
The same thing happens if I use theano.map. The problem is much worse in
my actual code, which uses multiple levels of indexing--despite also using
much larger data arrays, the time in the many calls to HostFromGpu
dominates.

Code example:

import theano
import theano.tensor as T
import numpy as np

H = W = 3
N = 10
B = 3

src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
idxs = T.ivector('idxs')

selections = [src[idxs[i]] for i in range(B)]
new_dest = T.stack(selections)
updates = [(dest, new_dest)]
f = theano.function(inputs=[idxs], updates=updates)

np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
print(dest.get_value())
f(np_idxs)
print(dest.get_value())

theano.printing.debugprint(f)
for _ in range(10):
f(np_idxs)

Debugprint (notice the HostFromGpu listed with unique ID leading up to each
ScalarFromTensor):

GpuJoin [id A] '' 16
|TensorConstant{0} [id B]
|InplaceGpuDimShuffle{x,0,1} [id C] '' 15
| |GpuSubtensor{int32} [id D] '' 14
| |src [id E]
| |ScalarFromTensor [id F] '' 13
| |HostFromGpu(gpuarray) [id G] '' 12
| |GpuSubtensor{int64} [id H] '' 11
| |GpuFromHost<None> [id I] '' 0
| | |idxs [id J]
| |Constant{0} [id K]
|InplaceGpuDimShuffle{x,0,1} [id L] '' 10
| |GpuSubtensor{int32} [id M] '' 9
| |src [id E]
| |ScalarFromTensor [id N] '' 8
| |HostFromGpu(gpuarray) [id O] '' 7
| |GpuSubtensor{int64} [id P] '' 6
| |GpuFromHost<None> [id I] '' 0
| |Constant{1} [id Q]
|InplaceGpuDimShuffle{x,0,1} [id R] '' 5
|GpuSubtensor{int32} [id S] '' 4
|src [id E]
|ScalarFromTensor [id T] '' 3
|HostFromGpu(gpuarray) [id U] '' 2
|GpuSubtensor{int64} [id V] '' 1
|GpuFromHost<None> [id I] '' 0
|Constant{2} [id W]

Theano profile (in 10 calls to the function--notice 10 calls to GpuFromHost
but 30 calls to HostFromGPU):

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.9% 38.9% 0.001s 5.27e-05s C 10 1
theano.gpuarray.basic_ops.GpuJoin
31.5% 70.4% 0.000s 1.42e-05s C 30 3
theano.gpuarray.basic_ops.HostFromGpu
15.0% 85.4% 0.000s 2.03e-05s C 10 1
theano.gpuarray.basic_ops.GpuFromHost
7.4% 92.8% 0.000s 1.67e-06s C 60 6
theano.gpuarray.subtensor.GpuSubtensor
6.0% 98.8% 0.000s 2.69e-06s C 30 3
theano.gpuarray.elemwise.GpuDimShuffle
1.2% 100.0% 0.000s 5.56e-07s C 30 3
theano.tensor.basic.ScalarFromTensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)

Appreciate any tips! Thanks!
Adam

--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adam Stooke

2018-01-24 00:19:12 UTC

Permalink

I realize now the above example might seem strange where I make the
"selections" an explicit list, rather than just feeding the "idxs" directly
into "src". The reason for this is because I actually need to get a slice
(of fixed sized) at each index. The script below contains the full
problem, including three possible solutions--
1) explicitly construct the list of slices,
2) use theano.map to get the slices,
3) make all the individual indexes corresponding to the slice elements, get
those at once and then reshape (each slice becomes its own unit of data,
segregated by another dimension).

My observations in testing:
For small batch size, like 32, method 3 (idx) is fastest, followed by
method 1 (list). For large batch size, like 2048, method 2 (map) is
fastest, and method 1 (list) doesn't compile, after several minutes at
least. Still, a significant portion of time in both methods 1 and 2 are
spent in the HostFromGpu, to do with the indexes. The scan op appears to
go on the cpu. However I think the efficiency of grabbing the full slices,
rather than each and every index, might be leading to the better
performance at large batch size.

So the question stands: how to collect indexes/slices from a shared
variable without getting HostFromGpu happening for all the indexes?

Please help! :)

import theano
import theano.tensor as T
import numpy as np
import time

E = 4
H = W = 200
N = 2000
B = 32 # 256, 2048
S = 4
LOOPS = 10

LIST = True
MAP = True
IDX = True

np_src = np.random.rand(E, N, H, W).astype(np.float32)
src = theano.shared(np_src, name="src")
np_dest_zeros = np.zeros([B, S, H, W], dtype=np.float32)
idxs_0 = T.lvector('idxs_0')
idxs_1 = T.lvector('idxs_1')

np_idxs_0 = np.random.randint(low=0, high=E, size=B)
np_idxs_1 = np.random.randint(low=0, high=N - S, size=B) #
.astype(np.int32)
np_answer = np.stack([np_src[e, i:i + S] for e, i in zip(np_idxs_0,
np_idxs_1)])

# Fixed list of states method ############
if LIST:
dest_list = theano.shared(np.zeros([B, S, H, W], dtype=np.float32),
name="dest_list")
selections_list = [src[idxs_0[i], idxs_1[i]:idxs_1[i] + S] for i in
range(B)]
new_dest_list = T.stack(selections_list)
updates_list = [(dest_list, new_dest_list)]
f_list = theano.function(inputs=[idxs_0, idxs_1], updates=updates_list,
name="list")

# print(dest_list.get_value())
f_list(np_idxs_0, np_idxs_1)
# print(dest_list.get_value())
theano.printing.debugprint(f_list)
# time.sleep(1)
# t0_list = time.time()
for _ in range(LOOPS):
f_list(np_idxs_0, np_idxs_1)
# x = dest_list.get_value()
# t_list = time.time() - t0_list

# mapped list of states method ###########
if MAP:

# s = theano.shared(S, name="S")
# print("s.dtype: ", s.dtype, "s.get_value: ", s.get_value())
dest_map = theano.shared(np_dest_zeros, name="dest_map")

def get_state(idx_0, idx_1, data):
# tried using a shared variable in place of "S" here--no effect
return data[idx_0, idx_1:idx_1 + S]
# return data[idx_0, slice(idx_1, idx_1 + S)]

states_map, updates_map = theano.map(
fn=get_state,
sequences=[idxs_0, idxs_1],
non_sequences=src,
)
new_dest_map = T.concatenate([states_map])
updates_map = [(dest_map, new_dest_map)]
f_map = theano.function(inputs=[idxs_0, idxs_1], updates=updates_map,
name="map")

# print(dest_map.get_value())
f_map(np_idxs_0, np_idxs_1)
# print(dest_map.get_value())
print("\n\n")
theano.printing.debugprint(f_map)
# time.sleep(1)
# t0_map = time.time()
for _ in range(LOOPS):
f_map(np_idxs_0, np_idxs_1)
# x = dest_map.get_value()
# t_map = time.time() - t0_map

# full idx list reshaping method ########
if IDX:
dest_idx = theano.shared(np_dest_zeros, name="dest_idx")

step_idxs_col = T.reshape(idxs_1, (-1, 1))
step_idxs_tile = T.tile(step_idxs_col, (1, S))
step_idxs_rang = step_idxs_tile + T.arange(S)
step_idxs_flat = step_idxs_rang.reshape([-1])
env_idxs_repeat = T.repeat(idxs_0, S)

selections_idx = src[env_idxs_repeat, step_idxs_flat]
new_dest_idx = selections_idx.reshape([-1, S, H, W])
updates_idx = [(dest_idx, new_dest_idx)]
f_idx = theano.function(inputs=[idxs_0, idxs_1], updates=updates_idx,
name="idx")

# print(dest_idx.get_value())
f_idx(np_idxs_0, np_idxs_1)
# print(dest_idx.get_value())
print("\n\n")
theano.printing.debugprint(f_idx)
# time.sleep(1)
# t0_idx = time.time()
for _ in range(LOOPS):
f_idx(np_idxs_0, np_idxs_1)
# x = dest_idx.get_value()
# t_idx = time.time() - t0_idx

###################################################
if LIST:
print("Theano list values pass: ", np.allclose(np_answer,
dest_list.get_value()))
# print("list time: ", t_list)
if MAP:
print("Theano map values pass: ", np.allclose(np_answer,
dest_map.get_value()))
# print("map time: ", t_map)
if IDX:
print("Theano idx values pass: ", np.allclose(np_answer,
dest_idx.get_value()))
# print("idx time: ", t_idx)

Post by Adam Stooke
Hi,
I am holding an array on the GPU (in a shared variable), and I'm
sampling random minibatches from it, but it seems there is a call to
HostFromGpu at every index, which causes significant delay. Is there a way
to avoid this?
Here is a minimal code example, plus the debug and profiling printouts.
The same thing happens if I use theano.map. The problem is much worse in
my actual code, which uses multiple levels of indexing--despite also using
much larger data arrays, the time in the many calls to HostFromGpu
dominates.
import theano
import theano.tensor as T
import numpy as np
H = W = 3
N = 10
B = 3
src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
idxs = T.ivector('idxs')
selections = [src[idxs[i]] for i in range(B)]
new_dest = T.stack(selections)
updates = [(dest, new_dest)]
f = theano.function(inputs=[idxs], updates=updates)
np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
print(dest.get_value())
f(np_idxs)
print(dest.get_value())
theano.printing.debugprint(f)
f(np_idxs)
Debugprint (notice the HostFromGpu listed with unique ID leading up to
GpuJoin [id A] '' 16
|TensorConstant{0} [id B]
|InplaceGpuDimShuffle{x,0,1} [id C] '' 15
| |GpuSubtensor{int32} [id D] '' 14
| |src [id E]
| |ScalarFromTensor [id F] '' 13
| |HostFromGpu(gpuarray) [id G] '' 12
| |GpuSubtensor{int64} [id H] '' 11
| |GpuFromHost<None> [id I] '' 0
| | |idxs [id J]
| |Constant{0} [id K]
|InplaceGpuDimShuffle{x,0,1} [id L] '' 10
| |GpuSubtensor{int32} [id M] '' 9
| |src [id E]
| |ScalarFromTensor [id N] '' 8
| |HostFromGpu(gpuarray) [id O] '' 7
| |GpuSubtensor{int64} [id P] '' 6
| |GpuFromHost<None> [id I] '' 0
| |Constant{1} [id Q]
|InplaceGpuDimShuffle{x,0,1} [id R] '' 5
|GpuSubtensor{int32} [id S] '' 4
|src [id E]
|ScalarFromTensor [id T] '' 3
|HostFromGpu(gpuarray) [id U] '' 2
|GpuSubtensor{int64} [id V] '' 1
|GpuFromHost<None> [id I] '' 0
|Constant{2} [id W]
Theano profile (in 10 calls to the function--notice 10 calls to
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.9% 38.9% 0.001s 5.27e-05s C 10 1
theano.gpuarray.basic_ops.GpuJoin
31.5% 70.4% 0.000s 1.42e-05s C 30 3
theano.gpuarray.basic_ops.HostFromGpu
15.0% 85.4% 0.000s 2.03e-05s C 10 1
theano.gpuarray.basic_ops.GpuFromHost
7.4% 92.8% 0.000s 1.67e-06s C 60 6
theano.gpuarray.subtensor.GpuSubtensor
6.0% 98.8% 0.000s 2.69e-06s C 30 3
theano.gpuarray.elemwise.GpuDimShuffle
1.2% 100.0% 0.000s 5.56e-07s C 30 3
theano.tensor.basic.ScalarFromTensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Appreciate any tips! Thanks!
Adam

Adam Stooke

2018-01-30 19:25:50 UTC

Permalink

Nevermind.

The point of this setup was to speedup some computation by keeping raw data
on the GPU, then selecting minibatches from it with some reshaping and
other slight pre-processing (e.g. sometimes need to set certain subtensors
to 0), into another shared variable, on which the function computes. In my
case, this did save the ~10% time in GpuFromHost in the function, and the
~20% overall time building the input array with numpy on CPU, but the total
time ended up more than doubling, because the functions to manipulate data
on the GPU are very slow. Not sure if this is more to do with pygpu or
just with GPUs in general? I'm content with the CPU-memory solution.

In cases where a random minibatch can be gathered by simply indexing into a
batch, I have seen overall speed improvements by putting the batch onto the
GPU, with the same kind of data.

Would be interested to read about any of your experiences.

Thanks,
Adam

Post by Adam Stooke
I realize now the above example might seem strange where I make the
"selections" an explicit list, rather than just feeding the "idxs" directly
into "src". The reason for this is because I actually need to get a slice
(of fixed sized) at each index. The script below contains the full
problem, including three possible solutions--
1) explicitly construct the list of slices,
2) use theano.map to get the slices,
3) make all the individual indexes corresponding to the slice elements,
get those at once and then reshape (each slice becomes its own unit of
data, segregated by another dimension).
For small batch size, like 32, method 3 (idx) is fastest, followed by
method 1 (list). For large batch size, like 2048, method 2 (map) is
fastest, and method 1 (list) doesn't compile, after several minutes at
least. Still, a significant portion of time in both methods 1 and 2 are
spent in the HostFromGpu, to do with the indexes. The scan op appears to
go on the cpu. However I think the efficiency of grabbing the full slices,
rather than each and every index, might be leading to the better
performance at large batch size.
So the question stands: how to collect indexes/slices from a shared
variable without getting HostFromGpu happening for all the indexes?
Please help! :)
import theano
import theano.tensor as T
import numpy as np
import time
E = 4
H = W = 200
N = 2000
B = 32 # 256, 2048
S = 4
LOOPS = 10
LIST = True
MAP = True
IDX = True
np_src = np.random.rand(E, N, H, W).astype(np.float32)
src = theano.shared(np_src, name="src")
np_dest_zeros = np.zeros([B, S, H, W], dtype=np.float32)
idxs_0 = T.lvector('idxs_0')
idxs_1 = T.lvector('idxs_1')
np_idxs_0 = np.random.randint(low=0, high=E, size=B)
np_idxs_1 = np.random.randint(low=0, high=N - S, size=B) #
.astype(np.int32)
np_answer = np.stack([np_src[e, i:i + S] for e, i in zip(np_idxs_0,
np_idxs_1)])
# Fixed list of states method ############
dest_list = theano.shared(np.zeros([B, S, H, W], dtype=np.float32),
name="dest_list")
selections_list = [src[idxs_0[i], idxs_1[i]:idxs_1[i] + S] for i in
range(B)]
new_dest_list = T.stack(selections_list)
updates_list = [(dest_list, new_dest_list)]
f_list = theano.function(inputs=[idxs_0, idxs_1],
updates=updates_list, name="list")
# print(dest_list.get_value())
f_list(np_idxs_0, np_idxs_1)
# print(dest_list.get_value())
theano.printing.debugprint(f_list)
# time.sleep(1)
# t0_list = time.time()
f_list(np_idxs_0, np_idxs_1)
# x = dest_list.get_value()
# t_list = time.time() - t0_list
# mapped list of states method ###########
# s = theano.shared(S, name="S")
# print("s.dtype: ", s.dtype, "s.get_value: ", s.get_value())
dest_map = theano.shared(np_dest_zeros, name="dest_map")
# tried using a shared variable in place of "S" here--no effect
return data[idx_0, idx_1:idx_1 + S]
# return data[idx_0, slice(idx_1, idx_1 + S)]
states_map, updates_map = theano.map(
fn=get_state,
sequences=[idxs_0, idxs_1],
non_sequences=src,
)
new_dest_map = T.concatenate([states_map])
updates_map = [(dest_map, new_dest_map)]
f_map = theano.function(inputs=[idxs_0, idxs_1], updates=updates_map,
name="map")
# print(dest_map.get_value())
f_map(np_idxs_0, np_idxs_1)
# print(dest_map.get_value())
print("\n\n")
theano.printing.debugprint(f_map)
# time.sleep(1)
# t0_map = time.time()
f_map(np_idxs_0, np_idxs_1)
# x = dest_map.get_value()
# t_map = time.time() - t0_map
# full idx list reshaping method ########
dest_idx = theano.shared(np_dest_zeros, name="dest_idx")
step_idxs_col = T.reshape(idxs_1, (-1, 1))
step_idxs_tile = T.tile(step_idxs_col, (1, S))
step_idxs_rang = step_idxs_tile + T.arange(S)
step_idxs_flat = step_idxs_rang.reshape([-1])
env_idxs_repeat = T.repeat(idxs_0, S)
selections_idx = src[env_idxs_repeat, step_idxs_flat]
new_dest_idx = selections_idx.reshape([-1, S, H, W])
updates_idx = [(dest_idx, new_dest_idx)]
f_idx = theano.function(inputs=[idxs_0, idxs_1], updates=updates_idx,
name="idx")
# print(dest_idx.get_value())
f_idx(np_idxs_0, np_idxs_1)
# print(dest_idx.get_value())
print("\n\n")
theano.printing.debugprint(f_idx)
# time.sleep(1)
# t0_idx = time.time()
f_idx(np_idxs_0, np_idxs_1)
# x = dest_idx.get_value()
# t_idx = time.time() - t0_idx
###################################################
print("Theano list values pass: ", np.allclose(np_answer,
dest_list.get_value()))
# print("list time: ", t_list)
print("Theano map values pass: ", np.allclose(np_answer,
dest_map.get_value()))
# print("map time: ", t_map)
print("Theano idx values pass: ", np.allclose(np_answer,
dest_idx.get_value()))
# print("idx time: ", t_idx)

Post by Adam Stooke
Hi,
I am holding an array on the GPU (in a shared variable), and I'm
sampling random minibatches from it, but it seems there is a call to
HostFromGpu at every index, which causes significant delay. Is there a way
to avoid this?
Here is a minimal code example, plus the debug and profiling
printouts. The same thing happens if I use theano.map. The problem is
much worse in my actual code, which uses multiple levels of
indexing--despite also using much larger data arrays, the time in the many
calls to HostFromGpu dominates.
import theano
import theano.tensor as T
import numpy as np
H = W = 3
N = 10
B = 3
src = theano.shared(np.random.rand(N, H, W).astype(np.float32), name="src")
dest = theano.shared(np.zeros([B, H, W], dtype=np.float32), name="dest")
idxs = T.ivector('idxs')
selections = [src[idxs[i]] for i in range(B)]
new_dest = T.stack(selections)
updates = [(dest, new_dest)]
f = theano.function(inputs=[idxs], updates=updates)
np_idxs = np.random.randint(low=0, high=N, size=B).astype(np.int32)
print(dest.get_value())
f(np_idxs)
print(dest.get_value())
theano.printing.debugprint(f)
f(np_idxs)
Debugprint (notice the HostFromGpu listed with unique ID leading up to
GpuJoin [id A] '' 16
|TensorConstant{0} [id B]
|InplaceGpuDimShuffle{x,0,1} [id C] '' 15
| |GpuSubtensor{int32} [id D] '' 14
| |src [id E]
| |ScalarFromTensor [id F] '' 13
| |HostFromGpu(gpuarray) [id G] '' 12
| |GpuSubtensor{int64} [id H] '' 11
| |GpuFromHost<None> [id I] '' 0
| | |idxs [id J]
| |Constant{0} [id K]
|InplaceGpuDimShuffle{x,0,1} [id L] '' 10
| |GpuSubtensor{int32} [id M] '' 9
| |src [id E]
| |ScalarFromTensor [id N] '' 8
| |HostFromGpu(gpuarray) [id O] '' 7
| |GpuSubtensor{int64} [id P] '' 6
| |GpuFromHost<None> [id I] '' 0
| |Constant{1} [id Q]
|InplaceGpuDimShuffle{x,0,1} [id R] '' 5
|GpuSubtensor{int32} [id S] '' 4
|src [id E]
|ScalarFromTensor [id T] '' 3
|HostFromGpu(gpuarray) [id U] '' 2
|GpuSubtensor{int64} [id V] '' 1
|GpuFromHost<None> [id I] '' 0
|Constant{2} [id W]
Theano profile (in 10 calls to the function--notice 10 calls to
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.9% 38.9% 0.001s 5.27e-05s C 10 1
theano.gpuarray.basic_ops.GpuJoin
31.5% 70.4% 0.000s 1.42e-05s C 30 3
theano.gpuarray.basic_ops.HostFromGpu
15.0% 85.4% 0.000s 2.03e-05s C 10 1
theano.gpuarray.basic_ops.GpuFromHost
7.4% 92.8% 0.000s 1.67e-06s C 60 6
theano.gpuarray.subtensor.GpuSubtensor
6.0% 98.8% 0.000s 2.69e-06s C 30 3
theano.gpuarray.elemwise.GpuDimShuffle
1.2% 100.0% 0.000s 5.56e-07s C 30 3
theano.tensor.basic.ScalarFromTensor
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Appreciate any tips! Thanks!
Adam

Frédéric Bastien

2018-02-07 21:36:50 UTC

Permalink

On the GPU, not all indexing are fast. The slices are fast (just a view).
But on advanced indexing, only this version have been well optimized:

a_tensor[a_vector_of_int]

The vector_of_int can be on any of the dimensions from memory. But for sure
on the first dimensions.

We have code that support more advanced indexing on the GPU, but sometimes
it is slower, sometimes faster. So it is not activated by default.

For the "other computation being slow". It will depend what is that
computation. Without seeing the profile of that part, I can't comment. But
we didn't spend a good amount of time optimizing those type of computation.
So I'm not suprised that there is case when the generated code isn't very
optimized.

FrÃ©dÃ©ric