[theano-users] Why is this GpuFromHost call generated?

Discussion:

Frédéric Bastien

2017-08-09 18:36:15 UTC

My guess is that you use the old GPU backend. Can you confirm you use the
Theano flag device=gpu? And that also you have float64 in the graph. The
old backend don't support them. I suggest that you install the just
released 0.10 beta and that you use the new backend with device=cuda.

Also,you can use the flag warn_float64=pdb to find where you have them and
make sore they are float32. This will be faster.

Fred

Hi,
I am running a RNN/GRU model for a fairly large dataset with the goal of
sequence prediction. When I profile my code, I found one GpuFromHost takes
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
<Gflops/s> <Apply name>
30.2% 73.0% 462.776s 3.71e-01s 1248 221
GpuFromHost(Subtensor{:int64:}.0)
input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4,
2097152)
output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152,
2048, 1)
theano.printing.debugprint shows that the call is generated in gradient
calculation; see snippet below. There is also a HostFromGpu a couple of
layers below.
| | | | |GpuFromHost [id FN] '' 221
| | | | |Subtensor{:int64:} [id FO] '' 220
| | | | |Subtensor{::int64} [id FP] '' 219
| | | | | |InplaceDimShuffle{1,2,0} [id FQ] '' 218
| | | | | | |Reshape{3} [id FR] '' 217
| | | | | | |CrossentropyCategorical1HotGrad [id FS] '' 216
| | | | | | | |Elemwise{Second}[(0, 0)] [id FT] '' 215
| | | | | | | | |CrossentropyCategorical1Hot [id FU] '' 209
| | | | | | | | | |HostFromGpu [id FV] '' 206
I have heard about the cost of using GpuFromHost (and its counterpart
HostFromGpu) and had moved almost all data to GPU (via shared variables).
So I don't understand why the call is needed. In particular I don't
1. If all my data are on GPU and theano is optimized for GPU, why is the
GpuFromHost even generated?
2. Is the call generated because the memory is too large? The call tries
to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have
12GB memory thus the need to move seems remote on the surface. Overall
memory consumption seems OK under profiling.
3. Does the call have anything to do with CrossentropyCategorical1Hot? I
assume CrossentropyCategorical1Hot has been optimized for GPU. But the
code shows that a HostFromGPU is called before CrossentropyCategorical1Hot
is applied. I am not sure if CrossentropyCategorical1Hot has any memory
requirement (e.g., c-contiguous).
4. Should I try any GPU assertion to debug the root cause of the problem?
Any hint is appreciated.
Thank you,
Haining
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Haining Yu

2017-08-09 18:47:48 UTC

Permalink

Thank you Fred.

Yes I am using device=gpu0. I will switch to the new backend and test again.

On float64, do you mean int64? If yes, am puzzled by that too. In my code I
never explicit cast to int64. Instead I use tensor.ivector() to index
matrices and cast them explicitly into int32. For example:

x = T.ivector()

z = T.cast(y, dtype='int32')

Do you think these things cause the problem?

Thank you,
Haining

Haining Yu on Gmail

Post by FrÃ©dÃ©ric Bastien
My guess is that you use the old GPU backend. Can you confirm you use the
Theano flag device=gpu? And that also you have float64 in the graph. The
old backend don't support them. I suggest that you install the just
released 0.10 beta and that you use the new backend with device=cuda.
Also,you can use the flag warn_float64=pdb to find where you have them and
make sore they are float32. This will be faster.
Fred

Hi,
I am running a RNN/GRU model for a fairly large dataset with the goal of
sequence prediction. When I profile my code, I found one GpuFromHost takes
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
<Gflops/s> <Apply name>
30.2% 73.0% 462.776s 3.71e-01s 1248 221
GpuFromHost(Subtensor{:int64:}.0)
input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4,
2097152)
output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152,
2048, 1)
theano.printing.debugprint shows that the call is generated in gradient
calculation; see snippet below. There is also a HostFromGpu a couple of
layers below.
| | | | |GpuFromHost [id FN] '' 221
| | | | |Subtensor{:int64:} [id FO] '' 220
| | | | |Subtensor{::int64} [id FP] '' 219
| | | | | |InplaceDimShuffle{1,2,0} [id FQ] '' 218
| | | | | | |Reshape{3} [id FR] '' 217
| | | | | | |CrossentropyCategorical1HotGrad [id FS] '' 216
| | | | | | | |Elemwise{Second}[(0, 0)] [id FT] '' 215
| | | | | | | | |CrossentropyCategorical1Hot [id FU] '' 209
| | | | | | | | | |HostFromGpu [id FV] '' 206
I have heard about the cost of using GpuFromHost (and its counterpart
HostFromGpu) and had moved almost all data to GPU (via shared
variables). So I don't understand why the call is needed. In particular I
1. If all my data are on GPU and theano is optimized for GPU, why is the
GpuFromHost even generated?
2. Is the call generated because the memory is too large? The call tries
to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have
12GB memory thus the need to move seems remote on the surface. Overall
memory consumption seems OK under profiling.
3. Does the call have anything to do with CrossentropyCategorical1Hot? I
assume CrossentropyCategorical1Hot has been optimized for GPU. But the
code shows that a HostFromGPU is called before CrossentropyCategorical1Hot
is applied. I am not sure if CrossentropyCategorical1Hot has any memory
requirement (e.g., c-contiguous).
4. Should I try any GPU assertion to debug the root cause of the problem?
Any hint is appreciated.
Thank you,
Haining
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

--
---
You received this message because you are subscribed to a topic in the
Google Groups "theano-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/theano-users/CjR0L_KroOU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.

Frédéric Bastien

2017-08-09 21:38:11 UTC

Permalink

Hi,

do you use float? I was meaning float32. The old back-end only suport
float32. So if you use float64 or int32, nothing will compute on the GPU.

The new back-end support many dtypes including float64 and int*. So it
should work better.

Note, if you do operation between float32 and int32, the result is float64.
This is the normal c/numpy casting rules. float32 and int16 return float32.
So if you end up with float64, it is frequently that case.
Fred

Post by Haining Yu
Thank you Fred.
Yes I am using device=gpu0. I will switch to the new backend and test again.
On float64, do you mean int64? If yes, am puzzled by that too. In my code
I never explicit cast to int64. Instead I use tensor.ivector() to index
x = T.ivector()
z = T.cast(y, dtype='int32')
Do you think these things cause the problem?
Thank you,
Haining
Haining Yu on Gmail
On Wed, Aug 9, 2017 at 2:36 PM, FrÃ©dÃ©ric Bastien <

Post by FrÃ©dÃ©ric Bastien
My guess is that you use the old GPU backend. Can you confirm you use the
Theano flag device=gpu? And that also you have float64 in the graph. The
old backend don't support them. I suggest that you install the just
released 0.10 beta and that you use the new backend with device=cuda.
Also,you can use the flag warn_float64=pdb to find where you have them
and make sore they are float32. This will be faster.
Fred

Hi,
I am running a RNN/GRU model for a fairly large dataset with the goal
of sequence prediction. When I profile my code, I found one GpuFromHost
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
<Gflops/s> <Apply name>
30.2% 73.0% 462.776s 3.71e-01s 1248 221
GpuFromHost(Subtensor{:int64:}.0)
input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4,
2097152)
output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152,
2048, 1)
theano.printing.debugprint shows that the call is generated in gradient
calculation; see snippet below. There is also a HostFromGpu a couple of
layers below.
| | | | |GpuFromHost [id FN] '' 221
| | | | |Subtensor{:int64:} [id FO] '' 220
| | | | |Subtensor{::int64} [id FP] '' 219
| | | | | |InplaceDimShuffle{1,2,0} [id FQ] '' 218
| | | | | | |Reshape{3} [id FR] '' 217
| | | | | | |CrossentropyCategorical1HotGrad [id FS] '' 216
| | | | | | | |Elemwise{Second}[(0, 0)] [id FT] '' 215
| | | | | | | | |CrossentropyCategorical1Hot [id FU] '' 209
| | | | | | | | | |HostFromGpu [id FV] '' 206
I have heard about the cost of using GpuFromHost (and its counterpart
HostFromGpu) and had moved almost all data to GPU (via shared
variables). So I don't understand why the call is needed. In particular I
1. If all my data are on GPU and theano is optimized for GPU, why is the
GpuFromHost even generated?
2. Is the call generated because the memory is too large? The call tries
to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have
12GB memory thus the need to move seems remote on the surface. Overall
memory consumption seems OK under profiling.
3. Does the call have anything to do with CrossentropyCategorical1Hot? I
assume CrossentropyCategorical1Hot has been optimized for GPU. But the
code shows that a HostFromGPU is called before CrossentropyCategorical1Hot
is applied. I am not sure if CrossentropyCategorical1Hot has any memory
requirement (e.g., c-contiguous).
4. Should I try any GPU assertion to debug the root cause of the problem?
Any hint is appreciated.
Thank you,
Haining
--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
---
You received this message because you are subscribed to a topic in the
Google Groups "theano-users" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/theano-users/CjR0L_KroOU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.

--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Haining Yu

2017-08-10 13:00:52 UTC

Permalink

I don't see any float64 in debugprint result.

Inspecting the code, I am just using floatX e.g.
self.x = theano.shared(name='gx', value=x1.astype(theano.config.floatX))

I did use int32 to cast various indices. In profiling it seems to be
converted into int64.

Will make all the changes based on your suggestion and test one more time.

Thanks again.

Post by FrÃ©dÃ©ric Bastien
Hi,
do you use float? I was meaning float32. The old back-end only suport
float32. So if you use float64 or int32, nothing will compute on the GPU.
The new back-end support many dtypes including float64 and int*. So it
should work better.
Note, if you do operation between float32 and int32, the result is
float64. This is the normal c/numpy casting rules. float32 and int16 return
float32. So if you end up with float64, it is frequently that case.
Fred

Post by FrÃ©dÃ©ric Bastien
My guess is that you use the old GPU backend. Can you confirm you use
the Theano flag device=gpu? And that also you have float64 in the graph.
The old backend don't support them. I suggest that you install the just
released 0.10 beta and that you use the new backend with device=cuda.
Also,you can use the flag warn_float64=pdb to find where you have them
and make sore they are float32. This will be faster.
Fred

Hi,
I am running a RNN/GRU model for a fairly large dataset with the goal
of sequence prediction. When I profile my code, I found one GpuFromHost
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
<Gflops/s> <Apply name>
30.2% 73.0% 462.776s 3.71e-01s 1248 221
GpuFromHost(Subtensor{:int64:}.0)
input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4,
2097152)
output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152,
2048, 1)
theano.printing.debugprint shows that the call is generated in gradient
calculation; see snippet below. There is also a HostFromGpu a couple
of layers below.
| | | | |GpuFromHost [id FN] '' 221
| | | | |Subtensor{:int64:} [id FO] '' 220
| | | | |Subtensor{::int64} [id FP] '' 219
| | | | | |InplaceDimShuffle{1,2,0} [id FQ] '' 218
| | | | | | |Reshape{3} [id FR] '' 217
| | | | | | |CrossentropyCategorical1HotGrad [id FS] '' 216
| | | | | | | |Elemwise{Second}[(0, 0)] [id FT] '' 215
| | | | | | | | |CrossentropyCategorical1Hot [id FU] '' 209
| | | | | | | | | |HostFromGpu [id FV] '' 206
I have heard about the cost of using GpuFromHost (and its counterpart
HostFromGpu) and had moved almost all data to GPU (via shared
variables). So I don't understand why the call is needed. In particular I
1. If all my data are on GPU and theano is optimized for GPU, why is
the GpuFromHost even generated?
2. Is the call generated because the memory is too large? The call
tries to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should
have 12GB memory thus the need to move seems remote on the surface. Overall
memory consumption seems OK under profiling.
3. Does the call have anything to do with CrossentropyCategorical1Hot?
I assume CrossentropyCategorical1Hot has been optimized for GPU. But the
code shows that a HostFromGPU is called before CrossentropyCategorical1Hot
is applied. I am not sure if CrossentropyCategorical1Hot has any
memory requirement (e.g., c-contiguous).
4. Should I try any GPU assertion to debug the root cause of the problem?
Any hint is appreciated.
Thank you,
Haining
--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
---
You received this message because you are subscribed to a topic in the
Google Groups "theano-users" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/theano-users/CjR0L_KroOU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/d/optout.