Frédéric Bastien
2017-08-09 18:36:15 UTC
My guess is that you use the old GPU backend. Can you confirm you use the
Theano flag device=gpu? And that also you have float64 in the graph. The
old backend don't support them. I suggest that you install the just
released 0.10 beta and that you use the new backend with device=cuda.
Also,you can use the flag warn_float64=pdb to find where you have them and
make sore they are float32. This will be faster.
Fred
Theano flag device=gpu? And that also you have float64 in the graph. The
old backend don't support them. I suggest that you install the just
released 0.10 beta and that you use the new backend with device=cuda.
Also,you can use the flag warn_float64=pdb to find where you have them and
make sore they are float32. This will be faster.
Fred
Hi,
I am running a RNN/GRU model for a fairly large dataset with the goal of
sequence prediction. When I profile my code, I found one GpuFromHost takes
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
<Gflops/s> <Apply name>
30.2% 73.0% 462.776s 3.71e-01s 1248 221
GpuFromHost(Subtensor{:int64:}.0)
input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4,
2097152)
output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152,
2048, 1)
theano.printing.debugprint shows that the call is generated in gradient
calculation; see snippet below. There is also a HostFromGpu a couple of
layers below.
| | | | |GpuFromHost [id FN] '' 221
| | | | |Subtensor{:int64:} [id FO] '' 220
| | | | |Subtensor{::int64} [id FP] '' 219
| | | | | |InplaceDimShuffle{1,2,0} [id FQ] '' 218
| | | | | | |Reshape{3} [id FR] '' 217
| | | | | | |CrossentropyCategorical1HotGrad [id FS] '' 216
| | | | | | | |Elemwise{Second}[(0, 0)] [id FT] '' 215
| | | | | | | | |CrossentropyCategorical1Hot [id FU] '' 209
| | | | | | | | | |HostFromGpu [id FV] '' 206
I have heard about the cost of using GpuFromHost (and its counterpart
HostFromGpu) and had moved almost all data to GPU (via shared variables).
So I don't understand why the call is needed. In particular I don't
1. If all my data are on GPU and theano is optimized for GPU, why is the
GpuFromHost even generated?
2. Is the call generated because the memory is too large? The call tries
to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have
12GB memory thus the need to move seems remote on the surface. Overall
memory consumption seems OK under profiling.
3. Does the call have anything to do with CrossentropyCategorical1Hot? I
assume CrossentropyCategorical1Hot has been optimized for GPU. But the
code shows that a HostFromGPU is called before CrossentropyCategorical1Hot
is applied. I am not sure if CrossentropyCategorical1Hot has any memory
requirement (e.g., c-contiguous).
4. Should I try any GPU assertion to debug the root cause of the problem?
Any hint is appreciated.
Thank you,
Haining
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
I am running a RNN/GRU model for a fairly large dataset with the goal of
sequence prediction. When I profile my code, I found one GpuFromHost takes
<% time> <sum %> <apply time> <time per call> <#call> <id> <Mflops>
<Gflops/s> <Apply name>
30.2% 73.0% 462.776s 3.71e-01s 1248 221
GpuFromHost(Subtensor{:int64:}.0)
input 0: dtype=float32, shape=(512, 1024, 2048), strides=(-4096, 4,
2097152)
output 0: dtype=float32, shape=(512, 1024, 2048), strides=(2097152,
2048, 1)
theano.printing.debugprint shows that the call is generated in gradient
calculation; see snippet below. There is also a HostFromGpu a couple of
layers below.
| | | | |GpuFromHost [id FN] '' 221
| | | | |Subtensor{:int64:} [id FO] '' 220
| | | | |Subtensor{::int64} [id FP] '' 219
| | | | | |InplaceDimShuffle{1,2,0} [id FQ] '' 218
| | | | | | |Reshape{3} [id FR] '' 217
| | | | | | |CrossentropyCategorical1HotGrad [id FS] '' 216
| | | | | | | |Elemwise{Second}[(0, 0)] [id FT] '' 215
| | | | | | | | |CrossentropyCategorical1Hot [id FU] '' 209
| | | | | | | | | |HostFromGpu [id FV] '' 206
I have heard about the cost of using GpuFromHost (and its counterpart
HostFromGpu) and had moved almost all data to GPU (via shared variables).
So I don't understand why the call is needed. In particular I don't
1. If all my data are on GPU and theano is optimized for GPU, why is the
GpuFromHost even generated?
2. Is the call generated because the memory is too large? The call tries
to move 512 x 1024 x 2048 x 4 = 4.2GB memory. But my Tesla K80 should have
12GB memory thus the need to move seems remote on the surface. Overall
memory consumption seems OK under profiling.
3. Does the call have anything to do with CrossentropyCategorical1Hot? I
assume CrossentropyCategorical1Hot has been optimized for GPU. But the
code shows that a HostFromGPU is called before CrossentropyCategorical1Hot
is applied. I am not sure if CrossentropyCategorical1Hot has any memory
requirement (e.g., c-contiguous).
4. Should I try any GPU assertion to debug the root cause of the problem?
Any hint is appreciated.
Thank you,
Haining
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.