Discussion:
[theano-users] Does theano compute parallel branches in parallel?
Sharapolas
2017-04-18 09:24:30 UTC
Permalink
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Frédéric Bastien
2017-04-19 17:54:54 UTC
Permalink
Sadly, we don't do that anywhere currently.

Fred
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Patric
2017-04-20 01:07:45 UTC
Permalink
Could you share your model with us? We'd like to take a look :)
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sharapolas
2017-04-20 09:43:18 UTC
Permalink
Guys thanks for your feedback.

For the past week I have been trying to optimize my solver as much as
possible and I optimized so much that the CPU is twice faster than the GPU
now :D Extremelly puzzled with this result and I hope you could shed some
light on that.

Wider story:
In my initial version, I arranged the tensors such that I do not need
to do slicing. Then I noticed that GPU load is directly proportional to the
size of the tensors being used, thus I decided to use smaller tensors but
lump them together and then slice in the few cases where I need it. As a
result the GPU code turned to be more than 4 times slower, but CPU code
almost rivals my first GPU version. I tried using different version of
indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in
similar timings.

Do you have suggestions how I could speed up my GPU code? Otherwise, I
might as well just run on multicode CPU and prob become even faster than
GPU :/


GPU version. Flags:
os.environ['THEANO_FLAGS'] =
",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
Pickled version:
https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
Graph:
https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
Profile:
Function profiling
==================
Time in 1000 calls to Function.__call__: 2.170000e+01s
Time in Function.fn.__call__: 2.166000e+01s (99.816%)
Time in thunks: 2.150321e+01s (99.093%)
Total compile time: 1.809000e+00s
Number of Apply nodes: 276
Theano Optimizer time: 1.099000e+00s
Theano validate time: 2.069981e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
2.370000e-01s
Import time 3.000021e-03s
Node make_thunk time 2.260001e-01s
Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0,
CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0},
convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
time 2.000093e-03s
Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0},
TensorConstant{(4L,) of 1}) time 2.000093e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 101.753s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
theano.sandbox.cuda.blas.GpuDot22
16.9% 54.9% 3.627s 4.37e-05s C 83000 83
theano.sandbox.cuda.basic_ops.GpuElemwise
14.7% 69.6% 3.169s 1.76e-04s Py 18000 18
theano.sandbox.cuda.basic_ops.GpuSplit
13.8% 83.4% 2.970s 1.65e-04s C 18000 18
theano.sandbox.cuda.basic_ops.GpuJoin
12.4% 95.9% 2.674s 1.57e-04s C 17000 17
theano.sandbox.cuda.blas.GpuGemm
3.5% 99.4% 0.751s 4.17e-05s C 18000 18
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.6% 100.0% 0.137s 1.96e-06s C 70000 70
theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
GpuDot22
13.8% 51.8% 2.970s 1.65e-04s C 18000 18
GpuJoin
12.4% 64.3% 2.674s 1.57e-04s C 17000 17
GpuGemm{inplace}
7.7% 71.9% 1.649s 2.36e-04s Py 7000 7
GpuSplit{4}
6.1% 78.1% 1.317s 4.39e-05s C 30000 30
GpuElemwise{Mul}[(0, 1)]
5.4% 83.5% 1.167s 1.30e-04s Py 9000 9
GpuSplit{2}
3.6% 87.0% 0.766s 4.26e-05s C 18000 18
GpuElemwise{mul,no_inplace}
3.5% 90.6% 0.763s 4.24e-05s C 18000 18
GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
3.5% 94.1% 0.751s 4.17e-05s C 18000 18
GpuCAReduce{add}{0,1}
1.9% 95.9% 0.399s 4.99e-05s C 8000 8
GpuElemwise{Mul}[(0, 0)]
1.6% 97.6% 0.353s 1.76e-04s Py 2000 2
GpuSplit{3}
1.1% 98.7% 0.247s 4.12e-05s C 6000 6
GpuElemwise{Add}[(0, 2)]
0.6% 99.4% 0.133s 2.56e-06s C 52000 52
GpuDimShuffle{1,0}
0.4% 99.8% 0.094s 4.70e-05s C 2000 2
GpuElemwise{Add}[(0, 1)]
0.2% 100.0% 0.041s 4.10e-05s C 1000 1
GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
0.0% 100.0% 0.004s 2.22e-07s C 18000 18
GpuDimShuffle{0,x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
1.2% 1.2% 0.259s 2.59e-04s 1000 14
GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 2.3% 0.246s 2.46e-04s 1000 9
GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
1.1% 3.5% 0.245s 2.45e-04s 1000 236
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
1.1% 4.6% 0.239s 2.39e-04s 1000 239
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.1% 5.7% 0.233s 2.33e-04s 1000 8
GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 6.8% 0.232s 2.32e-04s 1000 5
GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 7.8% 0.228s 2.28e-04s 1000 0
GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 8.9% 0.227s 2.27e-04s 1000 2
GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 9.9% 0.225s 2.25e-04s 1000 238
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 11.0% 0.224s 2.24e-04s 1000 4
GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 12.0% 0.223s 2.23e-04s 1000 260
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 13.0% 0.221s 2.21e-04s 1000 271
GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) +
i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 14.0% 0.218s 2.18e-04s 1000 261
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
0.9% 15.0% 0.203s 2.03e-04s 1000 237
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
0.9% 15.8% 0.184s 1.84e-04s 1000 146
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 16.7% 0.181s 1.81e-04s 1000 84
GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
0.8% 17.5% 0.179s 1.79e-04s 1000 134
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 18.4% 0.179s 1.79e-04s 1000 16
GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
0.8% 19.2% 0.175s 1.75e-04s 1000 83
GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
0.8% 20.0% 0.174s 1.74e-04s 1000 11
GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
... (remaining 256 Apply instances account for 80.03%(17.21s) of the
runtime)


Some info useful for gpu:

Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and
0.000s(0.00%) transfert Op

Theano function input that are float64
<fct name> <input name> <input type> <str input>

List of apply that don't have float64 as input but have float64 in
outputs
(Useful to know if we forgot some cast when using floatX=float32 or gpu
code)
<Apply> <Apply position> <fct name> <inputs type> <outputs type>

Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.

The CPU version. Flags:
os.environ['THEANO_FLAGS'] =
',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
Graph:
https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
Pickled function:
https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
Profile:
Function profiling
==================
Time in 1000 calls to Function.__call__: 5.470006e+00s
Time in Function.fn.__call__: 5.422005e+00s (99.122%)
Time in thunks: 5.277404e+00s (96.479%)
Total compile time: 9.329998e-01s
Number of Apply nodes: 285
Theano Optimizer time: 7.650001e-01s
Theano validate time: 1.880007e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
1.140001e-01s
Import time 0.000000e+00s
Node make_thunk time 1.020000e-01s
Node InplaceDimShuffle{x,0}(Sum{axis=[0], acc_dtype=float64}.0)
time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time
1.000166e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 62.174s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
theano.tensor.blas.Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
theano.tensor.blas.Gemm
2.8% 95.9% 0.146s 1.59e-06s C 92000 92
theano.tensor.elemwise.Elemwise
1.6% 97.6% 0.085s 4.72e-06s C 18000 18
theano.tensor.elemwise.Sum
1.1% 98.7% 0.058s 3.22e-06s C 18000 18
theano.tensor.basic.Join
1.0% 99.7% 0.053s 2.94e-06s C 18000 18
theano.tensor.basic.Split
0.3% 100.0% 0.018s 2.57e-07s C 70000 70
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
Gemm{inplace}
1.6% 94.8% 0.085s 4.72e-06s C 18000 18
Sum{axis=[0], acc_dtype=float64}
1.4% 96.2% 0.076s 4.22e-06s C 18000 18
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
1.1% 97.3% 0.058s 3.22e-06s C 18000 18
Join
0.7% 98.0% 0.038s 2.11e-06s C 18000 18
Elemwise{mul,no_inplace}
0.5% 98.5% 0.025s 3.56e-06s C 7000 7
Split{4}
0.4% 98.9% 0.021s 2.34e-06s C 9000 9
Split{2}
0.2% 99.2% 0.013s 2.50e-07s C 52000 52
InplaceDimShuffle{1,0}
0.2% 99.4% 0.012s 3.08e-07s C 39000 39
Elemwise{Mul}[(0, 1)]
0.2% 99.6% 0.011s 1.83e-06s C 6000 6
Elemwise{Add}[(0, 2)]
0.1% 99.7% 0.007s 3.51e-06s C 2000 2
Split{3}
0.1% 99.8% 0.005s 5.56e-07s C 9000 9
Elemwise{Mul}[(0, 0)]
0.1% 99.9% 0.005s 2.77e-07s C 18000 18
InplaceDimShuffle{x,0}
0.1% 100.0% 0.004s 2.00e-06s C 2000 2
Elemwise{Add}[(0, 1)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.0% 2.0% 0.106s 1.06e-04s 1000 110
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
2.0% 4.0% 0.104s 1.04e-04s 1000 107
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 5.7% 0.093s 9.30e-05s 1000 188
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 7.5% 0.093s 9.30e-05s 1000 78
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.8% 9.3% 0.093s 9.29e-05s 1000 146
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 11.0% 0.092s 9.20e-05s 1000 135
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
1.7% 12.8% 0.092s 9.20e-05s 1000 105
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 14.5% 0.092s 9.19e-05s 1000 164
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 16.2% 0.090s 9.03e-05s 1000 177
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 17.9% 0.090s 8.99e-05s 1000 178
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 19.6% 0.089s 8.90e-05s 1000 159
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 21.3% 0.089s 8.90e-05s 1000 168
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 23.0% 0.089s 8.90e-05s 1000 157
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 24.6% 0.088s 8.80e-05s 1000 73
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.6% 26.3% 0.087s 8.71e-05s 1000 121
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 27.9% 0.087s 8.70e-05s 1000 193
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 29.6% 0.086s 8.60e-05s 1000 170
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 31.2% 0.085s 8.50e-05s 1000 166
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 32.8% 0.084s 8.40e-05s 1000 155
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 34.3% 0.083s 8.30e-05s 1000 140
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
... (remaining 265 Apply instances account for 65.66%(3.46s) of the
runtime)

Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
Post by Patric
Could you share your model with us? We'd like to take a look :)
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sharapolas
2017-04-20 09:49:09 UTC
Permalink
My system:
Windows 8.1 Enterprise x64
Anaconda Python 2.7.12 x64
Theano 0.9.0rc4.dev-44f7578c16e7b991c06e373d470d9889c2729844
Geforce GTX 1070
Post by Sharapolas
Guys thanks for your feedback.
For the past week I have been trying to optimize my solver as much as
possible and I optimized so much that the CPU is twice faster than the GPU
now :D Extremelly puzzled with this result and I hope you could shed some
light on that.
In my initial version, I arranged the tensors such that I do not need
to do slicing. Then I noticed that GPU load is directly proportional to the
size of the tensors being used, thus I decided to use smaller tensors but
lump them together and then slice in the few cases where I need it. As a
result the GPU code turned to be more than 4 times slower, but CPU code
almost rivals my first GPU version. I tried using different version of
indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in
similar timings.
Do you have suggestions how I could speed up my GPU code? Otherwise, I
might as well just run on multicode CPU and prob become even faster than
GPU :/
os.environ['THEANO_FLAGS'] =
",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
Function profiling
==================
Time in 1000 calls to Function.__call__: 2.170000e+01s
Time in Function.fn.__call__: 2.166000e+01s (99.816%)
Time in thunks: 2.150321e+01s (99.093%)
Total compile time: 1.809000e+00s
Number of Apply nodes: 276
Theano Optimizer time: 1.099000e+00s
Theano validate time: 2.069981e-01s
2.370000e-01s
Import time 3.000021e-03s
Node make_thunk time 2.260001e-01s
Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0,
CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0},
convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
time 2.000093e-03s
Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0},
TensorConstant{(4L,) of 1}) time 2.000093e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 101.753s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
theano.sandbox.cuda.blas.GpuDot22
16.9% 54.9% 3.627s 4.37e-05s C 83000 83
theano.sandbox.cuda.basic_ops.GpuElemwise
14.7% 69.6% 3.169s 1.76e-04s Py 18000 18
theano.sandbox.cuda.basic_ops.GpuSplit
13.8% 83.4% 2.970s 1.65e-04s C 18000 18
theano.sandbox.cuda.basic_ops.GpuJoin
12.4% 95.9% 2.674s 1.57e-04s C 17000 17
theano.sandbox.cuda.blas.GpuGemm
3.5% 99.4% 0.751s 4.17e-05s C 18000 18
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.6% 100.0% 0.137s 1.96e-06s C 70000 70
theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
GpuDot22
13.8% 51.8% 2.970s 1.65e-04s C 18000 18
GpuJoin
12.4% 64.3% 2.674s 1.57e-04s C 17000 17
GpuGemm{inplace}
7.7% 71.9% 1.649s 2.36e-04s Py 7000 7
GpuSplit{4}
6.1% 78.1% 1.317s 4.39e-05s C 30000 30
GpuElemwise{Mul}[(0, 1)]
5.4% 83.5% 1.167s 1.30e-04s Py 9000 9
GpuSplit{2}
3.6% 87.0% 0.766s 4.26e-05s C 18000 18
GpuElemwise{mul,no_inplace}
3.5% 90.6% 0.763s 4.24e-05s C 18000 18
GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
3.5% 94.1% 0.751s 4.17e-05s C 18000 18
GpuCAReduce{add}{0,1}
1.9% 95.9% 0.399s 4.99e-05s C 8000 8
GpuElemwise{Mul}[(0, 0)]
1.6% 97.6% 0.353s 1.76e-04s Py 2000 2
GpuSplit{3}
1.1% 98.7% 0.247s 4.12e-05s C 6000 6
GpuElemwise{Add}[(0, 2)]
0.6% 99.4% 0.133s 2.56e-06s C 52000 52
GpuDimShuffle{1,0}
0.4% 99.8% 0.094s 4.70e-05s C 2000 2
GpuElemwise{Add}[(0, 1)]
0.2% 100.0% 0.041s 4.10e-05s C 1000 1
GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
0.0% 100.0% 0.004s 2.22e-07s C 18000 18
GpuDimShuffle{0,x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
1.2% 1.2% 0.259s 2.59e-04s 1000 14
GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 2.3% 0.246s 2.46e-04s 1000 9
GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
1.1% 3.5% 0.245s 2.45e-04s 1000 236
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
1.1% 4.6% 0.239s 2.39e-04s 1000 239
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.1% 5.7% 0.233s 2.33e-04s 1000 8
GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 6.8% 0.232s 2.32e-04s 1000 5
GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 7.8% 0.228s 2.28e-04s 1000 0
GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 8.9% 0.227s 2.27e-04s 1000 2
GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 9.9% 0.225s 2.25e-04s 1000 238
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 11.0% 0.224s 2.24e-04s 1000 4
GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 12.0% 0.223s 2.23e-04s 1000 260
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 13.0% 0.221s 2.21e-04s 1000 271
GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) +
i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 14.0% 0.218s 2.18e-04s 1000 261
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
0.9% 15.0% 0.203s 2.03e-04s 1000 237
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
0.9% 15.8% 0.184s 1.84e-04s 1000 146
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 16.7% 0.181s 1.81e-04s 1000 84
GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
0.8% 17.5% 0.179s 1.79e-04s 1000 134
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 18.4% 0.179s 1.79e-04s 1000 16
GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
0.8% 19.2% 0.175s 1.75e-04s 1000 83
GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
0.8% 20.0% 0.174s 1.74e-04s 1000 11
GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
... (remaining 256 Apply instances account for 80.03%(17.21s) of the
runtime)
Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and
0.000s(0.00%) transfert Op
Theano function input that are float64
<fct name> <input name> <input type> <str input>
List of apply that don't have float64 as input but have float64 in
outputs
(Useful to know if we forgot some cast when using floatX=float32 or
gpu code)
<Apply> <Apply position> <fct name> <inputs type> <outputs type>
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
os.environ['THEANO_FLAGS'] =
',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
Function profiling
==================
Time in 1000 calls to Function.__call__: 5.470006e+00s
Time in Function.fn.__call__: 5.422005e+00s (99.122%)
Time in thunks: 5.277404e+00s (96.479%)
Total compile time: 9.329998e-01s
Number of Apply nodes: 285
Theano Optimizer time: 7.650001e-01s
Theano validate time: 1.880007e-01s
1.140001e-01s
Import time 0.000000e+00s
Node make_thunk time 1.020000e-01s
Node InplaceDimShuffle{x,0}(Sum{axis=[0], acc_dtype=float64}.0)
time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time
1.000166e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 62.174s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
theano.tensor.blas.Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
theano.tensor.blas.Gemm
2.8% 95.9% 0.146s 1.59e-06s C 92000 92
theano.tensor.elemwise.Elemwise
1.6% 97.6% 0.085s 4.72e-06s C 18000 18
theano.tensor.elemwise.Sum
1.1% 98.7% 0.058s 3.22e-06s C 18000 18
theano.tensor.basic.Join
1.0% 99.7% 0.053s 2.94e-06s C 18000 18
theano.tensor.basic.Split
0.3% 100.0% 0.018s 2.57e-07s C 70000 70
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
Gemm{inplace}
1.6% 94.8% 0.085s 4.72e-06s C 18000 18
Sum{axis=[0], acc_dtype=float64}
1.4% 96.2% 0.076s 4.22e-06s C 18000 18
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
1.1% 97.3% 0.058s 3.22e-06s C 18000 18
Join
0.7% 98.0% 0.038s 2.11e-06s C 18000 18
Elemwise{mul,no_inplace}
0.5% 98.5% 0.025s 3.56e-06s C 7000 7
Split{4}
0.4% 98.9% 0.021s 2.34e-06s C 9000 9
Split{2}
0.2% 99.2% 0.013s 2.50e-07s C 52000 52
InplaceDimShuffle{1,0}
0.2% 99.4% 0.012s 3.08e-07s C 39000 39
Elemwise{Mul}[(0, 1)]
0.2% 99.6% 0.011s 1.83e-06s C 6000 6
Elemwise{Add}[(0, 2)]
0.1% 99.7% 0.007s 3.51e-06s C 2000 2
Split{3}
0.1% 99.8% 0.005s 5.56e-07s C 9000 9
Elemwise{Mul}[(0, 0)]
0.1% 99.9% 0.005s 2.77e-07s C 18000 18
InplaceDimShuffle{x,0}
0.1% 100.0% 0.004s 2.00e-06s C 2000 2
Elemwise{Add}[(0, 1)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.0% 2.0% 0.106s 1.06e-04s 1000 110
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
2.0% 4.0% 0.104s 1.04e-04s 1000 107
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 5.7% 0.093s 9.30e-05s 1000 188
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 7.5% 0.093s 9.30e-05s 1000 78
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.8% 9.3% 0.093s 9.29e-05s 1000 146
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 11.0% 0.092s 9.20e-05s 1000 135
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
1.7% 12.8% 0.092s 9.20e-05s 1000 105
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 14.5% 0.092s 9.19e-05s 1000 164
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 16.2% 0.090s 9.03e-05s 1000 177
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 17.9% 0.090s 8.99e-05s 1000 178
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 19.6% 0.089s 8.90e-05s 1000 159
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 21.3% 0.089s 8.90e-05s 1000 168
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 23.0% 0.089s 8.90e-05s 1000 157
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 24.6% 0.088s 8.80e-05s 1000 73
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.6% 26.3% 0.087s 8.71e-05s 1000 121
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 27.9% 0.087s 8.70e-05s 1000 193
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 29.6% 0.086s 8.60e-05s 1000 170
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 31.2% 0.085s 8.50e-05s 1000 166
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 32.8% 0.084s 8.40e-05s 1000 155
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 34.3% 0.083s 8.30e-05s 1000 140
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
... (remaining 265 Apply instances account for 65.66%(3.46s) of the
runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
Post by Patric
Could you share your model with us? We'd like to take a look :)
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Patric
2017-04-21 05:14:59 UTC
Permalink
Very thanks for the information.

From the profiling log, the CPU is quick good since there are lots of data
operations such as split, join and which are almost 100X faster in CPU.

The topologies of your model include huge of small GEMM and Elemwise so I
think the big cache will be helpful in CPU side. And as the title, parallel
branch would be a very good idea for independent compute flow.

Do you have used Intel MKL as the backend of GEMM which will show better
performance?

btw, I can't open .p file, any suggestions?
Post by Sharapolas
Guys thanks for your feedback.
For the past week I have been trying to optimize my solver as much as
possible and I optimized so much that the CPU is twice faster than the GPU
now :D Extremelly puzzled with this result and I hope you could shed some
light on that.
In my initial version, I arranged the tensors such that I do not need
to do slicing. Then I noticed that GPU load is directly proportional to the
size of the tensors being used, thus I decided to use smaller tensors but
lump them together and then slice in the few cases where I need it. As a
result the GPU code turned to be more than 4 times slower, but CPU code
almost rivals my first GPU version. I tried using different version of
indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in
similar timings.
Do you have suggestions how I could speed up my GPU code? Otherwise, I
might as well just run on multicode CPU and prob become even faster than
GPU :/
os.environ['THEANO_FLAGS'] =
",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
Function profiling
==================
Time in 1000 calls to Function.__call__: 2.170000e+01s
Time in Function.fn.__call__: 2.166000e+01s (99.816%)
Time in thunks: 2.150321e+01s (99.093%)
Total compile time: 1.809000e+00s
Number of Apply nodes: 276
Theano Optimizer time: 1.099000e+00s
Theano validate time: 2.069981e-01s
2.370000e-01s
Import time 3.000021e-03s
Node make_thunk time 2.260001e-01s
Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0,
CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0},
convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
time 2.000093e-03s
Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0},
TensorConstant{(4L,) of 1}) time 2.000093e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 101.753s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
theano.sandbox.cuda.blas.GpuDot22
16.9% 54.9% 3.627s 4.37e-05s C 83000 83
theano.sandbox.cuda.basic_ops.GpuElemwise
14.7% 69.6% 3.169s 1.76e-04s Py 18000 18
theano.sandbox.cuda.basic_ops.GpuSplit
13.8% 83.4% 2.970s 1.65e-04s C 18000 18
theano.sandbox.cuda.basic_ops.GpuJoin
12.4% 95.9% 2.674s 1.57e-04s C 17000 17
theano.sandbox.cuda.blas.GpuGemm
3.5% 99.4% 0.751s 4.17e-05s C 18000 18
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.6% 100.0% 0.137s 1.96e-06s C 70000 70
theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
GpuDot22
13.8% 51.8% 2.970s 1.65e-04s C 18000 18
GpuJoin
12.4% 64.3% 2.674s 1.57e-04s C 17000 17
GpuGemm{inplace}
7.7% 71.9% 1.649s 2.36e-04s Py 7000 7
GpuSplit{4}
6.1% 78.1% 1.317s 4.39e-05s C 30000 30
GpuElemwise{Mul}[(0, 1)]
5.4% 83.5% 1.167s 1.30e-04s Py 9000 9
GpuSplit{2}
3.6% 87.0% 0.766s 4.26e-05s C 18000 18
GpuElemwise{mul,no_inplace}
3.5% 90.6% 0.763s 4.24e-05s C 18000 18
GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
3.5% 94.1% 0.751s 4.17e-05s C 18000 18
GpuCAReduce{add}{0,1}
1.9% 95.9% 0.399s 4.99e-05s C 8000 8
GpuElemwise{Mul}[(0, 0)]
1.6% 97.6% 0.353s 1.76e-04s Py 2000 2
GpuSplit{3}
1.1% 98.7% 0.247s 4.12e-05s C 6000 6
GpuElemwise{Add}[(0, 2)]
0.6% 99.4% 0.133s 2.56e-06s C 52000 52
GpuDimShuffle{1,0}
0.4% 99.8% 0.094s 4.70e-05s C 2000 2
GpuElemwise{Add}[(0, 1)]
0.2% 100.0% 0.041s 4.10e-05s C 1000 1
GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
0.0% 100.0% 0.004s 2.22e-07s C 18000 18
GpuDimShuffle{0,x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
1.2% 1.2% 0.259s 2.59e-04s 1000 14
GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 2.3% 0.246s 2.46e-04s 1000 9
GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
1.1% 3.5% 0.245s 2.45e-04s 1000 236
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
1.1% 4.6% 0.239s 2.39e-04s 1000 239
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.1% 5.7% 0.233s 2.33e-04s 1000 8
GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 6.8% 0.232s 2.32e-04s 1000 5
GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 7.8% 0.228s 2.28e-04s 1000 0
GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 8.9% 0.227s 2.27e-04s 1000 2
GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 9.9% 0.225s 2.25e-04s 1000 238
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 11.0% 0.224s 2.24e-04s 1000 4
GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 12.0% 0.223s 2.23e-04s 1000 260
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 13.0% 0.221s 2.21e-04s 1000 271
GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) +
i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 14.0% 0.218s 2.18e-04s 1000 261
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
0.9% 15.0% 0.203s 2.03e-04s 1000 237
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
0.9% 15.8% 0.184s 1.84e-04s 1000 146
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 16.7% 0.181s 1.81e-04s 1000 84
GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
0.8% 17.5% 0.179s 1.79e-04s 1000 134
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 18.4% 0.179s 1.79e-04s 1000 16
GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
0.8% 19.2% 0.175s 1.75e-04s 1000 83
GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
0.8% 20.0% 0.174s 1.74e-04s 1000 11
GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
... (remaining 256 Apply instances account for 80.03%(17.21s) of the
runtime)
Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and
0.000s(0.00%) transfert Op
Theano function input that are float64
<fct name> <input name> <input type> <str input>
List of apply that don't have float64 as input but have float64 in
outputs
(Useful to know if we forgot some cast when using floatX=float32 or
gpu code)
<Apply> <Apply position> <fct name> <inputs type> <outputs type>
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
os.environ['THEANO_FLAGS'] =
',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
Function profiling
==================
Time in 1000 calls to Function.__call__: 5.470006e+00s
Time in Function.fn.__call__: 5.422005e+00s (99.122%)
Time in thunks: 5.277404e+00s (96.479%)
Total compile time: 9.329998e-01s
Number of Apply nodes: 285
Theano Optimizer time: 7.650001e-01s
Theano validate time: 1.880007e-01s
1.140001e-01s
Import time 0.000000e+00s
Node make_thunk time 1.020000e-01s
Node InplaceDimShuffle{x,0}(Sum{axis=[0], acc_dtype=float64}.0)
time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time
1.000166e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 62.174s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
theano.tensor.blas.Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
theano.tensor.blas.Gemm
2.8% 95.9% 0.146s 1.59e-06s C 92000 92
theano.tensor.elemwise.Elemwise
1.6% 97.6% 0.085s 4.72e-06s C 18000 18
theano.tensor.elemwise.Sum
1.1% 98.7% 0.058s 3.22e-06s C 18000 18
theano.tensor.basic.Join
1.0% 99.7% 0.053s 2.94e-06s C 18000 18
theano.tensor.basic.Split
0.3% 100.0% 0.018s 2.57e-07s C 70000 70
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
Gemm{inplace}
1.6% 94.8% 0.085s 4.72e-06s C 18000 18
Sum{axis=[0], acc_dtype=float64}
1.4% 96.2% 0.076s 4.22e-06s C 18000 18
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
1.1% 97.3% 0.058s 3.22e-06s C 18000 18
Join
0.7% 98.0% 0.038s 2.11e-06s C 18000 18
Elemwise{mul,no_inplace}
0.5% 98.5% 0.025s 3.56e-06s C 7000 7
Split{4}
0.4% 98.9% 0.021s 2.34e-06s C 9000 9
Split{2}
0.2% 99.2% 0.013s 2.50e-07s C 52000 52
InplaceDimShuffle{1,0}
0.2% 99.4% 0.012s 3.08e-07s C 39000 39
Elemwise{Mul}[(0, 1)]
0.2% 99.6% 0.011s 1.83e-06s C 6000 6
Elemwise{Add}[(0, 2)]
0.1% 99.7% 0.007s 3.51e-06s C 2000 2
Split{3}
0.1% 99.8% 0.005s 5.56e-07s C 9000 9
Elemwise{Mul}[(0, 0)]
0.1% 99.9% 0.005s 2.77e-07s C 18000 18
InplaceDimShuffle{x,0}
0.1% 100.0% 0.004s 2.00e-06s C 2000 2
Elemwise{Add}[(0, 1)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.0% 2.0% 0.106s 1.06e-04s 1000 110
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
2.0% 4.0% 0.104s 1.04e-04s 1000 107
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 5.7% 0.093s 9.30e-05s 1000 188
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 7.5% 0.093s 9.30e-05s 1000 78
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.8% 9.3% 0.093s 9.29e-05s 1000 146
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 11.0% 0.092s 9.20e-05s 1000 135
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
1.7% 12.8% 0.092s 9.20e-05s 1000 105
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 14.5% 0.092s 9.19e-05s 1000 164
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 16.2% 0.090s 9.03e-05s 1000 177
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 17.9% 0.090s 8.99e-05s 1000 178
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 19.6% 0.089s 8.90e-05s 1000 159
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 21.3% 0.089s 8.90e-05s 1000 168
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 23.0% 0.089s 8.90e-05s 1000 157
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 24.6% 0.088s 8.80e-05s 1000 73
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.6% 26.3% 0.087s 8.71e-05s 1000 121
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 27.9% 0.087s 8.70e-05s 1000 193
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 29.6% 0.086s 8.60e-05s 1000 170
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 31.2% 0.085s 8.50e-05s 1000 166
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 32.8% 0.084s 8.40e-05s 1000 155
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 34.3% 0.083s 8.30e-05s 1000 140
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
... (remaining 265 Apply instances account for 65.66%(3.46s) of the
runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
Post by Patric
Could you share your model with us? We'd like to take a look :)
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sharapolas
2017-04-21 07:04:02 UTC
Permalink
Dear Patric,

Thank you for your help and comments. Coincidentally, soon after posting I
have came across MKL and find it pretty criminal that its not by default in
anaconda! :)

The CPU version now is either much faster ( when I reduce the internal
matrices from 1000x1000 to 200x1000 ) or equal to my GPU version. So CPU is
able to exploit better my fundamental optimizations of the problem itself.
Pretty curious how this would like on a server type multi-core CPU.

Regarding the parallel branches, even aside my specific problem I see that
its more papers come out with multi inputs, forks and merges within models.
These structures would benefit greatly from parallel branches. Now,
thinking more about it, such parallelism could be achieved manually just
splitting the graph at some nodes with many inputs. One would just create
shared variables which would link the sub-graphs with the trunk graph. Then
because Theano utilises GPU asynchronously one would get the result.


New CPU profile:
Function profiling
==================
Message: D:\PK scripts\sgd_solver\utils\GameForest.py:137
Time in 1000 calls to Function.__call__: 9.868994e+00s
Time in Function.fn.__call__: 9.794995e+00s (99.250%)
Time in thunks: 9.372134e+00s (94.965%)
Total compile time: 8.510001e-01s
Number of Apply nodes: 276
Theano Optimizer time: 6.790001e-01s
Theano validate time: 1.619997e-01s
Theano Linker time (includes C, CUDA code generation/compiling):
1.150000e-01s
Import time 1.000166e-03s
Node make_thunk time 1.029999e-01s
Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/cr0r0r0r0a, Join.0, InplaceDimShuffle{0,x}.0,
TensorConstant{(1L, 1L) of 0.0}) time 1.999855e-03s
Node Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](Elemwise{Mul}[(0, 1)].0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/cr1r0a, Join.0, InplaceDimShuffle{0,x}.0,
TensorConstant{(1L, 1L) of 0.0}) time 1.000166e-03s
Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
time 1.000166e-03s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 74.134s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
65.8% 65.8% 6.164s 1.19e-04s Py 52000 52
theano.tensor.blas.Dot22
25.3% 91.0% 2.369s 1.39e-04s C 17000 17
theano.tensor.blas.Gemm
3.5% 94.5% 0.325s 3.91e-06s C 83000 83
theano.tensor.elemwise.Elemwise
2.1% 96.6% 0.197s 1.09e-05s C 18000 18
theano.tensor.basic.Split
1.9% 98.5% 0.174s 9.65e-06s C 18000 18
theano.tensor.basic.Join
1.1% 99.6% 0.104s 5.77e-06s C 18000 18
theano.tensor.elemwise.Sum
0.4% 100.0% 0.040s 5.71e-07s C 70000 70
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
65.8% 65.8% 6.164s 1.19e-04s Py 52000 52
Dot22
25.3% 91.0% 2.369s 1.39e-04s C 17000 17
Gemm{inplace}
1.9% 92.9% 0.174s 9.65e-06s C 18000 18
Join
1.5% 94.4% 0.139s 7.72e-06s C 18000 18
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
1.2% 95.6% 0.113s 1.25e-05s C 9000 9
Split{2}
1.1% 96.7% 0.104s 5.77e-06s C 18000 18
Sum{axis=[1], acc_dtype=float64}
0.8% 97.5% 0.077s 4.28e-06s C 18000 18
Elemwise{mul,no_inplace}
0.8% 98.3% 0.074s 1.06e-05s C 7000 7
Split{4}
0.4% 98.8% 0.042s 1.40e-06s C 30000 30
Elemwise{Mul}[(0, 1)]
0.3% 99.1% 0.030s 4.99e-06s C 6000 6
Elemwise{Add}[(0, 2)]
0.3% 99.3% 0.025s 4.80e-07s C 52000 52
InplaceDimShuffle{1,0}
0.2% 99.5% 0.018s 2.25e-06s C 8000 8
Elemwise{Mul}[(0, 0)]
0.2% 99.7% 0.015s 8.33e-07s C 18000 18
InplaceDimShuffle{0,x}
0.1% 99.8% 0.010s 4.99e-06s C 2000 2
Split{3}
0.1% 99.9% 0.010s 4.99e-06s C 2000 2
Elemwise{Add}[(0, 1)]
0.1% 100.0% 0.009s 8.99e-06s C 1000 1
Elemwise{Add}[(0, 0)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.5% 2.5% 0.237s 2.37e-04s 1000 84
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
2.1% 4.6% 0.194s 1.94e-04s 1000 83
Dot22(ranges_r=3, InplaceDimShuffle{1,0}.0)
1.9% 6.5% 0.182s 1.82e-04s 1000 150
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
1.7% 8.2% 0.157s 1.57e-04s 1000 71
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
1.7% 9.9% 0.156s 1.56e-04s 1000 94
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
1.7% 11.6% 0.156s 1.56e-04s 1000 153
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
1.6% 13.2% 0.154s 1.54e-04s 1000 119
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
1.6% 14.8% 0.152s 1.52e-04s 1000 126
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
1.6% 16.4% 0.150s 1.50e-04s 1000 134
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
1.6% 18.0% 0.150s 1.50e-04s 1000 165
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
1.6% 19.6% 0.149s 1.49e-04s 1000 164
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
1.6% 21.2% 0.147s 1.47e-04s 1000 184
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.6% 22.7% 0.146s 1.46e-04s 1000 160
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 24.3% 0.145s 1.45e-04s 1000 113
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
1.5% 25.8% 0.145s 1.45e-04s 1000 85
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
1.5% 27.4% 0.142s 1.42e-04s 1000 172
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
1.5% 28.9% 0.142s 1.42e-04s 1000 188
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 30.4% 0.141s 1.41e-04s 1000 183
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 31.8% 0.137s 1.37e-04s 1000 193
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 33.3% 0.137s 1.37e-04s 1000 72
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
... (remaining 256 Apply instances account for 66.70%(6.25s) of the
runtime)

Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
Post by Patric
Very thanks for the information.
From the profiling log, the CPU is quick good since there are lots of data
operations such as split, join and which are almost 100X faster in CPU.
The topologies of your model include huge of small GEMM and Elemwise so I
think the big cache will be helpful in CPU side. And as the title, parallel
branch would be a very good idea for independent compute flow.
Do you have used Intel MKL as the backend of GEMM which will show better
performance?
btw, I can't open .p file, any suggestions?
Post by Sharapolas
Guys thanks for your feedback.
For the past week I have been trying to optimize my solver as much as
possible and I optimized so much that the CPU is twice faster than the GPU
now :D Extremelly puzzled with this result and I hope you could shed some
light on that.
In my initial version, I arranged the tensors such that I do not
need to do slicing. Then I noticed that GPU load is directly proportional
to the size of the tensors being used, thus I decided to use smaller
tensors but lump them together and then slice in the few cases where I need
it. As a result the GPU code turned to be more than 4 times slower, but CPU
code almost rivals my first GPU version. I tried using different version of
indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in
similar timings.
Do you have suggestions how I could speed up my GPU code? Otherwise, I
might as well just run on multicode CPU and prob become even faster than
GPU :/
os.environ['THEANO_FLAGS'] =
",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
Function profiling
==================
Time in 1000 calls to Function.__call__: 2.170000e+01s
Time in Function.fn.__call__: 2.166000e+01s (99.816%)
Time in thunks: 2.150321e+01s (99.093%)
Total compile time: 1.809000e+00s
Number of Apply nodes: 276
Theano Optimizer time: 1.099000e+00s
Theano validate time: 2.069981e-01s
2.370000e-01s
Import time 3.000021e-03s
Node make_thunk time 2.260001e-01s
Node GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0,
CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0},
convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
time 2.000093e-03s
Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0},
TensorConstant{(4L,) of 1}) time 2.000093e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 101.753s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
theano.sandbox.cuda.blas.GpuDot22
16.9% 54.9% 3.627s 4.37e-05s C 83000 83
theano.sandbox.cuda.basic_ops.GpuElemwise
14.7% 69.6% 3.169s 1.76e-04s Py 18000 18
theano.sandbox.cuda.basic_ops.GpuSplit
13.8% 83.4% 2.970s 1.65e-04s C 18000 18
theano.sandbox.cuda.basic_ops.GpuJoin
12.4% 95.9% 2.674s 1.57e-04s C 17000 17
theano.sandbox.cuda.blas.GpuGemm
3.5% 99.4% 0.751s 4.17e-05s C 18000 18
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.6% 100.0% 0.137s 1.96e-06s C 70000 70
theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
GpuDot22
13.8% 51.8% 2.970s 1.65e-04s C 18000 18
GpuJoin
12.4% 64.3% 2.674s 1.57e-04s C 17000 17
GpuGemm{inplace}
7.7% 71.9% 1.649s 2.36e-04s Py 7000 7
GpuSplit{4}
6.1% 78.1% 1.317s 4.39e-05s C 30000 30
GpuElemwise{Mul}[(0, 1)]
5.4% 83.5% 1.167s 1.30e-04s Py 9000 9
GpuSplit{2}
3.6% 87.0% 0.766s 4.26e-05s C 18000 18
GpuElemwise{mul,no_inplace}
3.5% 90.6% 0.763s 4.24e-05s C 18000 18
GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
3.5% 94.1% 0.751s 4.17e-05s C 18000 18
GpuCAReduce{add}{0,1}
1.9% 95.9% 0.399s 4.99e-05s C 8000 8
GpuElemwise{Mul}[(0, 0)]
1.6% 97.6% 0.353s 1.76e-04s Py 2000 2
GpuSplit{3}
1.1% 98.7% 0.247s 4.12e-05s C 6000 6
GpuElemwise{Add}[(0, 2)]
0.6% 99.4% 0.133s 2.56e-06s C 52000 52
GpuDimShuffle{1,0}
0.4% 99.8% 0.094s 4.70e-05s C 2000 2
GpuElemwise{Add}[(0, 1)]
0.2% 100.0% 0.041s 4.10e-05s C 1000 1
GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
0.0% 100.0% 0.004s 2.22e-07s C 18000 18
GpuDimShuffle{0,x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
1.2% 1.2% 0.259s 2.59e-04s 1000 14
GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 2.3% 0.246s 2.46e-04s 1000 9
GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
1.1% 3.5% 0.245s 2.45e-04s 1000 236
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
1.1% 4.6% 0.239s 2.39e-04s 1000 239
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.1% 5.7% 0.233s 2.33e-04s 1000 8
GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 6.8% 0.232s 2.32e-04s 1000 5
GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 7.8% 0.228s 2.28e-04s 1000 0
GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 8.9% 0.227s 2.27e-04s 1000 2
GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 9.9% 0.225s 2.25e-04s 1000 238
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 11.0% 0.224s 2.24e-04s 1000 4
GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 12.0% 0.223s 2.23e-04s 1000 260
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 13.0% 0.221s 2.21e-04s 1000 271
GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) +
i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 14.0% 0.218s 2.18e-04s 1000 261
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
0.9% 15.0% 0.203s 2.03e-04s 1000 237
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
0.9% 15.8% 0.184s 1.84e-04s 1000 146
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 16.7% 0.181s 1.81e-04s 1000 84
GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
0.8% 17.5% 0.179s 1.79e-04s 1000 134
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 18.4% 0.179s 1.79e-04s 1000 16
GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
0.8% 19.2% 0.175s 1.75e-04s 1000 83
GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
0.8% 20.0% 0.174s 1.74e-04s 1000 11
GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
... (remaining 256 Apply instances account for 80.03%(17.21s) of the
runtime)
Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and
0.000s(0.00%) transfert Op
Theano function input that are float64
<fct name> <input name> <input type> <str input>
List of apply that don't have float64 as input but have float64 in
outputs
(Useful to know if we forgot some cast when using floatX=float32 or
gpu code)
<Apply> <Apply position> <fct name> <inputs type> <outputs type>
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
os.environ['THEANO_FLAGS'] =
',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
Function profiling
==================
Time in 1000 calls to Function.__call__: 5.470006e+00s
Time in Function.fn.__call__: 5.422005e+00s (99.122%)
Time in thunks: 5.277404e+00s (96.479%)
Total compile time: 9.329998e-01s
Number of Apply nodes: 285
Theano Optimizer time: 7.650001e-01s
Theano validate time: 1.880007e-01s
1.140001e-01s
Import time 0.000000e+00s
Node make_thunk time 1.020000e-01s
Node InplaceDimShuffle{x,0}(Sum{axis=[0],
acc_dtype=float64}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time
1.000166e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 62.174s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
theano.tensor.blas.Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
theano.tensor.blas.Gemm
2.8% 95.9% 0.146s 1.59e-06s C 92000 92
theano.tensor.elemwise.Elemwise
1.6% 97.6% 0.085s 4.72e-06s C 18000 18
theano.tensor.elemwise.Sum
1.1% 98.7% 0.058s 3.22e-06s C 18000 18
theano.tensor.basic.Join
1.0% 99.7% 0.053s 2.94e-06s C 18000 18
theano.tensor.basic.Split
0.3% 100.0% 0.018s 2.57e-07s C 70000 70
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
Gemm{inplace}
1.6% 94.8% 0.085s 4.72e-06s C 18000 18
Sum{axis=[0], acc_dtype=float64}
1.4% 96.2% 0.076s 4.22e-06s C 18000 18
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
1.1% 97.3% 0.058s 3.22e-06s C 18000 18
Join
0.7% 98.0% 0.038s 2.11e-06s C 18000 18
Elemwise{mul,no_inplace}
0.5% 98.5% 0.025s 3.56e-06s C 7000 7
Split{4}
0.4% 98.9% 0.021s 2.34e-06s C 9000 9
Split{2}
0.2% 99.2% 0.013s 2.50e-07s C 52000 52
InplaceDimShuffle{1,0}
0.2% 99.4% 0.012s 3.08e-07s C 39000 39
Elemwise{Mul}[(0, 1)]
0.2% 99.6% 0.011s 1.83e-06s C 6000 6
Elemwise{Add}[(0, 2)]
0.1% 99.7% 0.007s 3.51e-06s C 2000 2
Split{3}
0.1% 99.8% 0.005s 5.56e-07s C 9000 9
Elemwise{Mul}[(0, 0)]
0.1% 99.9% 0.005s 2.77e-07s C 18000 18
InplaceDimShuffle{x,0}
0.1% 100.0% 0.004s 2.00e-06s C 2000 2
Elemwise{Add}[(0, 1)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.0% 2.0% 0.106s 1.06e-04s 1000 110
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
2.0% 4.0% 0.104s 1.04e-04s 1000 107
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 5.7% 0.093s 9.30e-05s 1000 188
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 7.5% 0.093s 9.30e-05s 1000 78
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.8% 9.3% 0.093s 9.29e-05s 1000 146
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 11.0% 0.092s 9.20e-05s 1000 135
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
1.7% 12.8% 0.092s 9.20e-05s 1000 105
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 14.5% 0.092s 9.19e-05s 1000 164
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 16.2% 0.090s 9.03e-05s 1000 177
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 17.9% 0.090s 8.99e-05s 1000 178
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 19.6% 0.089s 8.90e-05s 1000 159
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 21.3% 0.089s 8.90e-05s 1000 168
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 23.0% 0.089s 8.90e-05s 1000 157
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 24.6% 0.088s 8.80e-05s 1000 73
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.6% 26.3% 0.087s 8.71e-05s 1000 121
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 27.9% 0.087s 8.70e-05s 1000 193
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 29.6% 0.086s 8.60e-05s 1000 170
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 31.2% 0.085s 8.50e-05s 1000 166
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 32.8% 0.084s 8.40e-05s 1000 155
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 34.3% 0.083s 8.30e-05s 1000 140
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
... (remaining 265 Apply instances account for 65.66%(3.46s) of the
runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
Post by Patric
Could you share your model with us? We'd like to take a look :)
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sharapolas
2017-05-09 09:54:48 UTC
Permalink
I have investigated Theano runtime and I think I could achieve what I want
with a custom linker. Before I do that I would like to get your feedback.

As far as I understand Theano graph is traversed during runtime using
Linkers. Nodes are sorted in the order which they should be executed and
then their thunk codes get executed. A simple linker that I've found is

class Loop(VM):
"""
Unconditional start-to-finish program execution in Python.
No garbage collection is allowed on intermediate results.
"""
# Some other part of Theano query that information
allow_gc = False

def __call__(self):
if self.time_thunks:
for cont in self.pre_call_clear:
cont[0] = None
try:
for i, (thunk, node) in enumerate(zip(self.thunks,
self.nodes)):
t0 = time.time()
thunk()
t1 = time.time()
self.call_counts[i] += 1
self.call_times[i] += t1 - t0
except:
link.raise_with_op(node, thunk)
else:
for cont in self.pre_call_clear:
cont[0] = None
try:
for thunk, node in zip(self.thunks, self.nodes):
thunk()
except:
link.raise_with_op(node, thunk)

Here the thunks are processed sequentially. Now suppose all my thunks are
independent ( say many updates of many independent variables ), then I
could run all the thunks in parallel. In the case when some of the thunks
are dependent I still could run them parallel as long as I make sure that
by the time a thunk is run its inputs are ready. I imagine the latter could
be done, but before doing anything I would like to ask you whether I
understand the situation correct.
Post by Sharapolas
Dear Patric,
Thank you for your help and comments. Coincidentally, soon after posting I
have came across MKL and find it pretty criminal that its not by default in
anaconda! :)
The CPU version now is either much faster ( when I reduce the internal
matrices from 1000x1000 to 200x1000 ) or equal to my GPU version. So CPU is
able to exploit better my fundamental optimizations of the problem itself.
Pretty curious how this would like on a server type multi-core CPU.
Regarding the parallel branches, even aside my specific problem I see that
its more papers come out with multi inputs, forks and merges within models.
These structures would benefit greatly from parallel branches. Now,
thinking more about it, such parallelism could be achieved manually just
splitting the graph at some nodes with many inputs. One would just create
shared variables which would link the sub-graphs with the trunk graph. Then
because Theano utilises GPU asynchronously one would get the result.
Function profiling
==================
Message: D:\PK scripts\sgd_solver\utils\GameForest.py:137
Time in 1000 calls to Function.__call__: 9.868994e+00s
Time in Function.fn.__call__: 9.794995e+00s (99.250%)
Time in thunks: 9.372134e+00s (94.965%)
Total compile time: 8.510001e-01s
Number of Apply nodes: 276
Theano Optimizer time: 6.790001e-01s
Theano validate time: 1.619997e-01s
1.150000e-01s
Import time 1.000166e-03s
Node make_thunk time 1.029999e-01s
Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/cr0r0r0r0a, Join.0, InplaceDimShuffle{0,x}.0,
TensorConstant{(1L, 1L) of 0.0}) time 1.999855e-03s
Node Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](Elemwise{Mul}[(0, 1)].0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0,
0)](raw_p:cc/cc/cc/cr1r0a, Join.0, InplaceDimShuffle{0,x}.0,
TensorConstant{(1L, 1L) of 0.0}) time 1.000166e-03s
Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
time 1.000166e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 74.134s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
65.8% 65.8% 6.164s 1.19e-04s Py 52000 52
theano.tensor.blas.Dot22
25.3% 91.0% 2.369s 1.39e-04s C 17000 17
theano.tensor.blas.Gemm
3.5% 94.5% 0.325s 3.91e-06s C 83000 83
theano.tensor.elemwise.Elemwise
2.1% 96.6% 0.197s 1.09e-05s C 18000 18
theano.tensor.basic.Split
1.9% 98.5% 0.174s 9.65e-06s C 18000 18
theano.tensor.basic.Join
1.1% 99.6% 0.104s 5.77e-06s C 18000 18
theano.tensor.elemwise.Sum
0.4% 100.0% 0.040s 5.71e-07s C 70000 70
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op
name>
65.8% 65.8% 6.164s 1.19e-04s Py 52000 52
Dot22
25.3% 91.0% 2.369s 1.39e-04s C 17000 17
Gemm{inplace}
1.9% 92.9% 0.174s 9.65e-06s C 18000 18
Join
1.5% 94.4% 0.139s 7.72e-06s C 18000 18
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
1.2% 95.6% 0.113s 1.25e-05s C 9000 9
Split{2}
1.1% 96.7% 0.104s 5.77e-06s C 18000 18
Sum{axis=[1], acc_dtype=float64}
0.8% 97.5% 0.077s 4.28e-06s C 18000 18
Elemwise{mul,no_inplace}
0.8% 98.3% 0.074s 1.06e-05s C 7000 7
Split{4}
0.4% 98.8% 0.042s 1.40e-06s C 30000 30
Elemwise{Mul}[(0, 1)]
0.3% 99.1% 0.030s 4.99e-06s C 6000 6
Elemwise{Add}[(0, 2)]
0.3% 99.3% 0.025s 4.80e-07s C 52000 52
InplaceDimShuffle{1,0}
0.2% 99.5% 0.018s 2.25e-06s C 8000 8
Elemwise{Mul}[(0, 0)]
0.2% 99.7% 0.015s 8.33e-07s C 18000 18
InplaceDimShuffle{0,x}
0.1% 99.8% 0.010s 4.99e-06s C 2000 2
Split{3}
0.1% 99.9% 0.010s 4.99e-06s C 2000 2
Elemwise{Add}[(0, 1)]
0.1% 100.0% 0.009s 8.99e-06s C 1000 1
Elemwise{Add}[(0, 0)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.5% 2.5% 0.237s 2.37e-04s 1000 84
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
2.1% 4.6% 0.194s 1.94e-04s 1000 83
Dot22(ranges_r=3, InplaceDimShuffle{1,0}.0)
1.9% 6.5% 0.182s 1.82e-04s 1000 150
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
1.7% 8.2% 0.157s 1.57e-04s 1000 71
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
1.7% 9.9% 0.156s 1.56e-04s 1000 94
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
1.7% 11.6% 0.156s 1.56e-04s 1000 153
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 0)].0)
1.6% 13.2% 0.154s 1.54e-04s 1000 119
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
1.6% 14.8% 0.152s 1.52e-04s 1000 126
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
1.6% 16.4% 0.150s 1.50e-04s 1000 134
Dot22(convert2reduced_p=0_r=3, Elemwise{Mul}[(0, 1)].0)
1.6% 18.0% 0.150s 1.50e-04s 1000 165
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
1.6% 19.6% 0.149s 1.49e-04s 1000 164
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
1.6% 21.2% 0.147s 1.47e-04s 1000 184
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.6% 22.7% 0.146s 1.46e-04s 1000 160
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 24.3% 0.145s 1.45e-04s 1000 113
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
InplaceDimShuffle{1,0}.0, TensorConstant{1.0})
1.5% 25.8% 0.145s 1.45e-04s 1000 85
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
1.5% 27.4% 0.142s 1.42e-04s 1000 172
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
1.5% 28.9% 0.142s 1.42e-04s 1000 188
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 30.4% 0.141s 1.41e-04s 1000 183
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 31.8% 0.137s 1.37e-04s 1000 193
Gemm{inplace}(Dot22.0, TensorConstant{1.0}, convert2reduced_p=0_r=3,
Elemwise{Mul}[(0, 0)].0, TensorConstant{1.0})
1.5% 33.3% 0.137s 1.37e-04s 1000 72
Dot22(convert2reduced_p=0_r=3, InplaceDimShuffle{1,0}.0)
... (remaining 256 Apply instances account for 66.70%(6.25s) of the
runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
Post by Patric
Very thanks for the information.
From the profiling log, the CPU is quick good since there are lots of
data operations such as split, join and which are almost 100X faster in CPU.
The topologies of your model include huge of small GEMM and Elemwise so I
think the big cache will be helpful in CPU side. And as the title, parallel
branch would be a very good idea for independent compute flow.
Do you have used Intel MKL as the backend of GEMM which will show better
performance?
btw, I can't open .p file, any suggestions?
Post by Sharapolas
Guys thanks for your feedback.
For the past week I have been trying to optimize my solver as much as
possible and I optimized so much that the CPU is twice faster than the GPU
now :D Extremelly puzzled with this result and I hope you could shed some
light on that.
In my initial version, I arranged the tensors such that I do not
need to do slicing. Then I noticed that GPU load is directly proportional
to the size of the tensors being used, thus I decided to use smaller
tensors but lump them together and then slice in the few cases where I need
it. As a result the GPU code turned to be more than 4 times slower, but CPU
code almost rivals my first GPU version. I tried using different version of
indexing (eg. A[:,i], T.take(A, i, 1), T.split ) but all resulted in
similar timings.
Do you have suggestions how I could speed up my GPU code? Otherwise, I
might as well just run on multicode CPU and prob become even faster than
GPU :/
os.environ['THEANO_FLAGS'] =
",mode=FAST_RUN,floatX=float32,device=gpu,allow_gc=False,lib.cnmem=0.3,profile=True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
https://drive.google.com/open?id=0BwqtLV7TthvgUUZCTVJOajFxZGM
https://drive.google.com/open?id=0BwqtLV7TthvgdjVWOWtCWGxQOVU
Function profiling
==================
Time in 1000 calls to Function.__call__: 2.170000e+01s
Time in Function.fn.__call__: 2.166000e+01s (99.816%)
Time in thunks: 2.150321e+01s (99.093%)
Total compile time: 1.809000e+00s
Number of Apply nodes: 276
Theano Optimizer time: 1.099000e+00s
Theano validate time: 2.069981e-01s
2.370000e-01s
Import time 3.000021e-03s
Node make_thunk time 2.260001e-01s
Node GpuElemwise{Composite{maximum(((i0 + i1) - i2),
i3)}}[(0, 0)](raw_p:cc/cc/cc/r1a, GpuJoin.0, GpuDimShuffle{0,x}.0,
CudaNdarrayConstant{[[ 0.]]}) time 3.000021e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuSplit{2}(raw_p:cc/cc/cc/cr1r0r0a, TensorConstant{0},
TensorConstant{(2L,) of 1}) time 2.000093e-03s
Node GpuGemm{inplace}(GpuDot22.0, TensorConstant{1.0},
convert2reduced_p=0_r=3, GpuElemwise{Mul}[(0, 1)].0, TensorConstant{1.0})
time 2.000093e-03s
Node GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0},
TensorConstant{(4L,) of 1}) time 2.000093e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 101.753s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
theano.sandbox.cuda.blas.GpuDot22
16.9% 54.9% 3.627s 4.37e-05s C 83000 83
theano.sandbox.cuda.basic_ops.GpuElemwise
14.7% 69.6% 3.169s 1.76e-04s Py 18000 18
theano.sandbox.cuda.basic_ops.GpuSplit
13.8% 83.4% 2.970s 1.65e-04s C 18000 18
theano.sandbox.cuda.basic_ops.GpuJoin
12.4% 95.9% 2.674s 1.57e-04s C 17000 17
theano.sandbox.cuda.blas.GpuGemm
3.5% 99.4% 0.751s 4.17e-05s C 18000 18
theano.sandbox.cuda.basic_ops.GpuCAReduce
0.6% 100.0% 0.137s 1.96e-06s C 70000 70
theano.sandbox.cuda.basic_ops.GpuDimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Op name>
38.0% 38.0% 8.176s 1.57e-04s C 52000 52
GpuDot22
13.8% 51.8% 2.970s 1.65e-04s C 18000 18
GpuJoin
12.4% 64.3% 2.674s 1.57e-04s C 17000 17
GpuGemm{inplace}
7.7% 71.9% 1.649s 2.36e-04s Py 7000 7
GpuSplit{4}
6.1% 78.1% 1.317s 4.39e-05s C 30000 30
GpuElemwise{Mul}[(0, 1)]
5.4% 83.5% 1.167s 1.30e-04s Py 9000 9
GpuSplit{2}
3.6% 87.0% 0.766s 4.26e-05s C 18000 18
GpuElemwise{mul,no_inplace}
3.5% 90.6% 0.763s 4.24e-05s C 18000 18
GpuElemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
3.5% 94.1% 0.751s 4.17e-05s C 18000 18
GpuCAReduce{add}{0,1}
1.9% 95.9% 0.399s 4.99e-05s C 8000 8
GpuElemwise{Mul}[(0, 0)]
1.6% 97.6% 0.353s 1.76e-04s Py 2000 2
GpuSplit{3}
1.1% 98.7% 0.247s 4.12e-05s C 6000 6
GpuElemwise{Add}[(0, 2)]
0.6% 99.4% 0.133s 2.56e-06s C 52000 52
GpuDimShuffle{1,0}
0.4% 99.8% 0.094s 4.70e-05s C 2000 2
GpuElemwise{Add}[(0, 1)]
0.2% 100.0% 0.041s 4.10e-05s C 1000 1
GpuElemwise{Composite{(((i0 + i1) + i2) + i3)}}[(0, 0)]
0.0% 100.0% 0.004s 2.22e-07s C 18000 18
GpuDimShuffle{0,x}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
1.2% 1.2% 0.259s 2.59e-04s 1000 14
GpuSplit{4}(raw_p:cc/cc/cc/cr0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 2.3% 0.246s 2.46e-04s 1000 9
GpuSplit{4}(raw_p:cc/cc/cc/c, TensorConstant{0}, TensorConstant{(4L,) of 1})
1.1% 3.5% 0.245s 2.45e-04s 1000 236
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
1.1% 4.6% 0.239s 2.39e-04s 1000 239
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.1% 5.7% 0.233s 2.33e-04s 1000 8
GpuSplit{4}(raw_p:cc/cc/cc/cr1r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.1% 6.8% 0.232s 2.32e-04s 1000 5
GpuSplit{4}(raw_p:cc/cc/cc/r0, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 7.8% 0.228s 2.28e-04s 1000 0
GpuSplit{4}(raw_p:cc/cc/cc/r1, TensorConstant{0}, TensorConstant{(4L,) of
1})
1.1% 8.9% 0.227s 2.27e-04s 1000 2
GpuSplit{4}(raw_p:cc/cc/cc/r1r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 9.9% 0.225s 2.25e-04s 1000 238
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 11.0% 0.224s 2.24e-04s 1000 4
GpuSplit{4}(raw_p:cc/cc/cc/r0r0r0, TensorConstant{0}, TensorConstant{(4L,)
of 1})
1.0% 12.0% 0.223s 2.23e-04s 1000 260
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 13.0% 0.221s 2.21e-04s 1000 271
GpuJoin(TensorConstant{1}, GpuElemwise{Composite{(((i0 + i1) + i2) +
i3)}}[(0, 0)].0, GpuGemm{inplace}.0, GpuElemwise{Add}[(0, 2)].0,
GpuElemwise{Add}[(0, 2)].0)
1.0% 14.0% 0.218s 2.18e-04s 1000 261
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 2)].0)
0.9% 15.0% 0.203s 2.03e-04s 1000 237
GpuJoin(TensorConstant{1}, GpuDot22.0, GpuDot22.0, GpuGemm{inplace}.0,
GpuElemwise{Add}[(0, 1)].0)
0.9% 15.8% 0.184s 1.84e-04s 1000 146
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 16.7% 0.181s 1.81e-04s 1000 84
GpuDot22(ranges_r=3, GpuDimShuffle{1,0}.0)
0.8% 17.5% 0.179s 1.79e-04s 1000 134
GpuDot22(ranges_r=3, GpuElemwise{Mul}[(0, 1)].0)
0.8% 18.4% 0.179s 1.79e-04s 1000 16
GpuSplit{3}(raw_p:cc/cc/cc/cr0r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
0.8% 19.2% 0.175s 1.75e-04s 1000 83
GpuDot22(convert2reduced_p=0_r=3, GpuDimShuffle{1,0}.0)
0.8% 20.0% 0.174s 1.74e-04s 1000 11
GpuSplit{3}(raw_p:cc/cc/cc/cr1r0r0r0, TensorConstant{0},
TensorConstant{(3L,) of 1})
... (remaining 256 Apply instances account for 80.03%(17.21s) of the
runtime)
Spent 0.000s(0.00%) in cpu Op, 21.503s(100.00%) in gpu Op and
0.000s(0.00%) transfert Op
Theano function input that are float64
<fct name> <input name> <input type> <str input>
List of apply that don't have float64 as input but have float64 in
outputs
(Useful to know if we forgot some cast when using floatX=float32 or
gpu code)
<Apply> <Apply position> <fct name> <inputs type> <outputs type>
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
os.environ['THEANO_FLAGS'] =
',mode=FAST_RUN,floatX=float32,device=cpu,profile=True'
https://drive.google.com/open?id=0BwqtLV7TthvgQ0RuLXRaZUw5VVk
https://drive.google.com/open?id=0BwqtLV7TthvgY2pMZ3FVNG1sMlU
Function profiling
==================
Time in 1000 calls to Function.__call__: 5.470006e+00s
Time in Function.fn.__call__: 5.422005e+00s (99.122%)
Time in thunks: 5.277404e+00s (96.479%)
Total compile time: 9.329998e-01s
Number of Apply nodes: 285
Theano Optimizer time: 7.650001e-01s
Theano validate time: 1.880007e-01s
1.140001e-01s
Import time 0.000000e+00s
Node make_thunk time 1.020000e-01s
Node InplaceDimShuffle{x,0}(Sum{axis=[0],
acc_dtype=float64}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Elemwise{Mul}[(0, 1)](InplaceDimShuffle{1,0}.0,
InplaceDimShuffle{1,0}.0) time 1.000166e-03s
Node Gemm{inplace}(Dot22.0, TensorConstant{1.0},
Elemwise{Mul}[(0, 1)].0, convert2reduced_p=1_r=3, TensorConstant{1.0}) time
1.000166e-03s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 62.174s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Class name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
theano.tensor.blas.Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
theano.tensor.blas.Gemm
2.8% 95.9% 0.146s 1.59e-06s C 92000 92
theano.tensor.elemwise.Elemwise
1.6% 97.6% 0.085s 4.72e-06s C 18000 18
theano.tensor.elemwise.Sum
1.1% 98.7% 0.058s 3.22e-06s C 18000 18
theano.tensor.basic.Join
1.0% 99.7% 0.053s 2.94e-06s C 18000 18
theano.tensor.basic.Split
0.3% 100.0% 0.018s 2.57e-07s C 70000 70
theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply>
<Op name>
74.3% 74.3% 3.921s 7.54e-05s Py 52000 52
Dot22
18.9% 93.2% 0.996s 5.86e-05s C 17000 17
Gemm{inplace}
1.6% 94.8% 0.085s 4.72e-06s C 18000 18
Sum{axis=[0], acc_dtype=float64}
1.4% 96.2% 0.076s 4.22e-06s C 18000 18
Elemwise{Composite{maximum(((i0 + i1) - i2), i3)}}[(0, 0)]
1.1% 97.3% 0.058s 3.22e-06s C 18000 18
Join
0.7% 98.0% 0.038s 2.11e-06s C 18000 18
Elemwise{mul,no_inplace}
0.5% 98.5% 0.025s 3.56e-06s C 7000 7
Split{4}
0.4% 98.9% 0.021s 2.34e-06s C 9000 9
Split{2}
0.2% 99.2% 0.013s 2.50e-07s C 52000 52
InplaceDimShuffle{1,0}
0.2% 99.4% 0.012s 3.08e-07s C 39000 39
Elemwise{Mul}[(0, 1)]
0.2% 99.6% 0.011s 1.83e-06s C 6000 6
Elemwise{Add}[(0, 2)]
0.1% 99.7% 0.007s 3.51e-06s C 2000 2
Split{3}
0.1% 99.8% 0.005s 5.56e-07s C 9000 9
Elemwise{Mul}[(0, 0)]
0.1% 99.9% 0.005s 2.77e-07s C 18000 18
InplaceDimShuffle{x,0}
0.1% 100.0% 0.004s 2.00e-06s C 2000 2
Elemwise{Add}[(0, 1)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
2.0% 2.0% 0.106s 1.06e-04s 1000 110
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
2.0% 4.0% 0.104s 1.04e-04s 1000 107
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 5.7% 0.093s 9.30e-05s 1000 188
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.8% 7.5% 0.093s 9.30e-05s 1000 78
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.8% 9.3% 0.093s 9.29e-05s 1000 146
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 11.0% 0.092s 9.20e-05s 1000 135
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
1.7% 12.8% 0.092s 9.20e-05s 1000 105
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 14.5% 0.092s 9.19e-05s 1000 164
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 16.2% 0.090s 9.03e-05s 1000 177
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 17.9% 0.090s 8.99e-05s 1000 178
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 19.6% 0.089s 8.90e-05s 1000 159
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 21.3% 0.089s 8.90e-05s 1000 168
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 23.0% 0.089s 8.90e-05s 1000 157
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.7% 24.6% 0.088s 8.80e-05s 1000 73
Dot22(InplaceDimShuffle{1,0}.0, ranges_r=3)
1.6% 26.3% 0.087s 8.71e-05s 1000 121
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 27.9% 0.087s 8.70e-05s 1000 193
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 29.6% 0.086s 8.60e-05s 1000 170
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 31.2% 0.085s 8.50e-05s 1000 166
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 32.8% 0.084s 8.40e-05s 1000 155
Dot22(Elemwise{Mul}[(0, 1)].0, ranges_r=3)
1.6% 34.3% 0.083s 8.30e-05s 1000 140
Dot22(Elemwise{Mul}[(0, 0)].0, ranges_r=3)
... (remaining 265 Apply instances account for 65.66%(3.46s) of the
runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing
list).
Test them first, as they are not guaranteed to always
provide a speedup.
Sorry, no tip for today.
Post by Patric
Could you share your model with us? We'd like to take a look :)
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Sharapolas
2017-05-17 04:30:35 UTC
Permalink
Could anyone confirm whether parallelization be implemented running trunks
in parallel? Would this transfer to C/GPU?

Of course this will require making sure that thunks wait before preceding
trunks complete their job, but I think this is somewhat simpler.
Post by Sharapolas
I have a computation tree and am implementing leaf node evalutions. In
theano graph do paralle branches get evaluated in parallel on the GPU?
--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...