[theano-users] Is there a way to speed this operation in theano

Discussion:

Šarūnas S.

2017-05-05 08:15:26 UTC

In my current theano script the bottleneck is equivalent to the following
numpy code:

import numpy as np

# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )

start = time.time( )
for i in xrange( 1000 ):
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
print

I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?

--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Šarūnas S.

2017-05-05 10:17:13 UTC

Permalink

I was shown that in *numpy* I could speed it up in the following way:

result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]

result = np.matmul(prob.transpose(2,0,1), cases.T).T

Bot give me the expected speedup in *numpy*, but neither is implemented in
*Theano*. Is there a way to do the same in *Theano* on the *GPU*?

Post by Å arÅ«nas S.
In my current theano script the bottleneck is equivalent to the following
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
print
I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?

Jesse Livezey

2017-05-05 16:23:12 UTC

Permalink

I think tensordot should do what you want
http://deeplearning.net/software/theano/library/tensor/basic.html#theano.tensor.tensordot
something like
result = T.tensordot(prob, cases, axes=1)

Post by Å arÅ«nas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is implemented
in *Theano*. Is there a way to do the same in *Theano* on the *GPU*?

Šarūnas S.

2017-05-06 07:41:06 UTC

Permalink

I have tried that, but to no avail. The problem is that I have to multiply
on 2 axes, but sum only on 1.

Post by Jesse Livezey
I think tensordot should do what you want
http://deeplearning.net/software/theano/library/tensor/basic.html#theano.tensor.tensordot
something like
result = T.tensordot(prob, cases, axes=1)

Post by Å arÅ«nas S.
In my current theano script the bottleneck is equivalent to the
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
print
I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?

Šarūnas S.

2017-05-08 08:30:32 UTC

Permalink

Currently, I have 3 approaches that are portable to theano:

# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )

# Elementwise + sum
for i in xrange( 100 ):
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )

# Loop version
result = np.zeros( ( 1000, 1, 50 ) )
for i in xrange( 5 ):
result[ :, :, i ] = np.dot( prob[ :, :, i ], cases[ :, :, i ] )

# Block diagonal sparse dot version
prob_big = np.zeros( ( 1, 1000, 50, 50 ) )
cases_big = np.zeros( ( 1000, 1000, 50, 50 ) )

for i in xrange( 50 ):
prob_big[ :, :, i, i ] = prob[ :, :, i, i ]
cases_big[ :, :, i, i ] = prob[ :, :, i, i ]

intermediate = np.tensordot( prob_big, cases_big, axes=[ [ 0 ], [ 1 ] ] )
result = np.zeros( 1000, 1, 50 )
for i in range( 50 ):
result[ :, :, i ] = intermediate[ :, :, i, i ]

I think the the one which would structure this as a sparse block diagonal
matrix would work best since I've seen some support for the block sparse
matrices. However, it looks like I would still need some loop for
blocksparse to iterate over all the blocks. Is there a way to somehow do
all the blocks at once and collect the diagonal without using scan?

Post by Å arÅ«nas S.
I have tried that, but to no avail. The problem is that I have to multiply
on 2 axes, but sum only on 1.

Jesse Livezey

2017-05-08 23:22:37 UTC

Permalink

I see, you can use batched_dot for that. I wrote a gist which compares the
numpy matmul, theano batch_dot, and theano multiply and sum approaches.
https://gist.github.com/JesseLivezey/42cabcf87aa0033410f7520933942127

On GPU, the multiply and sum seems to be fastest, but it will also use more
memory.

Post by Å arÅ«nas S.
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
# Elementwise + sum
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
# Loop version
result = np.zeros( ( 1000, 1, 50 ) )
result[ :, :, i ] = np.dot( prob[ :, :, i ], cases[ :, :, i ] )
# Block diagonal sparse dot version
prob_big = np.zeros( ( 1, 1000, 50, 50 ) )
cases_big = np.zeros( ( 1000, 1000, 50, 50 ) )
prob_big[ :, :, i, i ] = prob[ :, :, i, i ]
cases_big[ :, :, i, i ] = prob[ :, :, i, i ]
intermediate = np.tensordot( prob_big, cases_big, axes=[ [ 0 ], [ 1 ] ] )
result = np.zeros( 1000, 1, 50 )
result[ :, :, i ] = intermediate[ :, :, i, i ]
I think the the one which would structure this as a sparse block diagonal
matrix would work best since I've seen some support for the block sparse
matrices. However, it looks like I would still need some loop for
blocksparse to iterate over all the blocks. Is there a way to somehow do
all the blocks at once and collect the diagonal without using scan?

Post by Å arÅ«nas S.
I have tried that, but to no avail. The problem is that I have to
multiply on 2 axes, but sum only on 1.

Post by Å arÅ«nas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is
implemented in *Theano*. Is there a way to do the same in *Theano* on
the *GPU*?

Post by Å arÅ«nas S.
In my current theano script the bottleneck is equivalent to the
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
print
I had seen in 2D case that replacing elementwise+sum with a dot
product gave me 5x speedup. Are there any theano matrix operations that
could help me out here?

Sharapolas

2017-05-09 08:14:28 UTC

Permalink

Thanks Jesse,

After experimenting a lot I ended up wrapping the *einsum *and using it on
the CPU. In my case CPU + einsum version is faster than GPU + multiply +
sum.

This is the wrapper I'm using:

import os
import numpy as np

os.environ['THEANO_FLAGS'] =
',mode=FAST_RUN,floatX=float32,device=cpu,openmp=True,openmp_elemwise_minsize=10'
os.environ['THEANO_FLAGS'] += ',allow_gc=False,'
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'

import theano as th
import theano.tensor as T

class Einsum(th.Op):
__props__ = ()

# itypes = [ T.ftensor3, T.ftensor3 ]
# otypes = [ T.fmatrix ]

itypes = None
otypes = None

def make_node(self, *inputs ):
# x = th.tensor.as_tensor_variable( inputs[ 0 ] )
# Note: using x_.type() is dangerous, as it copies x's broadcasting
# behaviour

outputs = []
if len( self._code[ self._code.rindex( '->' )+2: ] ) == 0:
outputs.append( T.fscalar( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 1:
outputs.append( T.fvector( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 2:
outputs.append( T.fmatrix( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 3:
outputs.append( T.ftensor3( ) )
else:
raise NotImplementedError

return th.Apply(self, inputs, outputs )

def __init__( self, code ):
self._code = code

def perform( self, node, inputs, output_storage ):
x = inputs[ 0 ]
y = inputs[ 1 ]
z = output_storage[ 0 ]
z[0] = np.einsum( self._code, x, y )

def einsum( code, *inputs ):

out = Einsum( code )( *inputs )

return out

Post by Jesse Livezey
I see, you can use batched_dot for that. I wrote a gist which compares the
numpy matmul, theano batch_dot, and theano multiply and sum approaches.
https://gist.github.com/JesseLivezey/42cabcf87aa0033410f7520933942127
On GPU, the multiply and sum seems to be fastest, but it will also use
more memory.

Post by Å arÅ«nas S.
I have tried that, but to no avail. The problem is that I have to
multiply on 2 axes, but sum only on 1.

Post by Å arÅ«nas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is
implemented in *Theano*. Is there a way to do the same in *Theano* on
the *GPU*?

Post by Å arÅ«nas S.
In my current theano script the bottleneck is equivalent to the
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
print
I had seen in 2D case that replacing elementwise+sum with a dot
product gave me 5x speedup. Are there any theano matrix operations that
could help me out here?