[theano-users] Is there a way to speed this operation in theano
Šarūnas S.
2017-05-05 08:15:26 UTC
In my current theano script the bottleneck is equivalent to the following
numpy code:

import numpy as np

# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )

start = time.time( )
for i in xrange( 1000 ):
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape

I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Šarūnas S.
2017-05-05 10:17:13 UTC
I was shown that in *numpy* I could speed it up in the following way:

result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]

result = np.matmul(prob.transpose(2,0,1), cases.T).T

Bot give me the expected speedup in *numpy*, but neither is implemented in
*Theano*. Is there a way to do the same in *Theano* on the *GPU*?
Post by Šarūnas S.
In my current theano script the bottleneck is equivalent to the following
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jesse Livezey
2017-05-05 16:23:12 UTC
I think tensordot should do what you want
something like
result = T.tensordot(prob, cases, axes=1)
Post by Šarūnas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is implemented
in *Theano*. Is there a way to do the same in *Theano* on the *GPU*?
Post by Šarūnas S.
In my current theano script the bottleneck is equivalent to the following
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Šarūnas S.
2017-05-06 07:41:06 UTC
I have tried that, but to no avail. The problem is that I have to multiply
on 2 axes, but sum only on 1.
Post by Jesse Livezey
I think tensordot should do what you want
something like
result = T.tensordot(prob, cases, axes=1)
Post by Šarūnas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is implemented
in *Theano*. Is there a way to do the same in *Theano* on the *GPU*?
Post by Šarūnas S.
In my current theano script the bottleneck is equivalent to the
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Šarūnas S.
2017-05-08 08:30:32 UTC
Currently, I have 3 approaches that are portable to theano:

# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )

# Elementwise + sum
for i in xrange( 100 ):
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )

# Loop version
result = np.zeros( ( 1000, 1, 50 ) )
for i in xrange( 5 ):
result[ :, :, i ] = np.dot( prob[ :, :, i ], cases[ :, :, i ] )

# Block diagonal sparse dot version
prob_big = np.zeros( ( 1, 1000, 50, 50 ) )
cases_big = np.zeros( ( 1000, 1000, 50, 50 ) )

for i in xrange( 50 ):
prob_big[ :, :, i, i ] = prob[ :, :, i, i ]
cases_big[ :, :, i, i ] = prob[ :, :, i, i ]

intermediate = np.tensordot( prob_big, cases_big, axes=[ [ 0 ], [ 1 ] ] )
result = np.zeros( 1000, 1, 50 )
for i in range( 50 ):
result[ :, :, i ] = intermediate[ :, :, i, i ]

I think the the one which would structure this as a sparse block diagonal
matrix would work best since I've seen some support for the block sparse
matrices. However, it looks like I would still need some loop for
blocksparse to iterate over all the blocks. Is there a way to somehow do
all the blocks at once and collect the diagonal without using scan?
Post by Šarūnas S.
I have tried that, but to no avail. The problem is that I have to multiply
on 2 axes, but sum only on 1.
Post by Jesse Livezey
I think tensordot should do what you want
something like
result = T.tensordot(prob, cases, axes=1)
Post by Šarūnas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is implemented
in *Theano*. Is there a way to do the same in *Theano* on the *GPU*?
Post by Šarūnas S.
In my current theano script the bottleneck is equivalent to the
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
I had seen in 2D case that replacing elementwise+sum with a dot product
gave me 5x speedup. Are there any theano matrix operations that could help
me out here?
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Jesse Livezey
2017-05-08 23:22:37 UTC
I see, you can use batched_dot for that. I wrote a gist which compares the
numpy matmul, theano batch_dot, and theano multiply and sum approaches.

On GPU, the multiply and sum seems to be fastest, but it will also use more
Post by Šarūnas S.
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
# Elementwise + sum
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
# Loop version
result = np.zeros( ( 1000, 1, 50 ) )
result[ :, :, i ] = np.dot( prob[ :, :, i ], cases[ :, :, i ] )
# Block diagonal sparse dot version
prob_big = np.zeros( ( 1, 1000, 50, 50 ) )
cases_big = np.zeros( ( 1000, 1000, 50, 50 ) )
prob_big[ :, :, i, i ] = prob[ :, :, i, i ]
cases_big[ :, :, i, i ] = prob[ :, :, i, i ]
intermediate = np.tensordot( prob_big, cases_big, axes=[ [ 0 ], [ 1 ] ] )
result = np.zeros( 1000, 1, 50 )
result[ :, :, i ] = intermediate[ :, :, i, i ]
I think the the one which would structure this as a sparse block diagonal
matrix would work best since I've seen some support for the block sparse
matrices. However, it looks like I would still need some loop for
blocksparse to iterate over all the blocks. Is there a way to somehow do
all the blocks at once and collect the diagonal without using scan?
Post by Šarūnas S.
I have tried that, but to no avail. The problem is that I have to
multiply on 2 axes, but sum only on 1.
Post by Jesse Livezey
I think tensordot should do what you want
something like
result = T.tensordot(prob, cases, axes=1)
Post by Šarūnas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is
implemented in *Theano*. Is there a way to do the same in *Theano* on
the *GPU*?
Post by Šarūnas S.
In my current theano script the bottleneck is equivalent to the
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
I had seen in 2D case that replacing elementwise+sum with a dot
product gave me 5x speedup. Are there any theano matrix operations that
could help me out here?
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
2017-05-09 08:14:28 UTC
Thanks Jesse,

After experimenting a lot I ended up wrapping the *einsum *and using it on
the CPU. In my case CPU + einsum version is faster than GPU + multiply +

This is the wrapper I'm using:

import os
import numpy as np

os.environ['THEANO_FLAGS'] =
os.environ['THEANO_FLAGS'] += ',allow_gc=False,'
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'

import theano as th
import theano.tensor as T

class Einsum(th.Op):
__props__ = ()

# itypes = [ T.ftensor3, T.ftensor3 ]
# otypes = [ T.fmatrix ]

itypes = None
otypes = None

def make_node(self, *inputs ):
# x = th.tensor.as_tensor_variable( inputs[ 0 ] )
# Note: using x_.type() is dangerous, as it copies x's broadcasting
# behaviour

outputs = []
if len( self._code[ self._code.rindex( '->' )+2: ] ) == 0:
outputs.append( T.fscalar( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 1:
outputs.append( T.fvector( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 2:
outputs.append( T.fmatrix( ) )
elif len( self._code[ self._code.rindex( '->' )+2: ] ) == 3:
outputs.append( T.ftensor3( ) )
raise NotImplementedError

return th.Apply(self, inputs, outputs )

def __init__( self, code ):
self._code = code

def perform( self, node, inputs, output_storage ):
x = inputs[ 0 ]
y = inputs[ 1 ]
z = output_storage[ 0 ]
z[0] = np.einsum( self._code, x, y )

def einsum( code, *inputs ):

out = Einsum( code )( *inputs )

return out
Post by Jesse Livezey
I see, you can use batched_dot for that. I wrote a gist which compares the
numpy matmul, theano batch_dot, and theano multiply and sum approaches.
On GPU, the multiply and sum seems to be fastest, but it will also use
more memory.
Post by Šarūnas S.
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
# Elementwise + sum
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
# Loop version
result = np.zeros( ( 1000, 1, 50 ) )
result[ :, :, i ] = np.dot( prob[ :, :, i ], cases[ :, :, i ] )
# Block diagonal sparse dot version
prob_big = np.zeros( ( 1, 1000, 50, 50 ) )
cases_big = np.zeros( ( 1000, 1000, 50, 50 ) )
prob_big[ :, :, i, i ] = prob[ :, :, i, i ]
cases_big[ :, :, i, i ] = prob[ :, :, i, i ]
intermediate = np.tensordot( prob_big, cases_big, axes=[ [ 0 ], [ 1 ] ] )
result = np.zeros( 1000, 1, 50 )
result[ :, :, i ] = intermediate[ :, :, i, i ]
I think the the one which would structure this as a sparse block diagonal
matrix would work best since I've seen some support for the block sparse
matrices. However, it looks like I would still need some loop for
blocksparse to iterate over all the blocks. Is there a way to somehow do
all the blocks at once and collect the diagonal without using scan?
Post by Šarūnas S.
I have tried that, but to no avail. The problem is that I have to
multiply on 2 axes, but sum only on 1.
Post by Jesse Livezey
I think tensordot should do what you want
something like
result = T.tensordot(prob, cases, axes=1)
Post by Šarūnas S.
result = np.einsum('ijk,ijk->ik', prob, cases)[:,None,:]
result = np.matmul(prob.transpose(2,0,1), cases.T).T
Bot give me the expected speedup in *numpy*, but neither is
implemented in *Theano*. Is there a way to do the same in *Theano* on
the *GPU*?
Post by Šarūnas S.
In my current theano script the bottleneck is equivalent to the
import numpy as np
# 3D example
axis = 0
prob = np.random.random( ( 1, 1000, 50 ) )
cases = np.random.random( ( 1000, 1000, 50 ) )
start = time.time( )
result = ( cases * prob ).sum( axis=1-axis, keepdims=True )
print '3D naive method took {} seconds'.format( time.time() - start )
print result.shape
I had seen in 2D case that replacing elementwise+sum with a dot
product gave me 5x speedup. Are there any theano matrix operations that
could help me out here?
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.