[theano-users] Significant increase in GPU memory consumption with new GPU backend

Discussion:

Fabian Stemmer

2017-06-22 06:38:26 UTC

Hi,

I recently tried to switch my CNN implementation to the new theano GPU
backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.

However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.

I tried to remove parts of my model or exclude certain theano optimizations
(e.g. exclude conv_dnn to force theano to use a different convolution
algorithm) but nothing I changed in the model structure had an impact on
the discrepancy I see in memory usage.

I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old backend I
see very similar behavior for both the 0.8.2 and 0.9.0 releases. For the
new backend I tested the 0.9.0 release as well as a recent github checkout
(commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed the same
memory increase.

I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.

Thanks
Fabian

--
---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fabian Stemmer

2017-06-22 09:33:29 UTC

Permalink

One addition:
The theano 0.9.0 setup used libgpuarray v0.6.2.
The theano 0.10.dev setup used libgpuarray v0.6.5 - I just updated to
v0.6.7 and tested again, but I still get ~2GB memory usage.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano GPU
backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old backend I
see very similar behavior for both the 0.8.2 and 0.9.0 releases. For the
new backend I tested the 0.9.0 release as well as a recent github checkout
(commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed the same
memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

Frédéric Bastien

2017-06-22 13:22:56 UTC

Permalink

Do you use the Theano flag: gpuarray.preallocate=1? When you tried the
preallocation, how did you use it?

Is is mostly equivalent to lib.cnmem. But our default is different and by
default give more speed up, but sometimes can cause memory fragmentation.
the flag above fix the new fragmentation that can happen by default.

Post by Fabian Stemmer
The theano 0.9.0 setup used libgpuarray v0.6.2.
The theano 0.10.dev setup used libgpuarray v0.6.5 - I just updated to
v0.6.7 and tested again, but I still get ~2GB memory usage.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano GPU
backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old backend
I see very similar behavior for both the 0.8.2 and 0.9.0 releases. For the
new backend I tested the 0.9.0 release as well as a recent github checkout
(commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed the same
memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Fabian Stemmer

2017-06-22 13:41:30 UTC

Permalink

When I did use preallocation I used lib.cnmem=1 for theano 0.8.2 and
gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not use
preallocation, because the only way I could see the difference in memory
usage was through nvidia-smi, which only shows the static pre-allocation
when it is used.
I believe the problem does not disappear with pre-allocation, since I see
my training crash for much smaller models with the new backend even then.
However, I cannot measure the effect of switching backends on GPU memory
when I use preallocation.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried the
preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different and by
default give more speed up, but sometimes can cause memory fragmentation.
the flag above fix the new fragmentation that can happen by default.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano GPU
backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old backend
I see very similar behavior for both the 0.8.2 and 0.9.0 releases. For the
new backend I tested the 0.9.0 release as well as a recent github checkout
(commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed the same
memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

Frédéric Bastien

2017-06-22 13:44:45 UTC

Permalink

The equivalent to the old back-end setting for memory is:
gpuarray.preallocate=-1.

The new back-end by default will cache all call to cudaMalloc() to speed up
computation. This flag will disable this cache. THis is the same default as
the old back-end.

Post by Fabian Stemmer
When I did use preallocation I used lib.cnmem=1 for theano 0.8.2 and
gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not use
preallocation, because the only way I could see the difference in memory
usage was through nvidia-smi, which only shows the static pre-allocation
when it is used.
I believe the problem does not disappear with pre-allocation, since I see
my training crash for much smaller models with the new backend even then.
However, I cannot measure the effect of switching backends on GPU memory
when I use preallocation.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano GPU
backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed
the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send an

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/d/optout.

Fabian Stemmer

2017-07-10 06:42:38 UTC

Permalink

Thanks, by setting gpuarray.preallocate=-1 I now get similar behavior for
the new backend as for the old.

Do I understand correctly, that leaving preallocate at default behavior
(new backend) will not result in higher memory consumption, but merely
doesn't free memory once allocated, so what I see in nvidia-smi is
max-memory consumption up to this point?

A related question: When I run with profile=True,profile_memory=True -
shouldn't the max GPU memory stat in the profiling correspond to what I see
in nvidia-smi when I run with preallocate on default?

Currently, I see ~400MB GPU memory usage in profiling and that's what I see
with preallocate=-1 too (although I can't guarantuee there aren't higher
spikes that I don't see with nvidia-smi). When I leave preallocate at
default, I see GPU memory usage ~2GB (but the profiling still reports only
400MB).

Thanks
Fabian

Post by FrÃ©dÃ©ric Bastien
gpuarray.preallocate=-1.
The new back-end by default will cache all call to cudaMalloc() to speed
up computation. This flag will disable this cache. THis is the same default
as the old back-end.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried the
preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different and
by default give more speed up, but sometimes can cause memory
fragmentation. the flag above fix the new fragmentation that can happen by
default.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano GPU
backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed
the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

Pascal Lamblin

2017-07-11 01:23:44 UTC

Permalink

Post by Fabian Stemmer
Thanks, by setting gpuarray.preallocate=-1 I now get similar behavior for
the new backend as for the old.
Do I understand correctly, that leaving preallocate at default behavior
(new backend) will not result in higher memory consumption, but merely
doesn't free memory once allocated, so what I see in nvidia-smi is
max-memory consumption up to this point?

Not really, it can actually result in higher memory consumption due to the
way new memory blocks are allocated. For instance, in the worse case, if a
tensor of 1 MB gets allocated and deallocated, then a 2 MB tensor, a new 2
MB block will be added to the pool, however it will not be mergeable with
the first one, and if it gets freed, a 3 MB tensor cannot be "split"
between the first blocks. Due to that fragmentation effect, allocating /
deallocating 1 MB, then 2 MB, 3 MB, etc., will end up using 1 + 2 + 3 + ...
MB total on the GPU.

Post by Fabian Stemmer
A related question: When I run with profile=True,profile_memory=True -
shouldn't the max GPU memory stat in the profiling correspond to what I see
in nvidia-smi when I run with preallocate on default?

Again, not really, due to that fragmentation effect.

Post by Fabian Stemmer
Currently, I see ~400MB GPU memory usage in profiling and that's what I
see with preallocate=-1 too (although I can't guarantuee there aren't
higher spikes that I don't see with nvidia-smi). When I leave preallocate
at default, I see GPU memory usage ~2GB (but the profiling still reports
only 400MB).

Preallocating 400 or 500 MB may avoid fragmentation and bring the total
consumption peak closer to what is actually allocated to arrays.

Post by Fabian Stemmer
Thanks
Fabian

Post by Fabian Stemmer
When I did use preallocation I used lib.cnmem=1 for theano 0.8.2 and
gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not use
preallocation, because the only way I could see the difference in memory
usage was through nvidia-smi, which only shows the static pre-allocation
when it is used.
I believe the problem does not disappear with pre-allocation, since I
see my training crash for much smaller models with the new backend even
then. However, I cannot measure the effect of switching backends on GPU
memory when I use preallocation.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried the
preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different and
by default give more speed up, but sometimes can cause memory
fragmentation. the flag above fix the new fragmentation that can happen by
default.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano
GPU backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed
the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

---
You received this message because you are subscribed to the Google
Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

Anton Murashov

2017-08-30 00:30:49 UTC

Permalink

Hello all!

I have a very similar problem with new gpuarray backend, ) it has following
undesired behaviour:

(a) with preallocation turned ON (any value above and including zero) it
crashes with cuMemAlloc error (OutOfMemory) on problem of my size (smaller
problems work)
(b) with preallocation turned ON and if small problem is being fitted -
interrupting the kernel and restarting results in cuMemAlloc error
(OutOfMemory)
(b) with preallocation turned OFF (preallocation=-1) it does not even start
fitting with cuMemAlloc error (invalid argument!!! NOT OutOfMemory!!!!)

GpuArrayException: ('The following error happened while compiling the
node', forall_inplace,gpu,grad_of_scan_fn}(TensorConstant{1000},
GpuSubtensor{int64:int64:int64}.0, GpuElemwise{Composite{(i0 -
sqr(i1))}}[]<gpuarray>.0, GpuElemwise{tanh,no_inplace}.0,
InplaceGpuDimShuffle{0,2,1}.0, GpuAlloc<None>{memset_0=True}.0,
GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0,
GpuSubtensor{int64:int64:int64}.0, GpuAlloc<None>{memset_0=True}.0,
GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0,
TensorConstant{1000}, GpuSubtensor{::, int64:int64:}.0,
InplaceGpuDimShuffle{1,0}.0, GpuSubtensor{::, :int64:}.0, GpuSubtensor{::,
int64::}.0, InplaceGpuDimShuffle{1,0}.0, GpuSubtensor{::, int64:int64:}.0,
InplaceGpuDimShuffle{1,0}.0, InplaceGpuDimShuffle{1,0}.0,
GpuAlloc<None>{memset_0=True}.0), '\n', 'cuMemAlloc:
CUDA_ERROR_INVALID_VALUE: invalid argument')

Needless to say, on old backend all works fine, just 20% slower (on
problems which actually start fitting on both backends). I use versions
currently supplied with Anaconda (theano-0.9, libgpuarray 0.6.9, pygpu
0.6.9)

Post by Pascal Lamblin

Again, not really, due to that fragmentation effect.

Preallocating 400 or 500 MB may avoid fragmentation and bring the total
consumption peak closer to what is actually allocated to arrays.

Post by Fabian Stemmer
Thanks
Fabian

Post by Fabian Stemmer
When I did use preallocation I used lib.cnmem=1 for theano 0.8.2 and
gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not use
preallocation, because the only way I could see the difference in memory
usage was through nvidia-smi, which only shows the static pre-allocation
when it is used.
I believe the problem does not disappear with pre-allocation, since I
see my training crash for much smaller models with the new backend even
then. However, I cannot measure the effect of switching backends on GPU
memory when I use preallocation.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried the
preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different and
by default give more speed up, but sometimes can cause memory
fragmentation. the flag above fix the new fragmentation that can happen by
default.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano
GPU backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed
the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

Frédéric Bastien

2017-08-30 14:59:11 UTC

Permalink

Update to Theano dev version. There is many updates that could help you.

If that don't fix your problem, open an issue on github.

For preallocation, which flag to do you use?

Post by Anton Murashov
Hello all!
I have a very similar problem with new gpuarray backend, ) it has
(a) with preallocation turned ON (any value above and including zero) it
crashes with cuMemAlloc error (OutOfMemory) on problem of my size (smaller
problems work)
(b) with preallocation turned ON and if small problem is being fitted -
interrupting the kernel and restarting results in cuMemAlloc error
(OutOfMemory)
(b) with preallocation turned OFF (preallocation=-1) it does not even
start fitting with cuMemAlloc error (invalid argument!!! NOT
OutOfMemory!!!!)
GpuArrayException: ('The following error happened while compiling the
node', forall_inplace,gpu,grad_of_scan_fn}(TensorConstant{1000},
GpuSubtensor{int64:int64:int64}.0, GpuElemwise{Composite{(i0 -
sqr(i1))}}[]<gpuarray>.0, GpuElemwise{tanh,no_inplace}.0,
InplaceGpuDimShuffle{0,2,1}.0, GpuAlloc<None>{memset_0=True}.0,
GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0,
GpuSubtensor{int64:int64:int64}.0, GpuAlloc<None>{memset_0=True}.0,
GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0,
TensorConstant{1000}, GpuSubtensor{::, int64:int64:}.0,
InplaceGpuDimShuffle{1,0}.0, GpuSubtensor{::, :int64:}.0, GpuSubtensor{::,
int64::}.0, InplaceGpuDimShuffle{1,0}.0, GpuSubtensor{::, int64:int64:}.0,
InplaceGpuDimShuffle{1,0}.0, InplaceGpuDimShuffle{1,0}.0,
CUDA_ERROR_INVALID_VALUE: invalid argument')
Needless to say, on old backend all works fine, just 20% slower (on
problems which actually start fitting on both backends). I use versions
currently supplied with Anaconda (theano-0.9, libgpuarray 0.6.9, pygpu
0.6.9)

Post by Pascal Lamblin

Post by Fabian Stemmer
Thanks, by setting gpuarray.preallocate=-1 I now get similar behavior
for the new backend as for the old.
Do I understand correctly, that leaving preallocate at default behavior
(new backend) will not result in higher memory consumption, but merely
doesn't free memory once allocated, so what I see in nvidia-smi is
max-memory consumption up to this point?

Not really, it can actually result in higher memory consumption due to
the way new memory blocks are allocated. For instance, in the worse case,
if a tensor of 1 MB gets allocated and deallocated, then a 2 MB tensor, a
new 2 MB block will be added to the pool, however it will not be mergeable
with the first one, and if it gets freed, a 3 MB tensor cannot be "split"
between the first blocks. Due to that fragmentation effect, allocating /
deallocating 1 MB, then 2 MB, 3 MB, etc., will end up using 1 + 2 + 3 + ...
MB total on the GPU.

Again, not really, due to that fragmentation effect.

Preallocating 400 or 500 MB may avoid fragmentation and bring the total
consumption peak closer to what is actually allocated to arrays.

Post by Fabian Stemmer
Thanks
Fabian

Post by FrÃ©dÃ©ric Bastien
gpuarray.preallocate=-1.
The new back-end by default will cache all call to cudaMalloc() to
speed up computation. This flag will disable this cache. THis is the same
default as the old back-end.

Post by Fabian Stemmer
When I did use preallocation I used lib.cnmem=1 for theano 0.8.2 and
gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not use
preallocation, because the only way I could see the difference in memory
usage was through nvidia-smi, which only shows the static pre-allocation
when it is used.
I believe the problem does not disappear with pre-allocation, since I
see my training crash for much smaller models with the new backend even
then. However, I cannot measure the effect of switching backends on GPU
memory when I use preallocation.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried
the preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different
and by default give more speed up, but sometimes can cause memory
fragmentation. the flag above fix the new fragmentation that can happen by
default.

Post by Fabian Stemmer
The theano 0.9.0 setup used libgpuarray v0.6.2.
The theano 0.10.dev setup used libgpuarray v0.6.5 - I just updated
to v0.6.7 and tested again, but I still get ~2GB memory usage.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano
GPU backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed
the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it,

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

Anton Murashov

2017-08-30 15:20:26 UTC

Permalink

Actually, initially I tried theano-0.10-dev-0b1 or smth like this, which
appears to be the most recent dev version, which I later re-installed to be
theano-0.9 which is part of Anaconda package.

As per preallocate flag I tried following options:

(a) 1 and 0 (big problems crash with OutOfMem, some problems work initially
but crash with OutOfMem if fit is restarted after kernel interrupt).

(b) -1 (model.fit crashes on problem of any size (even which work in (a)
initially) with invalid argument error in cuMemAlloc) --> this one appears
to be an outright bug.

Should I open github ticket?

Post by FrÃ©dÃ©ric Bastien
Update to Theano dev version. There is many updates that could help you.
If that don't fix your problem, open an issue on github.
For preallocation, which flag to do you use?

Post by Pascal Lamblin

Post by Fabian Stemmer
Thanks, by setting gpuarray.preallocate=-1 I now get similar behavior
for the new backend as for the old.
Do I understand correctly, that leaving preallocate at default behavior
(new backend) will not result in higher memory consumption, but merely
doesn't free memory once allocated, so what I see in nvidia-smi is
max-memory consumption up to this point?

Not really, it can actually result in higher memory consumption due to
the way new memory blocks are allocated. For instance, in the worse case,
if a tensor of 1 MB gets allocated and deallocated, then a 2 MB tensor, a
new 2 MB block will be added to the pool, however it will not be mergeable
with the first one, and if it gets freed, a 3 MB tensor cannot be "split"
between the first blocks. Due to that fragmentation effect, allocating /
deallocating 1 MB, then 2 MB, 3 MB, etc., will end up using 1 + 2 + 3 + ...
MB total on the GPU.

Again, not really, due to that fragmentation effect.

Preallocating 400 or 500 MB may avoid fragmentation and bring the total
consumption peak closer to what is actually allocated to arrays.

Post by Fabian Stemmer
Thanks
Fabian

Post by FrÃ©dÃ©ric Bastien
gpuarray.preallocate=-1.
The new back-end by default will cache all call to cudaMalloc() to
speed up computation. This flag will disable this cache. THis is the same
default as the old back-end.

Post by Fabian Stemmer
When I did use preallocation I used lib.cnmem=1 for theano 0.8.2 and
gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not use
preallocation, because the only way I could see the difference in memory
usage was through nvidia-smi, which only shows the static pre-allocation
when it is used.
I believe the problem does not disappear with pre-allocation, since I
see my training crash for much smaller models with the new backend even
then. However, I cannot measure the effect of switching backends on GPU
memory when I use preallocation.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried
the preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different
and by default give more speed up, but sometimes can cause memory
fragmentation. the flag above fix the new fragmentation that can happen by
default.

Post by Fabian Stemmer
The theano 0.9.0 setup used libgpuarray v0.6.2.
The theano 0.10.dev setup used libgpuarray v0.6.5 - I just updated
to v0.6.7 and tested again, but I still get ~2GB memory usage.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new theano
GPU backend. To do so, I switched from "device=gpu" to "device=cuda" with
theano9 and libgpuarray installed. My theano code then works with the new
backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both
showed the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it,

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

---
You received this message because you are subscribed to the Google
Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/d/optout.

Frédéric Bastien

2017-08-30 15:52:14 UTC

Permalink

What is the name of the flag you used? The name changed with the new
back-end.

Make sure to use the github version. Not a tagged version.

FrÃ©dÃ©ric

Post by Anton Murashov
Actually, initially I tried theano-0.10-dev-0b1 or smth like this, which
appears to be the most recent dev version, which I later re-installed to be
theano-0.9 which is part of Anaconda package.
(a) 1 and 0 (big problems crash with OutOfMem, some problems work
initially but crash with OutOfMem if fit is restarted after kernel
interrupt).
(b) -1 (model.fit crashes on problem of any size (even which work in (a)
initially) with invalid argument error in cuMemAlloc) --> this one appears
to be an outright bug.
Should I open github ticket?

Post by Pascal Lamblin

Post by Fabian Stemmer
Thanks, by setting gpuarray.preallocate=-1 I now get similar behavior
for the new backend as for the old.
Do I understand correctly, that leaving preallocate at default
behavior (new backend) will not result in higher memory consumption, but
merely doesn't free memory once allocated, so what I see in nvidia-smi is
max-memory consumption up to this point?

Not really, it can actually result in higher memory consumption due to
the way new memory blocks are allocated. For instance, in the worse case,
if a tensor of 1 MB gets allocated and deallocated, then a 2 MB tensor, a
new 2 MB block will be added to the pool, however it will not be mergeable
with the first one, and if it gets freed, a 3 MB tensor cannot be "split"
between the first blocks. Due to that fragmentation effect, allocating /
deallocating 1 MB, then 2 MB, 3 MB, etc., will end up using 1 + 2 + 3 + ...
MB total on the GPU.

Again, not really, due to that fragmentation effect.

Post by Fabian Stemmer
Currently, I see ~400MB GPU memory usage in profiling and that's what
I see with preallocate=-1 too (although I can't guarantuee there aren't
higher spikes that I don't see with nvidia-smi). When I leave preallocate
at default, I see GPU memory usage ~2GB (but the profiling still reports
only 400MB).

Preallocating 400 or 500 MB may avoid fragmentation and bring the total
consumption peak closer to what is actually allocated to arrays.

Post by Fabian Stemmer
Thanks
Fabian

Post by FrÃ©dÃ©ric Bastien
gpuarray.preallocate=-1.
The new back-end by default will cache all call to cudaMalloc() to
speed up computation. This flag will disable this cache. THis is the same
default as the old back-end.

Post by Fabian Stemmer
When I did use preallocation I used lib.cnmem=1 for theano 0.8.2 and
gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not
use preallocation, because the only way I could see the difference in
memory usage was through nvidia-smi, which only shows the static
pre-allocation when it is used.
I believe the problem does not disappear with pre-allocation, since
I see my training crash for much smaller models with the new backend even
then. However, I cannot measure the effect of switching backends on GPU
memory when I use preallocation.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried
the preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different
and by default give more speed up, but sometimes can cause memory
fragmentation. the flag above fix the new fragmentation that can happen by
default.

Post by Fabian Stemmer
The theano 0.9.0 setup used libgpuarray v0.6.2.
The theano 0.10.dev setup used libgpuarray v0.6.5 - I just updated
to v0.6.7 and tested again, but I still get ~2GB memory usage.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new
theano GPU backend. To do so, I switched from "device=gpu" to "device=cuda"
with theano9 and libgpuarray installed. My theano code then works with the
new backend without any further changes.
However, when I do this, I see my GPU memory consumption increase
drastically. When I use theano memory profiling both GPU backends show the
same memory consumption, but when I use nvidia-smi to monitor memory usage
while the job is running, the old backend hovers somewhere around 400MB,
while the new backend uses 2GB for the same model size and data. When I try
to train larger models, the new GPU backend fails with memory errors for
much smaller models than the old backend. This is also true when I activate
memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) - both showed
the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it,

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

Anton Murashov

2017-08-30 17:23:09 UTC

Permalink

1.

for example - in .theanorc:

[gpuarray]
preallocate=-1

in my jupyter notebook I confirm that by printing:

theano.config.gpuarray.preallocate, which shows it is -1

on import of Theano warning that GPU memory caching is off is also duly
printed.

2. I get theano dev version by following this manual:

http://deeplearning.net/software/theano_versions/dev/install_ubuntu.html

git clone git://github.com/Theano/Theano.git
cd Theano
pip install -e

later I confirm version of theano in jupyter by printing theano.__version__
which duly shows 0.10-dev-1b0 or smth similar.

3. I tried getting newer version of libgpuarray (0.7, also a development
one) but there is a check in Theano, it wants "1" as a major version of
API, while in 7.0 it is "2" already, so to make it work the latest version
of libgpuarray (which, presumably, contains the bug we are discussing here)
you can use is 0.6.9, which is already supplied with Anaconda as part of
pygpu-0.6.9. I tried compliming my own libgpuarray doing this:

git clone https://github.com/Theano/libgpuarray.gitcd libgpuarraygit
checkout tags/v0.6.5 -b v0.6.9

and then building it and making sure relevant libs find their way into
python instance - to no difference in the result, still outofmem in case of
preallocate>=0 and invalid argument in case preallocate < 0 when model.fit
gets called.

On Wed, Aug 30, 2017 at 5:52 PM, FrÃ©dÃ©ric Bastien <

Post by FrÃ©dÃ©ric Bastien
What is the name of the flag you used? The name changed with the new
back-end.
Make sure to use the github version. Not a tagged version.
FrÃ©dÃ©ric

Post by Anton Murashov
Hello all!
I have a very similar problem with new gpuarray backend, ) it has
(a) with preallocation turned ON (any value above and including zero)
it crashes with cuMemAlloc error (OutOfMemory) on problem of my size
(smaller problems work)
(b) with preallocation turned ON and if small problem is being fitted -
interrupting the kernel and restarting results in cuMemAlloc error
(OutOfMemory)
(b) with preallocation turned OFF (preallocation=-1) it does not even
start fitting with cuMemAlloc error (invalid argument!!! NOT
OutOfMemory!!!!)
GpuArrayException: ('The following error happened while compiling the
node', forall_inplace,gpu,grad_of_scan_fn}(TensorConstant{1000},
GpuSubtensor{int64:int64:int64}.0, GpuElemwise{Composite{(i0 -
sqr(i1))}}[]<gpuarray>.0, GpuElemwise{tanh,no_inplace}.0,
InplaceGpuDimShuffle{0,2,1}.0, GpuAlloc<None>{memset_0=True}.0,
GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0,
GpuSubtensor{int64:int64:int64}.0, GpuAlloc<None>{memset_0=True}.0,
GpuAlloc<None>{memset_0=True}.0, GpuAlloc<None>{memset_0=True}.0,
TensorConstant{1000}, GpuSubtensor{::, int64:int64:}.0,
InplaceGpuDimShuffle{1,0}.0, GpuSubtensor{::, :int64:}.0, GpuSubtensor{::,
int64::}.0, InplaceGpuDimShuffle{1,0}.0, GpuSubtensor{::, int64:int64:}.0,
InplaceGpuDimShuffle{1,0}.0, InplaceGpuDimShuffle{1,0}.0,
CUDA_ERROR_INVALID_VALUE: invalid argument')
Needless to say, on old backend all works fine, just 20% slower (on
problems which actually start fitting on both backends). I use versions
currently supplied with Anaconda (theano-0.9, libgpuarray 0.6.9, pygpu
0.6.9)

Post by Pascal Lamblin

Post by Fabian Stemmer
Thanks, by setting gpuarray.preallocate=-1 I now get similar behavior
for the new backend as for the old.
Do I understand correctly, that leaving preallocate at default
behavior (new backend) will not result in higher memory consumption, but
merely doesn't free memory once allocated, so what I see in nvidia-smi is
max-memory consumption up to this point?

Not really, it can actually result in higher memory consumption due to
the way new memory blocks are allocated. For instance, in the worse case,
if a tensor of 1 MB gets allocated and deallocated, then a 2 MB tensor, a
new 2 MB block will be added to the pool, however it will not be mergeable
with the first one, and if it gets freed, a 3 MB tensor cannot be "split"
between the first blocks. Due to that fragmentation effect, allocating /
deallocating 1 MB, then 2 MB, 3 MB, etc., will end up using 1 + 2 + 3 + ...
MB total on the GPU.

Post by Fabian Stemmer
A related question: When I run with profile=True,profile_memory=True
- shouldn't the max GPU memory stat in the profiling correspond to what I
see in nvidia-smi when I run with preallocate on default?

Again, not really, due to that fragmentation effect.

Post by Fabian Stemmer
Currently, I see ~400MB GPU memory usage in profiling and that's what
I see with preallocate=-1 too (although I can't guarantuee there aren't
higher spikes that I don't see with nvidia-smi). When I leave preallocate
at default, I see GPU memory usage ~2GB (but the profiling still reports
only 400MB).

Preallocating 400 or 500 MB may avoid fragmentation and bring the
total consumption peak closer to what is actually allocated to arrays.

Post by Fabian Stemmer
Thanks
Fabian

Post by FrÃ©dÃ©ric Bastien
gpuarray.preallocate=-1.
The new back-end by default will cache all call to cudaMalloc() to
speed up computation. This flag will disable this cache. THis is the same
default as the old back-end.

Post by Fabian Stemmer
When I did use preallocation I used lib.cnmem=1 for theano 0.8.2
and gpuarray.preallocate=1 for theano 0.9.0 and 0.10.dev.
For most experiments (including those in the log files) I did not
use preallocation, because the only way I could see the difference in
memory usage was through nvidia-smi, which only shows the static
pre-allocation when it is used.
I believe the problem does not disappear with pre-allocation, since
I see my training crash for much smaller models with the new backend even
then. However, I cannot measure the effect of switching backends on GPU
memory when I use preallocation.

Post by FrÃ©dÃ©ric Bastien
Do you use the Theano flag: gpuarray.preallocate=1? When you tried
the preallocation, how did you use it?
Is is mostly equivalent to lib.cnmem. But our default is different
and by default give more speed up, but sometimes can cause memory
fragmentation. the flag above fix the new fragmentation that can happen by
default.
On Thu, Jun 22, 2017 at 5:33 AM Fabian Stemmer <

Post by Fabian Stemmer
The theano 0.9.0 setup used libgpuarray v0.6.2.
The theano 0.10.dev setup used libgpuarray v0.6.5 - I just
updated to v0.6.7 and tested again, but I still get ~2GB memory usage.

Post by Fabian Stemmer
Hi,
I recently tried to switch my CNN implementation to the new
theano GPU backend. To do so, I switched from "device=gpu" to "device=cuda"
with theano9 and libgpuarray installed. My theano code then works with the
new backend without any further changes.
However, when I do this, I see my GPU memory consumption
increase drastically. When I use theano memory profiling both GPU backends
show the same memory consumption, but when I use nvidia-smi to monitor
memory usage while the job is running, the old backend hovers somewhere
around 400MB, while the new backend uses 2GB for the same model size and
data. When I try to train larger models, the new GPU backend fails with
memory errors for much smaller models than the old backend. This is also
true when I activate memory pre-allocation.
I tried to remove parts of my model or exclude certain theano
optimizations (e.g. exclude conv_dnn to force theano to use a different
convolution algorithm) but nothing I changed in the model structure had an
impact on the discrepancy I see in memory usage.
I use CUDA 8.0 and cuDNN 5105 for these experiments. For the old
backend I see very similar behavior for both the 0.8.2 and 0.9.0 releases.
For the new backend I tested the 0.9.0 release as well as a recent github
checkout (commit c5cd87fa7895dc44c7acd54cb85e6d232b33bd3a) -
both showed the same memory increase.
I attached log files including my models computational graph and
information on libraries, environment variables, etc. Please let me know if
I can supply any additional information to make it easier to look into
this. I tried to prepare a simple sample script to reproduce the behavior,
but was so far unable to do so.
Thanks
Fabian

--
---
You received this message because you are subscribed to the
Google Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it,

Post by Fabian Stemmer
For more options, visit https://groups.google.com/d/optout.

--
---
You received this message because you are subscribed to the Google
Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/d/optout.

--
_______________________________

Anton Murashov
Quantstellation.Centaurus
desk +44 748 1916031