New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
210000% more RAM usage than expected when reading dask arrays #4448
Comments
The next thing to do is probably to work your situation into a minimum
reproducible example. See
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
…On Fri, Feb 1, 2019 at 6:48 AM arkanoid87 ***@***.***> wrote:
I'm a novice user but I barely believe that I'm experiencing the expected
performance of Dask
1. I've build my graph of operations with a persist at the end. The
only node in memory is the last one (I've called preserve() on it).
![screen shot 2019-02-01 at 06 02 22](https://user-images.githubusercontent.com/1446586/52128929-88fecc00-2636-11e9-9dee-a74310411b01.png)
1. I successfully get the dask.array out of it
result = predicted_y_img_rechunk.compute()
it is a 27915 x 22415 float16 image, ~1300MB according to math
result.shape, result.chunksize, result.dtype
((27915, 22415), (5000, 5000), dtype('float16'))
From this point I've been testing many different possibilities: to_hdf,
to_zarr, to_npy, even custom store(function), but all cause my ram to
fillup in seconds up to 128GB and crash the kernel
simplest example is a np.array conversion
result_np = np.array(result)
My data would fit nicely in RAM many many times, but I'm in the condition
of having a dask.array and not being able to save it to disk
Am on a distributed environment with 2 workers. The client is one of the
workers.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4448>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszBODc2ow_bHXhD2hiqQ0v_oiaSQXks5vJFO1gaJpZM4aec9i>
.
|
Please consider examples A and B AFits in RAM nicely
CPU times: user 15.6 s, sys: 23.2 s, total: 38.9 s BWARNING: RAM bomb
kernel crashed But with a smaller nrows = 100000, things are different:A
B
By changing all my code to pattern A I have successfully completed an iteration of my graph. Questions
Exception not raised
Exception raised
Thank you |
Could it be that dd needs metadata and is calling the function to retrieve the first number of results? Your function raise an error at that time. |
In case the above was not clear, see ttp://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_delayed, and try adding an argument meta to your call to dd.from_delayed. |
Thanks
Any idea about what's happening in the ram-bomb case B?
Is returning a dask collection from a delayed something to be considered?
I see that it generates unexpected ram usage even when extracting a single
element.
…On Sat, Feb 2, 2019, 14:28 Paul-Armand Verhaegen ***@***.*** wrote:
In case the above was not clear, see ttp://
docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_delayed,
and try adding an argument meta to your call to dd.from_delayed.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4448 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABYSukjmv25Kf4ITPUllMBfkPo2xy5sEks5vJZKMgaJpZM4aec9i>
.
|
Note that it's odd to see people use dask.delayed on top of other dask
collection functions. See
http://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections
…On Sat, Feb 2, 2019 at 5:36 AM arkanoid87 ***@***.***> wrote:
Thanks
Any idea about what's happening in the ram-bomb case B?
Is returning a dask collection from a delayed something to be considered?
I see that it generates unexpected ram usage even when extracting a single
element.
On Sat, Feb 2, 2019, 14:28 Paul-Armand Verhaegen ***@***.***
wrote:
> In case the above was not clear, see ttp://
> docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_delayed,
> and try adding an argument meta to your call to dd.from_delayed.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <#4448 (comment)>, or
mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/ABYSukjmv25Kf4ITPUllMBfkPo2xy5sEks5vJZKMgaJpZM4aec9i
>
> .
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4448 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszLU0nzqCAjx71byyX7AjUGPp51oFks5vJZROgaJpZM4aec9i>
.
|
Thanks for the comments, but please I need some more help to apply this to my real problem. I am doing ML on large images (30K x 30k, float32) with geo/meteo data. I am lazily loading geo images into dask arrays
I have to perform per-pixel prediction using different already fitted scikit models. In numpy I'd column-stack the three and call model.predict(X). But here I'd like to share the work on the available computing power (2 workers, 64GB - 20 cores each) How can I stack the images efficiently?
I'm also trying to work column wise using dask Series and Dataframes instead of arrays
To sum up, what is best given my context?
Or should I use dask_ml.wrappers.Incremental instead? Thanks |
If your images are tiled internally then I recommend using Xarray's
open_rasterio functionality. This will allow you to have chunks of data
that are smaller in memory and help you pipeline things better. As it is
each of your chunks is in the 4GB range, so having something like one chunk
in memory per core would already blow out your memory.
…On Sat, Feb 2, 2019 at 1:58 PM arkanoid87 ***@***.***> wrote:
Thanks for the comments, but please I need some more help to apply this to
my real problem.
I am doing ML on large images (30K x 30k, float32) with geo/meteo data.
I am lazily loading geo images into dask arrays
@delayed
def load_raster(path):
with rasterio.open(path, 'r') as raster:
return raster.read(1).ravel()
img1 = load_raster('A.tif')
img2 = load_raster('B.tif')
img3 = load_raster('C.tif')
# actually 7 of them
I have to perform per-pixel prediction using different already fitted
scikit models. In numpy I'd column-stack the three and call
model.predict(X). But here I'd like to share the work on the available
computing power (2 workers, 64GB - 20 cores each)
How can I stack the images efficiently?
This stack operation seems computed on a single worker, is there a better
solution?
X = da.stack([da.from_delayed(d, dtype=np.float32, shape=(30000,30000)) for d in [
img1, img2, img3
]], axis=1)
I'm also trying to work column wise using dask Series and Dataframes
instead of arrays
@delayed
def load_raster(path):
with rasterio.open(path, 'r') as raster:
return pd.Series(raster.read(1).ravel())
img1 = dd.from_delayed(load_raster('A.tif'), meta=('A', np.float32), divisions=(0,30000))
img2 = dd.from_delayed(load_raster('B.tif'), meta=('B', np.float32), divisions=(0,30000))
img3 = dd.from_delayed(load_raster('C.tif'), meta=('C', np.float32), divisions=(0,30000))
X = dd.multi.concat([img1s, img2s, img3s], axis=1).repartition(npartitions=40) # is partitioning by number of cores a good thing?
Thanks
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4448 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszFL-vQwt42H3z0-4R7xg2vMOlwh1ks5vJgnwgaJpZM4aec9i>
.
|
Ok, but xarray.open_rasterio returns a chunked Dask array, so here I'm back to one of my first questions:
e.g.
|
|
Ok, but then there's a missing point I have many large images in a NFS shared between the workers. Images does not fit nicely in ram and I have to stack them, so you told me to use open_rasterio to avoid the all-in-ram step and that is ok. But let's say I want to load them this way in parallel on the workers as they have access to the fs. If I can't return a dask collection from a delayed function, how? Thank you |
I don't follow. |
Wouldn't that load the whole image in ram anyway before dumping to a Dask Array? Just like using rasterio.open().load() I feel confusion between np, pandas, dask and xarray From what you are saying it seems that inside Delayed functions I get np/pd as input and I should return return np/pd collections as output, but now you are saying to return array.open_rasterio that would return a xarray DataArray. So the general question is: how can I demand a worker to lazy load data that would not fit into worker's memory? Thank you |
EDIT: I was pushing the model into the graph. Fixed
|
EDIT: I was pushing the model into the graph. Fixed
|
No, xarray is lazy. It uses dask under the hood .
My first guess is that you have mismatched versions. Try
|
I am getting a similar error "pickle.UnpicklingError: pickle data was truncated", also when using RandomForestRegressor. Have you had any luck in overcoming this issue? Edit: I figured out that the issue is related to RAM per core. Solved it by altering nprocs and/or spill rate. |
I'm glad to see that you've resolved your issue. I'm going to go ahead and close this. Thanks for reporting things @arkanoid87 |
I'm a novice user but I barely believe that I'm experiencing the expected performance of Dask
Configuration:
2 x Workers [128GB RAM each]:
Client is on worker#1
it is a 27915 x 22415 float16 image, ~1300MB according to math
From this point I've been testing many different possibilities: to_hdf, to_zarr, to_npy, even custom store(function), but all cause my ram to fillup in seconds up to 128GB and crash the kernel
simplest example is a np.array conversion
I've also tried to rechunk it == shape size, but no changes in behaviour
My data would fit nicely in RAM many many times, but I'm in the condition of having a dask.array and not being able to save it to disk
The text was updated successfully, but these errors were encountered: