Benchmarking a data-intensive operation

Benchmarking a data-intensive operation#

We’re going to compute the zonal, time average of a product of two high-frequency fields from the CFSR dataset.

Specifically we will compute

\[[\overline{vT}]\]

for the month of January (one single year)

I want to compare various ways of accelerating the calculation.

Compute environment#

I logged into https://jupyterlab.arcc.albany.edu, and spawned on batch with 4 cores and 32 GB memory

This resource is open to anyone with UAlbany login credentials.

Python kernel is daes_nov22.

Lazy execution using Dask locally#

ds = xr.open_mfdataset(files, chunks={'time':30, 'lev': 1})
ds

<xarray.Dataset>
Dimensions:  (time: 1460, lat: 361, lon: 720, lev: 32)
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 ... 2021-12-31T18:00:00
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lon      (lon) float32 -180.0 -179.5 -179.0 -178.5 ... 178.5 179.0 179.5
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
Data variables:
    t        (time, lev, lat, lon) float32 dask.array<chunksize=(30, 1, 361, 720), meta=np.ndarray>
    v        (time, lev, lat, lon) float32 dask.array<chunksize=(30, 1, 361, 720), meta=np.ndarray>
Attributes:
    description:    t 1000-10 hPa
    year:           2021
    source:         http://nomads.ncdc.noaa.gov/data.php?name=access#CFSR-data
    references:     Saha, et. al., (2010)
    created_by:     User: ab473731
    creation_date:  Sat Jan  2 06:00:24 UTC 2021

vT = ds.v * ds.t
meanfield = vT.mean(dim='lon').groupby(ds.time.dt.month).mean()
meanfield

<xarray.DataArray (month: 12, lev: 32, lat: 361)>
dask.array<transpose, shape=(12, 32, 361), dtype=float32, chunksize=(1, 1, 361), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

ds.close()

Conclusion#

On this particular system, we seem to get about the same performance either way – giving advantage to the lazy method, because the code is nicer to work with.

Lazy execution using Dask and a distributed cluster#

Now we’ll try the same thing but farm the computation out to a dask cluster.

from dask_jobqueue import SLURMCluster
from dask.distributed import Client

cluster = SLURMCluster(processes=4,  #By default Dask will run one Python process per job. 
                       # However, you can optionally choose to cut up that job into multiple processes
                       cores=80, # size of a single job -- typically one node of the HPC cluster
                       memory="376.2GB",  # the max memory on the 80-core batch nodes, I believe
                       walltime="01:00:00",
                       queue="batch",
                      )
cluster.scale(1)
client = Client(cluster)
client

Client

Client-cc027e39-66f3-11ed-936c-80000086fe80

Connection method: Cluster object	Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://169.226.65.166:8787/status

Cluster Info

SLURMCluster

9664ea60

Dashboard: http://169.226.65.166:8787/status	Workers: 0
Total threads: 0	Total memory: 0 B

Scheduler Info

Scheduler

Scheduler-f8b790f3-f390-4d98-b539-c12bce77e44e

Comm: tcp://169.226.65.166:33445	Workers: 0
Dashboard: http://169.226.65.166:8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Workers

Effects of different chunking#

ds = xr.open_mfdataset(files, chunks={'time':30, 'lev': 1})
vT = ds.v * ds.t
vT

<xarray.DataArray (time: 1460, lev: 32, lat: 361, lon: 720)>
dask.array<mul, shape=(1460, 32, 361, 720), dtype=float32, chunksize=(30, 1, 361, 720), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 ... 2021-12-31T18:00:00
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lon      (lon) float32 -180.0 -179.5 -179.0 -178.5 ... 178.5 179.0 179.5
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0

meanfield = vT.mean(dim='lon').groupby(ds.time.dt.month).mean()
meanfield

<xarray.DataArray (month: 12, lev: 32, lat: 361)>
dask.array<transpose, shape=(12, 32, 361), dtype=float32, chunksize=(1, 1, 361), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

ds.close()

And yet another chunking#

ds = xr.open_mfdataset(files, chunks={'time':124, 'lev': 32})
vT = ds.v * ds.t
vT

<xarray.DataArray (time: 1460, lev: 32, lat: 361, lon: 720)>
dask.array<mul, shape=(1460, 32, 361, 720), dtype=float32, chunksize=(124, 32, 361, 720), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 ... 2021-12-31T18:00:00
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lon      (lon) float32 -180.0 -179.5 -179.0 -178.5 ... 178.5 179.0 179.5
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0

meanfield = vT.mean(dim='lon').groupby(ds.time.dt.month).mean()
meanfield

<xarray.DataArray (month: 12, lev: 32, lat: 361)>
dask.array<transpose, shape=(12, 32, 361), dtype=float32, chunksize=(2, 32, 361), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

ds.close()

Results#

The smallest wall time for this calculation was achieved using the distributed cluster and a chunking strategy chunks={'time':10, 'lev': 32}.

For this one-month calculation, using the distributed cluster reduces our wall time from about 50 s down to about 20 s.

That’s close to a factor of three speed-up.

Can we do better through more intelligent chunking and/or fine-tuning of the distributed cluster? Probably.

cluster.close()

2022-11-17 22:49:54,749 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client