Benchmarking a data-intensive operation (snow version)

Benchmarking a data-intensive operation (snow version)#

We’re going to compute the zonal, time average of a product of two high-frequency fields from the CFSR dataset.

Specifically we will compute

\[[\overline{vT}]\]

for the month of January (one single year)

I want to compare various ways of accelerating the calculation.

Compute environment#

I logged into https://jupyterlab.arcc.albany.edu, and spawned on snow with 8 cores and 64 GB memory

This resource is only open to snow group members

Python kernel is daes_nov22.

Lazy execution using Dask locally#

ds = xr.open_mfdataset(files, chunks={'time':30, 'lev': 1})
ds

<xarray.Dataset>
Dimensions:  (time: 1460, lat: 361, lon: 720, lev: 32)
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 ... 2021-12-31T18:00:00
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lon      (lon) float32 -180.0 -179.5 -179.0 -178.5 ... 178.5 179.0 179.5
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
Data variables:
    t        (time, lev, lat, lon) float32 dask.array<chunksize=(30, 1, 361, 720), meta=np.ndarray>
    v        (time, lev, lat, lon) float32 dask.array<chunksize=(30, 1, 361, 720), meta=np.ndarray>
Attributes:
    description:    t 1000-10 hPa
    year:           2021
    source:         http://nomads.ncdc.noaa.gov/data.php?name=access#CFSR-data
    references:     Saha, et. al., (2010)
    created_by:     User: ab473731
    creation_date:  Sat Jan  2 06:00:24 UTC 2021

vT = ds.v * ds.t
meanfield = vT.mean(dim='lon').groupby(ds.time.dt.month).mean()
meanfield

<xarray.DataArray (month: 12, lev: 32, lat: 361)>
dask.array<transpose, shape=(12, 32, 361), dtype=float32, chunksize=(1, 1, 361), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

ds.close()

Conclusion#

Similar to running on batch, we get about the same performance either way – giving advantage to the lazy method, because the code is nicer to work with.

Snow is slower than batch – probably just because the cpu hardware is significantly older.

Lazy execution using Dask and a distributed cluster#

Now we’ll try the same thing but farm the computation out to a dask cluster.

from dask_jobqueue import SLURMCluster
from dask.distributed import Client

cluster = SLURMCluster(processes=8,  #By default Dask will run one Python process per job. 
                       # However, you can optionally choose to cut up that job into multiple processes
                       cores=32,  # size of a single job -- typically one node of the HPC cluster
                       memory="256GB", # snow has 8 GB ram per physical core I believe
                       walltime="01:00:00",
                       queue="snow",
                       interface="ib0",
                      )
cluster.scale(1)
client = Client(cluster)
client

Client

Client-974a185b-66f6-11ed-96c3-80000208fe80

Connection method: Cluster object	Cluster type: dask_jobqueue.SLURMCluster
Dashboard: http://10.77.8.107:8787/status

Cluster Info

SLURMCluster

407d9434

Dashboard: http://10.77.8.107:8787/status	Workers: 0
Total threads: 0	Total memory: 0 B

Scheduler Info

Scheduler

Scheduler-04bf2212-5777-4b86-9a7d-beee2a36c4b4

Comm: tcp://10.77.8.107:40409	Workers: 0
Dashboard: http://10.77.8.107:8787/status	Total threads: 0
Started: Just now	Total memory: 0 B

Workers

Effects of different chunking#

ds = xr.open_mfdataset(files, chunks={'time':30, 'lev': 1})
vT = ds.v * ds.t
vT

<xarray.DataArray (time: 1460, lev: 32, lat: 361, lon: 720)>
dask.array<mul, shape=(1460, 32, 361, 720), dtype=float32, chunksize=(30, 1, 361, 720), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 ... 2021-12-31T18:00:00
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lon      (lon) float32 -180.0 -179.5 -179.0 -178.5 ... 178.5 179.0 179.5
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0

meanfield = vT.mean(dim='lon').groupby(ds.time.dt.month).mean()
meanfield

<xarray.DataArray (month: 12, lev: 32, lat: 361)>
dask.array<transpose, shape=(12, 32, 361), dtype=float32, chunksize=(1, 1, 361), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

ds.close()

And yet another chunking#

ds = xr.open_mfdataset(files, chunks={'time':124, 'lev': 32})
vT = ds.v * ds.t
vT

<xarray.DataArray (time: 1460, lev: 32, lat: 361, lon: 720)>
dask.array<mul, shape=(1460, 32, 361, 720), dtype=float32, chunksize=(124, 32, 361, 720), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 2021-01-01 ... 2021-12-31T18:00:00
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lon      (lon) float32 -180.0 -179.5 -179.0 -178.5 ... 178.5 179.0 179.5
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0

meanfield = vT.mean(dim='lon').groupby(ds.time.dt.month).mean()
meanfield

<xarray.DataArray (month: 12, lev: 32, lat: 361)>
dask.array<transpose, shape=(12, 32, 361), dtype=float32, chunksize=(2, 32, 361), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float32 -90.0 -89.5 -89.0 -88.5 -88.0 ... 88.5 89.0 89.5 90.0
  * lev      (lev) float32 1e+03 975.0 950.0 925.0 900.0 ... 50.0 30.0 20.0 10.0
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

ds.close()

Results#

Same as for batch, the smallest wall time for this calculation was achieved using the distributed cluster and a chunking strategy chunks={'time':10, 'lev': 32}.

For this one-month calculation, using the distributed cluster reduces our wall time from about 65 s down to about 30 s.

That’s about a factor of two speedup.

Compared to batch, snow is about 1.5x slower (30 seconds vs 20 seconds).

cluster.close()