Sub-daily ratio regionalization: nationwide validation¶

When a record is daily-only, Cascade Hydro estimates the shorter durations by scaling the 24-hour design depth by a sub-daily depth ratio (the Tier-3 path; see also the provenance). Which ratios to use is a regionalization problem. This page is the evidence behind that choice. It compares two methods across every documented ECCC station, scored against ECCC's own published IDF curves, so the decision rests on independent, nationwide, out-of-sample data rather than a single station or a synthetic test.

The two methods:

Ecozone median: the median ratio of all gauges in the project's terrestrial ecozone. Fixed regions, hard boundaries.
Region of influence (ROI): the inverse-distance-weighted mean of the \(k=15\) nearest gauges (haversine), after Burn (1990). No fixed boundaries.

The evidence on this page is why ROI is the shipped default and the ecozone median is the fallback (kept where it provably wins, and where ROI has no nearby gauge). Everything here is reproduced by scripts/extract_idf_dataset.py then scripts/compare_ratio_methods.py from the ECCC IDF product (v3.40); the raw files are not committed, the figures and summary are.

How the ratios are computed¶

Three formulae produce a sub-daily ratio at an ungauged location. They are stated here with the constants and the code that implements them, so the numbers above are reproducible end to end.

1. The per-gauge ratio (what gets regionalized)¶

For each gauge the quantity regionalized is the depth ratio at return period \(T\) (default \(T = 10\) yr): the \(D\)-minute design depth over the 24-hour design depth. Each depth is a Gumbel (EV-I) quantile fitted by the method of moments to that gauge's annual-maximum series (Gumbel 1958). The two named results are the Gumbel method-of-moments estimators (scale \(\beta\), location \(\mu\)) and the Gumbel quantile function \(x_T = \mu + \beta\,y_T\). From the sample mean \(\bar{x}\) and standard deviation \(s\) of the maxima at duration \(D\),

\[ \beta = \frac{s\sqrt{6}}{\pi}, \qquad \mu = \bar{x} - \gamma\,\beta, \qquad x_T(D) = \mu + \beta\,y_T, \]

with the Euler-Mascheroni constant \(\gamma = 0.5772156649\) and the Gumbel reduced variate \(y_T = -\ln\!\big(-\ln(1 - 1/T)\big)\). The ratio is the anchored quotient

\[ r(D) = \frac{x_T(D)}{x_T(1440)}, \qquad r(1440) = 1 \text{ by construction}. \]

Code: gumbel_mom_ratios in scripts/compare_ratio_methods.py, mirroring the in-app fit in idf_analyzer.analysis.frequency. The shipped 686-gauge table (idf_analyzer.analysis.station_ratios) is exactly these vectors; section A below confirms they reproduce ECCC's published Table 2a ratios to a mean absolute error of 0.0005.

2. Region of influence (Burn 1990)¶

The region-of-influence idea (Burn 1990) is that every site gets its own pool of nearby gauges, weighted by proximity, rather than being assigned to a fixed region. Burn's scheme has three parts: a distance metric (his Eq. 1, a weighted Euclidean distance in an attribute space of basin descriptors), a threshold that defines the pool (his Eq. 2, \(I_i = \{ j : D_{ij} \le \theta_i \}\)), and a weighting function that down-weights the more distant members (his Eq. 3a, \(\eta_{ij} = f(D_{ij}, \Psi)\), realised as the decreasing power forms of his Eqs. 6 and 8). Cascade Hydro keeps Burn's structure but makes three choices suited to mapping a sub-daily ratio:

Distance is geographic, not attribute-space. We use the great-circle (haversine) distance between gauge coordinates, where Burn used Eq. 1 over basin attributes. Burn notes (his discussion) that for ungauged sites the metric must fall back to physical/locational features; geographic proximity is exactly that.
The pool is a fixed count, the \(k = 15\) nearest gauges, in place of Burn's distance threshold \(\theta_i\) (Eq. 2). Burn's own target pool size was also \(NST = 15\) (his "Parameters for ROI Options"), so \(k = 15\) matches his setting; a maximum-distance guard (below) plays the role of his threshold.
The weight is the inverse-distance form below, one member of Burn's general family (Eq. 3a) but not his Eq. 6 or Eq. 8 expression.

Burn then pooled probability-weighted moments and fitted a GEV growth curve (his Eqs. 18-24); Cascade Hydro instead pools the empirical depth-ratio vectors directly (the ratios derived in step 1 above), because the ratio is the quantity being regionalized. Distance is the haversine (great-circle) formula, \(R = 6371\) km:

\[ d = 2R\,\arcsin\!\sqrt{\sin^2\!\tfrac{\Delta\varphi}{2} + \cos\varphi_1 \cos\varphi_2 \sin^2\!\tfrac{\Delta\lambda}{2}} . \]

Each pooled gauge \(i\) is weighted by inverse distance and the weights are normalised, so the estimate is a convex combination (inverse-distance weighting) of the neighbours' ratios:

\[ w_i = \frac{1}{d_i + 1\,\text{km}}, \qquad \hat{r}(D) = \frac{\sum_{i=1}^{k} w_i\, r_i(D)}{\sum_{i=1}^{k} w_i} . \]

The \(+1\) km offset prevents a division by zero at a co-located gauge and lightly smooths the very-near field; nearer gauges dominate, distant ones still contribute. The anchor is preserved, \(\hat{r}(1440) = 1\). Two distance guards make the estimate honest about its support: ROI declines (and the resolver falls back to the ecozone table) when the nearest gauge is beyond \(d_{\text{max}} = 300\) km, and the result is flagged documented only when the nearest gauge is within 100 km, else planning-level. Code: _haversine / the IDW block in run() (study), and roi_ratios in idf_analyzer.analysis.idf_ratios (runtime), which share the formula and constants (\(k\), the weight, \(d_{\text{max}}\)).

3. Why pool by proximity, and how homogeneous the pools are¶

Pooling neighbours is justified because the sub-daily ratio is a smoothly varying regional quantity. The simple-scaling model of extreme rainfall (Menabde et al. 1999) makes this precise. Their hypothesis (Eq. 6) is that the annual-maximum intensity \(I_d\) scales with duration as \(I_d \stackrel{d}{=} (d/D)^{-\eta} I_D\), so same-return-period intensities obey \(i_d(T) = (d/D)^{-\eta}\, i_D(T)\) (their Eq. 18). Because depth is intensity times duration, the depth ratio Cascade Hydro uses is the same power law with exponent \(1-\eta\):

\[ r(D) = \frac{\text{depth}(D)}{\text{depth}(1440)} \approx \left(\frac{D}{1440}\right)^{\,1-\eta}, \]

and \(\eta\) is regional. Menabde report intensity exponents \(\eta \approx 0.65\) (temperate Melbourne) and \(0.76\) (convective Warmbaths). Backing out the implied \(\eta = 1 - (\text{depth exponent})\) from our shipped ratios gives \(\eta \approx 0.69\) for the convective Prairies, squarely in Menabde's range, and \(\eta \approx 0.50\) for the winter-frontal Pacific coast, flatter still, consistent with the paper's point that \(\eta\) tracks the storm climate. Two honest caveats follow from the paper. First, Menabde validate simple scaling over 30 min to 24 h; our 5-, 10-, and 15-minute ratios are below that range, one reason Cascade Hydro carries the empirical ratio vector rather than fitting a single \(\eta\) and extrapolating it. Second, the model's stated purpose (their abstract) is precisely the Tier-3 task: recover sub-daily amounts "directly from the analysis of daily data." The scaling law is therefore the physical reason a neighbour's ratio vector is informative and the reason the curves in the figures below tighten smoothly toward the 24-hour anchor; it is the motivation for ROI, not the estimator Cascade Hydro runs.

How well-posed a pool is can be measured. The discordancy measure \(D_i\) of Hosking & Wallis (1997), applied here to the gauge ratio vectors \(u_i\) within a pool of \(N\) gauges with pool mean \(\bar{u}\), is

\[ D_i = \tfrac{1}{3}\,N\,(u_i - \bar{u})^{\mathsf T}\, S^{-1}\,(u_i - \bar{u}), \qquad S = \sum_{j=1}^{N} (u_j - \bar{u})(u_j - \bar{u})^{\mathsf T}, \]

which flags gauges whose ratio vector is unusual for their pool. By this family of diagnostics the ROI pools are measurably more homogeneous than the ecozone groupings (about 9% lower within-pool dispersion; development notes): the nearest 15 gauges are a tighter group than a whole ecozone, which is what the out-of-sample error below confirms.

Two pictures of the same country¶

The methods differ in spatial structure before any error is measured. The ecozone method partitions Canada into 15 fixed regions and assigns every point the median of its region, so the ratio changes in steps at region borders. ROI has no borders: it averages the nearest gauges, producing a smooth field that follows the storm-climatology gradient. Only the ecozone method has boundaries to draw.

Ecozone regions vs the continuous ROI ratio surface — Left: the ecozone method's 15 fixed regions (hard boundaries), with the gauge network. Right: the region-of-influence 5-min : 24-h ratio surface from the shipped gauge table, smooth and boundary-free (low on the winter-wet Pacific coast, higher in the convective interior). White areas in the north are where ROI declines for want of a nearby gauge and the ecozone table is the fallback.

Ground truth: ECCC's published curves¶

The ratio used as truth for station \(s\) and duration \(D\) comes straight from ECCC's published Table 2a return-level depths:

\[ \text{ratio}_s(D) = \frac{\text{depth}_s(D,\,10\text{-yr})}{\text{depth}_s(1440,\,10\text{-yr})} . \]

Two analyses follow.

A. The shipped numbers are ECCC-equivalent¶

First, confirm the ratios Cascade Hydro derives (Gumbel by method of moments, from each station's Table 1 annual maxima, idf_analyzer.analysis.frequency) reproduce ECCC's published Table 2a ratios. Across all 686 stations with a full sub-daily curve:

\[ \overline{\text{MAE}}\bigl(\text{our ratios},\ \text{ECCC published}\bigr) = 0.0005 . \]

So the targets in the comparison below are, in effect, ECCC's own numbers, and the gauge table the app ships is faithful to them nationwide (consistent with the single-station, cell-by-cell check on the ECCC validation page).

B. ROI vs ecozone, leave-one-station-out¶

For each station, predict its sub-daily ratios from the other stations by each method (the station is excluded from its own ecozone median and ROI pool), and score against its published curve. Across 686 stations at the 10-year level:

Method	Mean MAE	Stations where it wins
Ecozone median (fallback)	0.0540	37%
Region of influence	0.0436	63%

ROI reduces mean absolute error by 19% (90% bootstrap CI on the reduction: \([0.0084,\ 0.0124]\), i.e. the gain is not noise). The improvement holds at the 2-year and 100-year levels too (--rp).

Per-station error distribution, ROI vs ecozone — Cumulative distribution of per-station MAE against ECCC's published curves. The ROI curve sits left of the ecozone curve across almost the whole range: lower error for most stations.

Where ROI helps, and where it does not¶

The national average hides real, physically meaningful variation. ROI wins in 12 of 15 ecozones, and the pattern is interpretable.

Mean error by ecozone, ROI vs ecozone median — Mean ratio error by ecozone. ROI (blue) beats the ecozone median (orange) in most regions, with the largest gains in sparse, internally varied regions.

Ecozone	n	Ecozone MAE	ROI MAE	ROI change
Taiga Shield	16	0.060	0.033	+45%
Boreal Cordillera	14	0.091	0.055	+40%
Boreal Shield	148	0.058	0.041	+30%
Atlantic Maritime	102	0.056	0.040	+30%
Southern Arctic	9	0.076	0.057	+24%
Prairies	83	0.054	0.045	+17%
Mixedwood Plains	129	0.044	0.038	+13%
Boreal Plains	41	0.053	0.046	+12%
Montane Cordillera	52	0.070	0.064	+10%
Northern Arctic	11	0.043	0.046	−8%
Hudson Plains	3	0.040	0.045	−14%
Pacific Maritime	66	0.039	0.047	−19%

(Arctic Cordillera and Taiga Cordillera have n = 2 and are omitted from the table; both favour ROI but the sample is too small to read.)

Why ROI wins most where it does. Sparse, heterogeneous regions (the Taiga and Boreal Cordillera, the Arctic, the vast Boreal Shield) are exactly where a single ecozone median blends gauges with genuinely different storm regimes. Pooling the nearest gauges instead recovers the local behaviour. Atlantic Maritime is large and spans a real coastal-to-interior gradient that ROI tracks and the median smears.

Why ROI loses in Pacific Maritime. This is the most instructive result. Pacific Maritime is a tight, internally homogeneous winter-frontal regime: its ecozone median is already the lowest-error of any region (0.039). There the fixed median is hard to beat, and ROI can pull in nearby Montane Cordillera gauges from the other side of the Coast Mountains, diluting a coastal estimate with an interior one. The same effect makes the single-station check at Vancouver Harbour (a Pacific Maritime gauge) marginally favour the ecozone median on the validation page.

Why station-to-station variance is expected¶

Even the better method leaves residual error, and that is not a defect to engineer away. A sub-daily ratio is set by the local mix of convective and synoptic storms producing the annual maxima, which varies between neighbouring gauges, and it is estimated from a finite record, so it carries sampling noise. A regional estimate therefore has irreducible spread: it predicts the regime, not the individual gauge's sampling realisation. The map below shows this directly. ROI improves the estimate over most of the country (green), but the gauge-level scatter remains, and is largest where records are short or regimes transition.

Map of per-station ROI improvement over ecozone — Per-station improvement of ROI over the ecozone median (green = ROI closer to ECCC's published curve). Improvement dominates nationally; the coastal Pacific is the main region where the fixed median holds its own.

The improvement is also duration-dependent. The two methods differ most at the short, design-relevant sub-hourly durations and converge toward the 24-hour anchor.

Error by duration and the region-by-duration improvement heatmap — Mean error by duration. Both methods tighten toward the 24-hour anchor (ratio = 1 by construction); the gap is widest at the sub-hourly durations that drive small-area design.

Region-by-duration improvement heatmap — ROI improvement over the ecozone median by region and duration (green = ROI better). The Pacific Maritime row is the notable band of red.

What this means for the tool¶

ROI is the better regionalization on independent, nationwide evidence (+19% mean, better at 63% of stations, validated against ECCC's published curves), and its pools are also more internally homogeneous than the ecozones (Hosking & Wallis discordancy and within-pool dispersion; see the development notes). It is therefore the default for the sub-daily ratio resolver (idf_analyzer.analysis.idf_ratios.resolve_ratios), with two deliberate exceptions where the ecozone median is used instead:

Where the median provably wins. This comparison is captured as a table, ratio_method_eval.ECOZONE_PREFERRED, listing the ecozones whose median beat ROI out-of-sample here with a large enough sample. Only the Pacific Maritime coast qualifies: a tight, internally homogeneous regime where the fixed median is hard to beat and ROI can pull in interior gauges from the wrong side of the Coast Mountains. There the resolver keeps the ecozone median, and says so.
Where ROI has no nearby gauge. Beyond a maximum nearest-gauge distance the resolver declines ROI and falls back to the documented ecozone table, so sparse and ungauged sites are not estimated from distant gauges.

In every case the method actually used (region of influence with its gauge count, or the ecozone median with the reason) is reported back in the IDF result so the choice is visible to the user.

Method and reproducibility¶

Data: ECCC Engineering Climate Datasets, Short-Duration Rainfall IDF, v3.40. Tables 1 (annual maxima) and 2a (published depths) parsed per station; lat/lon from each file header; ecozone by point-in-polygon (idf_analyzer.analysis.region_lookup).
Scoring: leave-one-station-out, mean absolute error of the predicted sub-daily ratio vector against ECCC's published ratios, with a station-level bootstrap CI.
ROI: \(k=15\) nearest gauges, haversine distance, inverse-distance weighting.
Run: python scripts/extract_idf_dataset.py then python scripts/compare_ratio_methods.py.
References: Burn (1990) (region of influence); Menabde et al. (1999) (rainfall scaling); Hosking & Wallis (1997) (homogeneity); ECCC Engineering Climate Datasets and the National Ecological Framework on the References page.