Paired calibration of the Lume 1.2 sensor against two EPA-approved reference methods — Colilert (IDEXX defined-substrate, MPN) and membrane filtration (MF, CFU) — and the Aquagenx Compartment Bag Test (CBT, MPN) used in international monitoring, for E. coli and total coliform quantification.
Field Validation
Lab Validation
Live data from the mWater Lume 1.2 – 2026 Validation Data datagrid. Each water sample collection event is paired with a reference enumeration; the Method column distinguishes which reference was used — Colilert (IDEXX defined-substrate MPN), membrane filtration (MF, CFU), or compartment bag tests (CBT, MPN). The Use in Calibration column flags rows that are unusable because no /diagnostics record (water temperature, required by the CFU regression) was streaming within ±20 min of the sample. All date/time columns are displayed in UTC. Times are corrected from the mWater-stored value using the Timezone Entered column: mWater records times using the data-entry device’s local clock (Boulder, MDT = UTC−6); for samples collected in a different timezone the stored time is adjusted accordingly.
All unique Colilert grab samples collected at Boulder Creek sites (n = 39, deduplicated). Values in CFU/100 mL. EPA single-sample recreational threshold: 126 CFU/100 mL.
| Site | n | Min | Median | Max | ≥126 CFU | All values (CFU/100 mL) |
|---|---|---|---|---|---|---|
| BC-CU | 12 | 12 | 47 | 1986 | 2 (17%) | 12, 15, 15, 21, 26, 44, 50, 53, 60, 75, 152, 1986 |
| BC-55 | 13 | 6 | 53 | 866 | 4 (31%) | 6, 20, 23, 24, 26, 28, 53, 75, 131, 145, 166, 378, 866 |
| BC-30 | 3 | 36 | 105 | 517 | 1 (33%) | 36, 105, 517 |
| BC-Can | 8 | 2 | 16 | 30 | 0 (0%) | 2, 3, 5, 7, 16, 17, 27, 30 |
| BC-Eben | 3 | 6 | 28 | 30 | 0 (0%) | 6, 28, 30 |
| All BC | 39 | 2 | 30 | 1986 | 7 (18%) | median = 30 • mean = 171 • ≥126: 7 of 39 |
Pooled OLS: log10(colilert) ~ barcode + mon2_val + temperature + tof_mean + mon2_val×temperature (led_power = 512, sipm_bias ≈ 3000, reference barcode: 50046). Left: all matched grabs (n = 34). Right: post burn-in only (n = 21, sensors 50052 and 50066 excluded). Source data: ⬇ field_matched_512.csv
Logistic regression: features mon2_val, temperature, tof_mean, mon2_val×temperature (z-scored per LOO fold) plus barcode fixed effects (reference: 50046). Left: all matched grabs (n = 34). Right: post burn-in only (n = 21). Metrics computed via LOO-CV. Source data: ⬇ field_matched_512.csv
Paired calibration of the Lume sensor against the Aquagenx Compartment Bag Test (CBT), an MPN-based reference method used in international water quality monitoring. Kigali field deployment data.
View CBT Calibration →Jan 2–6, 2026 calibration sessions (paper training range). Fluorescence signal (mon2_val) at the paper operating point: led_power = 1024, sipm_bias = 3040. CBT calibration data is in the field-calibration table above.
⬇ Download full dataset (CSV)
Pooled OLS (Jan 2–6, n = 125): log₁₀(colilert) ~ barcode + signal × floor_temp × tof_mean. Barcode is a fixed-effect intercept shift (reference: 50030); slopes shared. Fit separately for each LED/bias operating point. In-sample R².
LED = 1024 · bias = 3040 — paper
LED = 512 · bias = 3000 — production
LED = 256 · bias = 3300 — original
Logistic regression trained on the full lab dataset (Jan 2–21 2026, n = 300 rows with valid production-combo readings), binary label: Colilert ≥ 126 CFU/100 mL. Features: mon2_val_512, floor_temp, tof_mean (all z-scored; no sensor fixed effect so the model is sensor-agnostic). Performance estimated by leave-one-out cross-validation (LOO-CV): each fold re-standardizes from the training set of n = 299 before predicting the held-out row. Note: the 9 positive examples all come from one contamination event (Jan 8, three consecutive time points). LOO-CV performance for the positive class may be over-optimistic due to temporal correlation between the 9 rows.
Logistic regression fit on the combined lab (Jan 2–21 2026, n = 300) and field (n = 30) datasets. Binary label: Colilert ≥ 126 CFU/100 mL (16 positives total: 9 lab, 7 field). Features: mon2_val, temperature, tof_mean (continuous, z-scored per LOO fold) plus a dummy variable for every sensor relative to reference 50030 (unscaled 0/1). All 9 sensors are included. Field sensors with very few samples (50052 n=2, 50062 n=1, 50066 n=1) have sparse dummy estimates; their LOO predictions fall back to the pooled signal when their dummy is unobserved in training.
How does the Lume compare against the two EPA-approved laboratory methods — Colilert (IDEXX) and membrane filtration (MF) — and against itself when retrained on a different reference? Each column analyzes paired samples across three frameworks: log-log regression (top), Bland-Altman agreement (middle), and categorical classification (bottom).
The dedicated method comparison study pairs Colilert (n = 2 replicates) with membrane filtration (n = 3 replicates) across 161 datetimes; 8 zero-valued pairs are excluded from the log-scale analysis, yielding 153 observations. The two EPA-approved methods show R² = 0.572 with a +0.35 log10 bias — MF systematically reads ~2.2× higher than Colilert. 95% limits of agreement span [−0.64, +1.34], meaning paired lab samples can differ by up to ~22× in either direction. Categorical accuracy is 0.66 (Cohen’s κ = 0.40), i.e. “fair” agreement. This inter-method disagreement sets the ceiling for what any sensor can be expected to achieve against either reference.
The Colilert-trained Lume regression is evaluated against Colilert across all bench (n = 176) and field (n = 33) observations. The sensor achieves R² = 0.881, a bias of 0.00 log10, and tight limits of agreement [−0.42, +0.42] — Lume predictions stay within ~2.6× of the reference. Categorical accuracy is 0.89 with κ = 0.88, which is “almost perfect” agreement. Against its training reference, the Lume performs as well as or better than the two EPA methods perform against each other.
The same Colilert-trained Lume model is now evaluated against membrane filtration — a reference method it was never trained on. Performance drops to R² = 0.514 with LoA [−0.80, +0.83] and categorical accuracy 0.84 (κ = 0.65). Critically, the ~0.37 drop in R² from column 2 to column 3 is of the same order as the inter-method disagreement between Colilert and MF themselves (column 1, R² = 0.572). Most of the apparent loss is attributable to reference-method disagreement, not sensor limitations.
To isolate the effect of reference-method choice, the Lume regression is refit using MF as the training target, over the full bucket dataset. Performance jumps back to R² = 0.872 — essentially matching the Colilert-trained model against Colilert. Bias is 0.00 with LoA [−0.93, +0.93]; the slightly wider LoA reflects the higher within-method variability of MF replicates (57.9% RPD vs. 43.5% for Colilert), not a sensor deficiency. Categorical accuracy is 0.81 (κ = 0.66).
Sensor-to-reference agreement is bounded by reference-method reproducibility, not by Lume hardware. Whichever culture method is adopted as truth, the Lume fits it at R² ≈ 0.87–0.88. The gap between columns 2 and 3 is almost exactly the disagreement between the two lab methods themselves (column 1). The Lume is method-agnostic; its ceiling is set by the reference it is trained against, and it already achieves quantitative performance at or above the inter-method agreement ceiling between the two accepted laboratory techniques — while providing continuous temporal coverage that grab-sample laboratory methods cannot.