Paired calibration of the Lume 1.2 sensor against two EPA-approved reference methods — Colilert (IDEXX defined-substrate, MPN) and membrane filtration (MF, CFU) — and the Aquagenx Compartment Bag Test (CBT, MPN) used in international monitoring, for E. coli and total coliform quantification.
Field Validation
Lab Validation
Live data from the mWater Lume 1.2 – 2026 Validation Data datagrid. Each water sample collection event is paired with a reference enumeration; the Method column distinguishes which reference was used — Colilert (IDEXX defined-substrate MPN), membrane filtration (MF, CFU), or compartment bag tests (CBT, MPN). The Use in Calibration column flags rows that are unusable because no /diagnostics record (water temperature, required by the CFU regression) was streaming within ±20 min of the sample. All date/time columns are displayed in UTC. Times are corrected from the mWater-stored value using the Timezone Entered column: mWater records times using the data-entry device’s local clock (Boulder, MDT = UTC−6); for samples collected in a different timezone the stored time is adjusted accordingly.
All unique Colilert grab samples collected at Boulder Creek sites (n = 39, deduplicated). Values in CFU/100 mL. EPA single-sample recreational threshold: 126 CFU/100 mL.
| Site | n | Min | Median | Max | ≥126 CFU | All values (CFU/100 mL) |
|---|---|---|---|---|---|---|
| BC-CU | 12 | 12 | 47 | 1986 | 2 (17%) | 12, 15, 15, 21, 26, 44, 50, 53, 60, 75, 152, 1986 |
| BC-55 | 13 | 6 | 53 | 866 | 4 (31%) | 6, 20, 23, 24, 26, 28, 53, 75, 131, 145, 166, 378, 866 |
| BC-30 | 3 | 36 | 105 | 517 | 1 (33%) | 36, 105, 517 |
| BC-Can | 8 | 2 | 16 | 30 | 0 (0%) | 2, 3, 5, 7, 16, 17, 27, 30 |
| BC-Eben | 3 | 6 | 28 | 30 | 0 (0%) | 6, 28, 30 |
| All BC | 39 | 2 | 30 | 1986 | 7 (18%) | median = 30 • mean = 171 • ≥126: 7 of 39 |
Pooled OLS: log10(colilert) ~ barcode + mon2_val + temperature + tof_mean + mon2_val×temperature (led_power = 512, sipm_bias ≈ 3000, reference barcode: 50046). Left: all matched grabs. Right: post burn-in only (sensors 50052 and 50066 excluded). Source data: ⬇ field_matched_512.csv
Sensor signal is first run through a physics-motivated correction pipeline derived from Bedell et al. 2022 (temperature) and Skinner et al. 2024 (turbidity), with ρ and k fit empirically from this field dataset rather than taken from the literature seeds:
mon2_corrected = mon2_val · exp(−ρ · (sipm_temperature − 20)) · exp(−k · NTU), NTU = max(0, −145.89 + 2.0488 · signal_per_spad_kcps)
Then a single-predictor OLS: log10(colilert) ~ barcode + mon2_corrected — 2 free coefficients instead of 5, sensor offsets in the FE intercept, all temperature and turbidity dependence absorbed into mon2_corrected. Fitted ρ = −0.111/°C (vs Bedell literature −0.03) and k = +0.0004/NTU (vs Skinner literature −0.004) on full data — the field Lume has a much steeper temperature dependence than Bedell measured in lab tryptophan standards, and the turbidity coefficient effectively vanishes in this drinking-water deployment.
Compared to the 4-predictor original regression directly above, this 1-predictor corrected model achieves essentially equivalent in-sample R² on post-burn-in data (0.68 vs 0.66) and is far less overfit: the original's standard LOO R² is −0.92 on the full dataset (worse than predicting the mean); the corrected model's standard LOO is 0.43.
Two LOO numbers are reported because ρ and k were tuned from the data, and a fair generalization check has to account for that. Standard LOO refits only the OLS coefficients per fold (ρ and k stay at their full-data values, so they leak into every fold). Nested LOO refits ρ and k inside each fold as well — the rigorous measure. The nested LOO is meaningfully lower (0.31 full / 0.29 post-burn-in) because the fitted ρ wanders ±0.02 across folds; that instability is real and is what the page now displays as the honest generalization stat. The empirically-fit ρ tells us the Lume's effective temperature sensitivity in the field is roughly 3.7× stronger than what Bedell measured for a pure tryptophan standard in the lab — consistent with the field signal including SiPM gain drift and UV LED output drift on top of fluorophore quenching. With N = 37 and 6 effective parameters (intercept + 4 sensor FE + slope + ρ + k), the model is at the threshold where more data is the only real remedy for the remaining nested LOO gap.
Logistic regression: features mon2_val, temperature, tof_mean, mon2_val×temperature (z-scored per LOO fold) plus barcode fixed effects (reference: 50046). Left: all matched grabs. Right: post burn-in only. Metrics computed via LOO-CV. Source data: ⬇ field_matched_512.csv
Same logistic regression with the same Bedell + Skinner correction pipeline as the corrected OLS above. Single continuous feature mon2_corrected (z-scored per LOO fold) plus barcode fixed effects (reference: 50046). One free continuous coefficient instead of four. Empirically-fit ρ and k from the OLS section are reused.
Jan 2–6, 2026 calibration sessions (paper training range). Fluorescence signal (mon2_val) at the paper operating point: led_power = 1024, sipm_bias = 3040. CBT calibration data is in the field-calibration table above.
⬇ Download full dataset (CSV)
Pooled OLS (Jan 2–6, n = 125): log₁₀(colilert) ~ barcode + signal × floor_temp × tof_mean. Barcode is a fixed-effect intercept shift (reference: 50030); slopes shared. Fit separately for each LED/bias operating point. In-sample R².
LED = 1024 · bias = 3040 — paper
LED = 512 · bias = 3000 — production
LED = 256 · bias = 3300 — original
Logistic regression trained on the full lab dataset (Jan 2–21 2026, n = 300 rows with valid production-combo readings), binary label: Colilert ≥ 126 CFU/100 mL. Features: mon2_val_512, floor_temp, tof_mean (all z-scored; no sensor fixed effect so the model is sensor-agnostic). Performance estimated by leave-one-out cross-validation (LOO-CV): each fold re-standardizes from the training set of n = 299 before predicting the held-out row. Note: the 9 positive examples all come from one contamination event (Jan 8, three consecutive time points). LOO-CV performance for the positive class may be over-optimistic due to temporal correlation between the 9 rows.
Logistic regression fit on the combined lab (Jan 2–21 2026, n = 300) and field (n = 30) datasets. Binary label: Colilert ≥ 126 CFU/100 mL (16 positives total: 9 lab, 7 field). Features: mon2_val, temperature, tof_mean (continuous, z-scored per LOO fold) plus a dummy variable for every sensor relative to reference 50030 (unscaled 0/1). All 9 sensors are included. Field sensors with very few samples (50052 n=2, 50062 n=1, 50066 n=1) have sparse dummy estimates; their LOO predictions fall back to the pooled signal when their dummy is unobserved in training.
How does the Lume compare against the two EPA-approved laboratory methods — Colilert (IDEXX) and membrane filtration (MF) — and against itself when retrained on a different reference? Each column analyzes paired samples across three frameworks: log-log regression (top), Bland-Altman agreement (middle), and categorical classification (bottom).
The dedicated method comparison study pairs Colilert (n = 2 replicates) with membrane filtration (n = 3 replicates) across 161 datetimes; 8 zero-valued pairs are excluded from the log-scale analysis, yielding 153 observations. The two EPA-approved methods show R² = 0.572 with a +0.35 log10 bias — MF systematically reads ~2.2× higher than Colilert. 95% limits of agreement span [−0.64, +1.34], meaning paired lab samples can differ by up to ~22× in either direction. Categorical accuracy is 0.66 (Cohen’s κ = 0.40), i.e. “fair” agreement. This inter-method disagreement sets the ceiling for what any sensor can be expected to achieve against either reference.
The Colilert-trained Lume regression is evaluated against Colilert across all bench (n = 176) and field (n = 33) observations. The sensor achieves R² = 0.881, a bias of 0.00 log10, and tight limits of agreement [−0.42, +0.42] — Lume predictions stay within ~2.6× of the reference. Categorical accuracy is 0.89 with κ = 0.88, which is “almost perfect” agreement. Against its training reference, the Lume performs as well as or better than the two EPA methods perform against each other.
The same Colilert-trained Lume model is now evaluated against membrane filtration — a reference method it was never trained on. Performance drops to R² = 0.514 with LoA [−0.80, +0.83] and categorical accuracy 0.84 (κ = 0.65). Critically, the ~0.37 drop in R² from column 2 to column 3 is of the same order as the inter-method disagreement between Colilert and MF themselves (column 1, R² = 0.572). Most of the apparent loss is attributable to reference-method disagreement, not sensor limitations.
To isolate the effect of reference-method choice, the Lume regression is refit using MF as the training target, over the full bucket dataset. Performance jumps back to R² = 0.872 — essentially matching the Colilert-trained model against Colilert. Bias is 0.00 with LoA [−0.93, +0.93]; the slightly wider LoA reflects the higher within-method variability of MF replicates (57.9% RPD vs. 43.5% for Colilert), not a sensor deficiency. Categorical accuracy is 0.81 (κ = 0.66).
Sensor-to-reference agreement is bounded by reference-method reproducibility, not by Lume hardware. Whichever culture method is adopted as truth, the Lume fits it at R² ≈ 0.87–0.88. The gap between columns 2 and 3 is almost exactly the disagreement between the two lab methods themselves (column 1). The Lume is method-agnostic; its ceiling is set by the reference it is trained against, and it already achieves quantitative performance at or above the inter-method agreement ceiling between the two accepted laboratory techniques — while providing continuous temporal coverage that grab-sample laboratory methods cannot.