Paired calibration of the Lume 1.2 sensor against two EPA-approved reference methods — Colilert (IDEXX defined-substrate, MPN) and membrane filtration (MF, CFU) — and the Aquagenx Compartment Bag Test (CBT, MPN) used in international monitoring, for E. coli and total coliform quantification.
Field Validation
Lab Validation
Live data from the mWater Lume 1.2 – 2026 Validation Data datagrid. Each water sample collection event is paired with a reference enumeration; the Method column distinguishes which reference was used — Colilert (IDEXX defined-substrate MPN), membrane filtration (MF, CFU), or compartment bag tests (CBT, MPN). The Use in Calibration column flags rows that are unusable because no /diagnostics record (water temperature, required by the CFU regression) was streaming within ±20 min of the sample. All date/time columns are displayed in UTC. Times are corrected from the mWater-stored value using the Timezone Entered column: mWater records times using the data-entry device’s local clock (Boulder, MDT = UTC−6); for samples collected in a different timezone the stored time is adjusted accordingly.
All unique Colilert grab samples collected at Boulder Creek sites (n = 39, deduplicated). Values in CFU/100 mL. EPA single-sample recreational threshold: 126 CFU/100 mL.
| Site | n | Min | Median | Max | ≥126 CFU | All values (CFU/100 mL) |
|---|---|---|---|---|---|---|
| BC-CU | 12 | 12 | 47 | 1986 | 2 (17%) | 12, 15, 15, 21, 26, 44, 50, 53, 60, 75, 152, 1986 |
| BC-55 | 13 | 6 | 53 | 866 | 4 (31%) | 6, 20, 23, 24, 26, 28, 53, 75, 131, 145, 166, 378, 866 |
| BC-30 | 3 | 36 | 105 | 517 | 1 (33%) | 36, 105, 517 |
| BC-Can | 8 | 2 | 16 | 30 | 0 (0%) | 2, 3, 5, 7, 16, 17, 27, 30 |
| BC-Eben | 3 | 6 | 28 | 30 | 0 (0%) | 6, 28, 30 |
| All BC | 39 | 2 | 30 | 1986 | 7 (18%) | median = 30 • mean = 171 • ≥126: 7 of 39 |
Sensor signal is run through a physics-motivated correction pipeline derived from Bedell et al. 2022 (temperature) and Skinner et al. 2024 (turbidity), with ρ and k fit empirically from this field dataset:
mon2_corrected = mon2_val · exp(−ρ · (sipm_temperature − 20)) · exp(−k · NTU), NTU = max(0, −145.89 + 2.0488 · signal_per_spad_kcps)
Single-predictor OLS with per-sensor fixed-effect intercept: log10(colilert) ~ barcode + mon2_corrected. Fitted ρ = −0.111/°C (vs Bedell literature −0.03) and k = +0.0004/NTU (vs Skinner literature −0.004) on full data — the field Lume has a steeper temperature dependence than Bedell measured in lab tryptophan standards, and the turbidity coefficient effectively vanishes in this drinking-water deployment. Source data: ⬇ field_matched_512.csv
Same correction pipeline driving a logistic regression: single continuous feature mon2_corrected plus per-sensor fixed effects. One free continuous coefficient. ROC curves below show out-of-sample probabilities. Source data: ⬇ field_matched_512.csv
Jan 2–6, 2026 calibration sessions (paper training range). Fluorescence signal (mon2_val) at the paper operating point: led_power = 1024, sipm_bias = 3040. CBT calibration data is in the field-calibration table above.
⬇ Download full dataset (CSV)
Pooled OLS (Jan 2–6, n = 125): log₁₀(colilert) ~ barcode + signal × floor_temp × tof_mean. Barcode is a fixed-effect intercept shift (reference: 50030); slopes shared. Fit separately for each LED/bias operating point. In-sample R².
LED = 1024 · bias = 3040 — paper
LED = 512 · bias = 3000 — production
LED = 256 · bias = 3300 — original
Same Bedell + Skinner correction pipeline used in the field section, refit on the lab data (production combo LED 512 / bias 3000, n = 300 across Jan 2–21 2026):
mon2_corrected = mon2_val_512 · exp(−ρ · (floor_temp − 20)) · exp(−k · NTU), NTU = max(0, −145.89 + 2.0488 · tof_mean)
Single-predictor OLS: log₁₀(colilert) ~ barcode + mon2_corrected. Reference barcode 50030. Lab uses floor_temp (water temperature) as the temperature input — the proper Bedell input. Lab-fit ρ = −0.2015/°C and k = +0.01015/NTU. Lab ρ is roughly 2× the field-fit ρ = −0.111: the lab covers a wider temperature range (3.4 → 26.5 °C vs ~15 °C in the field) and a real turbidity range, so the fit has more leverage to recover the underlying physics. Lab corrected R² = 0.902 beats the 3-way interaction's R² = 0.867 on a larger dataset with half the free parameters.
Logistic regression trained on the full lab dataset (Jan 2–21 2026, n = 300 rows with valid production-combo readings), binary label: Colilert ≥ 126 CFU/100 mL. Features: mon2_val_512, floor_temp, tof_mean (all z-scored; no sensor fixed effect so the model is sensor-agnostic). Performance estimated by leave-one-out cross-validation (LOO-CV): each fold re-standardizes from the training set of n = 299 before predicting the held-out row. Note: the 9 positive examples all come from one contamination event (Jan 8, three consecutive time points). LOO-CV performance for the positive class may be over-optimistic due to temporal correlation between the 9 rows.
Same correction pipeline as the corrected lab OLS above, driving a logistic regression for the same ≥126 CFU/100 mL threshold. Single continuous feature mon2_corrected (z-scored per LOO fold) plus barcode dummies for 50031 and 50032 (reference: 50030). One free continuous coefficient instead of three. Uses the lab-fit ρ = −0.2015/°C and k = +0.01015/NTU. Same single-event caveat applies — the 9 positive examples are all from the Jan 8 contamination event, so LOO sensitivity is over-optimistic.
Continuous regression on the combined lab dataset (Jan 2–21 2026, n = 300, 3 sensors) and the post-burn-in field dataset (n = 24, 4 sensors — 50046 / 50048 / 50059 / 50066, with the singleton 50062 dropped and pre-burnin samples excluded per the field burn-in dates). Same Bedell + Skinner correction pipeline, with ρ and k fit jointly on the pooled data:
mon2_corrected = mon2 · exp(−ρ·(T−20)) · exp(−k·NTU), with T = floor_temp for lab rows and T = sipm_temperature for field rows (no water probe in the field build).
Single-predictor OLS with per-sensor fixed effects: log₁₀(colilert) ~ barcode + mon2_corrected. Reference 50030. Pooled-fit ρ = −0.2015/°C and k = +0.00936/NTU (essentially the lab fit; field rows are too few to move the joint optimum).
The lab subset dominates the pooled R² because it has 12× more rows than post-burnin field. The within-field fit is much weaker (R² = 0.28): the field sensor FE intercepts span ±1.9 log₁₀(MPN) (50046 +0.01, 50048 +1.91, 50059 −1.00, 50066 +1.74), meaning the corrected mon2 still leaves substantial per-sensor offset on field hardware. Most likely culprits: (1) field uses SiPM die temperature as a proxy for water temperature whereas lab uses floor_temp directly, and (2) field sensors have additional drift the lab sensors don't (e.g. cumulative LED ageing, biofouling, optical-window scaling) that the empirical ρ, k cannot absorb.
Same pooled dataset (n = 324 = 300 lab + 24 post-burnin field) driving a logistic regression for the ≥126 CFU/100 mL threshold. 18 positives total (9 lab from the Jan 8 contamination event + 9 field post-burnin). Features: mon2_corrected (z-scored per LOO fold) using the pooled-fit ρ = −0.2015/°C and k = +0.00936/NTU, plus barcode dummies for all 6 non-reference sensors (reference: 50030). One free continuous coefficient + 6 fixed-effect intercepts. LOO-CV. Single-event caveat applies to the lab positives — all 9 lab ≥126 readings are from Jan 8; field positives are from independent grabs across multiple sensors.
How does the Lume compare against the two EPA-approved laboratory methods — Colilert (IDEXX) and membrane filtration (MF) — and against itself when retrained on a different reference? Each column analyzes paired samples across three frameworks: log-log regression (top), Bland-Altman agreement (middle), and categorical classification (bottom).
The dedicated method comparison study pairs Colilert (n = 2 replicates) with membrane filtration (n = 3 replicates) across 161 datetimes; 8 zero-valued pairs are excluded from the log-scale analysis, yielding 153 observations. The two EPA-approved methods show R² = 0.572 with a +0.35 log10 bias — MF systematically reads ~2.2× higher than Colilert. 95% limits of agreement span [−0.64, +1.34], meaning paired lab samples can differ by up to ~22× in either direction. Categorical accuracy is 0.66 (Cohen’s κ = 0.40), i.e. “fair” agreement. This inter-method disagreement sets the ceiling for what any sensor can be expected to achieve against either reference.
The Colilert-trained Lume regression is evaluated against Colilert across all bench (n = 176) and field (n = 33) observations. The sensor achieves R² = 0.881, a bias of 0.00 log10, and tight limits of agreement [−0.42, +0.42] — Lume predictions stay within ~2.6× of the reference. Categorical accuracy is 0.89 with κ = 0.88, which is “almost perfect” agreement. Against its training reference, the Lume performs as well as or better than the two EPA methods perform against each other.
The same Colilert-trained Lume model is now evaluated against membrane filtration — a reference method it was never trained on. Performance drops to R² = 0.514 with LoA [−0.80, +0.83] and categorical accuracy 0.84 (κ = 0.65). Critically, the ~0.37 drop in R² from column 2 to column 3 is of the same order as the inter-method disagreement between Colilert and MF themselves (column 1, R² = 0.572). Most of the apparent loss is attributable to reference-method disagreement, not sensor limitations.
To isolate the effect of reference-method choice, the Lume regression is refit using MF as the training target, over the full bucket dataset. Performance jumps back to R² = 0.872 — essentially matching the Colilert-trained model against Colilert. Bias is 0.00 with LoA [−0.93, +0.93]; the slightly wider LoA reflects the higher within-method variability of MF replicates (57.9% RPD vs. 43.5% for Colilert), not a sensor deficiency. Categorical accuracy is 0.81 (κ = 0.66).
Sensor-to-reference agreement is bounded by reference-method reproducibility, not by Lume hardware. Whichever culture method is adopted as truth, the Lume fits it at R² ≈ 0.87–0.88. The gap between columns 2 and 3 is almost exactly the disagreement between the two lab methods themselves (column 1). The Lume is method-agnostic; its ceiling is set by the reference it is trained against, and it already achieves quantitative performance at or above the inter-method agreement ceiling between the two accepted laboratory techniques — while providing continuous temporal coverage that grab-sample laboratory methods cannot.