Virridy Home | Lume — Water Quality Sensing Water for Carbon
2026 Calibration

Lume 1.2 – Method Calibration

Paired calibration of the Lume 1.2 sensor against two EPA-approved reference methods — Colilert (IDEXX defined-substrate, MPN) and membrane filtration (MF, CFU) — and the Aquagenx Compartment Bag Test (CBT, MPN) used in international monitoring, for E. coli and total coliform quantification.

Field Validation

Field Records
Sensor Sites
Latest Sample

Lab Validation

Lab Records
Sensors Tested

Field Calibration Data

Live data from the mWater Lume 1.2 – 2026 Validation Data datagrid. Each water sample collection event is paired with a reference enumeration; the Method column distinguishes which reference was used — Colilert (IDEXX defined-substrate MPN), membrane filtration (MF, CFU), or compartment bag tests (CBT, MPN). The Use in Calibration column flags rows that are unusable because no /diagnostics record (water temperature, required by the CFU regression) was streaming within ±20 min of the sample. All date/time columns are displayed in UTC. Times are corrected from the mWater-stored value using the Timezone Entered column: mWater records times using the data-entry device’s local clock (Boulder, MDT = UTC−6); for samples collected in a different timezone the stored time is adjusted accordingly.

Loading validation data…

Boulder Creek E. coli Distribution — All Colilert Grabs

All unique Colilert grab samples collected at Boulder Creek sites (n = 39, deduplicated). Values in CFU/100 mL. EPA single-sample recreational threshold: 126 CFU/100 mL.

Site n Min Median Max ≥126 CFU All values (CFU/100 mL)
BC-CU 12 12 47 1986 2 (17%) 12, 15, 15, 21, 26, 44, 50, 53, 60, 75, 152, 1986
BC-55 13 6 53 866 4 (31%) 6, 20, 23, 24, 26, 28, 53, 75, 131, 145, 166, 378, 866
BC-30 3 36 105 517 1 (33%) 36, 105, 517
BC-Can 8 2 16 30 0 (0%) 2, 3, 5, 7, 16, 17, 27, 30
BC-Eben 3 6 28 30 0 (0%) 6, 28, 30
All BC 39 2 30 1986 7 (18%) median = 30 • mean = 171 • ≥126: 7 of 39

Observed vs. Predicted E. coli — Colilert

Sensor signal is run through a physics-motivated correction pipeline derived from Bedell et al. 2022 (temperature) and Skinner et al. 2024 (turbidity), with ρ and k fit empirically from this field dataset:
   mon2_corrected = mon2_val · exp(−ρ · (sipm_temperature − 20)) · exp(−k · NTU),  NTU = max(0, −145.89 + 2.0488 · signal_per_spad_kcps)
Single-predictor OLS with per-sensor fixed-effect intercept: log10(colilert) ~ barcode + mon2_corrected. Fitted ρ = −0.111/°C (vs Bedell literature −0.03) and k = +0.0004/NTU (vs Skinner literature −0.004) on full data — the field Lume has a steeper temperature dependence than Bedell measured in lab tryptophan standards, and the turbidity coefficient effectively vanishes in this drinking-water deployment. Source data: ⬇ field_matched_512.csv

All matched data — n=37, R²=0.60, RMSE=0.46 log₁₀(MPN)
Post burn-in only — n=24, R²=0.68, RMSE=0.45 log₁₀(MPN)
Loading…
Loading…

Binary Detection: ≥126 CFU/100 mL

Same correction pipeline driving a logistic regression: single continuous feature mon2_corrected plus per-sensor fixed effects. One free continuous coefficient. ROC curves below show out-of-sample probabilities. Source data: ⬇ field_matched_512.csv

Fitting…
Fitting…

Lab Calibration Data

Jan 2–6, 2026 calibration sessions (paper training range). Fluorescence signal (mon2_val) at the paper operating point: led_power = 1024, sipm_bias = 3040. CBT calibration data is in the field-calibration table above.
⬇ Download full dataset (CSV)

Loading lab validation data…

Lab: Predicted vs. Observed — Operating Point Comparison

Pooled OLS (Jan 2–6, n = 125): log₁₀(colilert) ~ barcode + signal × floor_temp × tof_mean. Barcode is a fixed-effect intercept shift (reference: 50030); slopes shared. Fit separately for each LED/bias operating point. In-sample R².

LED = 1024 · bias = 3040 — paper

LED = 512 · bias = 3000 — production

LED = 256 · bias = 3300 — original

Lab: Predicted vs. Observed — Corrected Single-Predictor Model

Same Bedell + Skinner correction pipeline used in the field section, refit on the lab data (production combo LED 512 / bias 3000, n = 300 across Jan 2–21 2026):
   mon2_corrected = mon2_val_512 · exp(−ρ · (floor_temp − 20)) · exp(−k · NTU),  NTU = max(0, −145.89 + 2.0488 · tof_mean)
Single-predictor OLS: log₁₀(colilert) ~ barcode + mon2_corrected. Reference barcode 50030. Lab uses floor_temp (water temperature) as the temperature input — the proper Bedell input. Lab-fit ρ = −0.2015/°C and k = +0.01015/NTU. Lab ρ is roughly 2× the field-fit ρ = −0.111: the lab covers a wider temperature range (3.4 → 26.5 °C vs ~15 °C in the field) and a real turbidity range, so the fit has more leverage to recover the underlying physics. Lab corrected R² = 0.902 beats the 3-way interaction's R² = 0.867 on a larger dataset with half the free parameters.

In-sample R² = 0.902 · LOO R² = 0.899 · RMSE = 0.41 log₁₀(MPN) · MAE = 0.27 · n = 300
Loading…

Lab Binary Logistic — ≥126 CFU/100 mL

Logistic regression trained on the full lab dataset (Jan 2–21 2026, n = 300 rows with valid production-combo readings), binary label: Colilert ≥ 126 CFU/100 mL. Features: mon2_val_512, floor_temp, tof_mean (all z-scored; no sensor fixed effect so the model is sensor-agnostic). Performance estimated by leave-one-out cross-validation (LOO-CV): each fold re-standardizes from the training set of n = 299 before predicting the held-out row. Note: the 9 positive examples all come from one contamination event (Jan 8, three consecutive time points). LOO-CV performance for the positive class may be over-optimistic due to temporal correlation between the 9 rows.

Fitting logistic regression with LOO-CV…

Lab Binary Logistic — Corrected Single-Predictor + Sensor FE

Same correction pipeline as the corrected lab OLS above, driving a logistic regression for the same ≥126 CFU/100 mL threshold. Single continuous feature mon2_corrected (z-scored per LOO fold) plus barcode dummies for 50031 and 50032 (reference: 50030). One free continuous coefficient instead of three. Uses the lab-fit ρ = −0.2015/°C and k = +0.01015/NTU. Same single-event caveat applies — the 9 positive examples are all from the Jan 8 contamination event, so LOO sensitivity is over-optimistic.

Fitting corrected logistic regression with LOO-CV…

Pooled Lab + Post-Burnin Field — Corrected Single-Predictor OLS

Continuous regression on the combined lab dataset (Jan 2–21 2026, n = 300, 3 sensors) and the post-burn-in field dataset (n = 24, 4 sensors — 50046 / 50048 / 50059 / 50066, with the singleton 50062 dropped and pre-burnin samples excluded per the field burn-in dates). Same Bedell + Skinner correction pipeline, with ρ and k fit jointly on the pooled data:
   mon2_corrected = mon2 · exp(−ρ·(T−20)) · exp(−k·NTU), with T = floor_temp for lab rows and T = sipm_temperature for field rows (no water probe in the field build).
Single-predictor OLS with per-sensor fixed effects: log₁₀(colilert) ~ barcode + mon2_corrected. Reference 50030. Pooled-fit ρ = −0.2015/°C and k = +0.00936/NTU (essentially the lab fit; field rows are too few to move the joint optimum).

Pooled in-sample R² = 0.891 · LOO R² = 0.878 · LOO RMSE = 0.46 log₁₀(MPN) · n = 324   |   per-source: lab R² = 0.901 (n=300), field R² = 0.279 (n=24)
Loading…

The lab subset dominates the pooled R² because it has 12× more rows than post-burnin field. The within-field fit is much weaker (R² = 0.28): the field sensor FE intercepts span ±1.9 log₁₀(MPN) (50046 +0.01, 50048 +1.91, 50059 −1.00, 50066 +1.74), meaning the corrected mon2 still leaves substantial per-sensor offset on field hardware. Most likely culprits: (1) field uses SiPM die temperature as a proxy for water temperature whereas lab uses floor_temp directly, and (2) field sensors have additional drift the lab sensors don't (e.g. cumulative LED ageing, biofouling, optical-window scaling) that the empirical ρ, k cannot absorb.

Merged Lab + Post-Burnin Field Binary Logistic — ≥126 CFU/100 mL

Same pooled dataset (n = 324 = 300 lab + 24 post-burnin field) driving a logistic regression for the ≥126 CFU/100 mL threshold. 18 positives total (9 lab from the Jan 8 contamination event + 9 field post-burnin). Features: mon2_corrected (z-scored per LOO fold) using the pooled-fit ρ = −0.2015/°C and k = +0.00936/NTU, plus barcode dummies for all 6 non-reference sensors (reference: 50030). One free continuous coefficient + 6 fixed-effect intercepts. LOO-CV. Single-event caveat applies to the lab positives — all 9 lab ≥126 readings are from Jan 8; field positives are from independent grabs across multiple sensors.

Fitting merged corrected logistic regression with LOO-CV…

Four-Panel Method Comparison

How does the Lume compare against the two EPA-approved laboratory methods — Colilert (IDEXX) and membrane filtration (MF) — and against itself when retrained on a different reference? Each column analyzes paired samples across three frameworks: log-log regression (top), Bland-Altman agreement (middle), and categorical classification (bottom).

Four-panel comparison: MF vs Colilert, Lume vs Colilert, Lume vs MF (Colilert-trained), Lume vs MF (MF-trained)

Column 1 · MF vs. Colilert (n = 153)

The dedicated method comparison study pairs Colilert (n = 2 replicates) with membrane filtration (n = 3 replicates) across 161 datetimes; 8 zero-valued pairs are excluded from the log-scale analysis, yielding 153 observations. The two EPA-approved methods show R² = 0.572 with a +0.35 log10 bias — MF systematically reads ~2.2× higher than Colilert. 95% limits of agreement span [−0.64, +1.34], meaning paired lab samples can differ by up to ~22× in either direction. Categorical accuracy is 0.66 (Cohen’s κ = 0.40), i.e. “fair” agreement. This inter-method disagreement sets the ceiling for what any sensor can be expected to achieve against either reference.

Column 2 · Lume vs. Colilert (n = 209, Colilert-trained)

The Colilert-trained Lume regression is evaluated against Colilert across all bench (n = 176) and field (n = 33) observations. The sensor achieves R² = 0.881, a bias of 0.00 log10, and tight limits of agreement [−0.42, +0.42] — Lume predictions stay within ~2.6× of the reference. Categorical accuracy is 0.89 with κ = 0.88, which is “almost perfect” agreement. Against its training reference, the Lume performs as well as or better than the two EPA methods perform against each other.

Column 3 · Lume vs. MF (n = 173, Colilert-trained)

The same Colilert-trained Lume model is now evaluated against membrane filtration — a reference method it was never trained on. Performance drops to R² = 0.514 with LoA [−0.80, +0.83] and categorical accuracy 0.84 (κ = 0.65). Critically, the ~0.37 drop in R² from column 2 to column 3 is of the same order as the inter-method disagreement between Colilert and MF themselves (column 1, R² = 0.572). Most of the apparent loss is attributable to reference-method disagreement, not sensor limitations.

Column 4 · Lume vs. MF (n = 303, MF-trained)

To isolate the effect of reference-method choice, the Lume regression is refit using MF as the training target, over the full bucket dataset. Performance jumps back to R² = 0.872 — essentially matching the Colilert-trained model against Colilert. Bias is 0.00 with LoA [−0.93, +0.93]; the slightly wider LoA reflects the higher within-method variability of MF replicates (57.9% RPD vs. 43.5% for Colilert), not a sensor deficiency. Categorical accuracy is 0.81 (κ = 0.66).

Headline Finding

Sensor-to-reference agreement is bounded by reference-method reproducibility, not by Lume hardware. Whichever culture method is adopted as truth, the Lume fits it at R² ≈ 0.87–0.88. The gap between columns 2 and 3 is almost exactly the disagreement between the two lab methods themselves (column 1). The Lume is method-agnostic; its ceiling is set by the reference it is trained against, and it already achieves quantitative performance at or above the inter-method agreement ceiling between the two accepted laboratory techniques — while providing continuous temporal coverage that grab-sample laboratory methods cannot.