Lume – CBT Calibration

Study context — programs & dMRV

This page documents the paired field validation of the Lume 1.2 fluorimeter against the Aquagenx Compartment Bag Test (CBT), conducted between May and June 2026 across two water treatment programs:

Amazi Meza — Rwanda. School-based drinking water filtration program operated by Virridy Rwanda Ltd with the Rwanda Environmental Management Authority and district governments. Currently serves ~600,000 students across 500+ schools, scaling to 1.5M students and 1,500 schools by 2028. Treatment is LifeStraw Community gravity filters at the point of use. Carbon credits are issued under Gold Standard GS12240 (33,911 tCO₂e issued to date; ~200,000 tCO₂e projected by 2028) and the revenue funds ongoing operations.

DRIP FUNDI — Kenya. USAID-funded Drought Resilience Impact Platform that secures water for ~120,000 people across 200 boreholes in five northern counties (Marsabit, Garissa, Isiolo, Turkana, Wajir). IoT sensors, satellite remote sensing, and machine learning predict groundwater demand and prioritize repairs; uptime has risen from 56% to 91% and median repair time has fallen from 214 to 26 days. As with Amazi Meza, carbon credit revenue underwrites continued maintenance.

Why we sample before AND after treatment. Carbon credit issuance under the Gold Standard methodology Emission Reductions from Safe Drinking Water Supply (v1.0) depends on demonstrating both halves of the story: (1) the baseline water source carries fecal contamination that would have caused real disease burden, and (2) the treated water actually delivered to households or students is safe. In practice we measure the baseline at each school or water pump once per crediting period to establish whether treatment is warranted, then take an annual sample of the filtered / treated water during the crediting period. A site earns credits only where (a) the baseline showed treatment was needed and (b) the annual treated-water samples come back free of contamination — credits are issued proportional to the share of those treated-water samples that pass.

Gold Standard dMRV Pilot 14 — approved September 2025. The Lume sensor is the technology piece in Gold Standard dMRV Pilot 14, which authorizes Lume estimates of E. coli to substitute for the laboratory grab-sample requirement under SDWS Parameter 18. This page exists in part to close FAR 1 (sensor validation & calibration) and FAR 3 (integration of manual and digital sampling) from the pilot — every CBT pair below is direct evidence for those FARs. Continuous Lume readings — cross-checked here against CBT grab samples — let the program report SDWS Parameter 18 (Microbial Drinking Water Quality) at >200× the daily sample density the methodology was originally written for, and at a fraction of the cost of running a lab assay on every sample.

Computing sample summary…

Data & Code Availability

All paired sensor–CBT observations, model coefficients, and figure-generation code are available for download and inspection. These files support full reproducibility of the results presented on this page and in the accompanying manuscript.

Paired Observations (CSV)

216 matched sensor–CBT records with raw and processed features, in-sample and LOOCV predictions, and WHO risk classifications. One row per observation.

Download CSV

Model Specification (JSON)

Tobit regression coefficients (mon2, temperature, ToF, per-sensor intercepts and slopes), z-score normalization means/SDs, per-sensor baselines, and agreement-band definition.

Download JSON

GitHub Repository

Public repository with the paired dataset, model specification, and Python script for generating publication figures (scatter, confusion matrices, feature relationships).

View on GitHub

Sensor GPS Track — Since 2026-05-25 00:00 UTC

Every GPS fix reported since 2026-05-25 00:00 UTC by 50045 & 50053 (deployed on Amazi Meza in Rwanda and DRIP in Kenya) and 50065. Cell-tower fixes are excluded; positions are deduped to ~1 m and connected in time order. Marker size scales with the number of fixes at that spot.

Loading GPS positions…

Live Sensor Time Series — Amazi Meza (Rwanda) & DRIP (Kenya)

CBT grab-sample E. coli results (diamonds) over raw TLF (mon2), raw ToF signal-per-SPAD (kcps), and water temperature from 50045 (Amazi Meza, Rwanda), 50053 (DRIP, Kenya), and 50065 starting 2026-05-25 00:00 UTC. SiPM rows are filtered to the calibrated combo (LED=512, bias~3000).

Loading sensor time series…

Equivalence testing — Lume vs. CBT E. coli

The Lume sensor estimates are statistically equivalent to CBT reference measurements under a two one-sided test (TOST) with a ±0.65 log₁₀ equivalence margin — the published 95% confidence interval of a single CBT bag. The underlying estimation model is a right-censored linear regression (Tobit) on log₁₀(E. coli + 1) with eight parameters: baseline-subtracted raw fluorescence (mon2), water temperature, and the turbidity proxy (tof) as shared terms, plus a per-sensor 2-point calibration (a per-sensor intercept offset and a per-sensor mon2 slope). Temperature is an explicit model predictor (not a pre-correction). Each CBT grab is paired to the lowest mon2 reading in a ±10 min in-water window (removing the sensor for sampling spikes mon2 up). No statistical (IQR/Cook’s D) screening is applied; the only exclusions are four physically-documented data-quality cases (one cross-sensor sensor fault, two turbidity-compromised readings, and one first-day baseline-transition reading). Agreement is leave-one-observation-out: 88% within the combined ±0.92 log₁₀ measurement uncertainty. Classification performance — the primary evidence for dMRV — is reported in the pooled classifier section below.

Loading validation data…

About the CBT method. The Aquagenx Compartment Bag Test is a portable, field-deployable MPN assay approved for use in lower-resource settings. Like all enzyme-substrate E. coli assays, it requires an incubation period (typically 24–48 h at ambient or body-temperature) before the compartments can be read — but it does not need a laboratory-grade thermostatic incubator and can be run at room temperature in the field, which is why it has become the standard reference for international water safety monitoring programs.

Published CBT precision benchmarks

The Compartment Bag Test is itself a noisy reference — these are the ceilings any sensor-vs-CBT comparison can approach, and they're the basis for the ±CBT error bar drawn on each point in the scatter below.

Intrinsic single-test 95% CI

~1.2–1.4 log₁₀

Width per bin in the Aquagenx 5-compartment MPN table (Gronewold 2017). A "5 CFU" reading is consistent with 1.5–30 at 95% confidence.

CBT vs membrane filtration

r = 0.904

n = 270 metro-Atlanta samples; sensitivity 94.9%, specificity 96.6% (Stauber et al. 2014).

Field CBT vs lab CBT

ρ = 0.88

Inter-operator repeatability across Peruvian field staff vs lab analysts (Heitzinger et al. 2017). No significant difference, P = 0.50.

Threshold agreement vs Colilert

~92–93%

Presence/absence match at ≥ 1 and ≥ 10 CFU/100 mL across 5 natural waters, 60 paired samples (KWR/JMP 2022).

What this means for the calibration below. The intrinsic CBT noise floor (~1.2–1.4 log₁₀ per single bag) is the irreducible lower bound on how tight any sensor regression can land. The horizontal error bars on each scatter point are the ±0.65 log₁₀ half-width of that 95% CI — predictions that fall inside an error bar are statistically consistent with the lab call.

How the error bars and the agreement test work. Both bars on each point are the same ±0.65 log₁₀ — the published per-bag CBT 95% CI (Gronewold 2017). The horizontal bar is the lab (CBT) uncertainty; the vertical bar applies that same interval to the Lume prediction, on the deliberately conservative assumption that a Lume reading is no more precise than a single CBT bag. (Our measured continuous residual, σ̂ ≈ 0.61 log₁₀, is in fact a bit larger than CBT's σ ≈ 0.33, so treating the Lume as CBT-equivalent is generous to the lab side, not to us.)

Agreement is NOT judged by whether the bars visually overlap. Two 95% CIs can overlap and still be significantly different — overlap overstates agreement. The correct test compares the difference of the two measurements, combining their standard errors in quadrature: a point agrees with the lab when |residual| ≤ 1.96·√(σ_CBT² + σ_Lume²) ≈ 0.92 log₁₀ — tighter than the ±1.30 you'd get by summing the two ±0.65 bars. That difference test drives the green/red rings and the headline "agree with lab" percentage.
Equivalence (TOST) answers a different question: is there a systematic bias between Lume and CBT? The banner reports a two-one-sided-test of the mean Lume−CBT difference against a ±0.65 log₁₀ equivalence margin at 90% confidence. Passing means the methods are statistically interchangeable on average, not merely "not proven different."

Net: with σ̂_Lume comparable to σ_CBT, the Lume behaves like a second independent reference. Neither a sensor nor a second CBT bag can agree with a single CBT bag more often than the assay's own ≈92–93% bag-to-bag repeatability ceiling allows.

Sources: Stauber et al. 2014, J Microbiol Methods · Heitzinger et al. 2017, Peruvian DHS · Gronewold et al. 2017, Sci Total Environ · KWR/JMP 2022 lab evaluation.

Loading sensor data for predictions…

Chlorination detection

Can the Lume detect chlorination efficacy? At sites in Kenya where free chlorine residual (Cl₂) was measured alongside CBT and Lume readings, we identify matched water systems that have both pre-chlorination (source, Cl₂ = 0) and post-chlorination (treated, Cl₂ > 0) samples collected on the same day. For each matched system, we compare source vs. treated Lume predictions to test whether fluorescence drops after chlorination. Tryptophan-like fluorescence (TLF) is attenuated by free chlorine, so effective chlorination should produce a measurable reduction in the Lume signal.

Loading chlorination comparison…

Pooled classifier

Logistic regression models fitted to all paired observations pooled across sensors. Binary classifiers fit log-odds(CBT ≥ threshold) ~ mon2c_n + tof_n + FE + slopes. Class-balanced weights, L2 (λ = 0.02). The operating threshold τ is tuned to maximize balanced accuracy (Youden's J). The "of CBT ceiling" figure expresses balanced accuracy as a fraction of the ~92.5% threshold-agreement ceiling any method faces against single-bag CBT labels (KWR/JMP 2022). The 3-level risk classifier fits a multinomial softmax model to three categories: <10, 10–99, and ≥100 CFU/100 mL.

Fitting pooled classifiers…

Machine learning benchmark

To verify that the transparent linear Tobit model is not leaving substantial predictive power on the table, we benchmark it against gradient-boosted tree models (LightGBM and XGBoost) using identical features and evaluation protocols. Both ML models use class-balanced weights, depth-2 trees with monotonicity constraints on fluorescence, and are evaluated via LOOCV, leave-one-day-out, and 200× repeated 70/30 splits. The ML models are shown as a benchmark only — the production crediting model remains the transparent 8-parameter Tobit for Gold Standard auditability.

Loading ML benchmark…

Validated use cases and limitations

Based on 216 paired Lume–CBT field observations across 3 sensors, 2 countries, and both source and treated water, the following use cases are supported or not yet supported.

Supported

Screening for contaminated water (≥10 CFU/100 mL)

Binary classification at the WHO “intermediate risk” threshold reaches 83% balanced accuracy with a class-balanced logistic classifier (sensitivity 84%, specificity 82%, AUC 0.885), or 85% with the deployed Tobit model thresholded at the boundary (sensitivity 76%, specificity 93%, AUC 0.892). This is ~90% of the ~92.5% ceiling imposed by CBT’s own inter-method variability (KWR/JMP 2022). The operationally relevant question for SDWS Parameter 18 — “is this water safe?” — is answered correctly with high sensitivity to true contamination.

WHO risk category assignment (±1 tier)

The 3-level risk classifier (<10, 10–99, ≥100 CFU) achieves 97% within ±1 category agreement with CBT. No Lume prediction is ever off by more than one risk tier in roughly 49 out of 50 samples. The Lume reliably distinguishes “safe” from “very high risk” water.

Confirming chlorination efficacy

In samples where free chlorine residual was measured (n = 66, all Kenya), 100% of chlorinated samples (Cl₂ > 0, n = 30) had 0 CFU by CBT and were correctly classified as safe by the Lume. The sensor clearly distinguishes chlorinated from unchlorinated water.

Continuous monitoring between grab samples

A single annual CBT grab sample captures one snapshot with ±0.65 log₁₀ noise. The Lume takes readings every ~5 minutes — roughly 288 readings per day — and can detect transient contamination events that a single grab sample would miss. Even with a per-reading agreement of 88%, the aggregate of continuous monitoring provides far higher temporal coverage than periodic grab sampling.

Limitations

Cannot reliably distinguish 0 from 1–9 CFU

The ≥1 CFU binary classifier achieves only 73% balanced accuracy (sensitivity = 62%). The Lume misses roughly 40% of “low risk” (1–9 CFU) samples, classifying them as conformity. This is inherent to the CBT reference: 0 and 1–9 CFU produce nearly identical fluorescence signals, and CBT itself is noisy at these levels. The Lume should not be used to certify drinking water as zero-E. coli.

Per-sensor calibration required

Per-sensor LOOCV agreement varies — see the Per-sensor LOOCV agreement table below the scatter plot for current figures (computed dynamically from the live dataset). The model carries a per-sensor intercept and a per-sensor mon2 slope, but the Rwanda-deployed units (50045, 50065) show a residual fluorescence-sensitivity offset in high-DOM spring water that these terms do not fully absorb. Each new sensor requires per-sensor calibration before its predictions are reliable.

Prediction is a risk category, not a precise count

The model explains ~52% of variance in log₁₀(CFU+1) (R² = 0.52, σ̂ = 0.61 log₁₀). Individual predictions carry substantial uncertainty. The Lume is validated for categorical risk classification (“safe” vs “contaminated”), not for reporting exact CFU concentrations.

Ongoing CBT verification needed

The Lume does not replace CBT — it extends it. Periodic CBT grab samples remain necessary to (a) verify the model has not drifted, (b) recalibrate baselines if sensor optics change over time, and (c) provide ground truth for any new deployment sites or water chemistries. The recommended cadence is at least one paired CBT session per sensor per quarter.

A note on the validation distribution. The CBT reference values in this dataset are concentrated at two poles: 57% at 0 CFU (conformity) and 22% at ≥100 CFU (the CBT detection limit), with only 21% in the 1–99 range. This is not a sampling artefact — it reflects the operational reality of these water treatment programs. In Amazi Meza (Rwanda), LifeStraw gravity filters either work (producing 0 CFU treated water) or the source water is untreated and often heavily contaminated. In DRIP (Kenya), boreholes are either naturally clean or post-treatment water has been effectively chlorinated. The “intermediate” zone (10–99 CFU) is genuinely rare in the field because partial treatment failure at these programs does not produce a smooth gradient of contamination — it produces a binary outcome.

This means the reported classifier performance is measured on the distribution the Lume will actually encounter in deployment. A hypothetical uniform distribution across WHO risk categories would be more challenging for the model near the boundaries, but it would also be unrealistic — no operational water program produces equal proportions of conformity, low-risk, intermediate, and very-high-risk water. The Lume is validated on — and optimised for — the water it will actually see. Where the model is weakest (distinguishing 5 from 15 CFU, or 50 from 100 CFU) is precisely where CBT itself is noisiest and where the operational distinction matters least: both 5 and 15 CFU indicate a treatment problem that needs attention, regardless of which side of a threshold they fall on.

Bottom line for dMRV. For the purpose of Gold Standard SDWS Parameter 18 — determining whether treated water is microbiologically safe — the Lume provides a validated, continuous alternative to periodic CBT grab sampling at the ≥10 CFU threshold. It agrees with CBT within CBT’s own measurement noise 88% of the time, is statistically equivalent on average (TOST, p < 0.001), and delivers >200× the temporal sampling density of a single annual grab sample. The model is fully specified by 8 transparent, auditable coefficients — no black-box inference. The Lume should be paired with quarterly CBT verification and should not be used to certify zero-E. coli conformity.

Lume 1.2 – CBT Calibration

Study context — programs & dMRV

Data & Code Availability

Sensor GPS Track — Since 2026-05-25 00:00 UTC

Live Sensor Time Series — Amazi Meza (Rwanda) & DRIP (Kenya)

Equivalence testing — Lume vs. CBT E. coli

Published CBT precision benchmarks

Chlorination detection

Pooled classifier

Machine learning benchmark

Validated use cases and limitations

Supported

Limitations