Field Calibration
Lume 1.2 – CBT Calibration
Paired calibration of the Lume 1.2 sensor against the Aquagenx Compartment Bag Test (CBT) — an MPN-based reference method widely used in international water quality monitoring. Samples are collected by enumerators on two field programs: Amazi Meza in Rwanda and DRIP in Kenya. Each grab sample is matched to the lowest-fluorescence sensor reading within a ±10 minute window.
Study context — programs & dMRV
This page documents the paired field validation of the Lume 1.2 fluorimeter against the Aquagenx Compartment Bag Test (CBT), conducted between May and June 2026 across two water treatment programs:
Amazi Meza — Rwanda.
School-based drinking water filtration program operated by Virridy Rwanda Ltd with the Rwanda Environmental Management Authority and district governments. Currently serves
~600,000 students across 500+ schools, scaling to
1.5M students and 1,500 schools by 2028. Treatment is LifeStraw Community gravity filters at the point of use. Carbon credits are issued under Gold Standard
GS12240 (
33,911 tCO2e issued to date; ~200,000 tCO
2e projected by 2028) and the revenue funds ongoing operations.
DRIP FUNDI — Kenya.
USAID-funded Drought Resilience Impact Platform that secures water for
~120,000 people across 200 boreholes in five northern counties (Marsabit, Garissa, Isiolo, Turkana, Wajir). IoT sensors, satellite remote sensing, and machine learning predict groundwater demand and prioritize repairs; uptime has risen from 56% to 91% and median repair time has fallen from 214 to 26 days. As with Amazi Meza, carbon credit revenue underwrites continued maintenance.
Why we sample before AND after treatment.
Carbon credit issuance under the Gold Standard methodology Emission Reductions from Safe Drinking Water Supply (v1.0) depends on demonstrating both halves of the story: (1) the baseline water source carries fecal contamination that would have caused real disease burden, and (2) the treated water actually delivered to households or students is safe. In practice we measure the baseline at each school or water pump once per crediting period to establish whether treatment is warranted, then take an annual sample of the filtered / treated water during the crediting period. A site earns credits only where (a) the baseline showed treatment was needed and (b) the annual treated-water samples come back free of contamination — credits are issued proportional to the share of those treated-water samples that pass.
Gold Standard dMRV Pilot 14 — approved September 2025.
The Lume sensor is the technology piece in
Gold Standard dMRV Pilot 14, which authorizes Lume estimates of E. coli to substitute for the laboratory grab-sample requirement under SDWS Parameter 18. This page exists in part to close
FAR 1 (sensor validation & calibration) and
FAR 3 (integration of manual and digital sampling) from the pilot — every CBT pair below is direct evidence for those FARs. Continuous Lume readings — cross-checked here against CBT grab samples — let the program report SDWS
Parameter 18 (Microbial Drinking Water Quality) at >200× the daily sample density the methodology was originally written for, and at a fraction of the cost of running a lab assay on every sample.
Computing sample summary…
Data & Code Availability
All paired sensor–CBT observations, model coefficients, and figure-generation code are available for download and inspection. These files support full reproducibility of the results presented on this page and in the accompanying manuscript.
Paired Observations (CSV)
216 matched sensor–CBT records with raw and processed features, in-sample and LOOCV predictions, and WHO risk classifications. One row per observation.
Download CSV
Model Specification (JSON)
Tobit regression coefficients (mon2, temperature, ToF, per-sensor intercepts and slopes), z-score normalization means/SDs, per-sensor baselines, and agreement-band definition.
Download JSON
GitHub Repository
Public repository with the paired dataset, model specification, and Python script for generating publication figures (scatter, confusion matrices, feature relationships).
View on GitHub
Sensor GPS Track — Since 2026-05-25 00:00 UTC
Every GPS fix reported since 2026-05-25 00:00 UTC by 50045 & 50053 (deployed on Amazi Meza in Rwanda and DRIP in Kenya) and 50065. Cell-tower fixes are excluded; positions are deduped to ~1 m and connected in time order. Marker size scales with the number of fixes at that spot.
Live Sensor Time Series — Amazi Meza (Rwanda) & DRIP (Kenya)
CBT grab-sample E. coli results (diamonds) over raw TLF (mon2), raw ToF signal-per-SPAD (kcps), and water temperature from 50045 (Amazi Meza, Rwanda), 50053 (DRIP, Kenya), and 50065 starting 2026-05-25 00:00 UTC. SiPM rows are filtered to the calibrated combo (LED=512, bias~3000).
Loading sensor time series…
Equivalence testing — Lume vs. CBT E. coli
The Lume sensor estimates are statistically equivalent to CBT reference measurements under a two one-sided test (TOST) with a ±0.65 log10 equivalence margin — the published 95% confidence interval of a single CBT bag. The underlying estimation model is a right-censored linear regression (Tobit) on log10(E. coli + 1) with eight parameters: baseline-subtracted raw fluorescence (mon2), water temperature, and the turbidity proxy (tof) as shared terms, plus a per-sensor 2-point calibration (a per-sensor intercept offset and a per-sensor mon2 slope). Temperature is an explicit model predictor (not a pre-correction). Each CBT grab is paired to the lowest mon2 reading in a ±10 min in-water window (removing the sensor for sampling spikes mon2 up). No statistical (IQR/Cook’s D) screening is applied; the only exclusions are four physically-documented data-quality cases (one cross-sensor sensor fault, two turbidity-compromised readings, and one first-day baseline-transition reading). Agreement is leave-one-observation-out: 88% within the combined ±0.92 log10 measurement uncertainty. Classification performance — the primary evidence for dMRV — is reported in the pooled classifier section below.
About the CBT method. The Aquagenx Compartment Bag Test is a portable, field-deployable MPN assay approved for use in lower-resource settings. Like all enzyme-substrate E. coli assays, it requires an incubation period (typically 24–48 h at ambient or body-temperature) before the compartments can be read — but it does not need a laboratory-grade thermostatic incubator and can be run at room temperature in the field, which is why it has become the standard reference for international water safety monitoring programs.
Published CBT precision benchmarks
The Compartment Bag Test is itself a noisy reference — these are the ceilings any sensor-vs-CBT comparison can approach, and they're the basis for the ±CBT error bar drawn on each point in the scatter below.
Intrinsic single-test 95% CI
~1.2–1.4 log10
Width per bin in the Aquagenx 5-compartment MPN table (Gronewold 2017). A "5 CFU" reading is consistent with 1.5–30 at 95% confidence.
CBT vs membrane filtration
r = 0.904
n = 270 metro-Atlanta samples; sensitivity 94.9%, specificity 96.6% (Stauber et al. 2014).
Field CBT vs lab CBT
ρ = 0.88
Inter-operator repeatability across Peruvian field staff vs lab analysts (Heitzinger et al. 2017). No significant difference, P = 0.50.
Threshold agreement vs Colilert
~92–93%
Presence/absence match at ≥ 1 and ≥ 10 CFU/100 mL across 5 natural waters, 60 paired samples (KWR/JMP 2022).
What this means for the calibration below. The intrinsic CBT noise floor (~1.2–1.4 log10 per single bag) is the irreducible lower bound on how tight any sensor regression can land. The horizontal error bars on each scatter point are the ±0.65 log10 half-width of that 95% CI — predictions that fall inside an error bar are statistically consistent with the lab call.
How the error bars and the agreement test work. Both bars on each point are the
same ±0.65 log
10 — the published per-bag CBT 95% CI (Gronewold 2017). The horizontal bar is the lab (CBT) uncertainty; the vertical bar applies that same interval to the Lume prediction, on the deliberately conservative assumption that a Lume reading is
no more precise than a single CBT bag. (Our measured continuous residual, σ̂ ≈ 0.61 log
10, is in fact a bit larger than CBT's σ ≈ 0.33, so treating the Lume as CBT-equivalent is generous to the lab side, not to us.)
- Agreement is NOT judged by whether the bars visually overlap. Two 95% CIs can overlap and still be significantly different — overlap overstates agreement. The correct test compares the difference of the two measurements, combining their standard errors in quadrature: a point agrees with the lab when
|residual| ≤ 1.96·√(σCBT² + σLume²) ≈ 0.92 log10 — tighter than the ±1.30 you'd get by summing the two ±0.65 bars. That difference test drives the green/red rings and the headline "agree with lab" percentage.
- Equivalence (TOST) answers a different question: is there a systematic bias between Lume and CBT? The banner reports a two-one-sided-test of the mean Lume−CBT difference against a ±0.65 log10 equivalence margin at 90% confidence. Passing means the methods are statistically interchangeable on average, not merely "not proven different."
Net: with σ̂Lume comparable to σCBT, the Lume behaves like a second independent reference. Neither a sensor nor a second CBT bag can agree with a single CBT bag more often than the assay's own ≈92–93% bag-to-bag repeatability ceiling allows.
Sources:
Stauber et al. 2014, J Microbiol Methods ·
Heitzinger et al. 2017, Peruvian DHS ·
Gronewold et al. 2017, Sci Total Environ ·
KWR/JMP 2022 lab evaluation.
Loading sensor data for predictions…
Chlorination detection
Can the Lume detect chlorination efficacy? At sites in Kenya where free chlorine residual (Cl2) was measured alongside CBT and Lume readings, we identify matched water systems that have both pre-chlorination (source, Cl2 = 0) and post-chlorination (treated, Cl2 > 0) samples collected on the same day. For each matched system, we compare source vs. treated Lume predictions to test whether fluorescence drops after chlorination. Tryptophan-like fluorescence (TLF) is attenuated by free chlorine, so effective chlorination should produce a measurable reduction in the Lume signal.
Loading chlorination comparison…
Pooled classifier
Logistic regression models fitted to all paired observations pooled across sensors. Binary classifiers fit log-odds(CBT ≥ threshold) ~ mon2cn + tofn + FE + slopes. Class-balanced weights, L2 (λ = 0.02). The operating threshold τ is tuned to maximize balanced accuracy (Youden's J). The "of CBT ceiling" figure expresses balanced accuracy as a fraction of the ~92.5% threshold-agreement ceiling any method faces against single-bag CBT labels (KWR/JMP 2022). The 3-level risk classifier fits a multinomial softmax model to three categories: <10, 10–99, and ≥100 CFU/100 mL.
Fitting pooled classifiers…
Machine learning benchmark
To verify that the transparent linear Tobit model is not leaving substantial predictive power on the table, we benchmark it against gradient-boosted tree models (LightGBM and XGBoost) using identical features and evaluation protocols. Both ML models use class-balanced weights, depth-2 trees with monotonicity constraints on fluorescence, and are evaluated via LOOCV, leave-one-day-out, and 200× repeated 70/30 splits. The ML models are shown as a benchmark only — the production crediting model remains the transparent 8-parameter Tobit for Gold Standard auditability.
Validated use cases and limitations
Based on 216 paired Lume–CBT field observations across 3 sensors, 2 countries, and both source and treated water, the following use cases are supported or not yet supported.
Supported
Screening for contaminated water (≥10 CFU/100 mL)
Binary classification at the WHO “intermediate risk” threshold reaches 83% balanced accuracy with a class-balanced logistic classifier (sensitivity 84%, specificity 82%, AUC 0.885), or 85% with the deployed Tobit model thresholded at the boundary (sensitivity 76%, specificity 93%, AUC 0.892). This is ~90% of the ~92.5% ceiling imposed by CBT’s own inter-method variability (KWR/JMP 2022). The operationally relevant question for SDWS Parameter 18 — “is this water safe?” — is answered correctly with high sensitivity to true contamination.
WHO risk category assignment (±1 tier)
The 3-level risk classifier (<10, 10–99, ≥100 CFU) achieves 97% within ±1 category agreement with CBT. No Lume prediction is ever off by more than one risk tier in roughly 49 out of 50 samples. The Lume reliably distinguishes “safe” from “very high risk” water.
Confirming chlorination efficacy
In samples where free chlorine residual was measured (n = 66, all Kenya), 100% of chlorinated samples (Cl2 > 0, n = 30) had 0 CFU by CBT and were correctly classified as safe by the Lume. The sensor clearly distinguishes chlorinated from unchlorinated water.
Continuous monitoring between grab samples
A single annual CBT grab sample captures one snapshot with ±0.65 log10 noise. The Lume takes readings every ~5 minutes — roughly 288 readings per day — and can detect transient contamination events that a single grab sample would miss. Even with a per-reading agreement of 88%, the aggregate of continuous monitoring provides far higher temporal coverage than periodic grab sampling.
Limitations
Cannot reliably distinguish 0 from 1–9 CFU
The ≥1 CFU binary classifier achieves only 73% balanced accuracy (sensitivity = 62%). The Lume misses roughly 40% of “low risk” (1–9 CFU) samples, classifying them as conformity. This is inherent to the CBT reference: 0 and 1–9 CFU produce nearly identical fluorescence signals, and CBT itself is noisy at these levels. The Lume should not be used to certify drinking water as zero-E. coli.
Per-sensor calibration required
Per-sensor LOOCV agreement varies — see the Per-sensor LOOCV agreement table below the scatter plot for current figures (computed dynamically from the live dataset). The model carries a per-sensor intercept and a per-sensor mon2 slope, but the Rwanda-deployed units (50045, 50065) show a residual fluorescence-sensitivity offset in high-DOM spring water that these terms do not fully absorb. Each new sensor requires per-sensor calibration before its predictions are reliable.
Prediction is a risk category, not a precise count
The model explains ~52% of variance in log10(CFU+1) (R² = 0.52, σ̂ = 0.61 log10). Individual predictions carry substantial uncertainty. The Lume is validated for categorical risk classification (“safe” vs “contaminated”), not for reporting exact CFU concentrations.
Ongoing CBT verification needed
The Lume does not replace CBT — it extends it. Periodic CBT grab samples remain necessary to (a) verify the model has not drifted, (b) recalibrate baselines if sensor optics change over time, and (c) provide ground truth for any new deployment sites or water chemistries. The recommended cadence is at least one paired CBT session per sensor per quarter.
A note on the validation distribution. The CBT reference values in this dataset are concentrated at two poles: 57% at 0 CFU (conformity) and 22% at ≥100 CFU (the CBT detection limit), with only 21% in the 1–99 range. This is not a sampling artefact — it reflects the operational reality of these water treatment programs. In Amazi Meza (Rwanda), LifeStraw gravity filters either work (producing 0 CFU treated water) or the source water is untreated and often heavily contaminated. In DRIP (Kenya), boreholes are either naturally clean or post-treatment water has been effectively chlorinated. The “intermediate” zone (10–99 CFU) is genuinely rare in the field because partial treatment failure at these programs does not produce a smooth gradient of contamination — it produces a binary outcome.
This means the reported classifier performance is measured on the distribution the Lume will actually encounter in deployment. A hypothetical uniform distribution across WHO risk categories would be more challenging for the model near the boundaries, but it would also be unrealistic — no operational water program produces equal proportions of conformity, low-risk, intermediate, and very-high-risk water. The Lume is validated on — and optimised for — the water it will actually see. Where the model is weakest (distinguishing 5 from 15 CFU, or 50 from 100 CFU) is precisely where CBT itself is noisiest and where the operational distinction matters least: both 5 and 15 CFU indicate a treatment problem that needs attention, regardless of which side of a threshold they fall on.
Bottom line for dMRV. For the purpose of Gold Standard SDWS Parameter 18 — determining whether treated water is microbiologically safe — the Lume provides a validated, continuous alternative to periodic CBT grab sampling at the ≥10 CFU threshold. It agrees with CBT within CBT’s own measurement noise 88% of the time, is statistically equivalent on average (TOST, p < 0.001), and delivers >200× the temporal sampling density of a single annual grab sample. The model is fully specified by 8 transparent, auditable coefficients — no black-box inference. The Lume should be paired with quarterly CBT verification and should not be used to certify zero-E. coli conformity.