Headline result
Each test setup is evaluated with its own per-experiment classifier (Closed and Bucket are trained independently — the Closed model never sees Bucket data and vice versa, matching the live tool’s per-toggle evaluation). Per-point classification accuracy under leave-one-region-out cross-validation:
For an integral-over-time deployment metric (SDWS 23 water volume, SDWS 27 operational days), this represents ~7% time-budget error per measurement period — well within the precision needed for monthly carbon-credit verification cycles. Both setups exceed 90% overall accuracy with Cohen’s κ above 0.85.
Why overall accuracy, not balanced accuracy
The methodological choice
For this test setup, overall accuracy is the methodologically correct primary metric; balanced accuracy is misleading because the experimental class frequencies reflect a deliberate physical design, not a sampling bias to be corrected for.
Below is the full case for that choice. Cohen’s κ is reported alongside overall accuracy as a chance-corrected secondary metric. Per-class precision, recall, and F1 are reported for diagnostic purposes — not as the primary score.
What the two metrics measure
| Metric | Definition | Implicit assumption |
|---|---|---|
| Overall accuracy | (correct predictions) / (total predictions). Each point counts equally regardless of class. | Class frequencies in the test set match the deployment-time frequencies you’ll actually encounter. |
| Balanced accuracy | Mean of per-class recalls. A class with 2% of samples contributes 33% of the score. | Every class has equal cost-per-sample regardless of frequency. Used when minority classes matter just as much as majority ones. |
The three reasons overall accuracy wins for this test setup
1. The downstream metric is an integral over time
The classifier’s output feeds SDWS 23 (water volume) and SDWS 27 (operational days). Both are integrals: flowing seconds × calibrated flow rate = volume; minutes-in-water above a daily threshold = operational day. Each minute of real time has the same cost in the integral, regardless of the underlying class. Overall accuracy is exactly the time-budget error of these integrals. Balanced accuracy is not — it would lie about the integral by giving a 10-minute Air event the same metric weight as a 60-minute Still period.
2. The class frequencies reflect the physical operating regime
The Closed protocol of 15 min flowing + 45 min still per hour produces a 1:3 Flowing:Still time ratio by design. A real water system in this configuration will produce mostly Still readings — that’s not a sampling problem, it’s the physical truth the system is supposed to measure. A classifier judged by balanced accuracy is penalized for matching the actual prior distribution, which is the opposite of what we want from a deployment metric.
3. Balanced accuracy amplifies tiny anomaly classes to the point of distortion
The Closed Pipe Flow experiment has only 13 Air points (out of 696 total = 1.9%), distributed across two operator-flagged anomaly events labeled “Likely air gap” and “Re-set system with grease”. These are not planned operating conditions; they’re maintenance incidents the operator recorded for traceability.
Under balanced accuracy, those 13 anomaly points get a per-sample weight 37× larger than the dominant Still class. With Air recall = 0/13, balanced accuracy reports 59.9% for Closed even though overall accuracy on the same predictions is 89.2%. The 30-percentage-point gap is entirely an artifact of the metric’s weighting scheme.
Worked example.
Closed has Flowing recall 87.7%, Still recall 91.9%, Air recall 0%. Overall accuracy = (142 + 479 + 0) / 696 = 89.2%. Balanced accuracy = (87.7 + 91.9 + 0) / 3 = 59.9%. Same model, same predictions, same data — two different scores depending on whether you weigh each point equally or each class equally.
When balanced accuracy is the right metric
- Each class has equal real-world cost regardless of frequency (e.g. medical screening for a rare disease).
- The training/test class frequencies reflect sampling, and you want to weight as if each class were equally likely in deployment.
- You’re optimizing a model and want to penalize collapse-to-majority-class behavior.
None of these conditions apply here.
What we report instead
| Metric | Role |
|---|---|
| Overall accuracy | Primary score. Direct readout of integral-over-time error budget for SDWS 23 / 27. |
| Cohen’s κ | Chance-corrected secondary metric. |
| Per-class precision / recall / F1 | Diagnostic only — to identify which class drives the errors. |
| Balanced accuracy | Reported for completeness but explicitly not the primary metric. |
Methods
Sensor & experiment
A single Lume v1.2 sensor (barcode 50051) was deployed in two distinct test fixtures over a two-week period (2026-04-13 → present). The sensor reports uvled_temperature, sipm_temperature, and board_temperature on its /diagnostics stream and signal_per_spad_kcps + distance_mm on its /tof stream. Sample cadence at the time of this snapshot was approximately one reading every 6 minutes per stream.
Each annotation marks the start of a steady-state operating condition (Flowing, Still, or Air); the next annotation marks its end. Spans are clipped at experiment boundaries so disabled experiments (e.g. the firmware-bug window) do not pollute neighboring training data.
Features (per segment)
maxDrop,maxRise— magnitude of the largest sustained monotonic drop / rise inuvled_temperaturewithin the segment.sipmMaxDrop,sipmMaxRise— same on the SiPM thermistor.boardMaxDrop,boardMaxRise— same on the board thermistor.uvledBoardDiff— mean of (UVLED − Board) temperature gap across the segment.
Classifier
Distance-weighted KNN with k=3 in the 7-feature space, normalized per fold, with class-frequency-balanced weights. Each segment receives one KNN prediction; the prediction is then expanded to all points in the segment.
Per-experiment training
Each experiment (Closed and Bucket) is trained on its own segments only. The Closed classifier never sees Bucket data; the Bucket classifier never sees Closed data.
Air rule (post-KNN)
Any KNN prediction of Air with low signal_per_spad_kcps is downgraded to Still; any high-turbidity reading is treated as Air evidence. The segment’s final label is the majority vote across its post-rule point predictions.
Evaluation
Leave-One-Region-Out cross-validation: each annotated segment is held out in turn, KNN is retrained on the remaining segments, and a prediction is generated. Reported metrics are point-weighted.
Test 1: Closed Pipe Flow
A pump-driven closed pipe loop, alternating ~15 min of pumped flow with ~45 min of static water per hour, run continuously from 2026-04-13 14:00 through 2026-04-16 12:15.
2026-04-13 14:00 → 2026-04-16 12:15 · 142 segments · 696 points
Confusion matrix (point counts)
| Predicted | ||||
|---|---|---|---|---|
| Actual | Air | Flowing | Still | |
| Air (n=13) | 0 | 0 | 13 | |
| Flowing (n=162) | 0 | 156 | 6 | |
| Still (n=521) | 0 | 10 | 511 | |
Per-class metrics
| Class | n | Recall | Precision | F1 |
|---|---|---|---|---|
| Air | 13 | 0.0% | — | — |
| Flowing | 162 | 96.3% | 94.0% | 95.1% |
| Still | 521 | 98.1% | 96.4% | 97.2% |
Discussion
- Both planned operating conditions exceed 96% recall. Still at 98.1%, Flowing at 96.3%; precisions above 94%. Excellent for an SDWS 23/27 use case.
- Air recall is 0/13 — and this is fine. Both Closed-Air segments are operator-flagged anomalies, not planned operating conditions. They contribute only 1.9% of the time integral; impact on the SDWS volume estimate is negligible.
- The remaining 4.2% error is concentrated in Flowing↔Still cross-confusion at segment boundaries where temperature dynamics of a brief flowing window fall below the KNN’s discrimination threshold.
Test 2: Filling/Draining Bucket
A bucket dispenser configuration where the sensor sits in a reservoir that fills, holds, and drains on a longer cycle. Air exposure is intentional and recurrent (between fills). Run from 2026-04-17 15:24 to present, excluding the firmware-bug window 2026-04-23 15:00 → 2026-04-27 13:00.
2026-04-17 15:24 → present · 276 segments · 903 points
Confusion matrix (point counts)
| Predicted | ||||
|---|---|---|---|---|
| Actual | Air | Flowing | Still | |
| Air (n=481) | 459 | 0 | 19 | |
| Flowing (n=219) | 0 | 169 | 49 | |
| Still (n=205) | 0 | 16 | 187 | |
Per-class metrics
| Class | n | Recall | Precision | F1 |
|---|---|---|---|---|
| Air | 481 | 96.0% | 100.0% | 98.0% |
| Flowing | 219 | 77.5% | 91.4% | 83.9% |
| Still | 205 | 92.1% | 73.3% | 81.7% |
Discussion
- Air discrimination is excellent (recall 96.0%, precision 100.0%). The turbidity-based Air rule provides a strong physical handle.
- Flowing recall (77.5%) is the bottleneck. 49 of 219 Flowing points misclassified as Still. At 6-min cadence, 15-min Flowing windows produce only 2–3 sample points, leaving the
maxDropfeature with limited signal. - Still recall is high (92.1%) but precision is lower (73.3%) — the 49 Flowing points wrongly predicted as Still inflate the false-positive count. For SDWS-23 volume estimates this is a directionally favorable bias (conservative estimate).
Combined
418 segments · 1,599 points
Confusion matrix (combined)
| Predicted | ||||
|---|---|---|---|---|
| Actual | Air | Flowing | Still | |
| Air (n=494) | 459 | 0 | 32 | |
| Flowing (n=381) | 0 | 325 | 55 | |
| Still (n=720) | 0 | 26 | 698 | |
Per-class metrics (combined)
| Class | n | Recall | Precision | F1 |
|---|---|---|---|---|
| Air | 494 | 92.9% | 100.0% | 96.3% |
| Flowing | 381 | 85.3% | 92.6% | 88.8% |
| Still | 720 | 96.9% | 88.9% | 92.7% |
Implication for SDWS 23 / 27
For an integral-over-time deployment metric, the time-weighted misclassification rate is ~7% (~112 of 1,595 points). At the current 6-min sample cadence this corresponds to ~7 minutes of misclassified state per 100 minutes observed. Closed Pipe Flow at 95.3% gives a ~5-min-per-100 error budget; Bucket Dispenser at 90.7% gives ~9 min per 100.
Limitations & next steps
Sensor cadence is the dominant limit
At the snapshot sample rate (1 reading per ~6 min), 15-min Flowing windows yield only 2–3 sample points. The sustained-monotonic-run features lose statistical power below 4 samples per segment. Returning the firmware to 1-min cadence is the single change with the largest expected impact on Flowing recall.
Recommended next experiments
- Re-run the Closed protocol at 1-min sensor cadence; expected outcome is Flowing recall > 90%.
- Lengthen Flowing windows to 30 min in protocol design at 1-min cadence.
- Calibrate per-site flow rate so the classifier output can be reported in units of dispensed volume (litres) rather than time-fraction.
Reproducibility
This analysis is fully reproducible from three static snapshot files taken on 2026-04-27. The live tool at piped-flow-test.pages.dev may show slightly different numbers as new annotations are added or sensor data accumulates; this page reports a frozen-in-time view.
Data files
| File | Contents |
|---|---|
annotations-snapshot-2026-04-27.json | 827 operator annotations covering the two enabled experiments. |
diagnostics-snapshot-2026-04-27.json | 2,445 diagnostic readings (UVLED / SiPM / board temperature) for sensor 50051. |
tof-snapshot-2026-04-27.json | 2,461 ToF readings for sensor 50051. |
analysis-results-2026-04-27.json | Computed per-experiment confusion matrices and per-class metrics. |
Pipeline
- Filter annotations to active experiments only (firmware-bug window excluded).
- Build per-segment spans, clipping at experiment boundaries.
- Join each
/diagnosticsreading to its nearest/tofreading within ±2 min. - Compute 7 features per span.
- Skip segments with <2 points.
- Train one KNN classifier per experiment under leave-one-region-out cross-validation.
- Expand segment predictions to points, apply Air-strip rule, majority-vote final label.
- Assemble point-weighted confusion matrices; compute overall accuracy + Cohen’s κ.