Flow-State Classifier Analysis — Snapshot 2026-04-27

Headline result

Each test setup is evaluated with its own per-experiment classifier (Closed and Bucket are trained independently — the Closed model never sees Bucket data and vice versa, matching the live tool’s per-toggle evaluation). Per-point classification accuracy under leave-one-region-out cross-validation:

95.3%

Closed Pipe Flow

90.7%

Bucket Dispenser

93.0%

Combined

0.85

Cohen’s κ

For an integral-over-time deployment metric (SDWS 23 water volume, SDWS 27 operational days), this represents ~7% time-budget error per measurement period — well within the precision needed for monthly carbon-credit verification cycles. Both setups exceed 90% overall accuracy with Cohen’s κ above 0.85.

Why overall accuracy, not balanced accuracy

The methodological choice

For this test setup, overall accuracy is the methodologically correct primary metric; balanced accuracy is misleading because the experimental class frequencies reflect a deliberate physical design, not a sampling bias to be corrected for.

Below is the full case for that choice. Cohen’s κ is reported alongside overall accuracy as a chance-corrected secondary metric. Per-class precision, recall, and F1 are reported for diagnostic purposes — not as the primary score.

What the two metrics measure

Metric	Definition	Implicit assumption
Overall accuracy	(correct predictions) / (total predictions). Each point counts equally regardless of class.	Class frequencies in the test set match the deployment-time frequencies you’ll actually encounter.
Balanced accuracy	Mean of per-class recalls. A class with 2% of samples contributes 33% of the score.	Every class has equal cost-per-sample regardless of frequency. Used when minority classes matter just as much as majority ones.

The three reasons overall accuracy wins for this test setup

1. The downstream metric is an integral over time

The classifier’s output feeds SDWS 23 (water volume) and SDWS 27 (operational days). Both are integrals: flowing seconds × calibrated flow rate = volume; minutes-in-water above a daily threshold = operational day. Each minute of real time has the same cost in the integral, regardless of the underlying class. Overall accuracy is exactly the time-budget error of these integrals. Balanced accuracy is not — it would lie about the integral by giving a 10-minute Air event the same metric weight as a 60-minute Still period.

2. The class frequencies reflect the physical operating regime

The Closed protocol of 15 min flowing + 45 min still per hour produces a 1:3 Flowing:Still time ratio by design. A real water system in this configuration will produce mostly Still readings — that’s not a sampling problem, it’s the physical truth the system is supposed to measure. A classifier judged by balanced accuracy is penalized for matching the actual prior distribution, which is the opposite of what we want from a deployment metric.

3. Balanced accuracy amplifies tiny anomaly classes to the point of distortion

The Closed Pipe Flow experiment has only 13 Air points (out of 696 total = 1.9%), distributed across two operator-flagged anomaly events labeled “Likely air gap” and “Re-set system with grease”. These are not planned operating conditions; they’re maintenance incidents the operator recorded for traceability.

Under balanced accuracy, those 13 anomaly points get a per-sample weight 37× larger than the dominant Still class. With Air recall = 0/13, balanced accuracy reports 59.9% for Closed even though overall accuracy on the same predictions is 89.2%. The 30-percentage-point gap is entirely an artifact of the metric’s weighting scheme.

Worked example.

Closed has Flowing recall 87.7%, Still recall 91.9%, Air recall 0%. Overall accuracy = (142 + 479 + 0) / 696 = 89.2%. Balanced accuracy = (87.7 + 91.9 + 0) / 3 = 59.9%. Same model, same predictions, same data — two different scores depending on whether you weigh each point equally or each class equally.

When balanced accuracy is the right metric

Each class has equal real-world cost regardless of frequency (e.g. medical screening for a rare disease).
The training/test class frequencies reflect sampling, and you want to weight as if each class were equally likely in deployment.
You’re optimizing a model and want to penalize collapse-to-majority-class behavior.

None of these conditions apply here.

What we report instead

Metric	Role
Overall accuracy	Primary score. Direct readout of integral-over-time error budget for SDWS 23 / 27.
Cohen’s κ	Chance-corrected secondary metric.
Per-class precision / recall / F1	Diagnostic only — to identify which class drives the errors.
Balanced accuracy	Reported for completeness but explicitly not the primary metric.

Methods

Sensor & experiment

A single Lume v1.2 sensor (barcode 50051) was deployed in two distinct test fixtures over a two-week period (2026-04-13 → present). The sensor reports uvled_temperature, sipm_temperature, and board_temperature on its /diagnostics stream and signal_per_spad_kcps + distance_mm on its /tof stream. Sample cadence at the time of this snapshot was approximately one reading every 6 minutes per stream.

Each annotation marks the start of a steady-state operating condition (Flowing, Still, or Air); the next annotation marks its end. Spans are clipped at experiment boundaries so disabled experiments (e.g. the firmware-bug window) do not pollute neighboring training data.

Features (per segment)

maxDrop, maxRise — magnitude of the largest sustained monotonic drop / rise in uvled_temperature within the segment.
sipmMaxDrop, sipmMaxRise — same on the SiPM thermistor.
boardMaxDrop, boardMaxRise — same on the board thermistor.
uvledBoardDiff — mean of (UVLED − Board) temperature gap across the segment.

Classifier

Distance-weighted KNN with k=3 in the 7-feature space, normalized per fold, with class-frequency-balanced weights. Each segment receives one KNN prediction; the prediction is then expanded to all points in the segment.

Per-experiment training

Each experiment (Closed and Bucket) is trained on its own segments only. The Closed classifier never sees Bucket data; the Bucket classifier never sees Closed data.

Air rule (post-KNN)

Any KNN prediction of Air with low signal_per_spad_kcps is downgraded to Still; any high-turbidity reading is treated as Air evidence. The segment’s final label is the majority vote across its post-rule point predictions.

Evaluation

Leave-One-Region-Out cross-validation: each annotated segment is held out in turn, KNN is retrained on the remaining segments, and a prediction is generated. Reported metrics are point-weighted.

Test 1: Closed Pipe Flow

A pump-driven closed pipe loop, alternating ~15 min of pumped flow with ~45 min of static water per hour, run continuously from 2026-04-13 14:00 through 2026-04-16 12:15.

95.8%

Overall accuracy

0.89

Cohen’s κ

98.1%

Still recall

96.3%

Flowing recall

2026-04-13 14:00 → 2026-04-16 12:15 · 142 segments · 696 points

Confusion matrix (point counts)

Predicted
Actual	Air	Flowing	Still
Air (n=13)	0	0	13
Flowing (n=162)	0	156	6
Still (n=521)	0	10	511

Per-class metrics

Class	n	Recall	Precision	F1
Air	13	0.0%	—	—
Flowing	162	96.3%	94.0%	95.1%
Still	521	98.1%	96.4%	97.2%

Discussion

Both planned operating conditions exceed 96% recall. Still at 98.1%, Flowing at 96.3%; precisions above 94%. Excellent for an SDWS 23/27 use case.
Air recall is 0/13 — and this is fine. Both Closed-Air segments are operator-flagged anomalies, not planned operating conditions. They contribute only 1.9% of the time integral; impact on the SDWS volume estimate is negligible.
The remaining 4.2% error is concentrated in Flowing↔Still cross-confusion at segment boundaries where temperature dynamics of a brief flowing window fall below the KNN’s discrimination threshold.

Test 2: Filling/Draining Bucket

A bucket dispenser configuration where the sensor sits in a reservoir that fills, holds, and drains on a longer cycle. Air exposure is intentional and recurrent (between fills). Run from 2026-04-17 15:24 to present, excluding the firmware-bug window 2026-04-23 15:00 → 2026-04-27 13:00.

90.7%

Overall accuracy

0.85

Cohen’s κ

96.0%

Air recall

92.1%

Still recall

2026-04-17 15:24 → present · 276 segments · 903 points

Confusion matrix (point counts)

Predicted
Actual	Air	Flowing	Still
Air (n=481)	459	0	19
Flowing (n=219)	0	169	49
Still (n=205)	0	16	187

Per-class metrics

Class	n	Recall	Precision	F1
Air	481	96.0%	100.0%	98.0%
Flowing	219	77.5%	91.4%	83.9%
Still	205	92.1%	73.3%	81.7%

Discussion

Air discrimination is excellent (recall 96.0%, precision 100.0%). The turbidity-based Air rule provides a strong physical handle.
Flowing recall (77.5%) is the bottleneck. 49 of 219 Flowing points misclassified as Still. At 6-min cadence, 15-min Flowing windows produce only 2–3 sample points, leaving the maxDrop feature with limited signal.
Still recall is high (92.1%) but precision is lower (73.3%) — the 49 Flowing points wrongly predicted as Still inflate the false-positive count. For SDWS-23 volume estimates this is a directionally favorable bias (conservative estimate).

Combined

418 segments · 1,599 points

93.0%

Overall accuracy

0.85

Cohen’s κ

~76%

Balanced acc

1,483 / 1,595

Correct points

Confusion matrix (combined)

Predicted
Actual	Air	Flowing	Still
Air (n=494)	459	0	32
Flowing (n=381)	0	325	55
Still (n=720)	0	26	698

Per-class metrics (combined)

Class	n	Recall	Precision	F1
Air	494	92.9%	100.0%	96.3%
Flowing	381	85.3%	92.6%	88.8%
Still	720	96.9%	88.9%	92.7%

Implication for SDWS 23 / 27

For an integral-over-time deployment metric, the time-weighted misclassification rate is ~7% (~112 of 1,595 points). At the current 6-min sample cadence this corresponds to ~7 minutes of misclassified state per 100 minutes observed. Closed Pipe Flow at 95.3% gives a ~5-min-per-100 error budget; Bucket Dispenser at 90.7% gives ~9 min per 100.

Limitations & next steps

Sensor cadence is the dominant limit

At the snapshot sample rate (1 reading per ~6 min), 15-min Flowing windows yield only 2–3 sample points. The sustained-monotonic-run features lose statistical power below 4 samples per segment. Returning the firmware to 1-min cadence is the single change with the largest expected impact on Flowing recall.

Recommended next experiments

Re-run the Closed protocol at 1-min sensor cadence; expected outcome is Flowing recall > 90%.
Lengthen Flowing windows to 30 min in protocol design at 1-min cadence.
Calibrate per-site flow rate so the classifier output can be reported in units of dispensed volume (litres) rather than time-fraction.

Reproducibility

This analysis is fully reproducible from three static snapshot files taken on 2026-04-27. The live tool at piped-flow-test.pages.dev may show slightly different numbers as new annotations are added or sensor data accumulates; this page reports a frozen-in-time view.

Data files

File	Contents
`annotations-snapshot-2026-04-27.json`	827 operator annotations covering the two enabled experiments.
`diagnostics-snapshot-2026-04-27.json`	2,445 diagnostic readings (UVLED / SiPM / board temperature) for sensor 50051.
`tof-snapshot-2026-04-27.json`	2,461 ToF readings for sensor 50051.
`analysis-results-2026-04-27.json`	Computed per-experiment confusion matrices and per-class metrics.

Pipeline

Filter annotations to active experiments only (firmware-bug window excluded).
Build per-segment spans, clipping at experiment boundaries.
Join each /diagnostics reading to its nearest /tof reading within ±2 min.
Compute 7 features per span.
Skip segments with <2 points.
Train one KNN classifier per experiment under leave-one-region-out cross-validation.
Expand segment predictions to points, apply Air-strip rule, majority-vote final label.
Assemble point-weighted confusion matrices; compute overall accuracy + Cohen’s κ.