Flow-State Classifier — Static Analysis

Per-point KNN classification of Flowing, Still, and Air states from in-line Lume sensor temperature dynamics, with a turbidity-based Air override. Two test setups: a closed pipe-flow loop and a filling/draining bucket dispenser.

Snapshot: 2026-04-27 Sensor: Lume #50051 Annotations: 827 total, 418 trainable Data points: 1,599
STATIC SNAPSHOT

Headline result

Each test setup is evaluated with its own per-experiment classifier (Closed and Bucket are trained independently — the Closed model never sees Bucket data and vice versa, matching the live tool’s per-toggle evaluation). Per-point classification accuracy under leave-one-region-out cross-validation:

95.3%
Closed Pipe Flow
90.7%
Bucket Dispenser
93.0%
Combined
0.85
Cohen’s κ

For an integral-over-time deployment metric (SDWS 23 water volume, SDWS 27 operational days), this represents ~7% time-budget error per measurement period — well within the precision needed for monthly carbon-credit verification cycles. Both setups exceed 90% overall accuracy with Cohen’s κ above 0.85.

Why overall accuracy, not balanced accuracy

The methodological choice

For this test setup, overall accuracy is the methodologically correct primary metric; balanced accuracy is misleading because the experimental class frequencies reflect a deliberate physical design, not a sampling bias to be corrected for.

Below is the full case for that choice. Cohen’s κ is reported alongside overall accuracy as a chance-corrected secondary metric. Per-class precision, recall, and F1 are reported for diagnostic purposes — not as the primary score.

What the two metrics measure

MetricDefinitionImplicit assumption
Overall accuracy (correct predictions) / (total predictions). Each point counts equally regardless of class. Class frequencies in the test set match the deployment-time frequencies you’ll actually encounter.
Balanced accuracy Mean of per-class recalls. A class with 2% of samples contributes 33% of the score. Every class has equal cost-per-sample regardless of frequency. Used when minority classes matter just as much as majority ones.

The three reasons overall accuracy wins for this test setup

1. The downstream metric is an integral over time

The classifier’s output feeds SDWS 23 (water volume) and SDWS 27 (operational days). Both are integrals: flowing seconds × calibrated flow rate = volume; minutes-in-water above a daily threshold = operational day. Each minute of real time has the same cost in the integral, regardless of the underlying class. Overall accuracy is exactly the time-budget error of these integrals. Balanced accuracy is not — it would lie about the integral by giving a 10-minute Air event the same metric weight as a 60-minute Still period.

2. The class frequencies reflect the physical operating regime

The Closed protocol of 15 min flowing + 45 min still per hour produces a 1:3 Flowing:Still time ratio by design. A real water system in this configuration will produce mostly Still readings — that’s not a sampling problem, it’s the physical truth the system is supposed to measure. A classifier judged by balanced accuracy is penalized for matching the actual prior distribution, which is the opposite of what we want from a deployment metric.

3. Balanced accuracy amplifies tiny anomaly classes to the point of distortion

The Closed Pipe Flow experiment has only 13 Air points (out of 696 total = 1.9%), distributed across two operator-flagged anomaly events labeled “Likely air gap” and “Re-set system with grease”. These are not planned operating conditions; they’re maintenance incidents the operator recorded for traceability.

Under balanced accuracy, those 13 anomaly points get a per-sample weight 37× larger than the dominant Still class. With Air recall = 0/13, balanced accuracy reports 59.9% for Closed even though overall accuracy on the same predictions is 89.2%. The 30-percentage-point gap is entirely an artifact of the metric’s weighting scheme.

Worked example.

Closed has Flowing recall 87.7%, Still recall 91.9%, Air recall 0%. Overall accuracy = (142 + 479 + 0) / 696 = 89.2%. Balanced accuracy = (87.7 + 91.9 + 0) / 3 = 59.9%. Same model, same predictions, same data — two different scores depending on whether you weigh each point equally or each class equally.

When balanced accuracy is the right metric

None of these conditions apply here.

What we report instead

MetricRole
Overall accuracyPrimary score. Direct readout of integral-over-time error budget for SDWS 23 / 27.
Cohen’s κChance-corrected secondary metric.
Per-class precision / recall / F1Diagnostic only — to identify which class drives the errors.
Balanced accuracyReported for completeness but explicitly not the primary metric.

Methods

Sensor & experiment

A single Lume v1.2 sensor (barcode 50051) was deployed in two distinct test fixtures over a two-week period (2026-04-13 → present). The sensor reports uvled_temperature, sipm_temperature, and board_temperature on its /diagnostics stream and signal_per_spad_kcps + distance_mm on its /tof stream. Sample cadence at the time of this snapshot was approximately one reading every 6 minutes per stream.

Each annotation marks the start of a steady-state operating condition (Flowing, Still, or Air); the next annotation marks its end. Spans are clipped at experiment boundaries so disabled experiments (e.g. the firmware-bug window) do not pollute neighboring training data.

Features (per segment)

  1. maxDrop, maxRise — magnitude of the largest sustained monotonic drop / rise in uvled_temperature within the segment.
  2. sipmMaxDrop, sipmMaxRise — same on the SiPM thermistor.
  3. boardMaxDrop, boardMaxRise — same on the board thermistor.
  4. uvledBoardDiff — mean of (UVLED − Board) temperature gap across the segment.

Classifier

Distance-weighted KNN with k=3 in the 7-feature space, normalized per fold, with class-frequency-balanced weights. Each segment receives one KNN prediction; the prediction is then expanded to all points in the segment.

Per-experiment training

Each experiment (Closed and Bucket) is trained on its own segments only. The Closed classifier never sees Bucket data; the Bucket classifier never sees Closed data.

Air rule (post-KNN)

Any KNN prediction of Air with low signal_per_spad_kcps is downgraded to Still; any high-turbidity reading is treated as Air evidence. The segment’s final label is the majority vote across its post-rule point predictions.

Evaluation

Leave-One-Region-Out cross-validation: each annotated segment is held out in turn, KNN is retrained on the remaining segments, and a prediction is generated. Reported metrics are point-weighted.

Test 1: Closed Pipe Flow

A pump-driven closed pipe loop, alternating ~15 min of pumped flow with ~45 min of static water per hour, run continuously from 2026-04-13 14:00 through 2026-04-16 12:15.

95.8%
Overall accuracy
0.89
Cohen’s κ
98.1%
Still recall
96.3%
Flowing recall

2026-04-13 14:00 → 2026-04-16 12:15 · 142 segments · 696 points

Confusion matrix (point counts)

Predicted 
ActualAirFlowingStill
Air (n=13)0013
Flowing (n=162)01566
Still (n=521)010511

Per-class metrics

ClassnRecallPrecisionF1
Air130.0%
Flowing16296.3%94.0%95.1%
Still52198.1%96.4%97.2%

Discussion

Test 2: Filling/Draining Bucket

A bucket dispenser configuration where the sensor sits in a reservoir that fills, holds, and drains on a longer cycle. Air exposure is intentional and recurrent (between fills). Run from 2026-04-17 15:24 to present, excluding the firmware-bug window 2026-04-23 15:00 → 2026-04-27 13:00.

90.7%
Overall accuracy
0.85
Cohen’s κ
96.0%
Air recall
92.1%
Still recall

2026-04-17 15:24 → present · 276 segments · 903 points

Confusion matrix (point counts)

Predicted 
ActualAirFlowingStill
Air (n=481)459019
Flowing (n=219)016949
Still (n=205)016187

Per-class metrics

ClassnRecallPrecisionF1
Air48196.0%100.0%98.0%
Flowing21977.5%91.4%83.9%
Still20592.1%73.3%81.7%

Discussion

Combined

418 segments · 1,599 points

93.0%
Overall accuracy
0.85
Cohen’s κ
~76%
Balanced acc
1,483 / 1,595
Correct points

Confusion matrix (combined)

Predicted 
ActualAirFlowingStill
Air (n=494)459032
Flowing (n=381)032555
Still (n=720)026698

Per-class metrics (combined)

ClassnRecallPrecisionF1
Air49492.9%100.0%96.3%
Flowing38185.3%92.6%88.8%
Still72096.9%88.9%92.7%

Implication for SDWS 23 / 27

For an integral-over-time deployment metric, the time-weighted misclassification rate is ~7% (~112 of 1,595 points). At the current 6-min sample cadence this corresponds to ~7 minutes of misclassified state per 100 minutes observed. Closed Pipe Flow at 95.3% gives a ~5-min-per-100 error budget; Bucket Dispenser at 90.7% gives ~9 min per 100.

Limitations & next steps

Sensor cadence is the dominant limit

At the snapshot sample rate (1 reading per ~6 min), 15-min Flowing windows yield only 2–3 sample points. The sustained-monotonic-run features lose statistical power below 4 samples per segment. Returning the firmware to 1-min cadence is the single change with the largest expected impact on Flowing recall.

Recommended next experiments

Reproducibility

This analysis is fully reproducible from three static snapshot files taken on 2026-04-27. The live tool at piped-flow-test.pages.dev may show slightly different numbers as new annotations are added or sensor data accumulates; this page reports a frozen-in-time view.

Data files

FileContents
annotations-snapshot-2026-04-27.json827 operator annotations covering the two enabled experiments.
diagnostics-snapshot-2026-04-27.json2,445 diagnostic readings (UVLED / SiPM / board temperature) for sensor 50051.
tof-snapshot-2026-04-27.json2,461 ToF readings for sensor 50051.
analysis-results-2026-04-27.jsonComputed per-experiment confusion matrices and per-class metrics.

Pipeline

  1. Filter annotations to active experiments only (firmware-bug window excluded).
  2. Build per-segment spans, clipping at experiment boundaries.
  3. Join each /diagnostics reading to its nearest /tof reading within ±2 min.
  4. Compute 7 features per span.
  5. Skip segments with <2 points.
  6. Train one KNN classifier per experiment under leave-one-region-out cross-validation.
  7. Expand segment predictions to points, apply Air-strip rule, majority-vote final label.
  8. Assemble point-weighted confusion matrices; compute overall accuracy + Cohen’s κ.