Four Sensors, Three Rewrites, One Working Breath Detector

Building a real-time breath detector for iOS looked like a weekend project. It took four sensing approaches, three complete rewrites, eight unexpected bugs, and one key moment where the tech lead proposed a different frame for the problem — and everything finally clicked.

Building a Real-Time Breath Detector in Four Sensing Approaches and Three Rewrites

A post-mortem on what it actually takes to make a phone feel your lungs move


The idea

An app that listens to your breathing, matches a metronome, and scores your compliance. Simple wellness concept. Diaphragmatic breathing at 5.5-second intervals is measurably calming — the research is solid, the use case is clear. What couldn't be simple, it turned out, was the sensing.

The phone has a microphone. It has a speaker. It has an accelerometer, a gyroscope, and an IMU. AirPods have their own sensors. In theory, any of these could detect the rhythmic expansion and contraction of your chest and belly. In practice, each approach failed in a different way before one finally worked.

This is the story of four sensing strategies, three full rewrites of the detection engine, what an AI copilot got right and what it got wrong, and why testing a physical sensor app is fundamentally different from testing software.


Approach 1 — Microphone

The idea

The most obvious starting point: breath is audible. A person breathing near a microphone produces detectable sound — the turbulence of air through the nose and mouth, the soft percussion of the chest. Point the microphone at the user, classify the sound, detect phases.

Apple provides SNClassifySoundRequest in the SoundAnalysis framework, which runs a built-in Core ML model that classifies environmental sounds in real time. "Breathing" is one of its known classes.

Why it failed

Sensitivity to environment. SNClassifySoundRequest is designed for ambient sound classification, not proximity breath detection. In a quiet room with the phone 30 cm away, confidence for "breathing" was inconsistent. A fan, a TV in the next room, or keyboard clicks degraded detection to noise.

The phone isn't near your mouth. The use case is the phone lying on your chest or belly — the microphone is pointing away from your face, and your breath is happening 50–60 cm from the sensor. Audio signal drops with the square of distance.

No phase information. Even when breath sounds were detected, the model classified them as "breathing" — not inhale vs. exhale. The acoustic difference between the two is subtle and speaker-dependent. Building a custom classifier would require labeled training data from many users and a model running continuously at low latency. Battery and latency both become problems fast.

Privacy optics. A wellness app requiring always-on microphone is a harder sell. Users notice the orange indicator.

Abandoned after a week of prototyping.


Approach 2 — Speaker (FMCW acoustic sensing)

The idea

If passive listening doesn't work, use active sonar. Frequency-Modulated Continuous-Wave (FMCW) is how modern automotive radar detects objects at millimeter precision. The same principle works acoustically: emit a frequency chirp from the speaker, capture the reflection with the microphone, and look for phase changes in the reflected signal caused by the moving chest wall.

The iPhone speaker and microphone are separated by only a few centimeters. If the phone is placed on the chest, the speaker emits a chirp in the 18–22 kHz range (near-ultrasound, inaudible), the chest reflects it, and the microphone picks up the reflection. As the chest rises and falls 0.5–2 cm per breath cycle, the path length changes, and the reflected frequency shifts by a detectable amount.

Research papers have demonstrated millimeter-accurate chest wall tracking with phone speakers. This seemed like the elegant solution — no external hardware, pure software.

Why it failed

Acoustic coupling is terrible on a soft body. FMCW radar works on rigid reflectors. A chest wall covered by clothing, skin, and subcutaneous fat is a poor acoustic reflector — it absorbs and scatters rather than reflecting coherently. The SNR of the reflected chirp was too low to extract meaningful phase information.

Speaker-to-mic bleedthrough. On a phone at rest, the direct path from speaker to microphone through the device body is stronger than the reflected path from the chest. Isolating the reflection required careful deconvolution that was feasible in a lab but not robust across different phone models and surface types.

Chirp generation variability. Generating a precise FMCW chirp (instantaneous frequency = f₀ + kt) with accurate timing and doing the cross-correlation in real time on-device got deep into vDSP signal processing before hitting a wall: acoustic environment variability was larger than the breathing signal itself.

Abandoned after two weeks. The concept is physically sound — Apple reportedly uses a variant internally — but making it work robustly in consumer conditions without hardware-level acoustic isolation is a significant research problem, not a weekend implementation.


Approach 3 — AirPods

The idea

AirPods Pro have an accelerometer, a gyroscope, and three microphones. When worn during breathing exercises, they're physically coupled to the head — and the head moves slightly with each breath. CMHeadphoneMotionManager provides head pose (roll, pitch, yaw) and acceleration from AirPods at up to 100 Hz.

Breathing causes subtle head movement, particularly when lying down. The chest rise physically transmits through the spine and neck to produce a measurable rocking motion. This seemed worth testing — existing hardware, no new sensors.

Why it failed

The signal is even weaker. The head is further from the breathing axis than the chest is. Chest wall displacement per breath is 0.5–2 cm; head displacement from that breath, transmitted through the spine, is a fraction of a millimeter.

Head movement dominates. Any actual head movement — swallowing, small adjustments, looking slightly to one side — produces signals 10–100× larger than the breathing-induced motion. Breathing is the smallest thing the head is doing at any moment.

Wearing requirement. The core use case is a phone lying on your belly during a session. Requiring the user to also wear AirPods adds friction, changes the use case, and introduces a dependency on Bluetooth state.

Abandoned after a few days of signal analysis. The data was clean; the signal-to-noise ratio for breathing specifically was insufficient.

Sensing approach comparison

ApproachSignal qualityEffortReliabilityOutcome
Microphone
SNClassifySoundRequest
Poor — distance + ambient noise
Low
Inconsistent
Abandoned
Speaker (FMCW)
18–22 kHz chirp + mic
Theoretical only
Very high
Bleedthrough dominates
Abandoned
AirPods
CMHeadphoneMotionManager
Very poor — head too far
Medium
Head noise dominates
Abandoned
Accelerometer
CoreMotion + PCA
Good — direct coupling
High — 3 rewrites
Good with calibration
Shipped

Approach 4 — Phone accelerometer (what finally worked)

The fourth approach was the most obvious one, saved for last: the phone itself, lying on the chest, measuring the chest's movement directly.

Place an iPhone face-up on your belly. Your abdomen rises on inhale and falls on exhale. That rise and fall moves the phone by 0.5–2 cm, producing a change in the accelerometer reading of approximately 0.008–0.040 G. That's a real, direct, physical signal — no reflection, no transmission through bone, no ambient noise interference.

The accelerometer samples at 50 Hz. The breathing band is 0.1–0.8 Hz (6–48 BPM). The signal exists. The engineering problem is extracting it cleanly, which required three complete rewrites.


The three accelerometer iterations

v1 — Z-axis with position normalization

The pipeline: Raw Z at 50 Hz → high-pass filter (removes gravity) → low-pass filter (removes noise above 1.5 Hz) → 750-sample buffer → detection.

Detection logic: Normalize the current sample against the recent 250-sample window: pos = (current − wMin) / (wMax − wMin). If pos > 0.60, inhale. If pos < 0.40, exhale. The normalization is self-calibrating — it adapts to different breathing depths and body types without any fixed amplitude threshold.

What broke: The placement transient. Setting the phone on your chest creates a step change of 0.5–0.8 G. The high-pass filter time constant is 5 seconds, so the spike takes 25–30 seconds to fully drain. With a 250-sample window, the spike locked the detector on "Inhale" for the first 5 seconds. Six different suppression approaches all failed — stability gates, zero-crossing, range checks — because any threshold that blocked the spike also blocked deep breathing.

The fix: A 3-second silent warmup. Don't fire detection for the first 150 samples. The spike drains, the buffer fills with clean signal, then detection starts.

V1 also added a Tibetan bowl metronome, guided cues, sync scoring, and session logging on top of the core detector.

v2 — Guided breathing coach

The core DSP from v1 was unchanged. V2 was about product: a 5.5-second metronome, per-window sync scoring (0–100%), session log, waveform display, and history.

Most bugs were state machine ordering bugs, not signal bugs. The metronome timer fired, set the expected phase, and on the very next line overwrote the sync result before the user's breathing had time to respond. Fixed with a 1.5-second asyncAfter delay for scoring evaluation.

The fundamental limitation of v2: the phone had to be placed precisely face-up. Any tilt changed the projection of breathing motion onto Z, weakening the signal. Put the phone at a 30° angle and detection degraded silently.

v3 — PCA calibration engine

The big architectural change: stop assuming Z is the breathing axis. Find it algorithmically.


The gyroscope detour

Before getting to PCA, the AI copilot suggested using the gyroscope to detect whether the user is holding the phone during calibration.

The problem it was solving: if a user holds the phone in their hand instead of placing it on their belly during the 66-second calibration phase, the training data is corrupted. The calibration produces thresholds and direction flags that don't match actual belly breathing.

The hypothesis: a hand holding a phone produces rotational noise (jerk in the gyro signal) that a phone lying still on a belly doesn't. Measure |Δgyro| per sample; if it stays elevated, flag the session.

Why it failed: The jerk magnitudes overlapped completely.

ConditionGyro jerk magnitude
Phone on belly, normal breathing0.008–0.012 rad/s
Phone held in still hand0.008–0.012 rad/s
Phone in hand, actively moving0.020–0.080 rad/s

When the hand is genuinely still, the gyro is indistinguishable from belly placement. The discriminative signal only appears with active hand movement — not when a user carefully holds the phone following calibration prompts. To solve this properly would require FFT features across the full gyro spectrum, labeled data from dozens of users, and a trained classifier. The direction quality check (described below) catches most miscalibration cases instead.


Why calibration exists

The accelerometer axis problem has no static solution. The breathing signal lives at a different angle for every user:

  • Whether the phone is flat or tilted
  • Whether the user is lying down, reclined, or sitting
  • Whether they're breathing into their chest or their belly
  • Whether the phone is above or below the navel

A fixed Z-axis assumption works when conditions match and fails silently when they don't. The alternative: let the user calibrate the phone to their body before each session.

The calibration phase is 66 seconds of guided breathing — six full inhale/exhale cycles prompted by audio cues. During that time, the detector is doing three things simultaneously:

1. Locking the breathing axis (PCA)

All three accelerometer axes are buffered continuously. After enough samples accumulate, the detector runs Principal Component Analysis on the 3D data. PC1 — the first principal component — is the axis of maximum variance. During active breathing, the axis of maximum variance is the breathing axis, because that's the direction the body is moving the most. Every subsequent sample gets projected onto PC1 to produce a scalar signal, regardless of phone orientation.

2. Learning per-user thresholds

Peak slope magnitudes are recorded separately for labeled inhale and exhale windows. The detection threshold is set at 30% of the median peak — large enough to reject sensor tremor and noise, small enough to catch real breathing including gentle breathers. Shallow breathers get low thresholds; deep breathers get high ones. No fixed constant works for both.

3. Determining breathing direction

PCA gives you an axis but not a polarity. Projecting the 3D signal onto PC1 gives a scalar that correlates with breathing — but it might be positive on inhale or positive on exhale depending on phone orientation. The detector computes the signed mean slope separately for inhale-labeled windows and exhale-labeled windows during calibration. If the inhale mean slope exceeds the exhale mean slope, positive slope = inhale. Otherwise flip. Six real sessions confirmed that correct belly breathing always produces opposite-sign means with a ratio between −0.96 and −1.23.


What the training phase tests

The 66-second calibration has three adversarial user states:

Correct belly breathing produces a strong, clean PC1 projection with alternating slope signs matching the prompts. The signed mean check passes, inhaleIsPositiveSlope is set correctly, thresholds calibrate to the user's actual amplitude. Happy path.

Hand-holding (phone not on body) produces high-frequency noise with no consistent rhythm. Training median amplitude is usually very low, or extremely high and noisy. The same-sign means check often catches this — without respiratory coupling driving direction-specific slopes, the means are both near zero or same-sign.

Reverse breathing (belly contracts on inhale instead of expanding) is the hardest case. Signal present, amplitude normal, but inhaleIsPositiveSlope gets set backwards. The early-flip logic handles most cases: within the first 50 active samples of the live session, if slope consistently contradicts the calibrated direction, flip once and lock. No perfect signal-only solution exists for the calibration phase itself — a post-calibration confirmation step ("take one deep breath in now") is the right UX fix.


The detection math

Once calibration is done and the live session starts, the per-sample pipeline is:

Step 1 — Gravity removal

grav_{t} = grav_{t-1} × (1 − α) + raw_{t} × α    [α = 0.0005, τ ≈ 40s]
accel_{t} = raw_{t} − grav_{t}

The exponential filter with α = 0.0005 tracks the gravity vector on a 40-second timescale — slow enough to ignore breathing, fast enough to follow gradual phone repositioning.

Step 2 — PCA projection

projection = accel.x × pc1.x + accel.y × pc1.y + accel.z × pc1.z

Dot product of the gravity-removed acceleration vector with the locked PC1 eigenvector. Output is a scalar in G.

Step 3 — Dual LP filter (slope extraction)

fastLP_{t} = fastLP_{t-1} × 0.92 + projection × 0.08
slowLP_{t} = slowLP_{t-1} × 0.985 + projection × 0.015
slope_{t}  = fastLP_{t} − slowLP_{t}

fastLP tracks the signal on a ~12-sample timescale (≈0.25s). slowLP tracks it on a ~65-sample timescale (≈1.3s). Their difference — the slope — represents the rate of change of the breathing signal while canceling DC drift. When the chest is actively rising, slope is large and signed. At the peak of an inhale (momentarily stationary), slope approaches zero.

Step 4 — Phase detection

signedSlope = inhaleIsPositiveSlope ? slope : −slope

if signedSlope > inhaleThreshold  → inhale
if signedSlope < −exhaleThreshold → exhale
else                              → hold current phase

inhaleThreshold = median(inhalePeaks) × 0.30 and exhaleThreshold = median(exhalePeaks) × 0.30 from calibration data. The 30% factor provides a noise margin while remaining sensitive to real breathing. Exhale is typically passive (smaller slope magnitude) so separate thresholds matter.

Step 5 — Scoring

The live session scores against a 5.5-second metronome. A sliding 50-sample window computes immediate response; a cumulative ramp reaches 100% after a full correct window. At each metronome tick, the sliding window score becomes the final window score and is appended to session history.


Why testing this is different from testing software

Every bug described above was found by lying on the floor with a phone on your belly.

There is no simulator for accelerometer breathing signal. There is no unit test for "correctly detects gentle exhale in a tired user at 10 BPM." The feedback loop is:

  1. Write code
  2. Build to device (1–3 minutes)
  3. Lie down
  4. Run a 66-second calibration + full session
  5. Read the log output (10–11 log lines per 5.5-second scoring window)
  6. Infer causation from correlation in a time series

Cause and effect are separated by a full session run. What looks like a scoring bug is often a calibration direction bug. What looks like a direction bug is often a gravity baseline initialization ordering bug. What looks like a threshold bug is often a state reset happening in the wrong order.

The gyroscope hypothesis is a good illustration. The suggestion was reasonable, the code was straightforward, and confirming it didn't work required 10+ physical sessions — half with the phone placed correctly, half deliberately hand-held — comparing the resulting log distributions. That's an afternoon of lying on the floor, not an afternoon of writing code.

State interactions were the most persistent bug class. By v3, the detector has: a gravity exponential filter, a rolling 250-sample XYZ buffer, PCA computation every 25 frames, two LP filters on the projection, a calibration state machine with 6 guided cycles, a session warmup, a per-window scoring accumulator, a sliding history buffer, and a direction-flip flag. Fixing any one component often invalidated assumptions in another.

The only reliable debugging technique: log every state transition with a timestamp, run the session, read the transcript like an audit trail.


What the AI copilot contributed

Structural framing. PCA as the orientation-robust axis was the copilot's suggestion. Slope (LP difference) rather than raw position as the primary detection variable was also a copilot suggestion. These are architectural decisions that determine everything downstream. Getting them wrong — as v1 and v2 did by assuming Z-axis — costs an entire iteration.

Drafting the linear algebra. The 3×3 covariance matrix construction and power iteration loop for finding the dominant eigenvector is standard but tedious. Having a first draft with the sign-flip convention already handled saved real time.

Generating falsifiable hypotheses. The gyroscope idea was wrong, but it was the right kind of wrong — a specific, testable claim with clear success criteria. It took one afternoon to falsify. Good hypotheses, even bad ones, are faster to process than vague intuitions.

What the copilot could not contribute: physical presence. Knowing that gyro values would overlap between hand-held and belly-placed phones required running sessions. Knowing that LP seeding from warmup data would cause scale mismatches required seeing it happen. Knowing that the scoring latch bug only manifests after five minutes of correct breathing required five minutes of correct breathing.

The development pattern that worked best: copilot proposes structure → human implements and runs sessions → log analysis reveals the actual failure mode → copilot proposes targeted fix → repeat. The ratio of physical testing time to coding time was roughly 3:1 across all three iterations.


Current state

The v3 detector works reliably for normal belly breathing sessions. Calibration correctly identifies breathing direction in most runs. Per-user thresholds handle the full range from shallow to deep breathers. Scoring is responsive and holds correctly through breath transitions.

What still needs work: a post-calibration breath confirmation step, per-user threshold adaptation across sessions, and the TestFlight build. App Store next.


How much did accuracy actually improve?

The honest answer requires a caveat first: the scoring metric changed between versions, so the numbers aren't directly comparable.

v1/v2 scoring measures how close each detected breath duration was to the 5.5-second target:

score = max(0, 1 − |duration − 5.5s| / 5.5s)

A perfect 5.5-second breath = 100%. A 4-second breath = 73%. A 3-second breath = 55%. Session score is the average across all scored breaths.

v3 scoring measures something different: the fraction of the last 50 accelerometer samples (1 second) that are in the correct direction, compared to the metronome prompt. It's a real-time compliance score, not a duration accuracy score. A 4-second hold count prevents score drops during breath transitions.

With that caveat in place, here's what the data actually shows.

v1 early (Z-axis, hysteresis detection)

Three measured sessions from the first working implementation:

SessionScoreNotes
191%Good placement, normal breathing
272%Slight phone angle, some mid-breath flips
370%Gentle breathing, signal near threshold

Mean: 77.7%. The hysteresis dead-zone caused the most failures: with the exit threshold at 25% of range, gentle breathing often couldn't exit the dead zone and phases got stuck for 14+ seconds.

v1 final (sustain counter, best iteration)

Five sessions from the sustain counter version — the best the Z-axis approach reached:

SessionScore
191%
293%
395%
497%
5100%

Session score progression — measured sessions vs version

Session scores: v1 early 70/72/91, v1 final 91/93/95/97/100.
v1 early (measured) v1 final (measured) v2 estimated range v3 omitted — uses different scoring metric

Mean: 95.2%. This is with the phone placed precisely face-up under normal breathing conditions. The caveat: these sessions were run by the developer under controlled conditions. Off-axis placement degraded scores significantly (untested).

v2 (pos normalization, guided mode)

No hard session numbers in the source files. The algorithm was the same core pipeline as v1 final. The main regression was orientation: a 30° phone tilt reduced the breathing signal projection onto Z by cos(30°) ≈ 0.87, and at 45° by cos(45°) ≈ 0.71 — a 29% signal reduction. With shallow breathing already near the noise floor, this was enough to cause detection failures. Estimated real-world score under typical placement conditions: 70–88% depending on phone angle.

v3 (PCA calibration — current)

Two separate quality metrics apply here.

Direction calibration accuracy — derived from 6 real reference sessions stored as code comments in BreathDetector.swift. In every session, the signed mean slope for inhale-labeled windows and exhale-labeled windows had opposite signs, with ratios between −0.96 and −1.23:

SessionInhale mean slopeExhale mean slopeRatioCorrect direction?
1+0.01767−0.01977−1.12Yes
2+0.01168−0.01257−1.08Yes
3+0.00221−0.00245−1.11Yes
4+0.03268−0.03143−0.96Yes
5+0.02009−0.02207−1.10Yes
6+0.00650−0.00797−1.23Yes

Direction calibration: 6/6 (100%) across all measured sessions. Note: session 3 had the smallest amplitude (median = 0.015 G, near the 0.012 threshold) — the calibration still held. Session 4 had the largest amplitude (median = 0.181 G, deep breathing) — no false flip. The ratio consistency (all between −0.96 and −1.23) confirms the slope-mean approach is robust across very different breathing depths.

v3 calibration quality — inhale vs exhale mean slope · 6 reference sessions

6 sessions plotted. All fall in upper-left quadrant (positive inhale, negative exhale) with ratios between -0.96 and -1.23.
Correct belly breathing (6/6) Shaded zone = valid calibration band (ratio −0.96 to −1.23). Same-sign quadrants = miscalibration.

Live session compliance score — the 50-sample sliding window with 4-second hold countdown is designed to stay above 80% during correct breathing and drop responsively when breathing stops or inverts. No controlled comparison sessions between v2 and v3 were logged with identical scoring metrics, so a direct number-to-number comparison isn't possible. The structural improvement is the hold countdown: in v2, score dropped sharply at every breath transition (correct behavior that felt like a bug); in v3 the 4-second hold absorbs the transition and score only decays if the next direction is genuinely wrong.

Accuracy across versions

v1 early
hysteresis
mean 77.7%
v1 final
sustain
mean 95.2%
v2
pos norm.
est. 70–88%
v3 current
PCA calib.
dir. 6/6
0%25%50%75%100%

v2 and v3 use different scoring metrics — not directly comparable to v1.

The signal-to-noise problem that constrained all versions

The fundamental limit isn't algorithmic — it's physical. Normal belly breathing moves the phone by 0.008–0.020 G. The accelerometer noise floor is ~0.003 G. That's an SNR of roughly 6–12 dB. Deep breathing moves the phone by up to 0.180 G (session 4 above) — SNR of ~35 dB. Shallow breathing near the noise floor produces SNR of 2–4 dB.

No detection algorithm can reliably extract phase from a 2 dB SNR signal. This is why the amplitude-normalization trick in v1 (pos) and the per-user threshold in v3 (30% of calibration median) matter: they adapt the sensitivity to wherever the user's signal actually lives, rather than using a universal threshold that breaks at one extreme or the other.


Surprises once real data arrived

Every assumption baked in during development looked reasonable at the time. Here is what the logs revealed.

Surprise 1 — The minActive guard was silencing valid sessions

A guard checked windowAnyActive < minActive before awarding any score. The intent was to zero out windows where the phone was clearly not on the body. In practice it zeroed out soft breathers, older users, and anyone breathing gently. The signal was real; the guard couldn't distinguish "not enough movement" from "gentle breathing." Removed entirely. The per-user calibration threshold already handles the noise floor — the guard was redundant and harmful.

Surprise 2 — Score collapsed at every breath transition

Between inhale and exhale there is a natural pause while the chest reverses direction. Slope approaches zero during that pause. The sliding window accumulated false entries for 0.5–1 second and the score dropped sharply at every transition — even during a perfect session. The behavior was technically correct but felt completely broken to anyone using the app. Fix: a 4-second hold countdown. When active correct breathing is detected, the hold resets. During the hold, score stays fixed rather than decaying. Score only drops if the hold expires and active wrong-direction samples appear. Transitions became invisible to the user.

Surprise 3 — Exhale systematically underscored

After fixing transitions, exhale windows still averaged 15–25 percentage points below inhale windows. Exhale is passive — the chest falls under gravity and muscle relaxation rather than active effort. Slope magnitude during exhale is typically 60–80% of the inhale peak. A single shared threshold meant exhale was below detection much more often than inhale. Fix: separate inhaleThreshold and exhaleThreshold calibrated independently from labeled training peaks. Each direction now gets a threshold appropriate to its actual signal amplitude.

Surprise 4 — Direction correct in training, inverted in the session

inhaleIsPositiveSlope was being set correctly by the calibration. But the live session started with 5–10 seconds of inverted scoring before it seemed to self-correct. Root cause: the LP filters (fastLP, slowLP) were reset to 0 at warmup, then gravity was reseeded from fresh samples. The calibration and the session were operating in slightly different coordinate frames during the filter ramp-up period. Fix: after warmup, the last 50 warmup projections are used to seed fastLP and slowLP to the same value. Slope = fastLP − slowLP = 0 at session start, and both filters ramp up from a consistent baseline. The inversion disappeared.

Surprise 5 — Deep breathing tripped the spike filter

The spike filter was designed to reject phone-pickup events. Its limit was a fixed constant of 0.05. A user breathing with full diaphragmatic effort (session 4 in the reference data, median 0.181G) produced slope values that exceeded 0.05 during peak inhalation. The detector thought the phone was being picked up and zeroed the signal. Fix: spikeLimit = max(0.05, median × 2.0) computed from calibration. For deep breathers this raised the limit to 0.36G — still well below a real phone-pickup event (~0.5G+) but clear of legitimate breathing peaks.

Surprise 6 — The run-length filter cut the first 3 samples of every window

The run-length filter requires 3 consecutive active samples before scoring anything, to reject single-sample taps and noise spikes. activeRunLength was being reset to 0 at every 5.5-second window boundary. Users are almost always mid-breath at a window boundary — their breathing cycle doesn't align with the metronome. The reset silently discarded the first ~0.06 seconds of each window. Fix: activeRunLength is now explicitly not reset at window boundaries. The code comment marks this as intentional: carry the run across the boundary.

Surprise 7 — Score latch held 100% after the user stopped

The cumulative window score used windowSampleCount, which only incremented. Once it crossed the 41-sample target, the window was locked at 100% for its remaining duration regardless of what the user did. A user could breathe correctly for 3 seconds, stop completely, and score 100% for the window. This wasn't caught in development because the developer always breathed correctly. Fix: score is now max(slidingScore, windowTarget). The sliding score decays within 1 second of stopping. The window target only rises. The max gives fast response on start, gradual ramp to 100% on sustained breathing, and immediate decay if breathing stops.

Surprise 8 — Drone audio produced an audible click at window boundaries

At each window reset, audio.stopAll() was being called. This stopped both the drone sound and any in-progress voice prompt ("Breathe in", "Breathe out"). Users heard a sharp audio cut every 5.5 seconds that broke the meditation context. Fix: audio.stopDroneOnly() — the drone stops and restarts with the new window, but voice cues complete naturally regardless of window timing.


Fine-tuning log — how the numbers changed as sessions accumulated

The architecture stabilized before the constants did. Every number below started as an estimate and was revised based on session logs.

Detection thresholds — from fixed to per-user

Original: a single hardcoded threshold applied to all users and both directions.

Problem: a threshold that worked for normal breathing was too high for a gentle breather and too low for a deep breather. Any fixed value broke someone.

Final: inhaleThreshold = median(inhalePeaks) × 0.30 and exhaleThreshold = median(exhalePeaks) × 0.30, computed during calibration. The 30% factor gives a noise margin (the noise floor is typically 5–10% of the peak) while remaining sensitive to real breathing. It's conservative enough not to trigger on stillness and permissive enough to catch gentle breathing.

The 0.30 multiplier itself was tuned: 0.20 caused false triggers on phone vibrations, 0.40 missed exhales on softer breathers. 0.30 held across the reference session set.

Spike limit — from fixed to calibration-derived

Original: fixed at spikeLimit = 0.05.

Problem: deep breathers exceeded this and got rejected as phone-pickup events.

Final: spikeLimit = max(0.05, median × 2.0). The floor of 0.05 protects shallow breathers from noise; the median × 2.0 factor scales with actual breathing amplitude. A user with median 0.020G gets a limit of 0.05G. A user with median 0.181G gets a limit of 0.36G.

Hold countdown — from nothing to 4 seconds

Original: no hold. Score decayed as soon as active breathing stopped.

First attempt: 1 second (50 frames). Insufficient — breath transitions can last 0.5–1.0 seconds and some users pause at the top of a breath for longer. The score still dipped visibly.

Final: 200 frames (4 seconds at 50 Hz). This covers the full transition period for even slow breathers. The 4-second hold means the score is essentially locked at the window target during any natural pause, and only drops if the user genuinely stops for more than 4 seconds or actively breathes the wrong way.

Score decay rates — asymmetric by intent

Two distinct decay rates are applied depending on what's happening:

  • Wrong direction active: liveScore -= 20 per sample. At 50 Hz this drops the score from 100% to 0% in about 0.1 seconds. Should feel immediate — the user is doing the wrong thing.
  • Phone off belly (hold expired, no active signal): liveScore -= 1 per sample. This drops from 100% to 0% in 2 seconds. Slow enough not to punish a brief stillness but responsive enough to show the score draining.

The asymmetry was deliberate: wrong direction should feel like an instant penalty; phone off belly should feel like a gentle fade.

Sliding window size — 50 samples (1 second)

The 50-sample window size sets how quickly the live score responds to breathing changes. Smaller = more responsive but noisier. Larger = smoother but laggy.

50 samples (1 second) was the smallest value that produced stable scores during normal breathing. At 25 samples (0.5 seconds), score flickered noticeably during breath-peak stillness even with the hold countdown active. At 100 samples (2 seconds), the score lagged behind breathing changes long enough to confuse users.

The slidingExpected threshold is 0.20 × 50 = 10 — only 10 of the last 50 samples need to be active correct matches for a sliding score of 100%. This is intentionally permissive: 10 samples is 0.2 seconds of active breathing within the last second, which easily happens during normal breathing even with transitions.

Window target calibration — 41 samples for 100%

The cumulative window score reaches 100% when windowSampleCount = 41. That's 0.15 × 275 = 41.25, where 275 is the total samples in a 5.5-second window at 50 Hz.

15% of the window is the target because users are never actively moving for the full window. The chest is stationary for ~0.5 seconds at each breath peak, and slope only exceeds threshold for the rising and falling portions of the breath. A realistic upper bound for active samples in a well-executed 5.5-second breath is about 60–70% of the window. Setting the 100% target at 15% means a user breathing correctly from the start hits 100% roughly halfway through the window.

The 0.15 factor was tuned down from 0.25 (too hard — good breathing sessions only reached 80%) and from 0.10 (too easy — partial breathing could hit 100% without full effort).

Run-length filter — 3 consecutive samples

The run-length filter rejects single-sample slope spikes by requiring 3 consecutive samples above threshold. The value 3 was the minimum that eliminated tap artifacts without introducing noticeable detection lag (3 samples = 0.06 seconds at 50 Hz). At 2 samples, a sharp phone tap still occasionally triggered. At 5 samples, the detector occasionally missed the very start of a slow exhale where slope built up gradually.

Early flip threshold — 3:1 ratio over 50 active samples

The early flip fires if earlyWrongCount > earlyRightCount × 3 after the first 50 active samples.

The 3:1 ratio is conservative by design. A genuinely backwards calibration produces near 100% wrong and 0% right — caught easily. A correct calibration with a noisy warmup might produce 60% right and 40% wrong during the first breath — not caught, as intended. Tested with ratios of 2:1 (too aggressive, occasionally flipped correct calibrations), 4:1 (too conservative, missed some backwards calibrations for 1–2 full windows), and settled on 3:1.

The 50-sample minimum ensures the check fires on real data, not warmup noise. At 50 Hz with the run-length filter, 50 active samples represents roughly the first 1–2 breaths of the session — fast enough to catch miscalibration before the user gives up, but with enough data to make the ratio meaningful.

Training quality gates — minPhasePeak and minPhaseActiveCount

Two constants guard against bad calibration phases being included in threshold computation:

  • minPhasePeak = 0.008G — the peak slope during a calibration phase must exceed this. Below 0.008G means the phone was essentially stationary. This threshold was set at 40% of the refMinMedian (0.012G), leaving headroom for phases where the user was breathing gently but correctly.
  • minPhaseActiveCount = 15 — at least 15 active samples must occur during the phase. This prevents a phase with one brief movement from contributing to the threshold median.

Both values were tuned together. Too high and valid shallow-breathing calibration phases get rejected, forcing the user to redo them. Too low and noise phases contaminate the median. The current values pass all 6 reference sessions and reject hand-holding.

Key constants — original value vs final tuned value

ConstantOriginalFinalDirectionWhy changed
pos thresholds0.65 / 0.350.60 / 0.40narrowed ↓Too wide — missed transitions at normal breathing rates
minFlipInterval0.8 s0.5 sdecreased ↓0.8s lag felt unresponsive to user
Threshold multiplierfixed constantmedian × 0.30→ per-userFixed value broke at amplitude extremes
spikeLimit0.05 G (fixed)max(0.05, median × 2.0)→ adaptiveDeep breathers tripped fixed limit
holdFramesnone → 50200 (4 s)increased ↑50 frames insufficient for slow breathers at peaks
Wrong-dir decay−5 / sample−20 / samplesteeper ↑−5 felt too gradual — wrong direction should be instant
Idle decay−5 / sample−1 / sampleshallower ↓−5 penalised natural pauses at breath peaks
slidingN window100 samples (2 s)50 samples (1 s)decreased ↓100 samples lagged behind breathing changes
Window 100% target25% of window15% of window (41 samples)decreased ↓25% too hard — correct sessions only reached 80%
Run-length filter2 consecutive3 consecutiveincreased ↑2 still let phone taps through
Early flip ratio2:13:1increased ↑2:1 occasionally flipped correct calibrations
warmupFrames200 (4 s)150 (3 s)decreased ↓4 s felt too slow; 3 s still clears placement transient

Subtle, critical, and regressive bugs

Not all bugs announce themselves with zero scores or crashes. Some corrupted data silently for weeks. Some were introduced by a fix. Some are sitting in the current codebase right now.


Subtle bugs — wrong in ways that were invisible

Gravity seeded from a single sample

The gravity filter initializes on the very first accelerometer callback:

if !gravSeeded { gravX = x; gravY = y; gravZ = z; gravSeeded = true; return }

One sample. If the phone was still moving when the first callback fired — the user just set it down, or bumped the table — gravX/Y/Z is seeded from a transient value. The exponential filter (α = 0.0005, τ ≈ 40 seconds) then spends the next 40 seconds slowly correcting. Every projection during that period is computed against a slightly wrong gravity baseline, producing a systematic offset in the slope signal.

This was never noticed because the 3-second warmup absorbs most of the transient, and the LP filters tolerate small offsets. But a user who places the phone with a thud will have subtly degraded detection quality for the first 10–15 seconds of the session. It's below the threshold of user perception — the score doesn't obviously drop — but it contributes to the occasional unexplained low-score first window.


PCA sign convention mismatch between training and live session

There are two PCA functions. computePCAFromArrays runs at the end of calibration and uses this sign convention:

if vz < 0 { vx = -vx; vy = -vy; vz = -vz }   // force positive Z

updatePCA runs every 25 frames during the live session and uses a different convention:

if vx*pc1x+vy*pc1y+vz*pc1z < 0 { vx = -vx; vy = -vy; vz = -vz }  // preserve continuity

The training function anchors PC1 to positive Z regardless of the actual breathing axis direction. The live function preserves sign continuity with the previous PC1 — whatever direction it was last time.

For a normally placed phone (face-up, flat), the breathing axis has a positive Z component and both conventions agree. But for an unusual placement — phone slightly face-down, or at an extreme angle — the breathing axis might have a negative Z component. Training would then flip PC1 to the opposite direction. When the live session starts with the default pc1z = 1 and runs updatePCA, it locks on to the correct breathing axis but with the opposite sign from what training recorded.

The result: inhaleIsPositiveSlope was calibrated against training's sign convention, but the live session is running in the flipped convention. The early flip logic may or may not catch this depending on how quickly the mismatch accumulates to 50 active samples. For most users this never manifests. For a user with a consistent unusual phone placement who doesn't trigger the early flip, every session scores poorly with no obvious explanation.


activeRunLength reset contradicts its own comment

In advanceMetronome, the window reset block at line 627 does:

activeRunLength = 0

Two lines later, at line 631:

// NOTE: activeRunLength intentionally NOT reset here — user is mid-breath
// at window transitions; carry the run across the boundary.

The comment describes the intended behavior. The code does the opposite. The comment was written to document a deliberate choice; at some point the reset was added back (possibly during a debugging session to rule out state accumulation as a bug source) and the comment was never updated.

The practical effect: surprise #6 from the previous section — the first 3 samples of each new window are silently discarded — is still happening in the current codebase. The activeRunLength = 0 reset means every window boundary starts the run-length filter from scratch. A user who is mid-exhale when the window ticks over loses 0.06 seconds of scoring credit. At 50 Hz and 5.5-second windows, this happens ~10 times per minute and costs roughly 0.5% of total possible score. Small, systematic, and currently unfixed.


seenPositive && seenNegative gate — always dropped the first exhale

In v2, a both-sides gate was added to prevent the placement spike from triggering false detections: don't fire any phase until pos had been seen in both the inhale zone (> 0.65) and the exhale zone (< 0.35). During the spike, pos ≈ 1.0 constantly, so seenNegative never set, and the gate held.

The gate worked as intended for the spike. But when the spike drained and the gate cleared, it cleared on the inhale side only — the signal came down from the spike into inhale territory first. The detector fired inhale. Then it needed pos < 0.35 to fire exhale. But pos during a normal exhale starting from the spike-cleared state was still high because wMin was elevated. The first exhale after warmup was systematically missed.

Every session had an implicit "one free pass" for the first exhale. The user would inhale, get credit, exhale, get silence, then the second inhale and everything after worked correctly. It looked like a timing issue or a user error. It was a structural property of the gate's clearing logic.


Critical bugs — broke core functionality completely

The confirmation timer self-cancellation loop

In v1, a 0.6-second confirmation delay was added to reduce false phase flips: a raw flip had to hold for 0.6 seconds before the phase was officially committed. The implementation:

raw flip detected → start 0.6s confirmation timer
timer fires → commit phase

The defect: if breathPhase was still oscillating near the threshold during the 0.6-second wait, a second flip arrived and cancelled the first timer, starting a new one. Which then got cancelled by the third flip. The timer kept restarting on every oscillation cycle and the phase was never committed. The note player waited for a committed phase. No note ever played.

This ran for a full development session before anyone noticed — the waveform looked fine, the raw phase was flipping correctly, the audio category was set up correctly. The confirmation timer was silently eating every phase event. Discovered by adding a log line at the timer fire point and finding it never appeared.


LP filter seeding from warmup projections inverted the direction

After calibration, a 3-second warmup re-seeds the gravity baseline. The idea was to also seed fastLP and slowLP from the warmup projection data to eliminate the 1–2 breath ramp-up transient at session start:

// attempt 1 — seeded from training projections
let seedVal = trainingProjections.last ?? 0
self.fastLP = seedVal; self.slowLP = seedVal

This caused direction inversion for the first 5–10 seconds of every session. The root cause: training projections were computed against the training gravity baseline. After warmup, gravity was reseeded from fresh samples, putting the projection in a slightly different coordinate frame. The seed value was no longer the neutral baseline — it was an offset. slope = fastLP − slowLP was nonzero at session start even though the chest was stationary, and its sign was determined by the frame mismatch, not the user's breathing.

The early flip logic would sometimes catch this but sometimes not, depending on how quickly the user started breathing. The fix: seed from warmup projections (computed with the new warmup gravity, not training gravity) or don't seed at all. The current code seeds from the last 50 warmup projections computed in the warmup coordinate frame, which eliminates the mismatch.


Zero-crossing noise gate made the noise floor larger than the signal

In v2, zero-crossing detection was tried as an alternative to pos normalization: positive slope = inhale, negative = exhale. To suppress noise, a noise gate was added: only cross if |current| > noiseFloor, where noiseFloor = 20% × peakAmp and peakAmp was the maximum absolute value in the 250-sample window.

The 250-sample window at the time of testing still contained the placement spike. The spike peak was 0.3–0.5G. noiseFloor = 0.20 × 0.3 = 0.06G. Breathing amplitude was 0.003–0.008G. The noise floor was 8–20× larger than the signal it was supposed to protect.

The gate blocked every breathing sample for the full 5 seconds the spike remained in the window. This looked — at first — like the placement spike problem was back. It was actually the noise gate, not the spike, that was blocking detection. The two bugs were additive and their interaction obscured the root cause. Logged noiseFloor ≈ 0.06, current ≈ 0.003, confirmed the problem, abandoned zero-crossing entirely.


Regressive bugs — a fix that broke something else

Repositioning detection halted detection during fast breathing

After the stable v2 was working, a mid-session repositioning feature was added: if range > 0.05 for N consecutive frames, assume the phone was moved and reset warmup. The implementation:

if range > 0.05 {
    largeRangeCount += 1
    if largeRangeCount >= repositionThreshold {
        resetWarmup()
    }
    return   // ← this line
}
largeRangeCount = 0

The return statement was inside the range > 0.05 block but fired on every frame where range > 0.05, not only when largeRangeCount expired. Fast breathing and deep breathing both push range above 0.05. Every frame during deep breathing returned early, skipping the detection code entirely. The detector produced silence during exactly the use case that was most important to support.

This ran undetected for several sessions because the test sessions used relaxed breathing at normal depth. The first time a deep-breathing session was run, score was 0% for the entire session with no log output from the detection path. Found by adding a log line at the top of the detection function and finding it never fired.

The fix for the return bug introduced a second regression: after removing the return, largeRangeCount now accumulated at 50 Hz and could hit the repositioning threshold during a sustained deep breath (a 5-second deep inhale holds range > 0.05 continuously). The detector would reset warmup mid-session. This produced a 3-second silence gap in every long breath. The repositioning feature was removed entirely.


windowAnyActive < minActive guard silenced shallow breathers

The guard was added to zero out windows where the phone was clearly not on the belly — if fewer than minActive samples exceeded threshold in a window, the whole window scored 0. The intent was to distinguish "user took the phone off" from "user didn't breathe well."

Problem: shallow breathers produce fewer active samples per window. Older users, users doing light breathing exercises, users lying at an angle that reduced the amplitude — all of these generated active sample counts near the minActive boundary. Their sessions scored 0% for every window despite breathing correctly.

This is a classic precision-recall tradeoff. The guard had high precision for the intended use case (genuine phone-off-belly windows correctly scored 0) but terrible recall (many valid sessions also scored 0). The per-user calibration threshold was supposed to handle this by setting detection sensitivity appropriately — the guard was a second, competing threshold that broke the calibration's job. Removed.


Aggressive responsiveness tuning caused constant flickering

At one point minFlipInterval was dropped from 0.8s to 0.25s and pos thresholds were narrowed from 0.65/0.35 to 0.55/0.45 to make the app feel more responsive. On paper this made the detection faster. In practice:

With range ≈ 0.008G (gentle breathing), the dead zone between 0.45 and 0.55 is 0.10 × 0.008 = 0.0008G — about 40% of the noise floor. The signal was oscillating across the boundary multiple times per noise cycle. With minFlipInterval = 0.25s, the detector was firing at the maximum rate continuously. The note was switching at 4 Hz. No user could breathe that fast.

The root cause: narrowing the thresholds assumes a clean signal. The signal is not clean near zero. Any threshold system that puts the decision boundary inside the noise requires hysteresis to be stable, and hysteresis (the 0.65/0.35 gap) was exactly what was removed. Reverted to 0.60/0.40 with 0.5s flip interval.

Bug taxonomy — category · iteration found · severity · status

BugCategoryFound inFound byStatus
Gravity seeded from single sampleSubtlev3QAOpen
PCA sign convention mismatchSubtlev3QAOpen
activeRunLength reset contradicts commentSubtlev3QA reviewOpen
seenPositive && seenNegative drops first exhaleSubtlev2QAFixed
Confirmation timer self-cancels (note never plays)Criticalv1QAFixed
LP seed coordinate frame mismatch → direction invertedCriticalv3QA 4 sessionsFixed
Zero-crossing noise gate floor > signalCriticalv2QAFixed
Score latch holds 100% after user stopsCriticalv3QAFixed
return inside repositioning guard halts detectionRegressivev2QAFixed (removed feature)
minActive guard silences shallow breathersRegressivev3QAFixed (removed guard)
Aggressive threshold tuning → constant flickeringRegressivev2QAFixed (reverted)
Repositioning reset triggers during deep breathRegressivev2QAFixed (removed feature)

3 bugs remain open in current codebase. All are low-impact systematic biases rather than failures.


How the tech lead and QA actually built this

The tech lead walks in with a whiteboard

By the end of v2, the Z-axis approach had hit its ceiling. The algorithm worked on a precisely placed phone under controlled conditions. In the real world — phone at any angle, user lying on a soft mattress, breathing gently or breathing deeply — the detection was inconsistent in ways that were hard to diagnose. The signal was real but the architecture wasn't robust enough to extract it reliably.

This is when the tech lead came in.

In this project, that role was played by an AI coding assistant (Claude). The tech lead's contribution wasn't writing production code — it was stepping back from the incremental iteration that had produced v1 and v2 and asking a structural question: why is this brittle, and what's the right primitive?

The answer came in three connected proposals, delivered in the span of one conversation.

First: PCA for orientation independence. The Z-axis assumption was the root cause of orientation brittleness. The solution: don't assume any axis. Run Principal Component Analysis on the full 3D accelerometer data. The first principal component is by definition the axis of maximum variance, which during active breathing is the breathing axis — regardless of how the phone is tilted. Project all three axes onto PC1 to produce a scalar signal. Phone at 30°, 45°, lying face-down on a soft surface — PC1 finds the breathing axis automatically.

Second: a calibration phase to learn the user. Fixed thresholds were breaking across the amplitude range (0.008–0.180G, a 22× span). The solution: don't use fixed thresholds at all. Run a 66-second guided calibration before each session. During calibration, record labeled inhale and exhale peaks from the user's actual breathing. Set the detection threshold at 30% of the median peak. The threshold adapts to whoever is using the app, not to whoever was sitting next to the developer when the constant was chosen.

Third: signed slope means for direction detection. PCA gives you an axis but not a polarity. The tech lead proposed computing the mean slope separately for inhale-labeled windows and exhale-labeled windows during calibration. If inhale mean slope > exhale mean slope, positive slope is inhale. Otherwise flip. Set the inhaleIsPositiveSlope flag once and apply it for the session. Data from sessions confirmed the ratio is always between −0.96 and −1.23 for correct belly breathing — tight enough to be a reliable quality check.

These three proposals were architectural. They didn't replace any of the existing DSP (the gravity filter, the LP filters, the sliding window scoring were all kept). They were a new layer underneath, making the existing DSP orientation-independent and user-adaptive.

The tech lead also proposed the gyroscope for hand-holding detection. That one didn't work — gyro jerk values overlap completely between a still hand and a belly-placed phone — but it was the right kind of proposal: specific, testable, falsifiable within a known budget of sessions. Took one afternoon to rule out.


QA comes in and finds what the whiteboard missed

The architecture was sound. The implementation took three weeks of testing to make it actually work.

QA in this project was also the developer: one person with a phone, lying on a hard floor, running sessions and reading logs. There is no other way to test a physical sensor app. The feedback loop was:

write code → build to device (2–3 min) → lie down →
run 66s calibration + full session → read logs →
identify failure mode → hypothesize cause → repeat

The ratio of physical testing time to coding time was roughly 3:1. Most of the hard problems required multiple sessions to reproduce, one session to instrument with new logging, and another session to confirm the fix.

QA found every bug described in the previous sections. The ones that took the longest:

The LP seeding coordinate frame mismatch took four sessions to diagnose. The symptom — direction inverted for the first 5–10 seconds — was intermittent and seemed to correlate with how quickly the user started breathing after warmup. The real cause was a coordinate frame mismatch between training gravity and warmup gravity that only manifested when fastLP was seeded from training projections. Finding it required adding log lines to print the seed value, the first 10 post-warmup slope values, and the current gravity baseline — and then doing one session with fast breathing, one with slow breathing, and one with the phone placed at a different angle to isolate the variable.

The activeRunLength reset took even longer to notice — it's a consistent small bias rather than an obvious failure. It's still in the current code. The comment at line 631 says it's intentionally not reset; line 627 resets it. The discrepancy between comment and code was found during a code review pass, not during testing, because a 0.06-second systematic bias per window doesn't produce a visible score change.

The return inside the repositioning guard was found in the first 60 seconds of the first session that used deep breathing. Score was 0% the entire session. No detection log output at all. Adding a single logger.debug at the top of the detection function — which never fired — confirmed the detection code was being skipped on every frame. The fix took 30 seconds; finding it took one full session.

The minActive guard took several sessions to attribute correctly because it only affected certain users. The developer, breathing at normal depth, never triggered it. A session with deliberately gentle breathing (simulating a shallow breather) produced 0% scores and led to the discovery. Without that deliberate adversarial test case, the guard could have shipped and silently broken the app for a significant portion of real users.

The common pattern: bugs that only appear under conditions the developer doesn't naturally reproduce. Normal breathing, correct placement, phone flat — all of that worked fine. Gentle breathing, unusual angles, deep breathing, picking up the phone mid-session, breathing backwards during calibration — all of those were QA scenarios that required explicit intention to test.


What the collaboration actually looked like

The tech lead proposed architecture. QA validated it physically and returned with specific failure modes. The tech lead proposed targeted fixes. QA tested the fixes and returned with new failure modes uncovered by the fix. This iterated for about three weeks on the v3 codebase.

A rough breakdown of who contributed what:

ContributionSource
PCA for orientation independenceTech lead (AI)
Calibration phase designTech lead (AI)
Signed slope means for directionTech lead (AI)
Gyroscope hand-holding detectionTech lead (AI) — falsified by QA
Dual LP filter slope extractionTech lead (AI)
Gravity single-sample seed bugQA (physical testing)
LP seeding coordinate frame mismatchQA (log analysis, 4 sessions)
`activeRunLength` reset contradictionQA (code review)
`minActive` guard silencing shallow breathingQA (adversarial testing)
Score latch at 41 samplesQA (deliberate stop test)
`return` in repositioning guardQA (first deep breathing session)
Asymmetric score decay rates (−20 / −1)QA (user feel testing)
4-second hold countdown durationQA (iterated from 50 to 200 frames)
3:1 early flip thresholdQA (tested 2:1, 4:1, settled on 3:1)
All numerical constantsQA (iterated through sessions)

The tech lead's proposals were structurally correct. None of them worked exactly as specified without QA finding the failure mode in each one and iterating the implementation. PCA was right but the sign convention between training and live was wrong. Calibration was right but the LP seeding after warmup was wrong. Signed slope means worked perfectly on the first try — that was the one proposal that required no QA iteration.

The honest summary: the architecture came from the tech lead, the constants came from QA, and most of the bugs were found by one person lying on the floor with a phone on their chest and a log file on their laptop.


How the collaboration actually worked — and what it produced

Three roles, one problem

This project had three distinct contributors with distinct jobs. They were never in the same room. Two of them were the same person.

the tech lead ran the project end to end — product owner, developer, and QA tester. He defined what the app needed to do, wrote every line of production Swift, and spent more cumulative hours lying on a hard floor with a phone on his chest than any reasonable person should. The physical testing, the log analysis, the judgment calls about what felt right versus what the metrics said — all of that was him.

Claude played the tech lead role. Not by writing production code, but by proposing architecture at the moments when the existing approach had hit a wall. When v2 was fragile due to orientation sensitivity, Claude proposed PCA. When fixed thresholds were breaking across users, Claude proposed a calibration phase. When direction detection was unreliable, Claude proposed signed slope means from labeled data. These were structural proposals — they changed what the system was doing, not how the code was written.

The feedback loop between them was the actual product. the tech lead would hit a wall. Claude would propose a structural approach. the tech lead would implement it, run sessions, find the failure mode, and bring the logs back. Claude would analyze the failure and propose a targeted fix. the tech lead would test the fix. This cycle repeated dozens of times across three versions.


What each contributed that the other couldn't

The clearest way to see the division is to look at what each contribution actually required.

Claude's contributions required knowledge — of signal processing primitives, linear algebra, the properties of exponential filters, the published literature on respiratory sensing. PCA is a known tool. Slope as a DC-immune rate-of-change detector is a known technique. Signed mean comparison for direction disambiguation is a straightforward application of labeled statistics. None of these required physical access to the problem. They could be proposed from a description.

the tech lead's contributions required presence. The single-sample gravity seed bug was only visible in session logs from a specific sequence of actions — placing the phone quickly while the accelerometer was still settling. The LP coordinate frame mismatch took four sessions with different placement angles to isolate. The minActive guard silencing shallow breathers required deliberately simulating a shallow breather — something that would never happen naturally during development. The return statement inside the repositioning guard was found in the first 60 seconds of the first deep-breathing test session.

None of these bugs were findable from code inspection alone. They required the physical system to behave unexpectedly, under specific conditions, in ways that produced observable artifacts in the logs. That is QA work. It cannot be delegated to anyone who isn't physically present with the hardware.


The moment the collaboration worked best

There were several cycles where the collaboration produced something neither contributor could have reached alone.

The clearest example: the LP seeding coordinate frame mismatch.

the tech lead observed the symptom — direction inverted for 5–10 seconds at session start, intermittently, correlated with how quickly he started breathing. He had a hypothesis (something to do with filter initialization) but couldn't pinpoint which part of initialization was wrong. He brought the symptom and the hypothesis to Claude.

Claude analyzed the filter state initialization sequence and identified the mismatch: training projections were computed with training gravity; warmup reseeded gravity before the session; seeding the LP filters from training projections after gravity had changed meant the seed value was in the wrong coordinate frame. The fix: seed from warmup projections, computed after the gravity reseed, so both use the same baseline.

the tech lead implemented it, ran two sessions — one with fast breathing start, one with slow — confirmed the inversion was gone.

Neither could have reached this alone. the tech lead had the symptom data but not the signal processing intuition to trace it to a coordinate frame mismatch. Claude could model the filter chain but had no access to the session logs showing the symptom was intermittent and onset-correlated. The diagnosis required both.


What the final solution actually looks like

After four sensing approaches, three rewrites, twelve bugs, and dozens of tuning iterations, the core detection logic at the heart of v3 is about 15 lines of arithmetic:

// Gravity removal
gravX = gravAlpha * x + (1 - gravAlpha) * gravX
let dx = x - gravX  // (same for y, z)

// PCA projection
let proj = dx * pc1x + dy * pc1y + dz * pc1z

// Slope extraction
fastLP = 0.08 * proj + 0.92 * fastLP
slowLP = 0.015 * proj + 0.985 * slowLP
let slope = fastLP - slowLP

// Phase detection
let signed = inhaleIsPositiveSlope ? slope : -slope
if signed > inhaleThreshold  { /* inhale */ }
if signed < -exhaleThreshold { /* exhale */ }

That is the entire detection core. Remove gravity, find the breathing axis via PCA, differentiate with two filters, compare to a calibrated threshold.

The calibration that feeds it is more complex — 66 seconds, labeled windows, covariance matrix, power iteration, peak extraction, median computation, signed mean comparison. The scoring state machine that sits on top of it is more complex — sliding windows, hold countdowns, drift flip logic, decay rates. But the detection itself, the thing that answers "is the user inhaling or exhaling right now," is 15 lines.

This is the pattern every good engineering outcome follows. The simplicity of the final solution is not the starting point — it is the destination. Every abandoned approach, every regression, every tuning iteration was the work of finding the simplest thing that was actually true about the problem.

The mic approach failed because it wasn't measuring the right thing. FMCW failed because the physics of soft bodies defeated the approach. AirPods failed because the signal-to-noise ratio was too low at the sensing point. The Z-axis assumption failed because orientation isn't fixed. Fixed thresholds failed because users vary. Once you've eliminated everything that isn't true, what remains is: the breathing axis is the axis of maximum variance, the slope of the projection is the rate of breathing, and the calibrated median sets the threshold.

Simple. And it took months to get there.


The broader lesson

AI coding tools are often framed as productivity multipliers — write code faster, autocomplete boilerplate, generate test cases. That framing misses what was most valuable here.

The most valuable moments weren't when Claude wrote code. They were when the tech lead proposed a different frame for the problem. "Stop assuming Z is the breathing axis" is not a code suggestion. It's a restatement of the problem that makes a whole class of solutions visible that weren't visible before. "Use the calibration phase to learn the user rather than tuning constants" is not a code suggestion either. It's a shift from a systems engineering problem (find the right constant) to a machine learning problem (learn the constant from data).

Human expertise was equally irreplaceable, but it operated at a different level. the tech lead knew what the app was supposed to feel like. He knew that a score that drops to zero at every breath transition feels broken even if it's technically correct. He knew that a 4-second hold felt right and a 1-second hold felt anxious. He knew that the repositioning detection feature — which was causing regressions — should be removed rather than fixed, because the use case it was solving (mid-session phone movement) was rare enough not to be worth the complexity. These are product judgments. They can't be derived from signal processing theory.

The collaboration worked because it kept these contributions separate. Architecture from the tech lead. Physical validation from QA. Product judgment from the human who actually used the app. Each role stayed in its lane. The 15 lines of detection core at the end are the result of all three lanes running in parallel for long enough.

Need a human in your loop?

Our senior engineers catch the complexity cliffs AI misses — reviewing architecture, security, and algorithmic fit before problems ship. Part-time or full-time, monthly.

Talk to a Dev Lead →