Real-Time Multimodal Sleep Staging from Consumer Wearable Sensors Validated Against Ear-EEG: A Study Protocol
Jonathan Berent
NextSense, Inc., Mountain View, CA, USA · Correspondence: jb@nextsense.io
Abstract
Background. Polysomnography (PSG) is the clinical gold standard for sleep staging but is lab-bound and obtrusive, limiting longitudinal, real-world monitoring. Consumer wearables promise accessible sleep tracking, yet most validate offline and against PSG rather than in real time, and few resolve the full stage structure. Objective. We describe a protocol to evaluate the feasibility of predicting sleep stages in real time from heart-rate, accelerometer, and microphone data recorded by a consumer smartwatch, using simultaneously recorded ear-EEG as the reference standard, and to compare three sensor combinations. Methods. Prospective, single-center, observational study; up to 18 healthy adults (≥22 years, no known sleep disorders) enrolled to yield ≥16 evaluable, each recorded for one overnight session with a smartwatch (heart rate, tri-axial accelerometer, microphone) and a simultaneous ear-EEG device scored into four stages (wake, light, deep, REM). Planned analysis. The primary endpoint is agreement (Cohen's κ) between watch-predicted and ear-EEG-reference stages; a one-sided non-inferiority test evaluates whether κ is non-inferior to a substantial-agreement criterion of 0.61, at α = 0.05 and 80% power. Secondary analyses compare three sensor arms and report per-stage performance and derived sleep parameters. Status. Pre-data; execution pending IRB approval.
1. Introduction
Sleep is increasingly recognized as central to cognitive, metabolic, and emotional health, yet the gold-standard tool for measuring it—polysomnography—remains confined to the laboratory. PSG is comprehensive but obtrusive, expensive, and typically limited to one or two nights, making it poorly suited to the longitudinal, ecological monitoring that both research and consumers increasingly want. Wearable sensors embedded in consumer devices have transformed access to sleep information, but two gaps persist. First, most consumer systems estimate sleep offline, after the night, whereas many of the most valuable applications—smart-alarm timing, closed-loop audio, just-in-time interventions—require staging in real time. Second, validation has overwhelmingly used wrist actigraphy against PSG, and non-EEG modalities alone struggle to resolve the full stage ladder.
Ear-EEG offers a practical reference standard that is itself wearable: in- and around-ear electrodes recover sleep-stage structure with agreement against PSG approaching expert inter-scorer reliability. This makes possible a study that would be impractical with PSG at scale: validating real-time, smartwatch-based staging against a comfortable neural reference across full nights at home or in a sleep-friendly setting. This protocol specifies such a study.
2. Objectives
Primary objective. To evaluate the feasibility of predicting sleep stages in real time from heart-rate, accelerometer, and microphone data recorded with a consumer smartwatch, by comparing watch-derived stage predictions against simultaneously recorded ear-EEG reference stages.
Secondary objective. To compare real-time staging performance across three sensor combinations: (i) heart rate + accelerometer; (ii) microphone; and (iii) heart rate + accelerometer + microphone.
3. Methods
3.1 Study design. Prospective, single-center, observational study. Each participant contributes one overnight recording session.
3.2 Participants. Adults aged ≥22 years with no known sleep disorders will be enrolled. Up to 18 participants will be enrolled to ensure a minimum of 16 evaluable participants, allowing for ~10% attrition or data loss. Inclusion/exclusion criteria, recruitment, and informed-consent procedures will be specified in the IRB-approved protocol.
3.3 Apparatus and data acquisition. Smartwatch signals will be acquired via standard mobile APIs: tri-axial accelerometer at 50–100 Hz and heart rate at ~1 Hz (CoreMotion), and stereo audio from the built-in microphones (AVFAudio). Simultaneously, an ear-EEG device will record overnight to provide the reference. All streams are timestamped to a common clock to permit realignment across their differing sampling rates. Four sleep stages will be derived: wake, light sleep, deep sleep, and REM.
3.4 Real-time staging pipeline. A staging model will be trained offline on existing labeled data and then applied online to incoming smartwatch streams. Per-epoch features (band-limited and statistical features from accelerometer, heart rate, and audio) feed a classifier producing stage estimates at the standard 30-second epoch cadence. The same trained pipeline will be evaluated under each of the three sensor-combination arms to isolate the contribution of each modality. Real-time operation tolerates modest latency, so on-device or phone-side inference is feasible.
3.5 Reference standard. Ear-EEG recordings will be scored into the four stages above to serve as the per-epoch reference against which watch predictions are compared. Scoring procedure, scorer training, and any consensus rules will be pre-specified in the IRB-approved protocol; current AASM-aligned guidance will be followed where applicable.
3.6 Endpoints. Primary endpoint. Epoch-by-epoch agreement between watch-predicted stages and ear-EEG reference stages, quantified by Cohen's κ. A κ greater than 0.61 is interpreted as substantial agreement. Secondary endpoints. Per-arm κ (the three sensor combinations); per-stage sensitivity and specificity and confusion matrices; and agreement on derived sleep parameters (total sleep time, sleep-onset latency, efficiency, wake after sleep onset) assessed with Bland–Altman analysis.
4. Statistical analysis plan
The primary analysis is a one-sided non-inferiority test evaluating whether the Cohen's κ for watch-based staging is non-inferior to an acceptance criterion of 0.61, at a significance level of α = 0.05 and statistical power of 0.80. Secondary analyses compare κ across the three sensor arms and summarize per-stage performance; derived sleep parameters are compared with Bland–Altman limits of agreement. Missing or unscorable epochs will be handled by a pre-specified rule and reported transparently.
Sample-size justification. The sample size derives from a power analysis for a one-sided non-inferiority test, N = (Zα + Zβ)² · σ² / d², with Zα = 1.96 (α = 0.05) and Zβ = 0.84 (power = 0.80), criterion κ = 0.61. Required N across plausible standard deviations (σ) and effect sizes (d):
| σ | Effect size d (κ) | Required N |
|---|---|---|
| 0.05 | 0.10 (0.71) | 2 |
| 0.05 | 0.05 (0.69) | 8 |
| 0.05 | 0.02 (0.63) | 49 |
| 0.07 | 0.10 (0.71) | 4 |
| 0.07 | 0.05 (0.69) | 16 |
| 0.07 | 0.02 (0.63) | 96 |
Assuming σ = 0.07 and d = 0.05 with a 10% dropout allowance, 18 participants will be enrolled to achieve a minimum of 16 evaluable subjects. The final number may be refined from pilot data.
5. Data management
All signals are timestamped to a common clock and stored with provenance. Smartwatch and ear-EEG streams, derived features, model predictions, and reference scores will be retained to permit re-analysis. Data handling, de-identification, retention, and security will follow the IRB-approved data-management plan; given the sensitivity of physiological data, on-device processing and data minimization are preferred where feasible.
6. Ethics, status, and timeline
This study requires Institutional Review Board (IRB) approval prior to any data collection; informed consent will be obtained from all participants. The indicative timeline once IRB is secured: device/app readiness and pilot (1–2 participants), enrollment of up to 18 participants (one overnight session each), scoring and analysis, and manuscript. This protocol may be deposited as a preprint and/or pre-registered prior to data collection.