The Apple Watch tops Stanford's heart rate accuracy study — and here's why

If you're planning on using a wrist-based monitor to track your heart rate while walking, running, or cycling, a group of scientists at Stanford (in partnership with the Swedish School of Sport and Health Services in Stockholm) claim the Apple Watch is the monitor to get, with the smallest margin of error (2%) out of seven tested devices.

The experiment also looked at each device's caloric estimations (or "EE", for energy expenditure). Though the Apple Watch doesn't do poorly in this arena, that doesn't mean much: the lowest margin of error across the pack was 27.4% average, with a whopping 92.6% average error for the Fitbit Surge. In short: Still a long way to go when calculating calories burned effectively on a wrist-worn device.

We evaluated the Apple Watch, Basis Peak, Fitbit Surge, Microsoft Band, Mio Alpha 2, PulseOn, and Samsung Gear S2. Participants wore devices while being simultaneously assessed with continuous telemetry and indirect calorimetry while sitting, walking, running, and cycling.Sixty volunteers (29 male, 31 female, age 38 ± 11 years) of diverse age, height, weight, skin tone, and fitness level were selected.

How was this experiment conducted?

In studies like this, scientists are primarily looking at margins of error when determining what device works "best": In other words, you want a device that regularly reports within a certain margin of error as compared to the control heart rate, or "gold standard."

For this experiment, Stanford used the following for its gold standard:

Gas analysis data from indirect calorimetry (VO2 and VCO2) served as the gold standard measurement for calculations of EE (kcal/min). ECG data was used as the gold standard for HR (beats-per-minute; bpm).

Because so little testing has been done on wrist-worn devices, there is no "official" standard for such experiments:

Prior studies of wrist-worn devices have focused on earlier stage devices, or have focused exclusively on HR or estimation of EE. Some have made comparisons among devices without reference to the U.S. a Food and Drug Administration (FDA) approved gold standard. None proposed an error model or framework for device validation.

As such, the scientists have also proposed a public repository of validated heart monitor data.

To do this first experiment, the scientists identified 45 potential manufacturers, then limited it to eight based on the following criteria:

wrist-worn watch or band; continuous measurement of HR; stated battery life >24 h; commercially available direct to consumer at the time of the study; one device per manufacturer. Eight devices met the criteria; Apple Watch; Basis Peak; ePulse2; Fitbit Surge; Microsoft Band; MIO Alpha 2; PulseOn; and Samsung Gear S2. Multiple ePulse2 devices had technical problems during pre-testing and were therefore excluded.

After excluding the ePulse2, the experiment was left with seven devices.

It is interesting to note that neither Garmin nor Polar's sport-specific wrist trackers were included in this study — we don't know if they were originally considered and then discarded, but it's worth noting given both manufacturers' prior expertise in sport-specific heart tracking.

Devices were tested in two phases. The first phase included the Apple Watch, Basis Peak, Fitbit Surge and Microsoft Band. The second phase included the MIO Alpha 2, PulseOn and Samsung Gear S2.Healthy adult volunteers (age ≥18) were recruited for the study through advertisements within Stanford University and local amateur sports clubs. From these interested volunteers, study participants were selected to maximize demographic diversity as measured by age, height, weight, body mass index (BMI), wrist circumference, and fitness level. In total, 60 participants (29 men and 31 women) performed 80 tests (40 with each batch of devices, 20 men and 20 women).

So what do the heart rate (HR) results mean?

Essentially, after all these tests, the scientists determined that the Apple Watch has the lowest margin of error when it comes to calculating heart rate while walking, running, or biking.

For the walking task, three of the devices achieved a median error rate below 5%: the Apple Watch, 2.5% (1.1%–3.9%); the PulseOn, 4.9% (1.4%–8.6%); and the Microsoft Band, 5.6% (4.9%–6.3%). The remaining four devices had median error between 6.5% and 8.8%. Across devices and modes of activities, the Apple Watch achieved the lowest error in HR, 2.0% (1.2%–2.8%), while the Samsung Gear S2 had the highest HR error, 6.8% (4.6%–9.0%) (Figure 3A and Figure 4A).

Most of the devices tested came within a median 5% margin of error throughout the tests, with only the Samsung Gear S2 falling outside the range on all activities (5.1% on cycling; a range of 6.5-8.8% on walking; and 6.8% total average).

So the Apple Watch is the best at heart rate for wrist-worn devices, right? According to this study, yes, but its competition is nipping at its heels — a less than 5% margin of error is still quite good when it comes to overall monitoring, so there's no need to throw out your Fitbit Surge if you're otherwise happy with it.

It's also worth noting that this experiment only tested wrist-worn devices in common exercise situations like biking, running, and walking — yoga, weight-lifting, and other wrist-bending activities were excluded, all of which have been known to negatively effect the accuracy of wrist-worn heart monitoring.

What about the caloric (EE) results?

"Calories burned" has always been a bit of a mysterious stat on wrist-worn devices, in part because the calculations behind energy expenditure (or EE) are obscured on a per-device basis. From the study:

It is not immediately clear why EE estimations perform so poorly. While calculations are proprietary, traditional equations to estimate EE incorporate height, weight, and exercise modality. It is likely that some algorithms now include HR. Since height and weight are relatively fixed and HR is now accurately estimated, variability likely derives either from not incorporating heart rate in the predictive equation or from inter-individual variability in activity specific EE. There is evidence for this—for example, 10,000 steps have been observed to represent between 400 kilocalories and 800 kilocalories depending on a person's height and weight.

As noted above, because there are a lot of variables involved in the calculation of EE — some that require user input, like height, weight, and activity type — it's much harder for any device to give you an accurate estimate. And the study proved it accordingly:

EE error rates significantly exceed the 10% threshold for all devices on both the cycling and walking tasks… The Apple Watch had the most favorable overall error profile while the PulseOn had the least favorable overall error profile.

Error in estimation of EE was considerably higher than for HR for all devices (Figure 2B and Figure 3B). Median error rates across tasks varied from 27.4% (24.0%–30.8%) for the Fitbit Surge to 92.6% (87.5%–97.7%) for the PulseOn. For EE, the lowest relative error (RE) rates across devices were achieved for the walking (31.8% (28.6%–35.0%)), and running (31.0% (28.0%–34.0%)) tasks, and the highest on the sitting tasks (52.4% (48.9%–57.0%)).… No device achieved an error in EE below 20 percent. The Apple Watch achieved the lowest overall error in both HR and EE, while the Samsung Gear S2 reported the highest.

In other words: The Apple Watch may have had the fewest variations in energy expenditure when compared to the other devices in the study, but it still isn't anywhere near the level of accuracy provided by the study's gold standard.

What does this mean for wrist monitors going forward?

For health tech junkies, Stanford's study is actually an incredibly important step forward in getting more reliable data from our devices. Stanford's proposal for a "wearable sensor evaluation framework" alone is a pretty exciting development — if scientists standardize a baseline testing framework and data repository, it allows experiments to be done all over the world with large testing groups, getting us comprehensive data.

Essentially, the more scientific experiments done on wrist-worn devices, the better: More data leads to competition from manufacturers to better their sensors, which gives us (the end-users) even better devices down the line.

And Apple Watch users? For now, you can rest smugly knowing that you'll get a pretty accurate heart rate for most walking, running, and biking activities. (And hope that Apple works on a better system for measuring energy expenditure in the future.)

Serenity was formerly the Managing Editor at iMore, and now works for Apple. She's been talking, writing about, and tinkering with Apple products since she was old enough to double-click. In her spare time, she sketches, sings, and in her secret superhero life, plays roller derby. Follow her on Twitter @settern.