The Apple Watch tops Stanford's heart rate accuracy study — and here's why

If you're planning on using a wrist-based monitor to track your heart rate while walking, running, or cycling, a group of scientists at Stanford (in partnership with the Swedish School of Sport and Health Services in Stockholm) claim the Apple Watch is the monitor to get, with the smallest margin of error (2%) out of seven tested devices.

The experiment also looked at each device's caloric estimations (or "EE", for energy expenditure). Though the Apple Watch doesn't do poorly in this arena, that doesn't mean much: the lowest margin of error across the pack was 27.4% average, with a whopping 92.6% average error for the Fitbit Surge. In short: Still a long way to go when calculating calories burned effectively on a wrist-worn device.

We evaluated the Apple Watch, Basis Peak, Fitbit Surge, Microsoft Band, Mio Alpha 2, PulseOn, and Samsung Gear S2. Participants wore devices while being simultaneously assessed with continuous telemetry and indirect calorimetry while sitting, walking, running, and cycling.Sixty volunteers (29 male, 31 female, age 38 ± 11 years) of diverse age, height, weight, skin tone, and fitness level were selected.

How was this experiment conducted?

In studies like this, scientists are primarily looking at margins of error when determining what device works "best": In other words, you want a device that regularly reports within a certain margin of error as compared to the control heart rate, or "gold standard."

For this experiment, Stanford used the following for its gold standard:

Gas analysis data from indirect calorimetry (VO2 and VCO2) served as the gold standard measurement for calculations of EE (kcal/min). ECG data was used as the gold standard for HR (beats-per-minute; bpm).

Because so little testing has been done on wrist-worn devices, there is no "official" standard for such experiments:

Prior studies of wrist-worn devices have focused on earlier stage devices, or have focused exclusively on HR or estimation of EE. Some have made comparisons among devices without reference to the U.S. a Food and Drug Administration (FDA) approved gold standard. None proposed an error model or framework for device validation.

As such, the scientists have also proposed a public repository of validated heart monitor data.

To do this first experiment, the scientists identified 45 potential manufacturers, then limited it to eight based on the following criteria:

wrist-worn watch or band; continuous measurement of HR; stated battery life >24 h; commercially available direct to consumer at the time of the study; one device per manufacturer. Eight devices met the criteria; Apple Watch; Basis Peak; ePulse2; Fitbit Surge; Microsoft Band; MIO Alpha 2; PulseOn; and Samsung Gear S2. Multiple ePulse2 devices had technical problems during pre-testing and were therefore excluded.

After excluding the ePulse2, the experiment was left with seven devices.

It is interesting to note that neither Garmin nor Polar's sport-specific wrist trackers were included in this study — we don't know if they were originally considered and then discarded, but it's worth noting given both manufacturers' prior expertise in sport-specific heart tracking.

Devices were tested in two phases. The first phase included the Apple Watch, Basis Peak, Fitbit Surge and Microsoft Band. The second phase included the MIO Alpha 2, PulseOn and Samsung Gear S2.Healthy adult volunteers (age ≥18) were recruited for the study through advertisements within Stanford University and local amateur sports clubs. From these interested volunteers, study participants were selected to maximize demographic diversity as measured by age, height, weight, body mass index (BMI), wrist circumference, and fitness level. In total, 60 participants (29 men and 31 women) performed 80 tests (40 with each batch of devices, 20 men and 20 women).

So what do the heart rate (HR) results mean?

Essentially, after all these tests, the scientists determined that the Apple Watch has the lowest margin of error when it comes to calculating heart rate while walking, running, or biking.

For the walking task, three of the devices achieved a median error rate below 5%: the Apple Watch, 2.5% (1.1%–3.9%); the PulseOn, 4.9% (1.4%–8.6%); and the Microsoft Band, 5.6% (4.9%–6.3%). The remaining four devices had median error between 6.5% and 8.8%. Across devices and modes of activities, the Apple Watch achieved the lowest error in HR, 2.0% (1.2%–2.8%), while the Samsung Gear S2 had the highest HR error, 6.8% (4.6%–9.0%) (Figure 3A and Figure 4A).

Most of the devices tested came within a median 5% margin of error throughout the tests, with only the Samsung Gear S2 falling outside the range on all activities (5.1% on cycling; a range of 6.5-8.8% on walking; and 6.8% total average).

So the Apple Watch is the best at heart rate for wrist-worn devices, right? According to this study, yes, but its competition is nipping at its heels — a less than 5% margin of error is still quite good when it comes to overall monitoring, so there's no need to throw out your Fitbit Surge if you're otherwise happy with it.

It's also worth noting that this experiment only tested wrist-worn devices in common exercise situations like biking, running, and walking — yoga, weight-lifting, and other wrist-bending activities were excluded, all of which have been known to negatively effect the accuracy of wrist-worn heart monitoring.

What about the caloric (EE) results?

"Calories burned" has always been a bit of a mysterious stat on wrist-worn devices, in part because the calculations behind energy expenditure (or EE) are obscured on a per-device basis. From the study:

It is not immediately clear why EE estimations perform so poorly. While calculations are proprietary, traditional equations to estimate EE incorporate height, weight, and exercise modality. It is likely that some algorithms now include HR. Since height and weight are relatively fixed and HR is now accurately estimated, variability likely derives either from not incorporating heart rate in the predictive equation or from inter-individual variability in activity specific EE. There is evidence for this—for example, 10,000 steps have been observed to represent between 400 kilocalories and 800 kilocalories depending on a person's height and weight.

As noted above, because there are a lot of variables involved in the calculation of EE — some that require user input, like height, weight, and activity type — it's much harder for any device to give you an accurate estimate. And the study proved it accordingly:

EE error rates significantly exceed the 10% threshold for all devices on both the cycling and walking tasks… The Apple Watch had the most favorable overall error profile while the PulseOn had the least favorable overall error profile.

Error in estimation of EE was considerably higher than for HR for all devices (Figure 2B and Figure 3B). Median error rates across tasks varied from 27.4% (24.0%–30.8%) for the Fitbit Surge to 92.6% (87.5%–97.7%) for the PulseOn. For EE, the lowest relative error (RE) rates across devices were achieved for the walking (31.8% (28.6%–35.0%)), and running (31.0% (28.0%–34.0%)) tasks, and the highest on the sitting tasks (52.4% (48.9%–57.0%)).… No device achieved an error in EE below 20 percent. The Apple Watch achieved the lowest overall error in both HR and EE, while the Samsung Gear S2 reported the highest.

In other words: The Apple Watch may have had the fewest variations in energy expenditure when compared to the other devices in the study, but it still isn't anywhere near the level of accuracy provided by the study's gold standard.

What does this mean for wrist monitors going forward?

For health tech junkies, Stanford's study is actually an incredibly important step forward in getting more reliable data from our devices. Stanford's proposal for a "wearable sensor evaluation framework" alone is a pretty exciting development — if scientists standardize a baseline testing framework and data repository, it allows experiments to be done all over the world with large testing groups, getting us comprehensive data.

Essentially, the more scientific experiments done on wrist-worn devices, the better: More data leads to competition from manufacturers to better their sensors, which gives us (the end-users) even better devices down the line.

And Apple Watch users? For now, you can rest smugly knowing that you'll get a pretty accurate heart rate for most walking, running, and biking activities. (And hope that Apple works on a better system for measuring energy expenditure in the future.)

Serenity Caldwell

Serenity was formerly the Managing Editor at iMore, and now works for Apple. She's been talking, writing about, and tinkering with Apple products since she was old enough to double-click. In her spare time, she sketches, sings, and in her secret superhero life, plays roller derby. Follow her on Twitter @settern.

17 Comments
  • Since the beginning of the Apple Charitable Matching program, to Stanford University Hospitals, Cook himself has donated $50 million. Take from that what you will.
  • Given that the experiment made its entire repository and testing data available to other scientists to replicate and retest as they see fit, I'll take from it that Cook's $50 million has been well-spent on some great servers for the scientific community.
  • Makes me question my Fitbit Charge HR. I like the Apple Watch but dont care for the smartphone features, just want a device that performs as a fitness band. I wish Apple offered a budget Watch that removed all the smartphone features and sold at a fraction of the cost.
  • Hi ohezm, don't throw out your Charge HR just yet. The second paragraph in the article incorrectly quotes the study saying the Surge has an error rate of 92.6%; that actually belongs to the PulseOn. Reading through the rest of the study, the EE error rate with the Surge was actually 27.4%, which might be the best of the bunch and NOT the Apple Watch. So Apple was only ahead with HR accuracy. Also, the Charge 2 has an improved HR tracking, which is the technology behind the guided breathing sessions and advanced sleep tracking. I look forward to see the Charge 2 tested against the rest of the devices.
  • Your have to understand, I don't I've seen anything do better than apple on this site.
  • They didn't bother to test a single Android Wear Watch? Ok...
  • Android watches are irrelevant, insignificant, not to be considered in such studies. They simply don’t have the processing power to do anything other than tell time. You know, what you people say about Apple products. Deal with it.
  • First of all, let me give you some unsolicited life advice and suggest that in the future you refrain from saying "you people", as if you have someone completely figured out by a couple of sentences in a post. Case in point, you're implying that I'm an Android/Google fanboy and I can assure that is most definitely not the case. My current primary phone is an iPhone 7+ and I use my iPad Air 2 a lot during the week. I also have a Nexus 6P, a Nexus 9 tablet and my laptop is a Surface Pro 4. I'm very brand agnostic and a fan of technology in general, which is precisely why I found it very strange that they basically ignored one of the major smartwatch/wearable platforms on the market. Nothing more, nothing less.
  • It says Samsung G2
  • The Gear S2 runs Tizen, not Android Wear. Posted via the iMore App for Android
  • It may be the best but it is still not that accurate. Don't even get me started on the calorie counting.
  • OK I think there's something wrong with this data because THE ONLY COMPANY THAT COMES CLOSE TO CHEST WORN HEART RATE MONITOR is MIO Especially for GYM Exercises. I'm saying this with personal experience and you can even see that in many reviews. every other hand based heart rate monitor fails at tracking gym/hiit/cross fit exercises including MIO but the closest is the MIO.
  • As a serious road cyclist I would never consider a phone-based HRM accurate, instead we use a chest strap HRM that communicates by ANT+ or BlueTooth LE to a Garmin bike computer, problem solved.
  • Yeah I know that but the closest one is MIO not the Apple Watch.
  • Apparently the Apple Watch matched the Mio: http://i.imgur.com/fsloX6z.png
  • I do not see Kardia AliveCor in test ? Why a serious study would not take into consideration an existing wearable for AppleWatch (medical grade device by the way) ?
  • This study would have carried more eight had they included a broader range of devices, particularly high end models from Garmin and Polar, and also chest-band models for reference, since those are considered the gold standard. It almost seems as if they cherry-picked lower end products to make the Apple Watch appear more favorable .