The Apple Heart Study, recently published in The New England Journal of Medicine, is perhaps one of the largest studies of the modern era. It enrolled over 400,000 patients to evaluate the Apple Watch and test it in “real world” settings to see whether it was effective in diagnosing arrhythmias. Any new tech gadget invariably gets a lot of media coverage and the size and scope of the Apple Heart Study is objectively impressive, but the results of the study are slightly more nuanced than some headlines suggested.

First, how does this impact clinical trials?

While people use the term “game-changer” far too much in medical journalism, we should acknowledge that the Apple Heart Study was a massive undertaking and used a radically different approach to study recruitment. It enrolled over 400,000 patients, which is a huge trial. While not the largest in clinical trial ever done (the March of Dimes was able to launch a study of 1.3 million children to test Jonas Salk’s polio vaccine in 1953), it is still impressive. The Apple Heart Study also recruited patients “virtually” by leveraging the power of the Internet. Rather than recruiting patients through their physicians or via a physical testing center, they recruited them through a downloadable app, which means patients could enroll in the study from home, without actually visiting any specific site. Enrolling 400,000 patients in 9 months, as this study did, is no mean feat, and would not have been easy using traditional recruitment methods. Given how difficult it is to recruit patients into clinical trials, this type of Internet-based study recruitment might be something we will be seeing more of in the near future.

But the scale of the study obscures some downsides. Since patients had to already have their own Apple Watch and iPhone, and had to download the app to participate and record the data, the patient population skewed towards the young, the wealthy, and the healthy. In the study over half of the participants were under 40 and only 6% were over age 65. With our ageing population demographic, that was not quite a representative sample.

As a consequence of the young age of the participants, very few patients actually had an irregular heartbeat during the trial. This is not entirely surprising since heart disease becomes more common with age (age is arguably the most important risk factor for heart disease). In fact, only 0.5% of the people enrolled in the Apple Heart Study ended up receiving an alert from their watch. Sadly, most people seem to have ignored these alerts. The study protocol required people to go get a proper heart monitor (in this case an ECG patch they would wear for 7 days to determine if they actually had an arrhythmia). But only about one in five people who got an alert from their phone actually complied. In the end, only 450 people submitted their ECG patches for analysis, which makes the Apple Heart Study fairly small by most measures.

This loss to follow-up is perhaps not totally surprising. Downloading an app on your phone is easy, but making a doctor’s appointment is hard, or at the very least, time consuming. So many of these young, and presumably otherwise healthy individuals, probably just ignored the alerts from their Apple Watch or looked into it with their own physicians without notifying the study personnel.

Unfortunately, loss to follow-up is a major problem in clinical trials. Usually, the staffs of clinical trials spend a great deal of time and effort following up with patients and making sure they follow through on the research protocol. However, with this type of “virtual recruitment” it simply wasn’t possible. To what extent this loss to follow-up affected the results of the Apple Heart Study is hard to say. Perhaps the Apple Watch study would have done better with more data points for researchers to analyze, or perhaps it would have done worse if more healthy people with no other problems were thrown into the mix. The most interesting aspect though, is that if “virtual recruitment” is going to be the future of clinical trials, then we need to work out how we are going to deal with the problem of people deciding to drop out.

Now on to the study itself.

What were they actually testing?

At its core, the Apple Heart Study was trying to do something that is quite difficult. It was trying to prove that population screening for atrial fibrillation is a good idea.

Atrial fibrillation is a very common arrhythmia that affects 2.7 million Americans and increases the risk of stroke. To prevent this rather severe complication, many patients with atrial fibrillation (or a. fib for short) are given blood thinners. There has been considerable research over the years looking into who should get blood thinners. In brief, anyone with risk factors for stroke (advanced age, having diabetes, high blood pressure, heart failure or a previous stroke) is generally prescribed these medications.

But screening for atrial fibrillation is not as easy as it seems. Screening implies testing a large number of symptom-free individuals to find a previously undiagnosed disease. And screening is a low yield activity that can generate many false positives. To understand why, consider the following theoretical scenario. Imagine a population of 1 million where about one thousand people have a.fib.

Have a.fib Do not have a.fib Total
Total 1,000 999,000 1,000,000

Now imagine we give all these people an Apple Watch to try and detect a.fib. Let’s assume for the simplicity that the Apple Watch is 99% accurate in its analysis. This means that for the 1,000 people who have a.fib, we will correctly identify 990 of them, and miss 10.

Have a.fib Do not have a.fib Total
Apple Watch detects a.fib 990
Apple Watch does not detect a.fib 10
Total 1,000 999,000 1,000,000

Now for the 999,000 people who did not have a.fib, we will correctly identify the heart rhythms as normal for 99% of them, or 989,010 individuals. It will also make an error with 1% of them and think 9,990 are having an arrhythmia when they are not.

Have a.fib Do not have a.fib Total
Apple Watch detects a.fib 990 9,990
Apple Watch does not detect a.fib 10 989,010
Total 1,000 999,000 1,000,000

So the Apple Watch will record atrial fibrillation in 990+9,990 = 10,980 patients and call 989,020 patients as normal.

Have a.fib Do not have a.fib Total
Apple Watch detects a.fib 990 9,990 10,980
Apple Watch does not detect a.fib 10 989,010 989,020
Total 1,000 999,000 1,000,000

In this fictitious patient population, you can be pretty certain that if your Apple Watch calls your heart rhythm normal, then it is probably normal. Only 10 of the 989,020 people who were recorded as having normal heart rhythms were misdiagnosed. In other words, there are few false negatives.

Have a.fib Do not have a.fib Total
Apple Watch detects a.fib 990

(true positives)


(false positives)

Apple Watch does not detect a.fib 10

(false negatives)


(true negatives)

Total 1,000 999,000 1,000,000

But its positive predictive value (PPV) is somewhat low. Of the 10,980 patients where the Apple Watch alerted its users to an abnormal rhythm, only 990 had a.fib in our example and the rest were false positives. Its PPV is actually 990 ÷ 10,980 = 9.01%. In other words, when detected a.fib we were only right 9% of the time.

Again, this was theoretical example to illustrate a point. But it’s useful to see what would happen if we took our same scenario but tested it in another population of one million people where atrial fibrillation was more common. In this population, where 10,000 people (or 1% of the population) have atrial fibrillation we can reconstruct the same table we did above.

Have a.fib Do not have a.fib Total
Apple Watch detects a.fib 9,900 9,900 19,800
Apple Watch does not detect a.fib 100 980,100 980,200
Total 10,000 990,000 1,000,000

In this case, the negative predictive value is still quite high. If the Apple Watch does not detect a. fib then the patient likely does not have it. Only 100 of the 980,200 people with normal recordings were misdiagnosed. The negative predictive value is still over 99%.

But the PPV has improved. Of the 19,800 people in whom a.fib was detected, there were 9,900 who did actually have a. fib. This means that a positive reading from the watch is correct 50% of the time.

What we can see from these two hypothetical examples is that the usefulness of the test improves when you test it in a population where more people have the disease. We unfortunately cannot reconstruct a table like this for the Apple Heart Study because only the people with abnormal notifications went on to get further testing to confirm the arrhythmias. But what we do know from the main result of the paper is that of the people who got an abnormal alert from their Apple Watch (and who followed the protocol to get further cardiac monitoring) only 34% had a confirmed atrial fibrillation on subsequent monitoring. In other words, the positive predictive value was 34%. What’s interesting is that for people under the age of 40, who made up most of the population, the positive predictive value was only 18%.

Why the Apple Watch did so poorly is not hard to see. Atrial fibrillation is a very common arrhythmia but it is very unusual to get it at a young age. About 9% of people over age 65 have a.fib, while only 2% of people under age 65% have it. For people under 40, who were half the study population, a.fib is quite rare.

So what does the Apple Watch study mean?

The Apple Watch clearly works and is obviously an improvement over the previous generation device where the Positive Predictive Value was only 8% in one of their study cohorts. That means it was generating false positives 92% of the time. The problem with the Apple Watch and screening for a.fib is that the Apple Watch is invariably bought and used by young, otherwise healthy individuals in whom a.fib is rare and in whom the Apple Watch will generate many false positives. The other problem is that it’s not clear what to do with short episodes of atrial fibrillation in young people. Even if you do make a diagnosis of atrial fibrillation, you wouldn’t necessarily give someone blood thinners if they were young and had no other risk factors. Also, most people would only give someone a diagnosis of atrial fibrillation if the arrhythmia lasted for more than 30 seconds, which is the definition used in this study. What to do with people who have brief episodes that last only a few seconds is less clear.

The problem with the Apple Watch is not its technology, it’s how we use it. Because even good tests and good technology fail if you use it on the wrong group of people.


Posted by Christopher Labos

Dr. Christopher Labos MD CM MSc FRCPC is a physician with a Royal College certification in cardiology. After his clinical training at McGill University he pursued a master’s degree in epidemiology. His main research focus is cardiovascular prevention. He realizes that half of his research findings will be disproved in five years: he just doesn’t know which half. He is also an associate with the McGill Office for Science and Society whose mission is to promote critical thinking and present science to the public. He co-hosts a podcast called The Body of Evidence. He is a freelance contributor for the Montreal Gazette, CJAD, and has also appeared on CBC Radio and CBC Television. To date, no one has recognized him on the street.