Estimating vital signs like heart rate, breathing rate and SpO2 levels from facial videos using computer vision.

Photo by Ryan Stone on Unsplash

Let's start with a small exercise. But first I have to ask you to put on your fitness tracker; You do own one right? Oh.. you thought the exercise was mental? Disastrous. The point is, if you’ve been even a tiny bit observant with the fitness device that sits on your wrist at the moment, you will have noticed a shiny light at the back (usually green). Now if you’ve been curious or a bit more observant, you’ll know that this light is used as a medium (pun unintended) to extract the pulse signal from your wrist! And that’s where all the vital, physiology related metrics like heart rate, breathing rate etc. that you see on the screen, are computed.

Over the past year, I’ve been working in this field of physiological signal estimation which is the great enabler behind my anecdotal introduction. Hopefully, the ubiquitousness of fitness trackers nowadays makes it relatable!

Physiological signal estimation is traditionally modelled as more of a signal processing task. However, we’ve been looking at estimating these vital signs through the mode of Deep Learning; Computer Vision to be specific. First, let us get some terminology out the way which I intend to present most succinctly.

Photo-plethysmography (PPG)

Photoplethysmography in action: using reflected light to extract pulsatile PPG signal. Photo from Wikipedia (CC BY-SA 4.0)

Fitness trackers employ a technique called Photo-plethysmography (PPG) for ‘tracking’ your vital signs. Shining a light through the wrist allows these trackers to measure the cardiovascular Blood Volume Pulse (BVP) through changes in the amount of light reflected or absorbed by the blood vessels underneath the skin. Generally, in healthcare, this is done by placing contact sensors at the fingertips, the chest, or the feet of a patient. Either way, the idea is to extract the plethysmographic signal which can be used to estimate physiological metrics like heart rate, breathing rate, oxygen saturation levels etc. You guessed right: the photo- in photo-plethysmography stands for the medium, which is light. Importantly, PPG is a low-cost alternative to the conventional approach of plethysmography that involves elaborate equipment and trained healthcare professionals to measure and make sense of a patient’s vital signs.

Photo left to right by Tim Sheerman-Chase (CC BY 2.0) and Vladislav Bychkov on Unsplash

Having conceptualised the overarching principle of photo-plethysmography, we now look at the mathematical model which makes it possible to extract the PPG signal from reflected light.

Dichromatic Reflectance Model (DRM)

The Dichromatic Reflectance Model (DRM) describes C(t) i.e. light reflected from the surface of an object (which is received at the sensor), as a linear combination of two major components. These two components are specular (interface) reflection and diffuse (body) reflection.

Specular reflection is the regular reflection that is observed at mirror-like surfaces where a ray of light gets reflected at an angle that is equal to its angle of incidence. On the other hand, in the case of diffuse reflection, light is scattered at many angles.

Modelling the reflected light C(t) as a combination of its constituent terms. Image by Author.
Note: These time-varying signals are recorded continuously at the sensor for a period of time, t = [0, T].

As per the DRM, the two reflection terms, Vs(t) for specular and Vd(t) for diffuse are modulated by an intensity term I(t). Additionally, we also account for the noise term Vn(t) which quantifies the noise incurred at the sensor.

Note that for all this to happen, the measuring device (even though a relatively inexpensive fitness wearable) still needs to be in contact with your (i.e. the patient’s) body. In other words, PPG is a contact-based method, where signals are usually measured with contact sensors placed at the fingertips, the wrist, or the feet. However, such type of sustained contact may not be suitable for certain applications such as sports and driving with the issue of causing restricted motion, inconvenience of the continuous contact and/or distraction.

Prefixing the ‘Remote-’ in Remote-PPG

In the wake of the COVID-19 outbreak, we all feel better when things are less touchy and more contactless than ever before. Voila! spotlight on the concept of remote-photoplethysmography with remote being the keyword. We will address this term henceforth as ‘r-PPG’.

Studies have shown that the pulsatile plethysmographic signal is strong enough to be captured by observing changes in skin colour from a series of images using low-cost RGB camera sensors.

Moving towards contactless. Photo to the right by Wesley Fryer (CC BY 2.0)

Extending the brief description of photo-plethysmography that I offered in the introduction, r-PPG provides a contactless, unobtrusive and concomitant method of measuring the plethysmographic signal by allowing us to extract the pulse signal from images and videos rather than through a contact sensor on a wearable device (see image above).

Revising the previous definition, r-PPG allows remote measurement of the cardiovascular Blood Volume Pulse (BVP) through changes in the amount of light reflected or absorbed by the blood vessels under the skin.

The premise of r-PPG hence enables the extraction of pulsatile information from facial videos. This ability to remotely estimate the BVP signal and by extension compute the vital physiological metrics like heart rate, breathing rate and oxygen saturation, can be applied to the most ubiquitous camera sensors in the world, the ones that reside in our smartphones!

So, we’ve established that there is a pulsatile signal of interest, p(t) that can be extracted from a sequence of images containing skin pixels; for example, facial videos. Now let’s see how this extraction is actually performed.

Improvising the DRM for images

Images are made up of multiple pixels. It helps to think of each such pixel to be an independent sensor for the task of extracting the pulsatile signal, p(t). Following all the pixels in an image over a continuous set of image frames from a video, we are left with a time-series of pixel intensities (RGB) for each pixel over the timeline of the video. Relating this to our previous, signal processing based example, each of these time-series data can be thought of as an individual C(t). We now subscript C(t) with a k for denoting the time series of pixel intensity pertaining to the kth pixel. The DRM can be revised for an image input as follows:

Adapting the Dichromatic Reflectance Model for image data. Image by Author.
Note: These time-varying signals are obtained over multiple image frames in a video input i.e. t = [0, T] for a video consisting of T frames. We obtain such a signal from each pixel in the image.

Observe that all the time-varying components of C(t) can be written as a combination of static DC components (terms that are not a function of time) and time-varying AC components. Having some knowledge about the signal of interest p(t) and given the state of this model, we make two minor re-arrangements to the definition of C(t).

Vn(t) approximated by spatial averaging of pixels via interpolation. Image by Author.

Firstly, we get rid of the Vn(t) term which accounts for the camera noise and quantization error. This is achieved based on an assumption that the camera noise can be reasonably eliminated by grouping (or averaging) a sufficiently large number of pixels (i.e. their intensities) that are in close proximity. In image processing, this is called ‘spatial averaging’ which is performed by downsampling the image using interpolation; leaving out technical details in the interest of brevity we have the following:

Spatial averaging eliminates the effect of camera quantization error. Image by Author.

Secondly, we re-arrange the components of C(t) by splitting them into their constituent specular and diffuse reflection parts and group the resulting terms based on their DC/AC nature (as mentioned previously).

Quantifying the colour signal C(t) received at the kth pixel as a simplification. Image by Author.

Notice that the time-independent DC terms from the specular and diffuse components are collected into a single term, Uc for simplicity. We are now left with a combination of an aggregate DC reflection term Uc and two time-varying signals s(t) and p(t), all of which are being modulated by the intensity term i(t). While I’m aware that this might be considered a bit of an over-simplification, the rudimentary idea is to extract the pulsatile signal of interest i.e. p(t) from the colour signal C(t) which we receive at the camera sensor.

As mentioned before, traditionally, the r-PPG extraction task has been treated as a signal processing problem. However, recent advances using deep-learning-based approaches have shown enough merit to justify further research.

Enter Deep Learning

Deep Learning, specifically computer vision has conferred us with the ability to develop end-to-end deep neural models for solving many complex tasks that would otherwise require multi-stage processing pipelines often involving hand-crafted feature manipulations. The pandemic has brought about a bona fide transformation in healthcare all over the world. For example, remote consultation and diagnosis via telehealth platforms or even video conferencing have become commonplace over the past year or so. Hence, an end-to-end framework for recovering physiological signals is desirable and seems like the logical next step, given the context.

Visualising an end-to-end Deep Learning pipeline for r-PPG extraction. Image by Author.

This is a rough visualisation of what an end-to-end r-PPG pipeline using Deep Learning would look like. There’s a lot of research and work being done on this pipeline to make it robust towards videos affected by motion (i.e. where the subject might be moving like in a fitness activity), videos affected by poor lighting conditions or heterogeneous illumination but the underlying premise is to best extract the r-PPG pulse signal p(t) from the input videos.

I advise you to assimilate all this information (especially the mathematical modelling) with a pinch of salt as there are certain interaction terms that I’ve purposely left out for the sake of simplicity. I want the reader to focus on the idea that a video taken by even the most basic smartphone camera contains a trace of the pulsatile signal p(t) which can be extracted, given the right set of tools. This plethysmographic signal p(t) can be leveraged to determine crucial vital signs like heart rate, breathing rate and SpO2 levels (a measure of oxygen saturation in the blood).

Through this introductory article, I hope the reader can appreciate the promise and potential application of Deep Learning to the task of remote-PPG estimation; modelling it as a computer vision task.

Hi! Thank you for making it through the article. Broadly speaking, this is the topic of my graduate thesis. For the sake of introducing the idea and building the premise, I have abstracted quite a few technicalities. For the more technically inclined, I am adding a list of some useful articles. Additionally, hit me up if you want to know more! And give me a follow for more content :)


  1. “Photoplethysmogram”,
  2. “Photoplethysmography by holographic laser Doppler imaging”, by username: Micatlan, Wikipedia licensed with CC BY-SA 4.0
  3. Chen et al. DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks. Proceedings of the European Conference on Computer Vision (ECCV) 2018
  4. “EEG Brain Scan” by Tim Sheerman-Chase is licensed with CC BY 2.0. To view a copy of this license, visit
  5. “Rachel on Facetime” by Wesley Fryer is licensed with CC BY-SA 2.0. To view a copy of this license, visit
  6., Scientific blog on r-PPG

Modelling Physiological Signal Estimation as a Deep Learning Problem was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.