What is it? Where and When is it used?
The goal of this article is to present you the method called PCA — Principal Component Analysis. This is a very interesting and useful method that can help you to boost the development of your machine learning project. The second goal of this article is to explain to you- Should you really care about PCA at all? So those are mainly two reasons why I decided to write this article. In the following paragraphs, I will tell you about what is the PCA, where & when it is used. There are tons of articles out there in Medium you can find out explaining how PCA works both in math and code perspective. So that’s why in this article, I would like to take a chance to talk mainly about its usage and pros & cons.
What is PCA — Principal Component Analysis?
Principal Component Analysis or PCA for short is a dimensionality reduction method, which is used to reduce the dimensions of large datasets by reducing the number of features within the dataset. But as you reduce the number of features, you will deal with some trade-off between accuracy and the number of features. It is obvious that as you will be reducing the dimension of the dataset, you will lose some “knowledge” about the dataset and thus lose some accuracy. PCA is very helpful because its result work, a dataset with reduced dimensions is easier to utilize, explore and visualize. Moreover, a subsequent benefit of PCA is that with this machine learning algorithms will perform faster and development speed gets faster. To make things clear let’s define some terminologies I used in this article:
- Visualization of datasets
Imagine we have a tabular dataset with 100 columns. These columns form our features. And thus our dataset made up of 100 features. The number of rows on each column would define our dataset’s dimension. For instance, for each 100 hundred columns we have 50 different records(rows) of some data. These 50 rows of data would define our dataset’s dimension. In order to be able to visualize this dataset, you would normally plot it into some 2D or 3D space. This process is called visualization of the dataset.
Where is it used?
PCA is widely used across different disciplines such as data science projects, face recognition, image compression. Let me explain the example of a data science project — imagine you have tabular data with hundreds of features(columns) and as we already know, in such kinds of projects, the visualization of such dataset is very important, so visualizing the dataset with such size becomes almost impossible. To be able to visualize your dataset on 2D/3D space you will need to reduce the number of features to some small number. This is where PCA comes to help. PCA reduces the number of features that best describe the dataset into a small number: e.g 2–3.
When it should be used?
So, among the main scenarios I can highlight TOP-3 use cases where PCA can be helpful:
- Better Perspective and less Complexity: PCA is useful when we need to get an intuitive understanding of a given dataset and having so many features is not necessary.
- Better Visualization: When we cannot get a good visualization due to a high number of dimensions we use PCA to reduce it into a shadow of 2D or 3D features
- Reduce size: When we have too much data and the result of applying PCA would give us only a reduced number of features (e.g 1%) that would explain the maximum variance in our dataset.
Is PCA always recommended to use?
Spoiler — No!
After all these praises towards this algorithm, now it is time to look at the other side of the coin. Applying PCA randomly in any project has always a bad idea. PCA may be very helpful on one kind of ML application but not preferred on another. So before applying PCA to any of your projects you should be aware of why you need it and lastly consider its pros and cons. Also, I want to note one important detail — before applying PCA, you should know whether your dataset has characteristics that make PCA the right choice for your project.
One of the main limitations of PCA is its linearity. PCA is a linear model that tries to find linear relationships between variables but in reality, your variables may have non-linear relationships, and the latter fact makes PCA a bad choice for your project. It is recommended to apply PCA only when it is really needed. And so here is my recommendation: Do not blindly apply PCA on every ML you work on, always consider its pros and cons before starting your project.