3️⃣Dimensionality Reduction (PCA, LLE, t-SNE)
Dimensionality reduction is a technique used in machine learning and statistics to reduce the number of features or dimensions in a dataset, while retaining as much information as possible. The goal of dimensionality reduction is to simplify the data without losing important information.
There are two main approaches for dimensionality reduction: feature selection and feature extraction.
Feature selection: This approach involves selecting a subset of the original features, based on certain criteria such as correlation or mutual information. Feature selection methods include techniques such as forward selection, backward elimination, and recursive feature elimination.
Feature extraction: This approach involves creating a new set of features from the original features, based on certain mathematical transformations such as linear or non-linear projections. Feature extraction methods include techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Non-negative Matrix Factorization (NMF).

Dimensionality reduction is useful in many applications such as image and speech recognition, natural language processing, and predictive analytics. It can be used to improve the performance of machine learning algorithms by reducing the computational cost, increasing the interpretability of the model, and preventing overfitting.
Additionally, dimensionality reduction can also help to visualize high-dimensional data by reducing it to two or three dimensions, making it easier to interpret and understand.
However, dimensionality reduction is not always necessary and it depends on the problem and the dataset. It's important to evaluate the performance of the model before and after dimensionality reduction, to ensure that the model's performance is not negatively affected.
Example
In this example, the make_blobs function is used to generate synthetic data with four features and four clusters. Then, the PCA class is imported and used to perform dimensionality reduction on the data. The number of components is set to 2, which means that the final dataset will have 2 features only. The fit_transform() method is used to fit the model to the data and transform the data to the new feature space.
Finally, the results are plotted using a scatter plot where the x and y axis are the first and second principal components, respectively, and the color of each point represents the cluster it belongs to. The PCA transformed the data from 4D to 2D, it captured most of the variance from the original data and maintained the main structure of the data.
It's important to note that this is a simple example and in the real world the dataset will be more complex and the number of components can be chosen based on the problem and the variance that you want to capture, but the idea behind the PCA remains the same. PCA is a powerful method for dimensionality reduction that can be used to improve the performance of machine learning algorithms, to visualize high-dimensional data, and to extract the most informative features from the data.
Python code
Output

References
Last updated