3️⃣Dimensionality Reduction (PCA, LLE, t-SNE)

Dimensionality reduction is a technique used in machine learning and statistics to reduce the number of features or dimensions in a dataset, while retaining as much information as possible. The goal of dimensionality reduction is to simplify the data without losing important information.

There are two main approaches for dimensionality reduction: feature selection and feature extraction.

  • Feature selection: This approach involves selecting a subset of the original features, based on certain criteria such as correlation or mutual information. Feature selection methods include techniques such as forward selection, backward elimination, and recursive feature elimination.

  • Feature extraction: This approach involves creating a new set of features from the original features, based on certain mathematical transformations such as linear or non-linear projections. Feature extraction methods include techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Non-negative Matrix Factorization (NMF).

Dimensionality reduction is useful in many applications such as image and speech recognition, natural language processing, and predictive analytics. It can be used to improve the performance of machine learning algorithms by reducing the computational cost, increasing the interpretability of the model, and preventing overfitting.

Additionally, dimensionality reduction can also help to visualize high-dimensional data by reducing it to two or three dimensions, making it easier to interpret and understand.

However, dimensionality reduction is not always necessary and it depends on the problem and the dataset. It's important to evaluate the performance of the model before and after dimensionality reduction, to ensure that the model's performance is not negatively affected.

Analogy:

An analogy for dimensionality reduction can be thought of as organizing a toolbox. Imagine you have a toolbox that contains a lot of tools, but you only use a few of them regularly. To make it easier to find and use the tools you need, you decide to remove some of the less used tools and organize the remaining tools in a way that makes sense to you.

This process is similar to dimensionality reduction in machine learning. You have a dataset with a lot of features, but not all of them are useful or relevant to the task at hand. By removing some of the less informative features, or by combining similar features, you can simplify the data and make it easier to use, just like how organizing a toolbox makes it easier to find and use the tools you need.

In this analogy, the tools in the toolbox can be thought of as the features in a dataset, and the process of removing or combining them is similar to the process of feature selection or feature extraction in dimensionality reduction. By reducing the number of features, you can make the data more manageable and interpretable, similar to how organizing a toolbox makes it more manageable and efficient to use.

Example

In this example, the make_blobs function is used to generate synthetic data with four features and four clusters. Then, the PCA class is imported and used to perform dimensionality reduction on the data. The number of components is set to 2, which means that the final dataset will have 2 features only. The fit_transform() method is used to fit the model to the data and transform the data to the new feature space.

Finally, the results are plotted using a scatter plot where the x and y axis are the first and second principal components, respectively, and the color of each point represents the cluster it belongs to. The PCA transformed the data from 4D to 2D, it captured most of the variance from the original data and maintained the main structure of the data.

It's important to note that this is a simple example and in the real world the dataset will be more complex and the number of components can be chosen based on the problem and the variance that you want to capture, but the idea behind the PCA remains the same. PCA is a powerful method for dimensionality reduction that can be used to improve the performance of machine learning algorithms, to visualize high-dimensional data, and to extract the most informative features from the data.

Python code

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# generate synthetic dataset
X, y = make_blobs(n_samples=200, n_features=4, centers=4, cluster_std=1.0, shuffle=True, random_state=0)

# perform PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

# plot the results
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap='rainbow')
plt.show()

Output

References

http://www.turingfinance.com/artificial-intelligence-and-statistics-principal-component-analysis-and-self-organizing-maps/

Last updated