Viewing entries in
Tutorials

K-Means Tutorial in Python

Comment

K-Means Tutorial in Python

Hi everyone,

Welcome back to my series of Machine Learning Algorithms Tutorials, this time we’ll be checking on K-Means, one of the most popular and powerful clustering algorithms in Machine Learning. In this article, I will explain what it is, how it works, and why it’s useful for finding patterns in data. We’ll obviously have our tutorial in Python as well!

I hope you enjoy reading this article as much as I enjoyed researching and writing it. With this series, I developed a fascination for Machine Learning and how it can help us solve complex problems and make better decisions. If you share this passion, then you are in the right place. Let’s dive in!

K-Means is one of the popular and simple clustering algorithms. It is an unsupervised learning technique that aims to partition a set of data points into a number of groups (called clusters) based on their similarity.

The basic idea of K-Means is to assign each data point to the cluster whose center (called centroid) is closest to it. The centroid of a cluster is the average of all the data points in that cluster. The algorithm iterates until the centroids stop changing or a maximum number of iterations is reached.

We need to know what the algorithm needs to work, in this case, K-means requires the following inputs:

  1. The number of clusters (k), where we specify the number of clusters we want the algorithm to group the data into.

  2. Data, that need to be clustered. Each data point should have a set of features or attributes that describe it.

  3. Initial centroids for each cluster. These centroids can be randomly selected from the data points or manually specified.

Determining the optimal number of clusters (k) is an important step in using the k-means algorithm effectively. There are several methods that can be used to estimate the optimal value of k, including:

  • Elbow method, which involves plotting the sum of squared distances between each data point and its assigned centroid for different values of k. The value of k at which the rate of decrease in the sum of squared distances slows down and forms an elbow-like shape is considered the optimal number of clusters.

  • Silhouette method involves calculating the silhouette score for different values of k. The silhouette score measures how similar a data point is to its assigned cluster compared to other clusters. The value of k that maximizes the average silhouette score is considered the optimal number of clusters.

  • Gap statistic method, this technique involves comparing the within-cluster variation for different values of k to a null reference distribution. The value of k that maximizes the gap statistic is considered the optimal number of clusters.

It’s important to note that these methods are not foolproof and may not always give a clear indication of the optimal number of clusters. Therefore, it’s often useful to try multiple methods and compare the results to choose the best value of k.

The next thing to do is to initialize the centroids, there are different ways to initialize the k centroids in this algorithm, including:

  • Random initialization: k centroids are randomly selected from the data points. This is a simple and commonly used method, but it may result in suboptimal clustering if the initial centroids are not representative of the data distribution.

  • K-means++ Initialization: aims to select k centroids that are far apart from each other and representative of the data distribution. It involves selecting the first centroid randomly from the data points and then selecting subsequent centroids based on the distance from the previously selected centroids. This method typically results in better clustering performance than random initialization.

  • Manual Initialization: in some cases, the user may have prior knowledge about the data and the expected clusters, and can manually specify the initial centroids.

Notice that the choice of initialization method can affect the clustering result, so it’s often recommended to run the algorithm multiple times with different initializations and choose the best result.

Once we have our initialization method defined, we can start with our iterative process which consists of calculating the distance between the points and each centroid, assigning the points to each cluster, and updating the centroid positions.

For each data point in the dataset, the algorithm calculates the Euclidean distance between the point and each centroid. The Euclidean distance is simply the straight-line distance between two points in a Euclidean space, such as a two-dimensional plane. This metric is more used because of its simplicity to compute, and it's also an intuitive distance metric that can be easily understood and visualized.

Moreover, the Euclidean distance is suitable for continuous data and mathematical models. However, there are cases where the Euclidean distance may not be appropriate, such as text clustering problems. In this situation is commonly used the cosine distance metric, this one measures the angle between two vectors.

The choice of distance metric depends on the nature of the data and the problem at hand. It is always a good practice to explore different metrics.

Once the distance is calculated, the algorithm assigns each data point to the cluster with the closest centroid.

After this step, the algorithm recalculates the centroid positions, which represents the mean of all data points assigned to each cluster. Next thing to do is repeat this iterative process until convergence is reached. It is achieved when the assignment of data points to clusters no longer changes or when the change is below a predefined threshold.

The final output of the algorithm is a set of k clusters, each represented by its centroid, and a label for each data point indicating its assigned cluster. Finally, describe how to evaluate the quality of the clustering result using metrics such as the within-cluster sum of squares of silhouette score.

After providing an overview of the k-means algorithm, it’s important to discuss its strengths and limitations, understanding these is important for making informed decisions about its use in different applications.

Among its benefits we can include the following:

  • It’s computationally efficient and suitable for large datasets, this is because the algorithm only requires a few simple computations for each iteration, making it a suitable choice for clustering tasks where efficiency is an important consideration.

  • It is easy to understand and implement, due to not requiring advanced mathematical or statistical knowledge. Making it accessible to practitioners with varying levels of expertise in data science and machine learning.

  • It can handle data with a large number of dimensions. K-means is able to find patterns and structure in high-dimensional data, making it a valuable tool in many applications.

However, K-Means is not without its limitations, including:

  • The algorithm relies on the initial selection of centroids, which can affect the final clustering results. As we previously discussed, it is recommended to run the algorithm multiple times with different initializations so it can help to mitigate, not eliminate, this limitation.

  • K-means assumes that the clusters are spherical, which can lead to incorrect cluster assignments when clusters are non-spherical. In real-world datasets, clusters can have complex shapes and structures that do not fit the spherical assumption of k-means. In these cases, more advanced clustering algorithms such as density-based clustering or hierarchical clustering may be more appropriate.

  • The algorithm struggles with identifying clusters of varying sizes and densities. This is because the algorithm assigns data points to the closest centroid, which can result in one large cluster and several small clusters.

Overall, understanding the limitations of k-means is important for making informed decisions about when and how to apply the algorithm. It is good to notice that despite these limitations, K-Means remains one of the most widely used clustering algorithms because of its simplicity and efficiency and has not deterred its use in various domains.

The K-Means algorithm has several applications in various disciplines, including:

  1. Market segmentation: K-means clustering is often used in marketing to segment customers based on their behavior, preferences, and demographics. By grouping customers with similar characteristics, companies can tailor their marketing strategies to each segment and improve customer satisfaction and loyalty.

  2. Image segmentation: segmentation of images based on their color or texture features. This technique is commonly used in image compression, object recognition, and image retrieval.

  3. Anomaly detection: it can be used for anomaly detection in various fields, such as finance, cybersecurity, and fraud detection. By clustering normal data points and identifying outliers that do not belong to any cluster, k-means can help detect unusual patterns that may indicate fraudulent or suspicious activity.

  4. Bioinformatics: clustering of genes, proteins, or samples based on their expression levels or sequence similarity. This technique can help identify patterns in large biological datasets and enable researchers to study the relationships between different biological entities.

  5. Social Network Analysis: K-means clustering can be used in social network analysis to cluster users based on their behavior, interests, or social connections. By identifying groups of users with similar characteristics, researchers can gain insights into the structure and dynamics of social networks and predict user behavior.

While K-Means is a very effective algorithm with plenty of applications, it may not be suitable for some situations like categorical data, as previously said. In the same way, there could be some flaws that led to the development of variants and extensions of this algorithm, including K-Modes.

K-Modes is a clustering algorithm that is specifically designed for categorical data and is based on the same principles as K-Means. In this way, the algorithm represents an important extension and highlights the ongoing development of clustering techniques to meet the diverse needs of researchers and practitioners.

There are several variants and extensions of the K-Means algorithm that have been proposed. Some examples are K-Medoids, Fuzzy C-Means, and K-Prototype. The first one replaces the mean calculation with the selection of a representative data point from each cluster, known as a medoid, making it more robust to outliers and noise in the data.

Fuzzy C-Means assigns a degree of membership to each data point for every cluster. This allows for more nuanced clustering and can be useful when there is uncertainty or overlap between clusters. For example, in image segmentation, a pixel may belong to multiple regions with different colors, and this type of clustering can provide a more accurate representation of the underlying structure of the data.

Finally, the K-Prototype extension is a hybrid algorithm that combines both K-Means and K-Modes to cluster datasets with both numeric and categorical data. It assigns a weight to each feature based on its type and uses this to calculate the distance between data points.

These variants and extensions demonstrate the ongoing efforts to improve and adapt clustering algorithms to better suit the needs of different applications and types of data.

Python Tutorial

To ensure compatibility, it is recommended to use an Anaconda distribution for this tutorial. However, if you don’t have Anaconda installed and you want to use your trusted Kernel, you can manually install the required packages using pip. You can execute the provided code block by uncommenting from the “import sys” line onwards to automatically install the necessary packages.

To perform this tutorial, we need to import several essential libraries in Python.

We import NumPy, a powerful library for numerical operations on arrays, which provides efficient mathematical functions and tools. Next, we import pandas, a widely used data manipulation and analysis library that allows us to work with structured data in a tabular format.

To visualize the results of our clustering analysis, we import matplotlib.pyplot, a plotting library that enables us to create types of charts and graphs. This will help us understand the patterns and relationships within the data.

For the actual clustering process, we import scikit-learn’s KMeans module. To ensure accurate results, we also import the StandardScaler module from scikit-learn’s preprocessing submodule. It is used for feature scaling, which helps to normalize the data and improve the performance of the clustering algorithm.

Lastly, we import silhoutte_score from scikit-learn’s metric module. The Silhouette Score is a metric used to evaluate the quality of the clustering results. It measures how well each data point fits within its assigned cluster.

Next, we’re using the housing data from a CSV file called “housing.csv”, you can reach this file in the following Kaggle page: California Housing Prices | Kaggle. We specify the columns of interest (longitude, latitude and median_house_value).

Then we remove any rows that have missing values (NaN). This ensures that we are working with a clean and complete dataset for further analysis.

Now we import the seaborn library to create a scatter plot. We indicate to the function the axis values and the hue, which will help us understand the relationship between the longitude and latitude coordinates of the housing data with the median house value. It displays the following scatter plot.

We create an instance of the StandardScaler class from scikit-learn’s preprocessing module. This will help us to normalize the data and bring it to a standard scale.

When we perform the fit.transform, the method calculates the mean and standard deviation of each feature in the dataset and applies the scaling transformation accordingly. We save this value in the data_scaled variable.

By scaling the features, we ensure that they have a similar range and variance, which can be beneficial for certain machine learning learning algorithms and data analysis techniques.

In this step, the initialize an empty list called “silhouette_scores” to store the scores. Then, we iterate through the range of k values from 2 to 10. For each value of k, we create an instance of the KMeans class with k clusters and fit the scaled data to the model.

Next, we calculate the silhouette score for the clustered data using the silhouette_score function, which measures the quality of the clustering results. The resulting score is appended to the silhouette_score list.

Finally, we plot the silhouette scores against the values of k, where the x-axis represents the number of clusters (k), and the y-axis represents the silhouette coefficient. The plot is labeled with appropriate axes labels and a title, and displayed.

After plotting this graph of the silhouette coefficient for different values of k, we can analyze the results to determine the optimal number of clusters for our data. We need to identify the value of k that corresponds to the peak or highest silhouette coefficient on the graph. This will be the number of clusters that yield the most distinct and well-separated groups within the data. In this case, k equals the number 2.

We set the number of clusters (k) to 2, partitioning the data into two distincts groups. We then create an instance of the KMeans class with the specified number of clusters. We use K-Means++ as our init method, which is widely used and helps improve the convergence of the algorithm. Additionally, we set a random state of 42 to ensure reproducibility of the results.

After that, we fit the scaled data to the KMeans model using the fit() method. This process calculates the cluster centroids and assigns each data point to its corresponding cluster based on the proximity of the centroids.

Finally we obtain the clusters and centroid labels and set the scatter plot labels to show our graph.

The cluster centroids are marked as red X’s. For furthermore exploration of this algorithm, you can visit our GitHub Repository where you can access the complete code for convenient execution and customization.

In conclusion

K-Means is a popular clustering algorithm in machine learning that aims to partition data points into clusters based on their similarity. It is an unsupervised learning technique that can find patterns in data.

The algorithm works by iteratively assigning data points to the cluster with the closest centroid and updating the centroids based on the assigned points. It continues this process until convergence.

K-Means requires specifying the number of clusters (k), providing the data to be clustered, and initializing the centroid. Determining the optimal value of k can be done using methods like the elbow method, silhouette method, or gap statistic method.

This technique has strengths such as computational efficiency, ease of implementation, and the ability to handle high-dimensional data. However, it has limitations such as sensitivity to initial centroid selection and the assumption of spherical clusters.

Evaluation of the clustering result can be done using metrics like the within-cluster sum of squares or silhouette score. K-Means finds applications in market segmentation, image segmentation, anomaly detection, bioinformatics, and social network analysis.

Are you eager to delve deeper into the fascinating world of machine learning and explore more powerful algorithms like K-Means? If so, I invite you to continue your learning journey and unlock the potential of this exciting field. You can explore our other tutorials and resources to provide in-depth explanations and practical examples that can guide you step-by-step through implementing Machine Learning algorithms.

Additionally, consider joining our community, where you can engage with like-minded individuals, exchange insights, and collaborate on different kinds of projects. The collective wisdom and support can enhance your learning experience and open doors to exciting opportunities. The possibilities are waiting for you, start your journey now!

References

Babitz, K. (2023). Introduction to k-Means Clustering with scikit-learn in Python. https://www.datacamp.com/tutorial/k-means-clustering-python

K-Means Clustering Algorithm - JavatPoint. (n.d.). www.javatpoint.com. https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning

Li, Y., & Wu, H. (2012). A clustering method based on K-Means algorithm. Physics Procedia, 25, 1104–1109. https://doi.org/10.1016/j.phpro.2012.03.206

Mannor, S., Jin, X., Han, J., Jin, X., Han, J., Jin, X., Han, J., & Zhang, X. (2011). K-Means clustering. In Springer eBooks (pp. 563–564). https://doi.org/10.1007/978-0-387-30164-8_425

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman and A. Y. Wu, "An efficient k-means clustering algorithm: analysis and implementation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, July 2002, doi: 10.1109/TPAMI.2002.1017616.

Comment

DBSCAN Algorithm Tutorial in Python

DBSCAN Algorithm Tutorial in Python

Density-based Spatial Clustering of Applications with Noise (DBSCAN)

In my previous article, HCA Algorithm Tutorial,  we did an overview of clustering with a deep focus on the Hierarchical Clustering method, which works best when looking for a hierarchical solution. In the case where we don’t want a hierarchical solution and we don’t want to specify the number of clusters we’re going to use, the Density-based Spatial Clustering of Applications with Noise, abbreviated DBSCAN, is a fantastic choice.

This technique performs better with arbitrary-shaped clusters - clusters without a simple geometric shape like a circle or square- with noise, which means those clusters have data points that don’t belong to them. DBSCAN also aids in outlier detection by grouping points near each other with the use of two parameters termed Eps and minPoints.

Eps, also known as Epsilons, are the radius of the circles we will generate around every data point, while minPoints are the smallest number of data points that must surround a single central data point in order to make it a core point.

All points that are not within another point’s Eps distance and do not have the corresponding number of minPoints within their own Eps distance are called noise or outliers.

The selection of Eps and minPoints must be done with specific focus because a single change in values might influence the entire clustering process. But how can we know the recommended values for our problems?

We can use some estimations according to our problem’s dataset. We should pick Eps depending on the dataset's distance and we may utilize a k-distance graph to assist us. We should be aware that a small number is always preferable since a large value would combine more data points per cluster, and some information can be lost.


Example of variation of eps values. Image extracted from How to Use DBSCAN Effectively

It is evident that the most optimal results are achieved when the value of “eps” lies between 0.17 and 0.25. When the value of eps is smaller than this range, there is an excessive amount of noise or outliers, represented by the green color in the plot. On the other hand, when it’s bigger, the clusters become too inclusive like the eps value of 1, when it’s a single cluster.

The value of minPoints is according to the number of data points we have. However, we must consider that the minimum value can be our dataset dimension, it means the number of features we’re working with plus  1, while we don’t have a maximum value number. Therefore, the larger the data collection, the greater the minPoints value we should select.

Now that we understand the concepts, let’s see how this algorithm works. The first step is to classify the data points. The algorithm will visit all data points but  arbitrarily select one to start with. If the iteration confirms that there are the corresponding number of minPoints in the radius Eps around the datapoint selected, it considers these points part of the same cluster.

Then the algorithm will repeat the process with the neighbors just selected, possibly expanding the cluster until there are no more near data points. At that point, it will select another point arbitrarily and start doing the same process.

It can be possible to have points that don’t get assigned to any clusters, these points are considered noise or outliers, and these will be discarded from our algorithm once we stop iterating through the dataset. 

Now that we have knowledge of how this algorithm works, let’s compute it with a simple Tutorial in Python.

Python Tutorial

As always, let’s use an Anaconda distribution, but in case you are not related to it, you can previously install the packages by executing some pip installing or you can run the following code block, beginning the uncommenting from the “import sys”:

        Now, we’re going to make some basic imports to use them in our program.

        For this tutorial, we’ll use a dataset known from our previous article, a list of mall customers (extracted from Machine Learning A-Z: Download Codes and Datasets - Page - SuperDataScience | Machine Learning | AI | Data Science Career | Analytics | Success). And we’re going to use, as before, the annual income and spending score values.

        At this point we have our data points on memory, let’s compute a plot where we can see those points for a clear view of this example.

        As previously mentioned, to compute this algorithm we need some defined parameters: Eps and minPoints. Although there’s not an automatic way to determine minPoints value, we can compute some functions that will allow us to have our Eps. This will be made by computing the k-distance between all points in the dataset, the elbow of the curve will give us an approximation of the Eps value.  

In this case when we call a function, we will get a float number, or a decimal value. We will have to round it.

In this case, we will have an Eps value of 5, which will be our entry to our next function allowing the estimated minPoints. Next, we will give a label to every cluster, going from -1 (representing the noises or outliers) to 4 (our last visited cluster).

Finally we will make our scatterplot, assigning colors for each of the labels. Finishing this tutorial.

In conclusion, the DBSCAN algorithm is a powerful and versatile method for clustering data in a variety of applications. It is particularly well-suited for handling data with irregular shapes and varying densities, and is able to identify noise points and outliers in the data. DBSCAN is also relatively easy to implement and does not require prior knowledge of the number of clusters in the data, making it a popular choice for exploratory data analysis.

However, like any clustering algorithm, DBSCAN has some limitations and assumptions that should be considered when applying it to real-world data. For example, it assumes that clusters are dense regions separated by areas of lower density, and may not perform well on data with very different density levels or noise points that are distributed throughout the data. Additionally, it requires careful selection of its parameters such as the neighborhood size and the minimum number of points required to form a cluster, which can affect the clustering results.

Overall, the DBSCAN algorithm is a valuable tool for data clustering and has been applied successfully in a wide range of fields including image processing, text mining, and bioinformatics. By understanding its strengths and limitations, researchers and practitioners can make informed choices about when and how to apply DBSCAN to their own data.

If you’re interested in exploring the code behind the project discussed in this article, I invite you to visit the corresponding repository on GitHub. There, you can find the source code and view it in action. By exploring the repository, you’ll gain a deeper understanding of how the project was built and how you might be able to use it for your own purposes. I’m available to answer any questions you may have, so don’t hesitate to reach out if you need support. You can find the link to the repository in the article’s footer or by visiting my GitHub profile. Happy coding!

References:

Hierarchical Clustering Algorithm Tutorial in Python

Hierarchical Clustering Algorithm Tutorial in Python

When researching a topic or starting to learn about a new subject a powerful strategy is to check for influential groups and make sure that sources of information agree with each other. In checking for data agreement, it may be possible to employ a clustering method, which is used to group unlabeled comparable data points based on their features. For example, you can use clustering to group documents by topics.

On the other hand, clustering can also be used in market segmentation, social network analysis, medical imaging, and anomaly detection.

There are different types of clustering algorithms, and their application goes according to the type of problem or approach we want to implement. For example, if you’re searching for a hierarchical method, which implies you’re attempting a multi-lever learning technique and learning at multiple grain-size spaces, you may use hierarchical clustering.

Hierarchical clustering is a prominent Machine Learning approach for organizing and classifying data to detect patterns and group items to differentiate one from another.

The hierarchy should display the data in a manner comparable to a tree data structure known as a Dendrogram, and there are two methods for grouping the data: agglomerative and divisive.

Before entering into the deep knowledge of this, we’re going to explain the importance of a dendrogram on clustering. Not only does it give a better representation of the data grouping, but it also gives us information about the perfect number of clusters we might compute for our number of data points.

The agglomerative method is the most common type of hierarchical clustering, consisting of a “bottom-up” approach in which each object starts in its cluster, called a leaf, and the two most comparable clusters are joined into a new larger cluster at each phase of the algorithm, called nodes.

It is an iterative method, repeated until all points belong to a single large cluster called root that will contain all the data.

The divisive method, which is the opposite of the agglomerative method, is not often used. This divisive approach is less typically employed in hierarchical clustering since it is more computationally costly, resulting in slower performance.

To make this algorithm possible, using an agglomerative approach, we must complete the following steps:

The first step is to create a proximity matrix, computing the distance between each pair of data points, which means the distance between a data point and the others. It’s commonly used the euclidean distance function, given by the formula

Where

Having this in mind, the two data points we’re going to select will be according to our chosen linkage criteria. These criteria should be chosen based on theoretical concerns from the domain of application. There are a commonly used few of them:

  • Min Linkage, also known as single-linkage, is the calculation of distance between the two most comparable components of a cluster, which means the closest data points. It can also be defined as the minimum distance between points. The Min technique has the benefit of being able to handle non-elliptical forms properly. One of its drawbacks is that it is susceptible to noise and outliers.

  • Max Linkage, also known as complete linkage, is based on the two least similar bits of a cluster, or the same thing as the maximum distance between points. This linkage method is more vulnerable to noise and outliers than the MIN technique. It also can separate massive clusters and prefers globular clusters.

  • Centroid Linkage, which calculates the distance between the centroids of each cluster.

  • Average Linkage, defines cluster distance as the average pairwise distance between all pairs of points in the clusters.

When there are no theoretical concerns in our problem, it’s very helpful to use linkage criteria called Ward Linkage. This method examines cluster variation rather than calculating distances directly, reducing variance between clusters. This is accomplished by lowering or reducing the sum of squared distances between each cluster’s centroids. This method comes with the great power of being more resistant to noise and outliers. Now that you’ve calculated the distance using the according linkage criteria, you can merge the data points, creating a new cluster for each pair. After that, all you have to do is keep iterating until you have a single cluster. You may accomplish this by generating a new proximity matrix and computing the distances using the same linking criteria as before.

Tutorial in Python

Now that we’ve passed through all the basic knowledge, we’re ready to enter the python tutorial.

Remember that we’re using the Anaconda distribution, so in case you’re not using it, you’ll need to make the corresponding pip installs if you haven’t used these libraries before.

python -m pip install pip

pip install numpy

pip install pandas

pip install matplotlib

pip install scipy

pip install scikit-learn

Or you can uncomment the first code block in our Jupyter Notebook and execute it.

Now that you have installed the packages, we’re going to start coding. First, we need to import some basic libraries

For this example, we’re using a Comma-separated values file, here we’re using a list of mall customers (extracted from Machine Learning A-Z: Download Codes and Datasets — Page — SuperDataScience | Machine Learning | AI | Data Science Career | Analytics | Success). In this file, we obtain a list of customers and information about their annual income and their spending score, and our goal is to separate them into clusters using the HCA Algorithm.

Now we’re going to assign to a variable our wanted data points, which are the annual income and spending score by customer. It’s stored as an array with only the values.

The next step is to make some other imports.

Next, we generate a dendrogram using the scipy library importing, since our problem isn’t involved in a theoretical approach and we want a simple result, we’re using a ward linkage to treat it.

Taking a break from coding, we want to explain a little about the use of the Dendrogram. As you read before, the Dendrogram will give us the recommended number of clusters we will want to compute for our problem. But how can we know this? It’s as easy as drawing a parallel line so that it intercepts the greatest number of in this case vertical lines, making sure to not hit a horizontal line or ‘branching point.’

When two clusters are merged, the dendrogram will join them in a node, each node has a vertical distance which can be interpreted as the length on the y-axis. In our problem dendrogram.

We see that the intersection of the line with the most considerable distance, which means the greatest distance of a node, marks 5 different groups, which means it recommends five clusters for the problem. Having this in mind, we can advance to the next step.

Note: if the parameter “affinity” gives you an error, try changing it to “metric”, due to sklearn library’s deprecated versions.

We’re going to compute the Agglomerative Clustering; we’re going to use the Euclidean distance method and the same linkage criteria. Notice that we’re passing that we want 5 clusters for this problem.

Then, we just assign the data points to a corresponding cluster.

Finally, we’re going to compute our scatterplot and give each cluster a respective tag, according to their spending score and their annual income.

Throwing the final result, a scatter plot showing the five different clusters we computed and their respective legend.

In Conclusion

Hierarchical clustering is a robust method that works with unlabeled data, which is useful because most data, especially new or original data, does not come pre-labeled and thus requires a significant amount of time to classify/annotate. In this article, you learned the fundamental ideas of clustering, how the algorithm works, and some additional resources for a better understanding, such as dendrograms, euclidean distance computation, and linking criteria.

Despite its benefits, such as not utilizing a fixed number of clusters, we must note that hierarchical clustering does not perform well with huge data sets owing to the high space and complexity of the algorithm. This drawback is due to the need to calculate the pairwise distance between the datasets as discussed in our section on linkages, as well as the analysis of the dendrogram being too computationally intensive on huge data sets. Keeping these facts in mind we understand that on huge data sets Hierarchical clustering may take too long or require too much in the way of computational resources to provide useful results, but this type of algorithm is great for small to standard data sets and particularly useful in early understanding of unlabeled data.

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

Introduction to Computer Vision: Image segmentation with Scikit-image

Computer Vision is an interdisciplinary field in Artificial Intelligence that enables machines to derive and analyze information from imagery (images and videos) and other forms of visual inputs. Computer Vision imitates the human eye and is used to train models to perform various functions with the help of cameras, algorithms, and data rather than optic nerves and the visual cortex. Computer vision has very significant real-world applications including facial recognition, self-driving cars, and predictive analysis. With self-driving cars Computer Vision (CV) allows the car’s computer to make sense of the visual input from a car’s cameras and other sensors. In various industries, CV is used many tasks such as for x-ray analysis in healthcare, for quality control in manufacturing, and predictive maintenance in construction just to name a few.

Outside of just recognition, other methods of analysis include

  • Video motion analysis which uses computer vision to estimate the velocity of objects in a video, or the camera itself.

  • In image segmentation, algorithms partition images into multiple sets of views which we will discuss later in this article.

  • Scene reconstruction which creates a 3D model of a scene inputted through images or video and is popular on social media.

  • In image restoration, noise such as blurring is removed from photos using Machine Learning based filters.

Scikit-Image

A great tool is Scikit-image which is a Python package dedicated to image processing. We’ll be using this tool throughout the article so to follow along you can use the code below to install it:

pip install scikit-image

# For Conda-based distributions

conda install -c conda-forge scikit-image

Basics for Scikit-image

Before getting into image segmentation, we will familiarize ourselves with the scikit-image ecosystem and how it handles images.

Importing images from skimage library

The skimage data module contains some built-in example data sets which are generally stored in jpeg or png format. We will use matplotlib to plot images which is an amazing visualization library in Python for 2D plots of arrays. You can find the link to our notebook here.

  • Importing a grayscale image

  • Importing a colored image

  • Importing images from an external source

Various factors affect methods used to process images some are color, format, and even size. More high-contrast images would need more advanced tools.

  • Loading multiple images

A ValueError will be raised if images in the ImageCollection don’t have identical shapes.

  • Saving images

Converting image format

RGB color model is an additive color model in which the red, green, and blue primary colors of light are added together at different intensities to reproduce a broad array of colors. RGB is the most common color model used today. Every television or computer monitor uses the RGB color model to display images.

  • RGB to Grayscale

So as to apply filters and other processing techniques, the expected input is a two-dimensional vector i.e. a monochrome image. This is great for basic segmentation that would not work properly with high-contrast images. rgb2gray module of the skimage package is used to convert a 3-channel RGB Image to one channel monochrome image.

Output:

  • RGB to HSV

An HSV (Hue, Saturation, and Value) color model is a color model designed to more closely resemble how the human vision perceives color. HSV is great for editing because it separates out the lightness variations from the hue and saturation variations. rgb2hsv() function is used to convert an RGB image to HSV format.

Output:

Image Segmentation

Image Segmentation is the process of splitting images into multiple layers, represented by an intelligent, pixel-wise mask. Simply put it is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics(for example, color, intensity, or texture). It involves merging, blocking, and separating an image from its integration level. Splitting a picture into a collection of Image Objects with comparable properties is the first stage in image processing. For this article, we will cover image segmentation with thresholding using supervised and unsupervised algorithms.

Thresholding

This is a simple way of segmenting objects in the background by choosing pixels of intensities above or below a certain threshold value. It is a way to create a binary image from a grayscale or full-color image. This is typically done in order to separate “object” or foreground pixels from background pixels to aid in image processing.

Supervised learning

This type of segmentation requires external input that includes things like setting a threshold, converting formats, and correcting external biases.

Segmentation by Thresholding — Manual Input

For this part we will use an external pixel value ranging from 0 to 255 is used to separate the picture from the background. The intensity value for each pixel is a single value for a gray-level image or three values for a color image. This will result in a modified picture that is more or less than the specified threshold as we will see below. To implement this thresholding we first normalize an image from 0–255 to 0–1. A threshold value is fixed and on the comparison, if evaluated to be true, then we store the result as 1, otherwise 0.

Output:

This globally binarized image can be used to detect edges as well as analyze contrast and color differences.

Active Contour Segmentation

An active contour is a segmentation approach that uses energy forces and restrictions to separate the pixels of interest from the remainder of the picture for further processing and analysis. The active contour model is among the dynamic approaches in image segmentation that uses the image’s energy restrictions and pressures to separate regions of interest. It is a technique for minimizing the energy function resulting from external and internal forces. An exterior force is specified as curves or surfaces, while an interior force is defined as picture data. The external force is a force that allows initial outlines to automatically transform into the forms of objects in pictures. Active Contour segmentation also called snakes and is initialized using a user-defined contour or line, around the area of interest. This contour then slowly contracts and is attracted or repelled from light and edges. The snakes model is popular in computer vision, and snakes are widely used in applications like object tracking, shape recognition, segmentation, edge detection, and stereo matching.

In the example below after importing the necessary libraries we will convert our image from the scikit-image package to grayscale. Then we will plot and draw a circle around the astronaut’s head to initialize the snake. active_contour() function active contours by fitting snakes to image features. Gaussian filter is also applied to denoise the image. For the parameters, alpha and beta, higher values of alpha will make this snake contract faster while beta makes the snake smoother.

Output:

Chan-Vese Segmentation

The Chan-Vese segmentation algorithm is designed to segment objects without clearly defined boundaries. The well-known Chan-Vese iterative segmentation method splits a picture into two groups with the lowest intra-class variance. The implementation of this algorithm is only suitable for grayscale images. Some of the parameters used are lambda1 and mu. The typical values for lambda1 and lambda2 are 1. However, if the ‘background’ is very different from the segmented object in terms of distribution then these values should be different from each other, for example, a uniform black image with figures of varying intensity. Typical values for mu are between 0 and 1, though higher values can be used when dealing with shapes with very ill-defined contours. The algorithm then returns a list of values that corresponds to the energy at each iteration. This can be used to adjust the various parameters we have discussed above.

In the example below, we begin by using rgb2gray to convert our image to grayscale. The chan_vese() function is used to segment objects using the Chan-Vese Algorithm whose boundaries are not clearly defined. Then we will plot the output tuple of 3 values which are the original image, the final level image, and one that shows the evolution of energy.

Output:

Unsupervised Learning

This type of image segmentation thresholding algorithm requires no user input. Consider an image that is so large that it is not feasible to consider all pixels simultaneously. So in such cases, Unsupervised segmentation can break down the image into several sub-regions, so instead of millions of pixels, you have tens to hundreds of regions. You may still be able to tweak certain settings to obtain desired outputs.

SLIC (Simple Linear Iterative Clustering)

SLIC algorithm utilizes K-means, a machine learning algorithm, under the hood. It takes in all the pixel values of the image and tries to separate them out into the given number of sub-regions.

SLIC works well with color so we do not need to convert images to grayscale. We will set the subregion to the average of that region which will make it look like an image that has decomposed into areas that are similar. label2rgb() replaces each discrete label with the average interior color.

Output:

Mark Boundaries

This technique produces an image with highlighted borders between labeled areas, where the pictures were segmented using the SLIC method.

In the example below we have segmented the image into 100 regions with compactness = 1 and this segmented image will act as a labeled array for the mark_boundaries() function. The mark_boundaries() function is to return images with boundaries between labeled regions.

Output:

Felzenszwalb’s Segmentation

Felzenszwalb uses minimum-spanning tree clustering for the machine-learning algorithm behind the scenes. Felzenszwaib doesn’t tell us the exact number of clusters that the image will be partitioned into. It will run and generate as many clusters as it thinks are appropriate for that given scale or zoom factor on the image. This may be used to isolate features and identify edges.

In the example below seg.felzenszwalb() function is used to compute Felsenszwalb’s efficient graph-based image segmentation. The parameter scale determines the level of observation. Sigma is used to smooth the pictures before segmentation. Scale is the sole way to control the quantity of generated segments as well as their size. The size of individual segments within a picture might change drastically depending on local contrast. This is useful in confining individual features, foreground isolation, and noise reduction, and can be useful to analyze an image more intuitively.

Output:

We can calculate the number of unique regions the image was partitioned into.

Let’s recolor the image using label2rgb() just like we did with the SLIC algorithm.

It is similar to a posterized image which is essentially just a reduction in the number of colors.

Conclusion

Image segmentation is a vital step in image processing. It is actively researched with its application in traffic and video surveillance to medical imaging. In this article, we have gone over image segmentation techniques using only the scikit image module. You could attempt some of these image segmentation methods with libraries like OpenCV. It is however important to mention some of the image segmentation techniques which use deep learning.

Training Neural Networks with JAX

JAX is a python library made to boost machine learning research using accelerators like TPUs/GPUs. Due to its speed and efficiency coupled with familiarity of Python and Numpy, it has been widely adopted by machine learning researchers. While training neural networks is even faster, another advantage of JAX is it saves memory cost and energy. In this tutorial, we’ll be using JAX to create a simple neural network which we’ll use to solve a regression task. If you are new to JAX this article here is a solid introduction. For our example, we’ll use a small dataset available from yellowbrick which is an open source, pure Python project that extends Scikit-Learn with visual analysis and diagnostic tools.

What is a neural network?

Simply put it is a mathematical function that maps a given input in conjunction with information from other nodes to develop an output. It is inspired and modeled on the human mind. In this tutorial, I won’t explain many of the basics of the neural network so if you’re new to neural networks I will refer to this article here.

Regression

Regression is a method of investigating the relationship between independent variables or features and a dependent variable or outcome. It’s used as a method for predictive modeling in machine learning, in which an algorithm is used to predict continuous outcomes. We will create a neural network with JAX to solve a regression task. We’ll use a dataset for Boston housing from scikit-learn. Below we will import JAX and some of its submodules stax and optimizers that we will use to train neural networks. We have also imported the jax.numpy module as we’ll require it to convert input data to JAX arrays and a few other calculations. Here is the link to our notebook.

Load dataset

We first load the concrete strength dataset available from Yellowbrick. Concrete is the most important material in civil engineering. The concrete compressive strength is the regression problem. The concrete compressive strength is a highly nonlinear function of age and ingredients. We have loaded data features in variable X and target values in variable Y. We split the dataset into the train (80%) and test (20%) sets according to the Pareto principle. The Pareto principle states that “for many events, roughly 80% of the effects come from 20% of the causes”. After dividing the dataset, we converted each NumPy array to a Jax array using jax.numpy.array() constructor. We also printed the shape of the train datasets and test datasets at the end.

Normalize data

To normalize data, we first calculated the mean and standard deviation of the training dataset for each feature of data. We then subtracted the mean from both training and testing sets. Finally, we divided subtracted values by standard deviation. The main reason to normalize data is to bring the values of each feature to almost the same scale. This helps the optimization algorithm gradient descent to converge faster. When values of different features are on a different scale and vary a lot then it can increase training time because the gradient descent algorithm will have a hard time converging.

Creating the Neural Network

The JAX module, stax provides various readily available layers that we can stack together to create a neural network. The process of creating a neural network using the stax module is almost the same as that of creating a neural network using Sequential() API of keras. The stax module provides a serial() method that accepts a list of layers and activation functions as input and creates a neural network. It applies the layers in the sequence in which they are given as input when performing forward pass-through data. Using the Dense() method we can create fully connected and dense layers. We can also provide a weight initialization and bias initialization function if we don’t want internal initialization performed by JAX after we create the layer using Dense().

Most stax module methods return 2 callable functions as output when executed:

  1. init_fun — This function takes seed for weight initialization of that layer/network and input shape for that layer/network as input. It then returns weights and biases. For a single layer, it returns just weights and biases as output and for the neural network, it’ll return a list of weights and biases.

  2. apply_fun — This function takes weights & biases of layer/network and data as input. It then executes the layer/network on input data using weights. It performs forward pass-through data for the network.

All activation functions are available as basic attributes of the stax module and we don’t need to call them with brackets. We can just give them as input to the serial() method after layers and they will be applied to the layer’s output.

Below is an example of a Dense() layer with 5 units to show the output returned by it. We can notice that it returns two callable functions which we described above.

We then created our neural network whose layer sizes are [5,10,15,1] the last layer is the output layer and all other layers are hidden layers. We have then created layers using the Dense() method followed by Relu (Rectified Linear Unit) activation function. We’ll provide all the layers and activation to the serial() method in sequence. The Relu function that we have used takes as input an array and returns a new array of the same size where all values less than 0 are replaced by 0.

By calling the init_fun() function we have initialized the weights of our neural network. We have given seed i.e (jax.random.PRNGKey(123)) and input data shape as input so it functions. The seed and shape information will be used to initialize the weights and biases of each layer of the neural network.

Below I have printed the weights and biases for each layer after initializing the weights.

We can perform a forward pass through our neural network. For this, we will take a few samples of our data and give them as input to the apply_fun() function along with weights. First, the weights are given followed by a small batch of data then apply_func() will perform one forward pass-through data using weights and return predictions.

Define loss function

In this part, we will calculate the gradient of the loss function with respect to weights and then update weights using gradients. We’ll use Mean squared error loss as our loss function. It simply subtracts predictions from actual values, squares subtracted values, and then the mean of them. Our loss function takes weights, data, and actual target values as input. It then performs a forward pass through the neural network using the apply_fun() function providing weights and data to it. The predictions made by the network are stored in a variable. We can then actually calculate MSE using actual target values and predictions.

Train Neural Network

We will create a function that we will call to train our neural network. The function takes data features, target values, number of epochs, and optimizer state as input. The Optimizer state is an object created by the optimizer that has our model’s weights.

Our function loops a number of epochs time, each time, it first calculates loss value and gradients using the value_and_grad() function. This function takes another function as input, the MSE loss function in our case. It then returns another callable which when called will return the actual value of the function as well as the gradient of the function with respect to the first parameter which is weights in our case. In this instance, we have given our loss function to the value_and_grad() function as input and then called the returned function by providing weights, data features, and target values. With these three as inputs of our loss function, the call will return MSE value and gradients for weights and biases of each layer of our neural network.

Then we will call an optimizer state update method that takes epoch number, gradients, and current optimizer state that has current weights as inputs. The method returns a new optimizer state which will have weights updated by subtracting gradients from it. We will print MSE at every 100 epochs to keep track of our training progress and finally, we return the last optimizer state (final updated weights).

Now that we have initialized an optimizer for our neural network we can go into what it is. The optimizer is an algorithm responsible for finding the minimum value of our loss function. The optimizers module available from the example_libraries module of jax provides us with a list of different optimizers. In our case, we’ll use the sgd() (gradient descent) optimizer. We initialized our optimizer by giving it a learning rate of (0.001).

The optimizer returns three callables necessary for maintaining and updating the weights of the neural network.

  1. init — This function takes weights of a neural network as input and returns the OptimizerState object which is a wrapper for holding and updating weights.

  2. update_fn — This function takes epoch number, gradients, and optimizer state as input. It then updates weights present in the optimizer state object by subtracting learning times gradients from it. It then returns a new OptimizerState object which has updated weights.

  3. params_fn — This function takes the OptimizerState object as input and returns the actual weights of the neural network.

Here we will train the neural network with the function we created in the previous cell. After initializing the optimizer with weights, we have called our training routine to actually perform training by providing data, target values, number of epochs, and optimizer state (weights). We are a training network for 2500 epochs.

Output:

As we can surmise from MSE getting printed every 100 epochs the model is getting better at the task.

Make Predictions

In this section, we have made predictions for both train and test datasets. We retrieved weights of the neural network using the params_fn optimizer function. We have then given weights and data features as input to the apply_fn method which will make predictions.

Evaluating the Model performance

Here we will evaluate how our model is actually performing. We are going to be calculating the R² score for both our train and test predictions. We are calculating the R² score using the r2_score() method of scikit-learn. The score generally returns the value in the range [0,1] where a value near 1 indicates a good model.

We can notice from the R² score that our model seems to be doing a good job.

Train the Model on Batches of Data

Some datasets are quite large and do not really fit into the main memory of the computer. In cases like this, we only bring a small batch of data into the main memory of the computer to train the model a batch at a time until the whole data is covered. The optimization algorithm used in this instance is referred to as stochastic gradient descent and it works on a small batch of data at a time.

The function we have below takes data features, target values, number of epochs, optimizer state (weights), and batch size (default 32) as input. We will perform training loops a number of epoch times while calculating the start and end indexes of our batch of data for each training loop. We will be performing a forward pass, calculating loss, and updating loss on a single batch of data at a time until the whole data is covered. When we are training data in batches we update weights for each batch of data until the whole data is covered for a number of epochs.

Now we have the function for training our neural network. We will initialize the weights of the neural network using init_fun by giving seed and input shape to it. Next, we initialized our optimizer by calling the sgd() function giving a learning rate (0.001). Then we created the first optimizer state with weights and then called our function from the previous cell to perform training in batches. We will be training the neural network for 500 epochs.

Making predictions in batches

Because all the data cannot fit into our main memory we will make predictions in batches. Below is a function that takes weights and data as input and then makes predictions on data in batches.

We will call the function above to make predictions on test and train datasets in batches. We will combine the prediction of the batches.

Evaluate model performance

We will calculate our R² score on train and test predictions to see how our network is performing.

Conclusion

You can attempt the example above yourself even with other tasks like classification which will use similar code. With modules like Stax and optimizers from JAX, you use less code which helps with efficiency. There are more libraries and modules from JAX to explore that may improve your machine learning research. As you can see with JAX you can vastly improve the speed of your machine learning research depending on your field of research.

References

https://coderzcolumn.com/tutorials/artificial-intelligence/create-neural-networks-using-high-level-jax-api

https://coderzcolumn.com/tutorials/artificial-intelligence/guide-to-create-simple-neural-networks-using-jax

https://jax.readthedocs.io/en/latest/notebooks/neural_network_with_tfds_data.html

SVD Algorithm Tutorial in Python

Singular Value Decomposition Algorithm



The Singular Value Decomposition is a matrix decomposition approach that aids in matrix reduction by generalizing the eigendecomposition of a square matrix (same number of columns and rows) to any matrix. It will help us to simplify matrix calculations.

If you don’t have clear concepts of eigendecomposition, I invite you to read my previous article about Principal Component Analysis Algorithm, specifically section 3: “Calculate the Eigendecomposition of the Covariance Matrix” (PCA Algorithm Tutorial in Python. Principal Component Analysis (PCA) | by Anthony Barrios | Accel.AI | Apr, 2022 | Medium). It will be of great help since SVD is a very similar approach to PCA Algorithm but made in a more general way. PCA does an assumption of the input square matrix, while SVD doesn’t.

In general,  when we work with real-number matrices, the formula of SVD is the following:

M = UVT

Where M is the m x n matrix we wish to decompose, U is the left singular m x m matrix that contains eigenvectors of the matrix MMT, the greek letter Sigma represents a diagonal matrix containing the square roots of the eigenvalues of MM* or M*M, arranged in descending order; V is the right singular n x n matrix, containing eigenvectors of matrix MTM.a

For a simple understanding of the function of each matrix, we can say that matrices U and V* cause rotation on the matrix, while the Sigma matrix causes scaling. A singular matrix refers to a matrix whose determinant is zero, indicating it doesn’t have a multiplicative inverse.

Python Tutorial

That’s it! Now, let’s see a basic example of this algorithm using Python. We’ll consider this matrix for our demonstration.


The thing about Python and some libraries is that we can make the whole SVD Algorithm by calling a function. But we can also recreate it to watch the step-to-step process. The first thing we’ll do is import the libraries, we’re using NumPy and SciPy. SciPy is a Python open-source library that is used to solve mathematical, scientific, engineering, and technological issues. It enables users to alter and view data using a variety of high-level Python commands. It is based on the NumPy Python extension. You can follow along in this Jupyter Notebook.


Now we’re going to create some functions that make the corresponding calculations, they’re all commented in case you want to see them.



We create our matrix, we’re calling it “A”.



Now we’ll assign values to our variables by calling the functions we created in the previous steps.



When we print our variables, we’re going to obtain the following output:



And that’s pretty much everything. Now, Python allows us to call the SVD function (imported from the SciPy library), making the calculations pretty simple and with no error margin. 




As we can see, the values are pretty the same, except for some signs changing in the values. Saying this, every investigator always wants to work quickly and efficiently, so we encourage them to work with the SciPy function.

Applications

Now that we have the basics, we don’t always want to use the SVD algorithm for a simple decomposition. There are many applications that we can do.

SVD can be used to calculate the Pseudoinverse of the matrix. This is an extension of the matrix inverse for square matrices to non-square ones (meaning they have a different number of rows and columns). It’s useful when recovering information lost from matrixes that don’t have an inverse. The SVD will compute the pseudoinverse of the matrix to work with it.

But we know that SVD Algorithm is widely used as a Dimensionality Reduction method, specifically in image compressions. Saying this, let’s see a Python example for image compression using the SVD Algorithm.

Image Compression in Python using SVD Algorithm

When we want to compress a file, we’re always looking for the most efficient approach with the lowest amount of unnecessary data. The smaller the image, the less the cost of storage and transmission. The SVD Algorithm will help us by decomposing a given matrix (an image is a matrix with different values representing colors) into three matrices to finally represent the image with a smaller set of values.

That way, image compression will be achieved while preserving the important features that make the original picture.

To start working with the algorithm, we’re going to pick the first k elements of each matrix. In the following example, we’re going to use the SVD Algorithm and show some variations according to the number of elements we’re going to work with. For this demonstration, we’ll use the photo of a kitten.


First, we’re going to import our libraries



At this point, we’re very familiar with NumPy. Let’s add other libraries like Matplotlib, a charting library for Python and a NumPy extension. Matplotlib contains image, where basic image loading, rescaling and display actions are supported. Specifically, it contains imread, a function that will help us by reading the image as an array.

Also, we’re importing the pyplot interface, it provides an implicit, MATLAB-like graphing method.

Finally the os library. It has features for creating and deleting directories (folders), retrieving their contents, updating and identifying the current directory, and so forth.

Next, we'll adjust the size of the graphic, read our picture, and convert it to grayscale color, making it easier to see the distinctions between the several photos we'll be displaying.


When we call the plt.show function, it will display our new photograph in a grayscale.



Now, let’s see where the magic begins. We’re going to compute the SVD Algorithm using the function imported in NumPy.




At first, this might be tricky to watch, but what we’re doing here is extracting the diagonal entries in the Sigma matrix and arranging them in descending order. Then, iteratively, we’re going to select the first k values of every matrix. Selecting the U’s columns and VT’s rows. This way we’re going to watch the closest way we can approach the original image.

When calling the last function, it will show us three different approaches with the given k values.

The first one won’t show us anything clear, but we can guess that the image is showing us a cat because of the pointy ears.



In the next picture, we can clearly see a cat, but obviously, it’s a blurry image, so we can’t appreciate the details the original image has.



The final image has a greater number of k values, which allows us to see a more clear picture. Sure, it has some details missing, but it’s a great number of values we can use for compression. This image’s great for future use in a project we want to work on, maybe a cat album page or a veterinary site.

In conclusion

The Singular Value Decomposition Algorithm is a powerful tool for dimensionality reduction, in this article, we were capable of making a quick review of some math terms that helps us know how this algorithm works and how can it be applied in important fields likewise image compression. 

In the case of image compression, being of great help when it comes to transmission and storage reduction, we noticed that simple use of this algorithm is achieved by choosing an adequate value of k. While we’re increasing this number, the original matrix will be reconstructed almost like the original image. That way, we can notice that this is a simple algorithm since it doesn’t have a high computational complexity.

But image compression is not the only thing that this algorithm work with but with other elements like a database itself or video compressions. I’ll encourage you to make further reading and practice with the algorithm we worked on in the Python tutorial sections. Nice coding!


References:

Introduction to the JAX library for ML in Python

Introduction to the JAX library for ML in Python

JAX (Just After eXecution) is a recent machine learning library used for expressing and composing numerical programs. JAX is able to compile numerical programs for the CPU and even accelerators like GPU and TPU to generate optimized code all while using pure python. JAX works great for machine-learning programs because of the familiarity of Python and NumPy together with hardware acceleration. This is great for the definition and composition of user-wielded function transformations. These transformations include automatic differentiation, automatic batching, end-to-end compilation (via XLA), parallelizing over multiple accelerators, and more. Researchers use it for a wide range of advanced applications, from studying training dynamics of neural networks to developing Machine Learning solutions, to probabilistic programming, to developing accelerated numerical code, and to scientific applications in physics and biology. Various tests have shown that JAX can perform up to 8600% faster when used for basic functions. This is highly valuable for data-heavy application-facing models, or just for getting more machine learning experiments done in a day.

Already understand why you want to use JAX? Jump forward to the code!

Some of its vital features are:

  • Just-in-Time (JIT) compilation.

  • Enabling NumPy code on not only CPUs but GPUs and TPUs as well.

  • Automatic differentiation of both NumPy and native Python code.

  • Automatic vectorization.

  • Expressing and composing transformations of numerical programs.

  • An Advanced (pseudo) random number generation.

  • There are more options for control flow.

JAX’s popularity is rising in the deep-learning industry because of its speed it is used increasingly in machine learning programs and accelerating research. JAX provides a general foundation for high-performance scientific computing this is useful in many various fields and instances, not just deep learning. Even if most of your work is not in Python but if you want to build some sort of hybrid model-based / neural-network system, then it is probably worth it to use JAX going forward. If most of your work is not in Python, or you’re using some specialized software for your studies (thermodynamics, semiconductors, etc.) then JAX probably isn’t the tool for you, unless you want to export data from these programs for some sort of custom computational processing. Suppose your area of interest is closer to physics/mathematics and incorporates computational methods (dynamical systems, differential geometry, statistical physics) and most of your work is in e.g. Mathematica. In that case, it’s probably worth it to stick with what you’re using, especially if you have a large custom codebase.

Getting started with JAX

You can follow along in this Jupyter Notebook, here we install JAX easily with pip in our command line:

This however supports CPU only which is useful for local development. If you want both CPU and GPU support you should first install CUDA and CuDNN if not already installed. Also, make sure to map the jaxlib version with the CUDA version you have.

Here is the JAX installation Github Guide for more installation options and troubleshooting.

We will import both JAX and Numpy into our notebook link here for a comparison of different use cases:

Why use JAX?

Accelerated Linear Algebra (XLA compiler) — XLA is a domain-specific compiler for linear algebra that has been used extensively by Tensorflow and is one of the factors that make JAX so fast. In order to perform matrix operations as fast as possible, the code is compiled into a set of computation kernels that can be extensively optimized based on the nature of the code.

Examples of such optimizations include:

  • Fusion of operations: Intermediate results are not saved in memory

  • Optimized layout: Optimize the “shape” an array is represented in memory

Just-in-time compilation to speed up functions — Just-in-time compilation is a way of executing code that entails the compilation of the code at run time rather than before the execution. Just-in-time compilation comes with Accelerated Linear Algebra (XLA compiler). If we have a sequence of operations, the @jit decorator comes into play to compile multiple operations together using XLA. In order to use XLA and jit, one can use either the jit() function or @jit decorator.

Using the timeit command we can see the improvement in execution time is quite clear. We use block_until_ready because JAX uses asynchronous execution by default. Although this is incredibly useful in deep learning, jit is not without limitation. One of its flaws is when you use “if” statements in your function jit may likely be unable to represent your function accurately.

Auto differentiation with grad() function

As well as evaluating numerical functions, we also want to transform them. One transformation is automatic differentiation. JAX is able to differentiate through all sorts of python and NumPy functions including loops, branches, recursions, and more. This is very useful in deep learning as backpropagation becomes very easy.

In the example below, we define a simple quadratic function and take its derivative on point 1.0. We will find the derivative manually as well In order to prove that the result it’s correct.

There is so much more to doing auto differentiation with JAX, if you are interested in its full capabilities, you can find more about it in the official documentation.

Auto-vectorization with vmap

Another transformation in JAX’s API that you might find useful is vmap(), the vectorizing map. It has the familiar semantics of mapping a function along array axes, but instead of keeping the loop on the outside, it pushes the loop down into a function’s primitive operations for better performance. When composed with jit(), it can be just as fast as adding the batch dimensions beforehand. In the example below we will take a function that operates on a single data point and vectorize it so it can accept a batch of these data points (or a vector) of arbitrary size.

vmap batches all the values together and passes them through the function so it squares all values at once. When d(x) run without vmap the square of each value was computed one at a time and the result was appended to the list. Needless to say, this results in an increase both in speed and memory consumption.

Replicate computation across devices with pmap

Pmap is another transformation that enables us to replicate the computation into multiple cores or devices and execute them in parallel. It automatically distributes computation across all the current devices and handles all the communication between them. You can run jax.devices() to check out the available devices.

Notice that DeviceArray is now SharedDeviceArray this is the structure that handles the parallel execution. JAX supports collective communication between devices. If for example, we want to perform an operation on values from different devices. To perform this, we need to gather all the data from all devices and find the mean.

The function above collects all “x” from the devices, finds the mean, and returns the result to each device to continue with the parallel computation. The code above however will not run unless you have more than one device communicating with each other to have the parallel computation. With pmap, we can define our own computation patterns and exploit our devices in the best possible way.

Control flows

In Python programming, the order in which the program’s code is executed at runtime is called control flow. The control flow of a Python program is regulated by conditional statements, loops, and function calls.

Python has three types of control structures:

  • Sequential — whose execution process happens in a sequence.

  • Selection — used for decisions and branching, i.e., if, if-else statements

  • Repetition — used for looping, i.e., repeating a piece of code multiple times.

Control flow with autodiff

When using grad in your python functions you can use regular python control-flow structures with no problems, as if you were using Autograd (or Pytorch or TF Eager).

Control flow with jit

Control flow with jit however is more complicated, and by default, it has more constraints.

When jit-compiling a function we want to compile a function that can be cached and reused for many different argument values. To get a view of your Python code that is valid for many different argument values, JAX traces it on abstract values that represent sets of possible inputs. There are multiple different levels of abstraction and different transformations which use different abstraction levels. If we trace using the abstract value we get a view of the function that can be reused for any concrete value in the corresponding functions (e.g while working on different sets of arrays) which means we can save on compile time.

The function being traced above isn’t committed to a specific concrete value. In the line with if x < 3 this expression x < 3 is a boolean. When Python attempts to coerce that to a concrete True or False, we get an error: we don’t know which branch to take, and can’t continue tracing. You can relax the traceability constraints by having jit trace on more refined abstract values. We could use the static_argnums argument to jit, we can specify to trace on concrete values of some arguments.

Asynchronous dispatch

Essentially what this means is control is returned to the python program even before operations are complete. It instead returns a DeviceArray which is a future, i.e., a value that will be produced in the future on an accelerator device but isn’t necessarily available immediately. The future can be passed to other operations before the computation is completed. Thus JAX allows Python code to run ahead of the accelerator, ensuring that it can enqueue operations for the hardware accelerator without it having to wait.

Pseudo-Random number generator (PRNG)

A random number generator has a state. The following “random” number is a function of the previous number and the state. The sequence of random values is limited and does repeat. Instead of a typical stateful PseudoRandom Number Generator (PRNGs) as in Numpy and Scipy, JAX random functions require a PRNG state to be passed explicitly as a first argument.

Something to note is PRNGs work well when dealing with vectorization and parallel computation between devices.

JAX vs NumPy

  • Accelerator Devices — The differences between NumPy and JAX can be seen in relation to accelerator devices, such as GPUs and TPUs. Classic NumPy’s promotion rules are too willing to overpromote to 64-bit types, which is problematic for a system designed to run on accelerators. JAX uses floating-point promotion rules that are more suited to modern accelerator devices and are less aggressive about promoting floating-point types similar to Pytorch.

  • Control Behavior — When performing unsafe type casts JAX’s behavior may be backend dependent, and in general, may diverge from NumPy’s behavior. Numpy allows control over the result in these scenarios via the casting argument JAX does not provide any such configuration, instead directly inheriting the behavior of XLA: ConvertElementType.

  • Arrays — JAX’s array update functions, unlike their NumPy versions, operate out-of-place. That is, the updated array is returned as a new array and the original array is not modified by the update.

  • Inputs — NumPy is generally happy accepting Python lists or tuples as inputs to its API functions JAX however returns an error. This is deliberate because passing lists or tuples to traced functions can lead to silent performance degradation that might otherwise be difficult to detect.

Conclusion

I have briefly covered what makes JAX a great library and it promises to make ML programming more intuitive, structured, and clean. There is so much more to this library that we haven’t covered go ahead and explore more in-depth uses of JAX. You can learn more from its documentation here.

JAX also provides a whole ecosystem of exciting libraries like:

  • Haiku is a neural network library providing object-oriented programming models.

  • RLax is a library for deep reinforcement learning.

  • Jraph, pronounced “giraffe”, is a library used for Graph Neural Networks (GNNs).

  • Optax provides an easy one-liner interface to utilize gradient-based optimization methods efficiently.

  • Chex is used for testing purposes.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our website, mailing list, meetup group, Twitter, and Facebook.

References

https://jax.readthedocs.io/en/latest/notebooks/quickstart.html

https://theaisummer.com/jax/

https://developer.nvidia.com/gtc/2020/video/s21989

https://www.shakudo.io/blog/a-quick-introduction-to-jax

Introduction to Machine Learning with Decision Trees

A machine learning model is a program that combs through data to learn, find patterns and make predictions. A model is trained with previously unseen data called training data which, when provided with an algorithm, can reason and learn from the data. An example of where this is used is if you want to build an application that can recognize if a voice is male or female. You can train the model by providing it with various voices labeled either male or female; the algorithm will be able to learn the difference in pitch or speech patterns and recognize if a voice is male or female. While there are various models in machine learning, in this tutorial we will begin with one called the Decision tree. Decision trees are the basic building block for some of the best models in data science and they are easy to pick up.

Decision Tree

When it comes to decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision-making. It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. First, we need to understand decision nodes and leaves.

The leaf node gives the final outcome; the decision node is where the data splits. There are two main types of decision trees, these are classification and regression. Classification trees are when the target variable can take a discrete set of values. For these tree structures, the leaves represent class labels and the branches represent the features that lead to those class labels. Regression trees are when the target variable takes continuous values, usually numbers. For simplicity, we will begin with a fairly simple decision tree.

Making the model

Exploring the data

When beginning any machine learning project the first step is familiarizing yourself with the data. For this, we will use the pandas library which is the primary tool data scientists use when exploring and manipulating data. It is used with the import pandas as pd command below. To follow along, click this link to the jupyter notebook.

Demonstration of the import command.

A vital part of the pandas library is the data frame where data is represented in a tabular format similar to a sheet in Excel or a table in a SQL database. Pandas has powerful methods that will be useful for this kind of data. In this tutorial, we’ll look at a dataset that contains data for housing prices. You can find this dataset in Kaggle.

We will use the pandas function describe() that will give us a summary of statistics from the columns in our dataset. The summary will only be for the columns containing numerical values which are easier to use with most machine learning models. Loading and understanding your data is very important in the process of making a machine learning model.

We load and explore the data with the commands below:

Demonstration of the describe() command.

The summary results for the columns can be understood this way.

  • Count shows how many rows have no missing values.

  • The mean is the average.

  • Std is the standard deviation, which measures how numerically spread out the values are.

  • To interpret the min, 25%, 50%, 75%, and max values, imagine sorting each column from the lowest to the highest value. The first (smallest) value is the min. If you go a quarter way through the list, you’ll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced “25th percentile”). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

Selecting data for modeling

Datasets sometimes have a lot of variables that make it difficult to get an accurate prediction. We can use our intuition to pare down this overwhelming information by picking only a few of the variables. To choose variables/columns, we’ll need to see a list of all columns in the dataset. That is done with the command below.

Demonstration of the .columns command.

There are many ways to select a subset of your data but we will focus on two approaches for now.

  1. Dot notation, which we use to select the “prediction target”

  2. Selecting with a column list, which we use to select the “features”

Selecting The Prediction Target

We can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data. We’ll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices is:

Demonstration of the variable `y` assignment to housing price.

Choosing Features

The variables/columns chosen to be added to the model and later used to make predictions are what are referred to as features. For this tutorial, it will be the columns that determine home prices. There are times when you may use all your columns as features and other times fewer features are preferred.

Our model will have fewer features. This is done by selecting multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

It looks like this:

Demonstration of the house_features list.

Traditionally this data is called X.

Demonstration of the variable `x` assignment to the selected housing features.

You can review the data in the features using the .head() command:

Demonstration of the .head() command.

The Model

When creating our model we will use the library Scikit-learn which is easily the most popular library for modeling typical data frames. More on Scikit-learn.

There are steps to building and using an effective model, they are as follows:

  • Defining — There are various types of models other than the decision tree. Picking the right model and the parameters that go with it is key

  • Train/Fit — this is when patterns are learned from the data.

  • Predict — the model will make predictions from the patterns it learned when training.

  • Evaluation — Check the accuracy of the predictions

Below is an example of a decision tree model defined with scikit-learn and fitted with the features and the target variable.

Example of a decision tree model.

The library is written as sklearn. By specifying a number for random_state you ensure you get the same results in each run. Many machine learning models allow some randomness in model training; this is considered a good practice.

We could predict the first few rows of the training data using the predict function.

Demonstration of the .predict() function.

How good is our model?

Measuring the quality of our model is imperative to improving the model. The fitting measure of a model’s quality is in its predictive accuracy. There are many metrics used for summarizing model quality but we’ll begin with Mean Absolute Error. In Mean Absolute error (MAE for short) the absolute value of the error is converted to a positive number and the average is taken. Simply put, on average our predictions are off by this value.

This is how to calculate the mean absolute error.

Calculating the mean absolute error.

What we calculated above is called the “in-sample” score. We have used the same sample of data for house prices to build and evaluate our model. Since the patterns learned were from the training data so it seems accurate in the training data. This is bad because those patterns derived won’t hold when new data is introduced. Because models’ practical value comes from making predictions on new data, we measure performance on data that wasn’t used to build the model. How to do this is to exclude some data from the model-building process, and then use those to test the model’s accuracy on data it hasn’t seen before. This data is called validation data.

The scikit-learn library has the function train_test_split to break up the data into two pieces. We’ll use some of that data as training data to fit the model, and we’ll use the other data as validation data to calculate mean_absolute_error. Here is the code:

Calculating mean_absolute_error.

This is the difference between a model that is almost exactly right and one that is unusable for most practical purposes.

Overfitting and Underfitting

Overfitting refers to when a model takes too well to training data. This means that the inaccuracies and random fluctuations are picked up as patterns and concepts by the models. The problem with this is that these patterns do not apply to new data and this negatively impacts the models’ ability to generalize.

Overfitting can be reduced by:

  • Increasing the training data

  • Reducing the models’ complexity

  • Using a resampling technique to estimate model accuracy.

  • Holding back validation data

  • Limiting the depth of the decision tree with parameters(see below)

There are various ways to control tree depth but here we will look at the max_leaf_nodes argument. This will control the depth of our tree. Below we will create a function to compare MAE scores from different values for max_leaf_nodes:

Function to compare MAE scores.

We can follow this by using a for-loop to compare the accuracy of the model with different values for max_leaf_nodes.

for-loop to compare the accuracy of the model.

50 nodes are the most optimal since they have the least MAE score.

Underfitting is when a model cannot learn patterns from data. When this happens it’s because the model or algorithm does not fit the data well. It happens usually when you do not have enough data to build an accurate model.

Underfitting can be reduced by:

  • Increasing the models’ complexity

  • Increasing the number of features

  • Increase the duration of training

While both overfitting and underfitting can lead to poor models, performance overfitting is the more recurring problem.

Conclusion

There are many ways to improve this model, such as experimenting to find better features or different model types. A random forest algorithm would also suit this model well as it has better predictive accuracy than a single decision tree and it works well with default parameters. As you keep modeling you will learn to use even more sophisticated models and the parameters that go with them.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

References

https://towardsdatascience.com/build-your-first-machine-learning-model-with-python-in-7-minutes-30b9e1a3eafa

https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/

PCA Algorithm Tutorial in Python

Principal Component Analysis (PCA)

Principal Component Analysis is an essential dimensionality reduction algorithm. It entails lowering the dimensionality of data sets to reduce the number of variables. It keeps the most crucial information in this manner. This method is a helpful tool capable of several applications such as data compression or simplifying business decisions. Keep reading if you want to learn more about this algorithm.

The focus of this algorithm is to reduce the data set to make it simpler while retaining the most valuable possible information. Simplifying the dataset makes it easier to manipulate and visualize the data, resulting in quicker analysis.

Already understand how PCA works? Jump forward to the code!

How is PCA possible?

There are multiple ways to calculate PCA. We’re going to explain the widely used: Eigendecomposition of the covariance matrix.

An eigendecomposition is the factorization of a matrix in linear algebra. We can use a mathematical expression to represent this through the eigenvalues and eigenvectors. These concepts will be explained in section 3: “Calculate the Eigendecomposition of the Covariance Matrix”. Having this in mind, let’s dive further into the five steps to compute the PCA Algorithm:

1. Standardization

Standardization is a form of scaling in which the values are centered around the mean and have a unit standard deviation. Allowing us to work, for example, with unrelated metrics.

It means every feature is standardized to have a mean of 0 and a variation of 1, putting them on the same scale. In this manner, each feature may contribute equally to the analysis regardless of the variance of variables or if they are of different types. This step prevents variables with ranges of 0 to 100 from dominating the ones with values of 0 to 1, making a less biased and more accurate outcome.

Standardization can be done mathematically by subtracting the mean and dividing by the standard deviation for each variable value. The letter z represents the standard score.

Once standardized, every feature will be on the same scale.

2. Covariance Matrix Computation

The covariance matrix is a square matrix that displays the relation between two random features, avoiding duplicated information since the variables we’re working with sometimes can be strongly related.

E.g. let’s imagine we have milk and coffee, they are different drinks that people usually drink. In conducting a survey, we can ask some people which beverage gives them more energy noting their score from 0 to 5. The results will be the data points we can collect to build our covariance matrix. The following table represents collected data from three subjects, having two variables “C” and “M”, that identify coffee and milk, respectively.

This table can be represented in a graph like the following.

As we can see, the horizontal axis represents Coffee, while the vertical axis represents Milk. Here, we can notice some vectors created with the values shown in the previous table, and every vector is labeled according to the subjects’ answers. This graph can be represented as a covariance matrix. A covariance matrix with these features will compare the values each feature has and will proceed with the calculations.

This covariance matrix is calculated by transposing the matrix A and multiplying by itself, then dividing by the number of samples. Represented as the formula below:

Resulting in matrices like these:

We have two instances of Covariance Matrices here. One is a 2x2 matrix, while the other is a 3x3 matrix. Each with the number of variables according to its dimension. These matrices are symmetric with respect to the main diagonal. Why?

The covariance of a variable with itself represents the main diagonal of each matrix.

cov(x,x)= var(x)

Also, we have to take into account that covariance is commutative.

cov(x,y)= cov(y,x)

That way, we see that the top and bottom triangular parts are therefore equal, making the covariance matrix symmetric with respect to the main diagonal.

At this point, we need to take into account the sign of the covariance. If it’s positive, both variables are correlated; if it’s negative, they’re inversely correlated.

3. Calculate the Eigendecomposition of the Covariance Matrix

In this phase, we will compute the eigenvectors and eigenvalues of the matrix from the previous step. That way, we’re going to obtain the Principal Component. Before entering into the process of how we do this, we’re going to talk about these linear algebra concepts.

In the realm of data science, eigenvectors and eigenvalues are both essential notions that scientists employ. These have a mathematical basis.

An eigenvector is a vector that represents an approximation to a large matrix. This vector won’t change by any transformation; instead, it becomes a scaled version of the original vector. The scaled version is caused by the eigenvalue, that’s simply a calculated value that stretches the vector. Both come in pairs, with the number of pairings equaling the number of matrix dimensions.

Principal Components, on the other hand, being the major focus of this approach, are variables formed as linear combinations of the starting features. The primary goal of the principal components is to save the most uncorrelated data in the first component and leave the remainder to the following component to have the most residual information and so on until all the data is saved.

This procedure will assist you in reducing dimensionality while retaining as much information as possible and discarding the components with low information.

Example of a Principal Component, showing the contributions of variables in Python. Image Source: plot — Contributions of variables to PC in python — Stack Overflow

Now we have all the concepts clear. The real question is how can PCA Algorithm construct the Principal Component?

All of the magic in this computation is due to the eigenvectors and eigenvalues. Eigenvectors represent the axes’ directions where the maximum information is permitted. These are referred to as Principal Components. Eigenvalues, as we know, are values attached to the eigenvectors, giving the amount of variance that every principal component has.

If you rank the eigenvectors in order of every eigenvalue, highest to lowest, you’re going to get the principal components in order of significance, just like the following picture.

Eigenvectors and eigenvalues of a 2-dimensional covariance matrix. Image source: A Step-by-Step Explanation of Principal Component Analysis (PCA) | Built In

With the principal components established, the only thing left to do is calculate the proportion of variation accounted for by each component. It is possible to do this by dividing the eigenvalues of each component by the sum of eigenvalues.

4. Feature Vector

A Feature Vector is a vector containing multiple variables. In this case, it is formed with the most significant Principal Components, which means, the one with the vectors corresponding to the highest eigenvalues. The number of the principal component we want to add to our feature vector is up to us and up to the problem we’re going to solve.

Following the same example of the last step, knowing that 1>2, our feature vector can be written this way:

Feature Vector with two Principal Components. Image source: A Step-by-Step Explanation of Principal Component Analysis (PCA) | Built In

Also, we know that our second vector contains less relevant information, which is why we can skip it and only create our feature vector with the first principal component.

Feature Vector with only a Principal Component. Image Source: A Step-by-Step Explanation of Principal Component Analysis (PCA) | Built In

We need to take into consideration that reducing our dimensionality will cause information loss, affecting the outcome of the problem. We can discard principal components when it looks convenient, which means when the information of those principal components we want to discard is less significant.

5. Recast the Data Along the Axes of the Principal Components

Because the input data is still in terms of the starting variables, we must utilize the Feature Vector to reorient the axes to those represented by our matrix’s Principal Components.

It can be done by multiplying the transposed standardized original data set by the transpose of the Feature Vector.

Now that we finally have the Final Data Set, that’s it! We’ve completed a PCA Algorithm.

PCA Python Tutorial

However, when you have basic data, such as the previous example, it is quite straightforward to do it with code. When working with massive amounts of data, scientists are continually looking for ways to compute this Algorithm. That’s why we’re going to complete an example in Python, you can follow along in this Jupyter Notebook.

The first thing we’re going to do is import all the datasets and functions we’re going to use. For a high-level explanation of the scientific packages: NumPy is a library that allows us to use mathematical functions, this will help us to operate with the matrices (also available thanks to this library).

Seaborn and Matplotlib are both used to generate plots and graphics. Matplotlib is also used as an extension of NumPy, including more mathematical functions.

Sklearn is the reserved word for scikit-learn, a machine learning library for Python, it has some loaded datasets and various Machine Learning algorithms.

Importing Data and Functions

Next, we will declare a class called PCA, which will have all the steps we learned previously in this blog.

PCA Class and Functions

Functions for Eigenvectors and Projection

These functions are pretty intuitive and easy to follow. The next step is to implement this PCA Class. We’re going to initialize the variables we’re going to use.

Initialization of Variables

Here we can notice we’re using the datasets library, extracted from scikit-learn. Then, we assigned those values to our variables. Scikit-learn comes with various datasets, we’re going to use the known ‘Toys datasets’. These are small standard datasets, used mostly in algorithms examples or tutorials.

Specifically, we’re loading the diabetes dataset. According to the authors, this data is based on “Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.”

Following the code, now we’re going to initialize the class and start calling out the functions of the class. When running the code, we can have the following outcome.

PCA Initialization

PCA Example in Python

To watch the differences between the original datasets, we can just call the first two functions of the class, throwing the next scatterplot:

Scatterplot Graph of Diabetes Dataset

If we do the whole PCA process, we can better visualize the data in this way:

PCA Graph of Diabetes Dataset

Now, we’ve learned about the key computing procedures, which included some crucial linear algebra basis.

In Conclusion

The Principal Component Analysis is a straightforward yet powerful algorithm for reducing, compressing, and untangling high-dimensional data. It allows us to isolate the data more clearly, and use it for various machine learning methods.

Since the original data was reduced in dimensionality terms, retaining trends and patterns, we can notice that the final output of this algorithm is easier to manipulate, making further analysis much easier and faster for the machine learning algorithm, allowing us to forget unnecessary variables and some problems like the curse of dimensionality.

The next step is to use your new transformed variables with other machine learning algorithms to better understand your data and finally finish your research. If you want a general guide on the existing different machine learning methods and when to use them, you can check my previous entry on “Machine Learning Algorithms”.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our websitemailing listmeetup groupTwitter, and Facebook.

References:

Data Processing in Python

Data Processing in Python

Generally speaking, data processing consists of gathering and manipulating data elements to return useful, potentially valuable information. Different encoding types will have various processing formats. The most known formats for encodings are XML, CSV, JSON, and HTML.

With Python, you can manage some encoding processes, and it’s better suited for data processing than other languages due to its simple syntax, scalability, and cleanliness that allows solving different complex problems in multiple ways. All you’re going to need are some libraries or modules to make those encoding methods work, for example, Pandas.

Why is Data processing essential?

Data processing is a vital part of data science. Having inaccurate and bad-quality data can be damaging to processes and analysis. Good clean data will boost productivity and provide great quality information for your decision-making.

What is Pandas?

When we talk about Pandas, most people assimilate the name with the black and white bear from Asia. But in the tech world, it’s a recognized open-source Python library, developed as an extension of NumPy. Its function is to work with Data Analysis, Processing, and Manipulation, offering data structures and operations to manage number tables and time series.

With this said, we agree that Pandas is a powerful essential programming tool for those interested in the Machine Learning field.

Processing CSV Data

Most Data Scientists rely on CSV files (which stand for “Comma Separated Values”) in their day-to-day work. It’s because of the simplicity of the storage in a tabular form as plain text, making it easier to read and comprehend.

CSV files are easy to create. We can use Notepad or another text editor to make a file, for example:

Then, save the file using the .csv extension (example.csv). And select the save as All Files (*.*) option. Now you have a CSV data file.

In the Python environment, you will use the Pandas library to work with this file. The most basic function is reading the CSV data.

Processing Data using Pandas

We will use a simple dataset for this tutorial i.e. Highest grossing movies dataset. You can download this and other datasets from “kaggle.com.

To start working with pandas we will import the library into our jupyter notebook which you can find here to follow along with this tutorial.

Pandas is one of the more notable libraries essential to the data science workflow as it provides you with the means to process and wrangle the data. This is vital as many consider the data pre-processing stage to occupy as much as 80% of a data scientist’s time.

Import dataset

The next step is to import the dataset for this we will use the read_csv() which is a function of pandas. Since the dataset is in a tabular format, pandas will convert it to a dataframe called data. A DataFrame is a two-dimensional, mutable data structure in Python. It is a combination of rows and columns like an excel sheet.

This dataset contains data on the highest-grossing movies of each year. When working with datasets it is important to consider: where did the data come from? Some will be machine-generated data. Some of them will be data that’s been collected via surveys. Some could be data that are recorded from human observations. Some may be data that’s been scraped from websites or pulled via APIs. Don’t jump right into the analysis; take the time to first understand the data you are working with.

Exploring the data

The head() function is a built-in function in pandas for the dataframe used to display the rows of the dataset by default; it displays the first five rows of the dataset. We can specify the number of rows by giving the number within the parenthesis.

Here we also get to see what data is in the dataset we are working with. As we can see there are not a lot of columns which makes the data easier to work with and explore.

We can also see how the last five rows look using the tail() function.

The function memory_usage() returns a pandas series having the memory usage(in bytes) in a pandas dataframe. The importance of knowing the memory usage of a dataframe helps when tackling errors like MemoryError in Python.

In datasets, the information is presented in tabular form so data is organized in rows and columns. Each column has a name, a data type, and other properties knowing how to manipulate the data in the columns is quite useful. We can continue and check the columns we have.

Keep in mind, because this is a simple dataset there are not a lot of columns.

loc[:] can be used to access specific rows and columns as per what you require. If for instance, you want the first 2 columns and the last 3 rows you can access them with loc[:]. One can use the labels or row and column numbers with the loc[:] function.

The above code will return the “YEAR”, “MOVIE”, and “TOTAL IN 2019 DOLLARS” columns for the first 5 movies. Keep in mind that the index starts from 0 in Python and that loc[:] is inclusive of both values mentioned. So 0:4 will mean indices 0 to 4, both included.

sort_values() is used to sort values in a column in ascending or descending order.

The ‘inplace’ attribute here is False but by specifying it to be True you can make a change in the original dataframe.

You can look at basic statistics from your data using the simple data frame function i.e. describe(), this helps to better understand your data.

value_counts() returns a Pandas Series containing the counts of unique values. value_counts() helps in identifying the number of occurrences of each unique value in a Series. It can be applied to columns containing data.

value_counts() can also be used to plot bar graphs of categorical and ordinal data syntax below.

Finding and Rebuilding Missing Data

Pandas has functions for finding null values if any are in your data. There are four ways to find missing values and we will look at all of them.

isnull() function: This function provides the boolean value for the complete dataset to know if any null value is present or not.

isna() function: This is the same as the isnull() function

isna().any() function: This function also gives a boolean value if any null value is present or not, but it gives results column-wise, not in tabular format.

isna().sum() function: This function gives the sum of the null values preset in the dataset column-wise.

isna().any().sum() function: This function gives output in a single value if any null is present or not. In this case there is no null value.

When there is a null value present in the dataset the fillna() function will fill the missing values with NA/NaN or 0. Below is the syntax.

De-Duplicate

This is removing all duplicate values. When analyzing data, duplicate values affect the accuracy and efficiency of the results. To find duplicate values the function duplicated() is used as seen below.

While this dataset does not contain any duplicate values if a dataset contains duplicate values it can be removed using the drop_duplicates() function.

Below is the syntax of this function:

We have seen here, we can already conduct fairly interesting data analysis with Pandas that provides various useful functionalities that are fairly straightforward and easy to use. Different approaches can be used for many different kinds of datasets to find patterns and trends to apply more advanced machine learning techniques in the future.

Follow me here for more AI, Machine Learning, and Data Science tutorials to come!

References

Neural Networks

What is a neural network?

A Neural Network is a system inspired by the human brain that is designed to recognize patterns. Simply put it is a mathematical function that maps a given input in conjunction with information from other nodes to develop an output.

What can a neural network do?

Neural Networks have a wide range of real-world and industrial applications a few examples are:

  • Guidance systems for self-driving cars

  • Customer behavior modeling in business analytics

  • Adaptive learning software for education

  • Facial recognition technology

And that’s just scratching the surface.

How does a neural network work?



A simple neural network includes an input layer, an output layer, and, in between, a hidden layer.

  • Input layer- this is where data is fed and passed to the next layer

  • Hidden layer- this does all kinds of calculations and feature extractions described below

  • Output layer- this delivers the final result

Each layer is connected via nodes. The nodes are activated when data is fed to the input layer. It is then passed to the hidden layer where processing and calculations take place through a system of weighted connections. Finally, the hidden layers link to the output layer — where the outputs are retrieved.



What python libraries or packages can you use?

There are a variety of useful python libraries that can be used ;

  • Pytorch- Apart from Python, PyTorch also has support for C++ with its C++ interface if you’re into that.

  • Keras- is one of the most popular and open-source neural network libraries for Python.

  • Tensorflow- is one of the best libraries available for working with Machine Learning on Python

  • Scikit-learn- It includes easy integration with different ML programming libraries like NumPy and Pandas

  • Theano- is a powerful Python library enabling easy defining, optimizing, and evaluation of powerful mathematical expressions

  • Numpy- concentrates on handling extensive multi-dimensional data and the intricate mathematical functions operating on the data.

  • Pandas- is a Python data analysis library and is used primarily for data manipulation and analysis.

How to implement a single-layer neural network in basic python to solve an XOR logic problem?

The Python language can be used to build neural networks from simple ones to the most complex. In this tutorial, you will learn how to solve an XOR logic problem. XOR problem is a valuable challenge because it is the simplest linearly inseparable problem that exists. It is a classic problem that helps in understanding the basics of Deep Learning. XOR (Exclusive OR) compares two input bits and generates one output bit; in other words, you must have one or the other but not both. The table below represents what we will implement in the code.

To begin we will use the python library NumPy which provides great functions and simplifies calculations. You can follow along in this linked Jupyter Notebook, or in your favorite text editor to begin by defining parameters.

Next is to define the weights and biases between each layer. Weights and biases are the learnable parameters of neural networks and other machine learning models. Weights determine how much influence the input has on the output. Biases that are constant have no incoming connections but have outgoing connections with their own weights. Biases guarantee activation even when inputs are zero. In this case, I did not use a bias. The weights have been set to random values.

I will then create functions for the activation of the neuron. Here we use the Sigmoid function which is normally used when the model is predicting probability. After activation forward propagation begins, this is the calculation for the predicted output. After we have calculated the output activations they will be returned to be used in further calculations. All values calculated during forward propagation are stored as they will be required during backpropagation. Backpropagation is the trial and error method of learning our neural network uses. It determines how much we should adjust the weights to improve the accuracy of our predictions.

In the end, we run the neural network for 10000 EPOCHS and view the loss function. An Epoch is when training neural networks the training data is used for one cycle the data is passed both forward and backward. A Forward pass and a backward pass are counted as one. In an Epoch the data is used just once. A loss function quantifies the difference between the expected outcome and the outcome produced by the neural network.

Following are the predictions of the neural network on test inputs:



We know that for XOR inputs 1,0 and 0,1 will give output 1 and inputs 1,1 and 0,0 will output 0. Another way to go about replicating this experiment is by using Tensorflow see it differs from NumPy in one major respect: TensorFlow is designed for use in machine learning and AI applications and so has libraries and functions designed for those applications. Pytorch is also great because it has a decent order over the Graphics Processing Unit it will be very useful in visualizing data to get additional insights into your work. In conclusion, writing the code and experimenting with the various ways to make this neural network will strengthen your skills and boost your knowledge of the inner workings of neural networks.








You can stay up to date with Accel.AI; workshops, research, and social impact initiatives through our websitemailing listmeetup groupTwitter, and Facebook.







2 Comments

Machine Learning Algorithms Cheat Sheet

Machine learning is a subfield of artificial intelligence (AI) and computer science that focuses on using data and algorithms to mimic the way people learn, progressively improving its accuracy. This way, Machine Learning is one of the most interesting methods in Computer Science these days, and it's being applied behind the scenes in products and services we consume in everyday life.

In case you want to know what Machine Learning algorithms are used in different applications, or if you are a developer and you’re looking for a method to use for a problem you are trying to solve, keep reading below and use these steps as a guide.

Machine Learning Algorithms Cheat Sheet by LatinX in AI™. Download the pdf: https://github.com/latinxinai/AI-Educational-Resources/raw/master/CheatSheets/Machine%20Learning%20Cheat%20Sheet.pdf

Machine Learning can be divided into three different types of learning: Unsupervised Learning, Supervised Learning, and Semi-supervised Learning.


Unsupervised learning uses information data that is not labeled, that way the machine should work with no guidance according to patterns, similarities, and differences.

On the other hand, supervised learning has a presence of a “teacher”, who is in charge of training the machine by labeling the data to work with. Next, the machine receives some examples that allow it to produce a correct outcome.

But there’s a hybrid approach for these types of learning, this Semi-supervised learning works with both labeled and unlabeled data. This method uses a tiny data set of labeled data to train and label the rest of the data with corresponding predictions, finally giving a solution to the problem.

To begin, you need to know the number of dimensions you’re working with, it means the number of inputs in your problem (also known as features). If you’re working with a large dataset or many features, you can opt for a Dimension Reduction algorithm.


Unsupervised Learning: Dimension Reduction

A large number of dimensions in a data collection can have a significant influence on machine learning algorithms' performance. The "curse of dimensionality" is a term used to describe the troubles large dimensionality might cause, for example, the “Distance Concentration” problem in Clustering, where the different data points will have the same value as the dimensionality of the data increases.

Techniques for minimizing the number of input variables in training data are referred to as “Dimension Reduction”. 


Now you need to be familiar with the concept of Feature Extraction and Feature Selection to keep going. The process of translating raw data into numerical features that can be processed while keeping the information in the original data set is known as feature extraction. It produces better outcomes than applying machine learning to raw data directly. 

It’s used for three known algorithms for dimensionality reduction including Principal Component Analysis, Singular Value Decomposition, and Linear  Discriminant Analysis, but you need to know exactly which tool you want to use to find patterns or infer new information from the data.

If you’re not looking to combine the variables of your data, instead you want to remove unneeded features by just keeping the important ones, then you can use the Principal Component Analysis algorithm.


PCA (Principal Component Analysis)

It's a mathematical algorithm for reducing the dimension of data sets to simplify the number of variables while retaining most of the information. This trade-off of accuracy for simplicity is extensively used to find patterns in large data sets.


In terms of linear connections, it has a wide range of applications when large amounts of data are present, such as media editing, statistical quality control, portfolio analysis, and in many applications such as face recognition and image compression.

Alternatively, if you want an algorithm that works by combining variables of the data you’re working with, a simple PCA may not be the best tool for you to use. Next, you can have a probabilistic model or a non-probabilistic one. Probabilistic data is data that involves a random selection and is preferred by most scientists for more accurate results. While non-probabilistic data doesn’t involve that randomness.

If you are working with non-probabilistic data, you should use the Singular Value Decomposition algorithm.

SVD (Singular Value Decomposition)

In the realm of machine learning, SVD allows data to be transformed into a space where categories can be easily distinguished. This algorithm decomposes a matrix into three different matrices. In image processing, for example, a reduced number of vectors are used to rebuild a picture that is quite close to the original.

Compression of an image with a given number of components. Source: Singular Value Decomposition | SVD in Python (analyticsvidhya.com)

Compared with the PCA algorithm, both can make a dimension reduction of the data. But while PCA skips the less significant components, the SVD just turns them into special data, represented as three different matrices, that are easier to manipulate and analyze.

When it comes to probabilistic approaches, it’s better to use the Linear Discriminant Analysis algorithm for more abstract problems.

LDA (Linear Discriminant Analysis)

Linear Discriminant Analysis (LDA) is a classification approach in which two or more groups have previously been identified, and fresh observations are categorized into one of them based on their features. It’s different from PCA since LDA discovers a feature subspace that optimizes group separability while the PCA ignores the class label and focuses on capturing the dataset's highest variance direction.

This algorithm uses Bayes’ Theorem, a probabilistic theorem used to determine the likelihood of an occurrence based on its relationship to another event. It is frequently used in face recognition, customer identification, and medical fields to identify the patient’s disease status.

Distribution of 170 face images of five subjects (classes) randomly selected from the UMIST database in (a) PCA-based subspace, (b) D-LDA-based subspace, and (c) DF-LDA-based subspace. Source: (PDF) Face recognition using LDA-based algorithms (researchgate.net)



The next step is to select whether or not you want your algorithm to have responses, which means if you want to develop a predictive model based on labeled data to teach your machine. You may use the Clustering techniques if you’d rather use non-labeled data so your machine can work with no guidance and search for similarities.

On the other hand, the process of picking a subset of relevant features (variables, predictors) for use in model creation is known as feature selection. It helps in the simplicity of models to make them easier to comprehend for researchers and users, as well as the reduction of training periods and the avoidance of the dimensionality curse.

It includes the Clustering, Regression, and Classification methods.

Unsupervised Learning: Clustering 

Clustering is a technique for separating groups with similar characteristics and assigning them to clusters. If you're looking for a hierarchical algorithm:

Hierarchical Clustering

This type of clustering is one of the most popular techniques in Machine Learning. Hierarchical Clustering assists an organization to classify data to identify similarities, and different groupings and features, so their pricing, goods, services, marketing messages, and other aspects of the business are targeted. Its hierarchy should show the data similar to a tree data structure, known as a Dendrogram. There are two ways of grouping the data: agglomerative and divisive.

Agglomerative clustering is a "bottom-up" approach. To put it another way, each item is first thought of as a single-element cluster (leaf). The two clusters that are the most comparable are joined into a new larger cluster at each phase of the method (nodes). This method is repeated until all points belong to a single large cluster (root).

Divisive clustering works in a “top-down” way. It starts at the root, where all items are grouped in a single cluster, then separates the most diverse into two at each iteration phase. Iterate the procedure until all of the items are in their group.

In case you’re not looking for a hierarchical solution, you must determine whether your method requires you to specify the number of clusters to be used. You can utilize the Density-based Spatial Clustering of Applications with Noise algorithm if you don't need to define it.


DBSCAN (Density-based Spatial Clustering of Applications with Noise)

When it comes to arbitrary-shaped clusters or detecting outliers, it’s better to use Density-Based Clustering. DBSCAN is a method for detecting those arbitrary-shaped clusters and the ones with noise by grouping points close to each other based on two parameters: eps and minPoints.

The eps tells us the distance that needs to be between two points to be considered a cluster. While the minPoints are the minimum number of points to create a cluster. We use this algorithm in the analysis of Netflix Servers outliers. The streaming service runs thousands of servers, and normally less than one percent it’s capable of becoming unhealthy, which degrades the performance of the streaming. The real problem is that this problem isn’t easily visible, to solve it, Netflix uses DBSCAN specifying a metric to be monitored, then collects data and finally is passed to the algorithm for detecting the servers outliers.



One daily usage can be when e-commerce makes a product recommendation to its customers. Applying DBSCAN on the data of products the user has bought before.

In case you need to specify the number of clusters, there are three existing algorithms you could use, including K-Modes, K-Means, and Gaussian Mixture Model. Next, you need to know if you’re going to work with categorical variables, which are discrete variables that capture qualitative consequences by grouping observations (or levels). If you’re going to use them, you may opt for K-Modes.


K-Modes

This approach is used to group categorical variables. We determine the total mismatches between these types of data points. The fewer the differences between our data points, the more similar they are. The main difference between K-Modes and K-Means is that for categorical data points we can’t calculate the distance since they aren’t numeric values.

This algorithm is used for text mining applications, document clustering, topic modeling (where each cluster group represents a specific subject), fraud detection systems, and marketing.

For numeric data, you should use K-Means clustering.

K-Means

Data is clustered into a k number of groups in such a manner that data points in the same cluster are related while data points in other clusters are further apart. This distance is frequently measured with the Euclidean distance. In other words, the K-Means algorithm tries to minimize distances within a cluster and maximize the distance between different clusters.

Search engines, consumer segmentation, spam/ham detection systems, academic performance, defects diagnosis systems, wireless communications, and many other industries use k-means clustering.

If the intended result is based on probability, then the Gaussian Mixture Model should be used.

GMM (Gaussian Mixture Model)

This approach implies the presence of many Gaussian distributions, each of which represents a cluster. The algorithm will determine the probability of each data point belonging to each of the distributions for a given batch of data.

GMM differs from K-means since in GMM we don’t know if a data point belongs to a specified cluster, we use probability to express this uncertainty. While the K-means method is certain about the location of a data point and starts to iterate over the whole data set. The Gaussian Mixture Model is frequently used in signal processing, language recognition, anomaly detection, and genre classification of music.

In the event that you use labeled data to train your machine, first, you need to specify if it is going to predict numbers, this numerical prediction will help the algorithm to solve the problem. In case it does, you can choose Regression Algorithms.


Supervised Learning: Regression

Regression is a machine learning algorithm in which the outcome is predicted as a continuous numerical value. This method is commonly used in banking, investment, and other fields.

Here, you need to decide whether you rather have speed or accuracy. In case you’re looking for speed, you can use a Decision Tree algorithm or a Linear Regression algorithm.

Decision Tree

A decision tree is a flowchart like a tree data structure. Here, the data is continuously split according to a given parameter. Each parameter is allowed in a tree node, while the outcomes of the whole tree are located in the leaves. There are two types of decision trees:

  • Classification trees (Yes/No types), here the decision variable is categorical.

  • Regression trees (Continuous data types), where the decision or the outcome variable is continuous.

When there are intricate interactions between the features and the output variables, decision trees come in handy. When there are missing features, a mix of category and numerical features, or a large variance in the size of features, they perform better in comparison to other methods.

This algorithm is used to enhance the accuracy of promotional campaigns, detection of fraud, and detection of serious or preventable diseases on patients.

Linear Regression

Based on a given independent variable, this method predicts the value of a dependent variable. As a result, this regression approach determines if there is a linear connection between the input (independent variable) and the output (dependent variable). Hence, the term Linear Regression was coined.

Linear regression is ideal for datasets in which the features and the output variable have a linear relationship. It's usually used for forecasting (which is particularly useful for small firms to understand the sales effect), understanding the link between advertising expenditure and revenue, and in the medical profession to understand the correlations between medicine dose and patient blood pressure.


Alternatively, if you need accuracy for your algorithm you can use the following three algorithms: Neural Network, Gradient Boosting Tree, and Random Forest.


Neural Network

A Neural Network is required to learn the intricate non-linear relationship between the features and the target. It’s an algorithm that simulates the workings of neurons in the human brain. There are several types of Neural Networks, including the Vanilla Neural Network (that handles structured data only), as well as Recurrent Neural Network and Convolutional Neural Network which both can work with unstructured data.


When you have a lot of data (and processing capacity), and accuracy is important to you, you'll almost certainly utilize a neural network. This algorithm has many applications, such as paraphrase detection, text classification, semantic parsing, and question answering.


Gradient Boosting Tree

Gradient Boosting Tree is a method for merging the outputs of separate trees to do regression or classification. Both supervised learning incorporates a large number of decision trees to lessen the danger of overfitting (a statistical modeling mistake that happens when a function is too tightly matched to a small number of data points, making it possible to reduce the predictive power of the model) that each tree confronts alone. This algorithm employs Boosting, which entails consecutively combining weak learners (typically decision trees with just one split, known as decision stumps) so that each new tree corrects the preceding one's faults.

When we wish to reduce the Bias error, which is the amount whereby a model's prediction varies from the target value, we usually employ the Gradient Boosting Algorithm. When there are fewer dimensions in the data, a basic linear model performs poorly, interpretability is not critical, and there is no stringent latency limit, gradient boosting is most beneficial.


It’s used in many studies, such as a gender prediction algorithm based on the motivation of masters athletes, using gradient boosted decision trees, exploring their capacity to predict gender based on psychological dimensions evaluating reasons to participate in masters sports as statistical methodologies.

Random Forest

Random Forest is a method for resolving regression and classification problems. It makes use of ensemble learning, which is a technique for solving complicated problems by combining several classifiers. It consists of many decision trees, where the outcomes of every one of them will throw the final result taking the average or mean decisions. The greater the number of trees, the better precision of the outcome.

Random Forest is appropriate when we have a huge dataset and interpretability is not a key problem, as it becomes increasingly difficult to grasp as the dataset grows larger. This algorithm is used in stock market analysis, diagnosis of patients in the medical field, to predict the creditworthiness of a loan applicant, and in fraud detection.

For non-numerical prediction algorithms, you can choose the Classification methods over regression.

Supervised Learning: Classification

Alike to the regression methods, you need to choose if you would rather speed or accuracy for your outcomes.

If you’re looking for accuracy, you not only may opt for the Kernel Support-Vector Machine, but you can use other algorithms that were mentioned previously, such as Neural Network, Gradient Boosting Tree, and Random Forest. Now, let’s introduce this new algorithm.

Kernel Support-Vector Machine

To bridge linearity and non-linearity, the kernel technique is commonly utilized in the Support-Vector Machine model. To understand this, it is essential to know that the SVM method learns how to separate different groups by forming decision boundaries.

But when we’re in front of a data set of higher dimensions and the costs are expensive, it is recommended to use this kernel method. It enables us to work in the original feature space without having to compute the data's coordinates in a higher-dimensional space.

It’s mostly used in text classification problems since most of them can be linearly separated.

When speed is needed, we need to see if the technique we're going to employ is explainable, which implies it can explain what happens in your model from start to finish. In that case, we might use a Decision Tree algorithm or a Logistic Regression.

Logistic Regression

Logistic Regression is used when the dependent variable is categorical. Through probability estimate, it aids in understanding the link between dependent variables and one or more independent variables.

There are three different types of Logistic Regression:

  • Binary Logistic Regression, where the response only has two possible values.

  • Multinomial Logistic Regression, three or more outcomes with no order.

  • Ordinal Logistic Regression, three or more categories with ordering.

The Logistic Regression algorithm is widely used in hotel booking, it shows you (through statistical research) the options you may want to have in your bookings, such as the hotel room, some journeys in the area, and more.

If you’re only interested in the input and output of your problem, you can check if the data you’re working with is too large. If the number is huge, you can use a Linear Support-Vector Machine.


Linear Support-Vector Machine

Linear SVM is used for linearly separable data. It works in data with different variables (linearly separable data) that can be separated with a simple straight line (linear SVM classifier). This straight line represents the user behavior or outcome through a stated problem.

Since texts are often linearly separable and have a lot of features, the Linear SVM is the best option to use in its classification. In the case of our next algorithm, you can use it either if the data is large or not.


Naïve Bayes

This algorithm is based on Bayes Theorem. It consists of predictions through objects’ probabilities. It’s called Naïve because it assumes that the appearance of one feature is unrelated to the appearance of other characteristics.

This method is well-liked because it can surpass even the most sophisticated classification approaches. Furthermore, it is simple to construct and may be built rapidly. Due to its easy use and efficiency, it’s used to make real-time decisions. Along with that, Gmail uses this algorithm to know if a mail is Spam or not.

The Gmail spam detection picks a set of words or ‘tokens’ to identify spam email (this method is also used in text classification and it’s commonly known as a bag of words). Next, they use those tokens and compare them to spam and non-spam emails. Finally, using the Naive Bayes algorithm, they calculate the probability that the email is spam or not.

In Conclusion

We find that Machine Learning is a widely utilized technology with many applications unrecognized by us because it's a regular occurrence. In this article, we have not only distinguished between the different approaches of machine learning but how to use them according to the data we’re working with and the problem we want to solve.

To learn Machine Learning, you have to have some knowledge of calculus, linear algebra, statistics, and programming skills. You can use different programming languages to implement one of these algorithms, from Python to C++, and R language. It’s up to you to make the best decision and start learning along with your machine.







2 Comments