Support Vector Machines (SVMs) are mathematical algorithms that are used in the field of machine learning to classify objects. In the area of text or image classification, they have advantages over neural networks because they can be trained more quickly and already deliver good results with a small amount of training data.
How do Support Vector Machines work?
Suppose we have data with two classes (blue, yellow) and two features (x, y). We want the SVM to decide whether the data object is classified as blue or yellow based on the features x and y. Since we have only two dimensions, we can map our training data into a coordinate system.
The Support Vector Machine delivers as a result the so-called hyperplane, which best separates the two groups. In two-dimensional space, this is a simple line. This plane is used to decide in which class a data object falls. In our example, all objects to the left of the hyperplane are classified as “yellow” and all to the right as “blue”.
The Support Vector Machine tries to select the hyperplane in different training runs so that the gap becomes maximum. This measures the distance from the nearest element of each group to the hyperplane. If this is the maximum, it means that the plane was chosen so that the SVM separates both classes as much as possible.
How does it behave with Non-Linear Separable Data?
Unfortunately, real-world applications cannot always be separated by a simple line as cleanly as in our example. By changing the data set just a little bit, the problem directly becomes much more complex.
Although we can clearly distinguish the two sets of data by eye, they cannot be separated with a simple linear hyperplane. One way to use Support Vector Machines anyway is to introduce a new dimension and create it so that the data points are separable in a higher dimensional space by a hyperplane.
Since this can not only be very difficult in some cases but also becomes computationally very complex in most cases and unnecessarily inflates the algorithm, we will not discuss this alternative in more detail in this article.
How does the Kernel Trick for Support Vector Classification work?
Instead of being able to classify such non-linear data sets using new dimensions, Support Vector Machines use the so-called kernel trick instead. Our problem is that SVMs can only separate data classes using linear hyperplanes. Therefore, we need to modify the non-linear data so that it can also be separated using a linear context. To do this, we need to find a higher-dimensional space where the data is linearly separable.
Mathematically, this corresponds to an optimization problem as we have seen above. We want to find the hyperplane with the maximum distance to the next data point from each class. If we have a non-linear data set, the so-called mapping function also appears in this optimization function. This maps each data object to the point in higher-dimensional space where the data is separable by a hyperplane.
The biggest problem is to find this mapping, which we need for the calculation of the optimization problem. Theoretically, there is an infinite number of functions that can solve this problem for us. However, we do not want our computer to have to calculate all these possibilities. Therefore, a mathematical theorem comes to our aid, namely the so-called Mercer’s Theorem.
In simple words, it says that we do not need to know the exact mapping to solve our optimization problem. It is enough if we know how to calculate the vectors of the data points with each other. For this operation, there are the so-called kernel functions (e.g. Gauss kernel or spectrum kernel). Any function that satisfies Mercer’s theorem is a kernel function and can be used instead of explicit mapping. This makes it much easier for us to optimize non-linearly separable data.
How to evaluate the performance of SVM?
Various key figures can be calculated to independently evaluate the performance of a Support Vector Machine. In most cases, it also makes sense to determine several of the following key figures to be able to evaluate the performance independently.
- Accuracy: This measures the ratio of correctly classified data points to the total number of data points in the data set.
- Precision: Precision deals exclusively with the samples of a class and measures the ratio of the correctly predicted samples of this class to the total number of predictions assigned to this class.
- Recall: Recall differs slightly from precision in that it does not divide by the total number of predictions of a class but by the total number of actual samples of a class in the data set.
- F1 score: The F1 score forms the harmonic mean of precision and recall to be able to evaluate these two metrics together. Both metrics are weighted equally.
- Confusion matrix: This matrix is formed for binary classification problems, i.e. those that only differentiate between two classes. The possible errors and true predictions that can occur are plotted in four fields.
- ROC curve: The sensitivity and 1-specificity are plotted on this curve to be able to evaluate the performance of the model at different threshold values.
- Precision-recall curve: This curve follows the same pattern as the ROC curve, with the difference that precision and recall are visualized as key figures.
Depending on the type of classification, some of these metrics are more important than others. In medical diagnostics, for example, the non-detection of a sick patient can have far greater consequences than the incorrect diagnosis of a healthy subject. Therefore, the focus here should be on recall, as this key figure evaluates how well the model recognizes all sick persons in a data set.
In addition to the key figure selection, the method of splitting the data set into training and test sets is also important. With a simple train-test split, the key figures calculated on the test data set may not be meaningful enough. It is therefore advisable to use cross-validation, for example, to calculate the key figures on different test sets and thus obtain a reliable statement.
What are the Advantages and Disadvantages of SVMs?
Advantages | Disadvantages |
Simple training of the model | Finding the right kernel function and parameters can be computationally intensive |
Support Vector Machines scale well for higher-dimensional data | Support Vector Machines can not filter noise well |
Good alternative to Neural Networks | Requires more records than number of features to work |
Can be used for linear and non-linear separable data | No probability interpretation of the result (decides exclusively the side of the hyperplane) |
What are the different types of Support Vector Machines?
Depending on the problem and the data set on which SVMs are trained, there are generally three different types of support vector machines:
- Linear SVMs: If the data classes can be separated from each other by a straight line or a hyperplane in a high-dimensional space, this is called a linear SVM. The advantage of these models is the comparatively simple and fast calculation, which enables uncomplicated classification even with large data sets. However, the structures and relationships in such data sets are not always linear, but much more complex and cannot be explained by a linear relationship.
- Non-linear SVMs: For such cases, non-linear SVMs are used, which separate the data with significantly more complex levels that are not linear. To do this, they use so-called kernel functions that specify the shape of the hyperplane in a high-dimensional space. Frequently used kernel functions are, for example, polynomial, radial basis (RBF), or the sigmoid function.
- Support vector regression (SVR): Support vector machines can also be used if continuous predictions are to be made and no classification is required. Support vector regression is used for this. The procedure initially remains similar in that a hyperplane is searched for, which then attempts to minimize the distance between the predicted outputs and the actual outputs. In this way, both linear and non-linear relationships between the inputs and outputs can be processed and learned. Here, too, the so-called kernel functions are used.
Depending on the type of application, one of these support vector machines is used. The structure of the data set also plays a decisive role here.
How to implement a Support Vector Machine in Python?
To implement a Support Vector Machine in Python, you can use the scikit-learn library, which allows you to easily import ready-made models and use them quickly. In this example, we show how to set up a support vector machine with the Iris dataset in just a few steps.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
The data set can then be loaded. You can either load your own data, for example from an Excel file, or you can use ready-made data sets from scikit-learn. For this example, we use the publicly available Iris dataset, which contains information and dimensions about various flowers and classifies them into subspecies as target variables.
iris = datasets.load_iris()
X = iris.data
y = iris.target
We split the data set into training and test data sets so that we can later calculate the model’s key figures on independent data. A 70/30 or 80/20 split into training and test is common:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
The model can now be instantiated and individualized with the hyperparameters. The most important parameter here is the kernel, which determines the type of decision boundary. The Gaussian kernel is used as the default value, which can also be used for non-linear classifications. The parameter C determines the regularization of the model.
svm = SVC(kernel='linear', C=1)
The model can now be trained with the training data:
svm.fit(X_train, y_train)
After training, the test data is classified so that the performance of the model can then be evaluated:
y_pred = svm.predict(X_test)
These predictions can now be used to calculate key figures for the test set.
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)
This section presents a simple example of how an SVM can be implemented in Python. You can use this as a starting point for your own calculation and customize it to your data.
This is what you should take with you
- Support Vector Machines are machine learning algorithms for classifying data objects.
- SVMs try to find the best so-called hyperplane, which separates the data groups most clearly from each other.
- If the data is not separable with a linear element, for example, a straight line or a plane, we can use the so-called kernel trick.
- SVMs are good alternatives to neural networks in the classification of objects.
What is the No-Free-Lunch Theorem?
Unlocking No-Free-Lunch Theorem: Implications & Applications in ML & Optimization
What is Automated Data Labeling?
Unlock efficiency in machine learning with automated data labeling. Explore benefits, techniques, and tools for streamlined data preparation.
What is Synthetic Data Generation?
Elevate your data game with synthetic data generation. Uncover insights, bridge data gaps, and revolutionize your approach to AI.
What is Multi-Task Learning?
Boost ML efficiency with Multi-Task Learning. Explore its impact on diverse domains from NLP to healthcare.
What is Federated Learning?
Elevate machine learning with Federated Learning. Collaborate, secure, and innovate while preserving privacy.
What is Adagrad?
Discover Adagrad: The Adaptive Gradient Descent for efficient machine learning optimization. Unleash the power of dynamic learning rates!
Other Articles on the Topic of SVMs
- Transformation of non-linear applications into higher dimensions using Monkey Learn as an example.