Introduction to machine learning

Overview

Teaching: min
Exercises: min

Questions

Objectives

What is machine learning?

Machine learning comprises a variety of tools and methodologies designed to uncover patterns within datasets. This lesson aims to introduce a selection of these techniques, although there exist numerous others beyond the scope of this session. These techniques can be broadly categorised into two main groups: predictors and classifiers. Predictors are employed to forecast a value or a set of values based on a given set of inputs. For instance, they may predict the cost of an item considering economic conditions and the price of raw materials, or forecast a country’s GDP based on its life expectancy. On the other hand, classifiers are tasked with categorised data into distinct groups. For example, they might discern visible characters within an image of written text, or determine whether a message is spam or legitimate.

Training Data

Many machine learning systems, although not all, acquire knowledge by processing a sequence of input and output data, which they then utilize to construct a model. The mathematical underpinnings of machine learning are agnostic to the nature of the data, as long as it can be represented numerically or categorised. Examples of such applications include:

Estimating an individual’s weight based on their height
Predicting commute durations given prevailing traffic conditions
Forecasting housing prices based on stock market fluctuations
Distinguishing between spam and legitimate emails
Identifying whether an image contains a person or not

Typically, these models require extensive training with hundreds, thousands, or even millions of examples before they achieve sufficient accuracy for practical predictions or classifications. Some systems undertake training as a one-time process, resulting in the creation of a model. Others may continuously refine their training through real-world system usage and human feedback also know as reinforcement learning. For instance, every time a user labels an email as spam or not spam, they likely contribute to further training of the spam filter’s model.

Types of output

Predictors will usually involve a continuous scale of outputs, such as the price of something. Classifiers will tell you which class (or classes) are present in the data. For example a system to recognise hand writing from an input image will need to classify the output into one of a set of potential characters.

Machine learning vs Artificial Intelligence

Artificial Intelligence encompasses systems with generalized intelligence, theoretically capable of solving a wide array of problems. However, AI is a broad term with varying interpretations. Machine learning systems, on the other hand, are typically trained to address specific problems. While they may exhibit learning behaviour, they lack the generalized intelligence to solve any problem a human could tackle. These systems often require hundreds or thousands of examples to learn and are limited to relatively straightforward classifications. In contrast, a human-like system could learn from a single example. Another definition of Artificial Intelligence traces back to the 1950s and Alan Turing’s “Imitation Game.” According to this concept, a system could be deemed intelligent if it could deceive a human into believing they were interacting with another human when in fact, they were conversing with a computer. Modern endeavours in this realm are approaching the point of successfully fooling humans, yet achieving a machine with full human-like intelligence remains a distant prospect.

Applications of machine learning

Machine learning in our daily lives

Image Recognition
Object Detection
Character Recognition
Insurance Premiums
Energy usage

Example of machine learning in research

Detecting water leaks in pipes.
Cancer detection.
Improving farming productivity.

Limitations of Machine Learning

Garbage In = Garbage Out

In Computer Science, there’s a well-known saying: “Garbage In = Garbage Out.” This adage highlights the principle that if the input data provided is of poor quality or irrelevant, the resulting output will likely be similarly flawed. For example, if we attempt to train a machine learning system to establish a correlation between two variables that are fundamentally unrelated, the model may still generate a semblance of a connection, but the output will lack meaningful significance. This is often apparent when the model’s output appears erratic or seemingly random.

Bias or lacking training data

The input data may also lack sufficient diversity to encompass all potential scenarios. Biases present in the data collection process can subsequently manifest in the machine learning system. For instance, if data on crime reporting is gathered, it may skew towards wealthier areas where incidents are more likely to be reported. Historical data might be inadequate in terms of coverage or relevance to the specific context being analysed. For example, imagine creating a model to transcribe written text from historical documents. If the model is trained solely on documents from the 1950s to 2000, it may perform well when tested on similar samples from that era. However, testing the model on pre-1950s material might yield poor results because handwriting styles and language usage evolve over time.

Extrapolation

We can only confidently forecast outcomes for data that falls within the range of our training data. When attempting to extrapolate beyond the scope of our training data, it’s likely that our predictions will be inaccurate. An easy way to see this is to plot your training data based on it features along with the sample you want to analyse. If the sample is no where near your data then you could consider this sample an outlier.

Over fitting

Sometimes ML algorithms become over trained to their training data and struggle to work when presented with real data. Meaning that the model has focused too much on certain characteristics that determine said task, but these may not be applicable when it is used to predict on the test set. This again results in some random predictions. Therefore, its critical not to over train (train for too long) your model.

Inability to explain answers

Many machine learning techniques will give us an answer given some input data even if that answer is wrong. Most are unable to explain any kind of logic in arriving at that answer. This can make diagnosing and even detecting problems with them difficult.

Key Points

Clustering

Overview

Teaching: min
Exercises: min

Questions

Objectives

Clustering

Clustering involves the categorisation of data points based on their similarities, offering a robust method for detecting patterns within datasets. It typically operates without the need for training, distinguishing it as an unsupervised learning approach. This lack of training requirement facilitates swift application..

Applications of Clustering

Looking for trends in data
Data compression, all data clustering around a point can be reduced to just that point. For example, reducing colour depth of an image.
Pattern recognition

K-means Clustering

he K-means clustering algorithm is a straightforward technique aimed at pinpointing the centroid of each cluster. It achieves this by seeking a point that minimizes the distance between the centroid and all the points within the cluster. While the algorithm requires a predetermined number of clusters to identify, a common approach involves experimenting with various cluster numbers and employing additional tests to determine the optimal configuration.

The Gartner Hype Cycle curve Image from analyticsvidhya

Lets look at our data

So firstly lets have a look at the features within our dataset:

> data("iris") ## load in data
> head(iris) ## show just the first few rows

 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

we could also compare different features, lets compare Petal length against Petal width:

> plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris Data") ## plot two features against each other
> legend("top", levels(iris$Species), pch = 21,col = c("red","green3","blue")) ### well of course we want a legend

As observed, the “setosa” species appears to cluster more distinctly, while there is some overlap or noise between “versicolor” and “virginica,” despite their apparent similarity. Now, let’s execute the model. Since the kmeans function is included in the base package of R, there’s no need to install any additional packages. When using the kmeans function, it’s essential to specify the “centers” parameter, which represents the number of clusters we intend to create. In this scenario, we know that the appropriate value is 3. Let’s proceed by setting it accordingly. Now lets try and cluster all the features

> set.seed(0)
> irisCluster <- kmeans(iris[,1:4], center=3, nstart=20) ### take only the data columns that we want
> irisCluster

Now lets have a look at the 3 clusters the model has come up with. To do this we use a library called “cluster”, so we can see the regions/groups that the points have been separated into.

> library(cluster)
> clusplot(iris, irisCluster$cluster, color=T, shade=T, labels=0, lines=0) ## special kind of plot for showing clusters

Limitations of K-Means

Requires number of clusters to be known in advance
Struggles when clusters have irregular shapes
Will always produce an answer finding the required number of clusters even if the data isn’t clustered (or clustered in that many clusters).
Requires linear cluster boundaries

Advantages of K-Means

Simple algorithm, fast to compute. A good choice as the first thing to try when attempting to cluster data.
Suitable for large datasets due to its low memory and computing requirements.

Spectral Clustering

Spectral clustering is a method utilised in machine learning and data analysis to cluster data points according to their likeness. This approach entails converting the data into a format where clusters are discernible, followed by applying a clustering algorithm to this altered data. In the R Programming Language, spectral clustering achieves this transformation by leveraging the eigenvalues and eigenvectors of a similarity matrix.

Spectral clustering works by transforming the data into a lower-dimensional space where clustering is performed more effectively. The key steps involved in spectral clustering are as follows:

Affinity Matrix

Begin with a dataset containing data points. Calculate an affinity or similarity matrix that measures the connections between these data points, indicating their level of similarity or correlation. This matrix encapsulates the degree of similarity or relationship between each pair of data points. Popular methods for computing affinity include Gaussian similarity, k-nearest neighbors, or a custom similarity function provided by the user.

Graph Representation

View the affinity matrix as the adjacency matrix of a weighted undirected graph. In this graph, each data point represents a vertex, and the edge weight between vertices indicates the similarity between the respective data points.

Laplacian Matrix

Construct the graph Laplacian matrix, which captures the connectivity of the data points in the graph. There are two main types of Laplacian matrices used in spectral clustering.

Unnormalized Laplacian: L = D – A, where D is the degree matrix and A is the affinity matrix. The degree of a vertex is the sum of the weights of its adjacent edges. Normalized Laplacian: L_norm = I – D^(-1/2) * A * D^(-1/2), where D^(-1/2) is the diagonal matrix of the inverse square root of the node degrees.

Eigenvalue Decomposition

Compute the eigenvalues (λ_1, λ_2, …, λ_n) and the corresponding eigenvectors (v_1, v_2, …, v_n) of the Laplacian matrix. You typically compute a few eigenvectors, corresponding to the smallest non-zero eigenvalues.

###Embedding

Use the selected eigenvectors to embed the data into a lower-dimensional space. The eigenvectors represent new features that capture the underlying structure of the data. The matrix containing these eigenvectors is referred to as the spectral embedding.

### Using Euclidean distance as a similarity measure
> similarity_matrix <- exp(-dist(iris[, 1:4])^2 / (2 * 1^2))
 
### Compute Eigenvalues and Eigenvectors
> eigen_result <- eigen(similarity_matrix)
> eigenvalues <- eigen_result$values
> eigenvectors <- eigen_result$vectors
 
### Choose the First k Eigenvectors
> k <- 3 
> selected_eigenvectors <- eigenvectors[, 1:k]
 
### Apply K-Means Clustering
> cluster_assignments <- kmeans(selected_eigenvectors, centers = k)$cluster
 
### Add species information to the clustering results
> iris$Cluster <- factor(cluster_assignments)
> iris$Species <- as.character(iris$Species)

### Plot the 

> library(ggplot2)

> ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Cluster, label = Species)) + geom_point() +
   geom_text(check_overlap = TRUE, vjust = 1.5) +
   labs(title = "Spectral Clustering with k-means of Iris Dataset",
        x = "Sepal Length", y = "Sepal Width")

Exercise: Increasing the number of cluster centres

Have ago at increasing the number of centres for you K-means cluster to find. What does it look like if you try 4,5 or even 6? How could we find the most optimal amount?

Key Points

Dimensional Reduction

Overview

Teaching: min
Exercises: min

Questions

Objectives

Dimensionality Reduction

Dimensionality reduction serves as a potent technique for analysing and visualising data sets, especially when dealing with high-dimensional data such as datasets or outputs from machine learning models. These methods effectively reduce the number of features in your data, which is crucial considering that visualising anything beyond two dimensions is challenging. For this section we will focus on two commonly used methods for dimensionally reducing your data, One being Principal Component analysis (PCA) a linear method and second t-SNE a non-parametric/ non-linear method.

Examine the dataset

Lets make some plots looking at each of our features, so we can see the distribution of our features.

> par(mfrow = c(2, 2))
> hist(iris$Sepal.Length, breaks = 20) ## histograms for each features
> hist(iris$Sepal.Width, breaks = 20)
> hist(iris$Petal.Length, breaks = 20)
> hist(iris$Petal.Width, breaks = 20)

Principle Component Analysis (PCA)

PCA is a technique that does rotations of data in a two dimensional array to decompose the array into combinations vectors that are orthogonal and can be ordered according to the amount of information they carry. As there are as many principal components as there are variables in the data, principal components are constructed in such a manner that the first principal component accounts for the largest possible variance in the data set. Hence, when you condense your data into two dimensions, you’re essentially utilising the two principal components characterised by the highest variance.

# PCA
# Make sure you reset your variables
> data(iris)
> pc <- prcomp(iris[,-5],center = T,scale. = T) ## start PCA
> pc
> summary(pc)

Standard deviations (1, .., p=4):
[1] 1.7083611 0.9560494 0.3830886 0.1439265

Rotation (n x k) = (4 x 4):
                   PC1         PC2        PC3        PC4
Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

Now lets visualise our reduced features:

> library(ggbiplot)
> ggbiplot(pc,obs.scale = 1, var.scale = 1, groups = iris$Species) ##plot our pca results

t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a statistical approach used to visually represent high-dimensional data by assigning each data point a position on a two- or three-dimensional map. Unlike linear techniques, t-SNE is nonlinear and is particularly effective for reducing the dimensionality of data to enable visualization in a lower-dimensional space. It accomplishes this by modeling each high-dimensional object as a point in two or three dimensions, ensuring that similar objects are positioned close together while dissimilar ones are placed farther apart with high probability.

# t-SNE embedding
> library(tsne)
> features <- subset(iris, select = -c(Species)) ### create a subset data frame without the species labels
> set.seed(0)
> tsne <- tsne(features, initial_dims = 2) ### lets reduce our features from 4 dimension to 2
> tsne <- data.frame(tsne) ## put it in data frame so its easier to use
> pdb <- cbind(tsne,iris$Species) ## add back in the labels
> summary(tsne)

      X1                X2           
Min.   :-16.857   Min.   :-5.276300  
1st Qu.:-10.994   1st Qu.:-2.199154  
Median : -2.691   Median : 0.009581  
Mean   :  0.000   Mean   : 0.000000  
3rd Qu.: 12.147   3rd Qu.: 2.051889  
Max.   : 20.724   Max.   : 5.731033

> plot(tsne, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Iris Data") ## plot tsne
> legend("top",levels(iris$Species), pch = 21, col = c("red","green3","blue")) 

Exercise: Parameters

Look up parameters that can be changed in PCA and t-SNE, and experiment with these. How do they change your resulting plots? Might the choice of parameters lead you to make different conclusions about your data?

Exercise: Other Algorithms

There are other algorithms that can be used for doing dimensionality reduction, for example the Higher Order Singular Value Decomposition (HOSVD) Do an internet search for some of these and examine the example data that they are used on. Are there cases where they do poorly? What level of care might you need to use before applying such methods for automation in critical scenarios? What about for interactive data exploration?

Key Points

Regression

Overview

Teaching: min
Exercises: min

Questions

Objectives

Linear regression

We now create a basic linear model for a given dataset. It would be valuable to assess the accuracy of this model. One way to achieve this is by computing the predicted y-values for each x-value in our original dataset and comparing them with the actual y-values. We can aggregate these individual discrepancies into a single comprehensive error metric by calculating the least squares. This involves squaring each difference, summing them all, dividing the sum by the total number of observations, and then taking the square root of the result. By squaring and subsequently taking the square root, we prevent negative errors from offsetting positive ones, thus providing us with an overall error metric to gauge the accuracy of our model.

Preprocess the dataset

Any easy way to calculate our intercepts is to use least squares fit.

> lsfit(iris$Petal.Length, iris$Petal.Width)$coefficients # find linear fit intercepts

Intercept X
-0.3630755 0.4157554 .4

So now we have our intercepts, lets plot our line of best fit to our data.

> plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Petal width")
> abline(lsfit(iris$Petal.Length, iris$Petal.Width)$coefficients, col="black") ### plot the clusters with linear line.
> legend("top",levels(iris$Species), pch = 21, col = c("red","green3","blue")) 

So lets now have ago at building a linear model instead using “lm”

> lm_fit <- lm(Petal.Width ~ Petal.Length, data=iris) ## create linear model
> lm_fit$coefficients

(Intercept) Petal.Length
-0.3630755 0.4157554 

Again lets plot our linear model

> plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Petal width")
> abline(lm(Petal.Width ~ Petal.Length, data=iris)$coefficients, col="black") ## plot linear model
> legend("top",levels(iris$Species), pch = 21, col = c("red","green3","blue")) 

We can also look at how well our linear model fits the data by examining the p values and also have our model predict values for Petal width.

> summary(lm(Petal.Width ~ Petal.Length, data=iris)) 
> newdata = data.frame(Petal.Length=c(2,3,5)) ##create dataframe of features to predict
> predict(lm_fit, newdata) ## predict linear model

Call:
lm(formula = Petal.Width ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56515 -0.12358 -0.01898  0.13288  0.64272 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2065 on 148 degrees of freedom
Multiple R-squared:  0.9271,	Adjusted R-squared:  0.9266 
F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Prediction Results

        1         2         3 
0.4684353 0.8841907 1.7157016 

Try different features

Have ago at using the same code and trying with sepal instead of petal, or any combination.

Logistic Regression

We’ve now seen how we can use linear regression to make a simple model and use that to predict values, but what do we do when the relationship between the data isn’t linear?

Logarithms Introduction

Logarithms are the inverse of an exponent (raising a number by a power).
log b(a) = c
b^c = a
For example:
2^5 = 32
log 2(32) = 5
If you need more help on logarithms see the Khan Academy’s page

This time instead of focusing on plotting, were going to use logistic regression as a classifier. First we need to prepossess our data set by splitting it into training and test data. Then we will apply logistic regression using the binomial family using the sepal length feature.

> library(caTools)

> set.seed(1)
> split = sample.split(iris$Sepal.Length, SplitRatio = 0.75) ## create dataset split
> train = subset(iris, split==TRUE) ## train split
> test = subset(iris, split==FALSE) ## test split
> y<-train$Species; x<-train$Sepal.Length ## use sepal length as features
> glfit<-glm(y~x, family = 'binomial')
> summary(glfit)

## Call:
## glm(formula = y ~ x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.94538  -0.50121   0.04079   0.45923   2.26238  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -25.386      5.517  -4.601 4.20e-06 ***
## x              4.675      1.017   4.596 4.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110.854  on 79  degrees of freedom
## Residual deviance:  56.716  on 78  degrees of freedom
## AIC: 60.716
## 
## Number of Fisher Scoring iterations: 6

So we have now created our model and we want to predict some of the samples in our test set.

> newdata<- data.frame(x=test$Sepal.Length) ## convert data into dataframe
> predicted_val<-predict(glfit, newdata, type="response") ## predict test set
> prediction<-data.frame(test$Sepal.Length, test$Species,predicted_val) ## cast prediction to dataframe
> prediction

  test.Sepal.Length test.Species predicted_val
1                4.6       setosa   0.014429053
2                5.0       setosa   0.098256223
3                4.8       setosa   0.038406518
4                5.4       setosa   0.447809228
5                5.1       setosa   0.152523368
6                4.9       setosa   0.061887119
7                4.4       setosa   0.005337797
8                5.1       setosa   0.152523368
9                5.0       setosa   0.098256223
10               6.4   versicolor   0.991906259
11               6.5   versicolor   0.995084059
12               5.2   versicolor   0.229146102
13               6.1   versicolor   0.964535637
14               5.6   versicolor   0.688708107
15               5.9   versicolor   0.908836090
16               6.8   versicolor   0.998904845
17               6.7   versicolor   0.998192419
18               5.5   versicolor   0.572554250
19               5.8   versicolor   0.857868639
20               5.4   versicolor   0.447809228
21               6.0   versicolor   0.942746684
22               6.3   versicolor   0.986701696
23               5.6   versicolor   0.688708107
24               5.5   versicolor   0.572554250
25               5.7   versicolor   0.785142952
26               4.9    virginica   0.061887119
27               7.2    virginica   0.999852714
28               5.7    virginica   0.785142952
29               5.8    virginica   0.857868639
30               6.4    virginica   0.991906259
31               6.1    virginica   0.964535637
32               7.7    virginica   0.999988017
33               6.3    virginica   0.986701696
34               6.0    virginica   0.942746684
35               6.9    virginica   0.999336667
36               6.7    virginica   0.998192419
37               6.2    virginica   0.978223885

Looking at our results, the prediction val column give thew prediction confidence that said belongs to that class. typically in machine learning we use the 0.5 confidence threshold. Now lest have a look at what our chat looks like.

> qplot(prediction[,1], round(prediction[,3]), col=prediction[,2], xlab = 'Sepal Length', ylab = 'Prediction using Logistic Reg.') ## plot our predictions

trying different features

Again have ago at using different features to see what changes in the prediction.

Key Points

day 1 practical

Overview

Teaching: min
Exercises: min

Questions

Objectives

Excerise 1

There were exercises throughout the sections that we covered today. If you didn’t get chance to do them then please go back and have ago at them now. If completed, go on to exercise 2.

Excercise 2

Download File

Please use the download link to be able to access the penguin dataset which will be used for this exercise.

Task 1:

Using the penguin dataset do some k-means clustering using the features: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and species as the labels. Use any combination of the features you want. Please produce a plot of your results of k-means. Do the same again but with Spectral clustering and a plot would be nice.

Task 2:

Now its time to do some dimensional reduction, so using PCA and Tsne please create a plot of your data using each method. If you want to be adventurous, try reducing them to 3 dimension and creating a 3D plot.

Task 3:

Have ago at fitting a linear and Logistic Regression models to your data. again plots would be nice.

Key Points

Non-Linear Classifiers

Overview

Teaching: min
Exercises: min

Questions

Objectives

K-Nearest Neighbour (KNN)

The k-nearest Neighbours algorithm, commonly referred to as KNN or k-NN, is a supervised learning classifier that falls under the non-parametric category. It leverages proximity to classify or predict the grouping of a specific data point. Although it can tackle both regression and classification tasks, it is predominantly employed as a classification tool. The underlying principle is based on the assumption that similar data points tend to cluster together. In classification scenarios, the algorithm assigns a class label through a majority vote mechanism. In other words, the label that appears most frequently among neighboring data points is adopted. While technically termed “plurality voting,” it is often referred to as “majority vote” in literature. The distinction lies in the requirement for a true majority (over 50%), which suits binary classification situations. In cases involving multiple classes (e.g., four categories), a conclusive decision regarding a class label can be made with a threshold vote exceeding 25%.

Before we train any non-linear machine learning models, we need to divide our data into train and test sets. To do this we use a library called caTools. Furthermore, traditionally machine learning models only accept inputs which are between zero and one. so we will also need to scale our data.

> library(caTools)

> set.seed(1)
> split = sample.split(iris$Sepal.Length, SplitRatio = 0.75)
> train = subset(iris, split==TRUE)
> test = subset(iris, split==FALSE)
> train_scaled = scale(train[-5])
> test_scaled = scale(test[-5])
> train_scaled

Sepal.Length  Sepal.Width Petal.Length  Petal.Width       setosa    virginica   versicolor 
  5.8522124    3.0663717    3.6734513    1.1513274    0.3628319    0.3362832    0.3008850 
attr(,"scaled:scale")
Sepal.Length  Sepal.Width Petal.Length  Petal.Width       setosa    virginica   versicolor 
  0.8523180    0.4524952    1.8304477    0.7617080    0.4829586    0.4745415    0.4606857

Now lets build our self KNN model, which we use a library called class.

> library(class)
> test_pred <- knn(train = train_scaled, test = test_scaled,cl = train$Species, k=2)
> test_pred

[1] setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     versicolor versicolor
[12] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
[23] versicolor versicolor versicolor virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
[34] virginica  virginica  virginica  virginica 
Levels: setosa versicolor virginica

Confusion Matrix

To look at how our model performed, there are a number of ways you could look at it. The best way is to have look at the confusion matrix and luckily in R there is a built in function that does this for us. All we have to do is pass our prediction results to the table function. Furthermore, by summing the diagonal and dividing by the length of our test set we can come up with an accuracy value.

> actual <- test$Species
> cm <- table(actual,test_pred)
> cm
> accuracy <- sum(diag(cm))/length(actual)
> sprintf("Accuracy: %.f%%", accuracy*100)

"Accuracy: 92%"

           test_pred
actual       setosa versicolor virginica
setosa          9          0         0
versicolor      0         16         0
virginica       0          3         9

Support Vector Machines (SVM)

The Support Vector Machine (SVM) emerges as a formidable supervised algorithm, demonstrating its effectiveness particularly on smaller yet intricate datasets. While adept at handling both regression and classification tasks, SVMs notably shine in classification scenarios. Originating in the 1990s, SVMs garnered widespread recognition and endure as a favoured option for high-performance algorithms, often requiring minimal adjustments to yield robust outcomes. Described as a machine learning algorithm utilising supervised learning models, SVMs tackle intricate classification, regression, and outlier detection challenges by executing optimal data transformations. These transformations delineate boundaries between data points based on predefined classes, labels, or outputs. This article elucidates the core principles of SVMs, their functionality, variations, and offers insights through real-world illustrations.

Strengths of support vector machines:

Effective in navigating high-dimensional spaces.
Remain potent even when faced with a higher number of dimensions compared to samples.
Operate efficiently on memory by utilizing a subset of training points known as support vectors in the decision-making process.
Offer versatility through the option to specify various Kernel functions for the decision function, including the provision for custom kernels.

Drawbacks of support vector machines:

When the number of features significantly exceeds the number of samples, guarding against over-fitting necessitates careful selection of Kernel functions and regularization terms.
Direct probability estimates are not provided by SVMs; obtaining such estimates involves resource-intensive techniques like five-fold cross-validation (refer to Scores and probabilities).

SVM in R

So to create a SVM model, we are going to use the library called “e1071”. We are also going to use our train/test separations from above.

> library(e1071)
> Species <- train$Species
> svm_model <- svm(Species ~ ., data=train_scaled, kernel="linear") #linear/polynomial/sigmoid

Now lets have ago at predicting our test set using the SVM model. Again we are going to produce a confusion matrix and generate an accuracy score.

> pred = predict(svm_model,test_scaled)
> tab = table(Predicted=pred, Actual = test$Species)
> tab
> accuracy <- sum(diag(tab))/length(test$Species)
> sprintf("Accuracy: %.f%%", accuracy*100)

"Accuracy: 92%"

            Actual
Predicted    setosa versicolor virginica
setosa          9          0         0
versicolor      0         16         3
virginica       0          0         9

different non-linear classifier

Have ago at implementing a different non-linear classifier. examples of decision tree can be found at: https://www.datacamp.com/tutorial/decision-trees-R Or even Random forest: https://www.r-bloggers.com/2021/04/random-forest-in-r/

Key Points

Neural Networks

Overview

Teaching: min
Exercises: min

Questions

Objectives

Introduction

Neural networks, drawing inspiration from the workings of the human brain, represent a machine learning approach adept at discerning patterns and categorizing data, frequently leveraging images as input. This technique, rooted in the 1950s, has evolved through successive iterations, surmounting inherent constraints. Today, the pinnacle of neural network advancement is often denoted as deep learning.

Perceptrons

Perceptrons serve as the foundational units within neural networks, mirroring the functionality of individual neurons in the brain. Typically equipped with one or more inputs and a solitary output, they operate by weighting each input and aggregating these weighted values. Subsequently, the summed result undergoes evaluation by an activation function, determining whether the neuron emits a signal. While some activation functions employ a straightforward threshold step mechanism, delineating between zero and one based on input magnitude, alternative designs may utilize different functions. Nevertheless, these functions commonly yield outputs ranging from zero to one and retain a step-wise characteristic.

A diagram of a perceptron

Coding a perceptron

The function requires three parameters: Inputs, a list of input values; Weights, a list of weight values; and Threshold, denoting the activation threshold. Initially, we perform element-wise multiplication of each input with its corresponding weight. Subsequently, the total sum of these products is computed. If this sum falls below the activation threshold, the output is zero; otherwise, it is one.

Perceptron limitations

A solitary perceptron is incapable of resolving any function that lacks linear separability, necessitating the ability to partition input and output classes with a straight line. An illustrative instance is the XOR function below:

Input 1	Input 2	Output
0	0	0
0	1	1
1	0	1
1	1	0

(Make a graph of this)

which yields zero output when all inputs are either one or zero, defying straightforward linear separation. This inadequacy, termed linear separability, was recognized in the 1960s, leading to a stagnation in neural network advancement for over a decade, often referred to as the “AI Winter.”

Multi-layer Perceptrons

A single perceptron lacks the capability to address functions that lack linear separability. To tackle such nonlinear challenges, we rely on multiple perceptrons, often organized into several layers. These layers constitute networks of artificial neurons, each capable of processing one or more inputs and producing a single output. The neurons interconnect within expansive networks, commonly comprising tens to thousands of units. Typically, these networks are structured in layers, encompassing an input layer, one or more hidden layers, and ultimately, an output layer.

A multi-layer perceptron

Training Multi-layer perceptrons

Multi-layer perceptrons need to be trained by showing them a set of training data and measuring the error between the network’s predicted output and the true value. Training takes an iterative approach that improves the network a little each time a new training example is presented. There are a number of training algorithms available for a neural network today, but we are going to use one of the best established and well known, the backpropagation algorithm. The algorithm is called back propagation because it takes the error calculated between an output of the network and the true value and takes it back through the network to update the weights. If you want to read more about back propagation, please see this chapter from the book “Neural Networks - A Systematic Introduction”.

Multi-layer perceptrons training in R

We’re preparing to construct a multi-layer perceptron to predict species in the iris dataset. With the dataset’s four computed features representing two aspects of the plant, along with width and height, we’ll set up four input neurons. The number of hidden layers can vary and is typically determined through experimentation. Since there are three different species of plants, we’ll incorporate three output neurons.

Before delving into the construction, let’s organize our data for ingestion into the neural network:

> data(iris)
> iris$setosa <- iris$Species=="setosa"
> iris$virginica <- iris$Species == "virginica"
> iris$versicolor <- iris$Species == "versicolor"
> iris.train.idx <- sample(x = nrow(iris), size = nrow(iris)*0.5)
> iris.train <- iris[iris.train.idx,]
> iris.valid <- iris[-iris.train.idx,]

Now lets build our neural network, to which we use a library called neuralnet:

> iris.net <- neuralnet(setosa+versicolor+virginica ~ 
                          Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
                          data=iris.train, hidden=c(10,10), rep = 5, err.fct = "ce", 
                          linear.output = F, lifesign = "minimal", stepmax = 1000000,
                          threshold = 0.001)

hidden: 10, 10    thresh: 0.001    rep: 1/5    steps:    1763	error: 0.00011	time: 0.29 secs
hidden: 10, 10    thresh: 0.001    rep: 2/5    steps:     661	error: 0.00035	time: 0.1 secs
hidden: 10, 10    thresh: 0.001    rep: 3/5    steps:     688	error: 0.00044	time: 0.1 secs
hidden: 10, 10    thresh: 0.001    rep: 4/5    steps:    1014	error: 0.00018	time: 0.15 secs
hidden: 10, 10    thresh: 0.001    rep: 5/5    steps:    1023	error: 0.00023	time: 0.15 secs

Now lets take a look at our trained neural network:

> plot(iris.net, rep="best")

Prediction using a multi-layer perceptron

Confusion Matrix

> iris.prediction <- compute(iris.net, iris.valid[-5:-8])
> idx <- apply(iris.prediction$net.result, 1, which.max)
> predicted <- c('setosa', 'versicolor', 'virginica')[idx]
> cm <- table(predicted, iris.valid$Species)
> accuracy <- sum(diag(cm))/length(iris.valid$Species)
> sprintf("Accuracy: %.f%%", accuracy*100)

predicted    setosa versicolor virginica
 setosa         25          0         0
 versicolor      0         23         2
 virginica       0          2        23

[1] Accuracy: 95%

Changing the different characteristics of the neural network

There are a number of characteristics you can change in your model, that may increase or decrease the performance of your model. have ago at adjusting the number of steps, linear output (“T” or “F”) and number of hidden layers.

Cloud APIs

Google, Microsoft, Amazon, and many others now have Cloud based Application Programming Interfaces (APIs) where you can upload an image and have them return you the result. Most of these services rely on a large pre-trained (and often proprietary) neural network.

Exercise: Try cloud image classification

Take a photo with your phone camera or find an image online of a common daily scene. Upload it Google’s Vision AI example at https://cloud.google.com/vision/ How many objects has it correctly classified? How many did it incorrectly classify? Try the same image with Microsoft’s Computer Vision API at https://azure.microsoft.com/en-gb/services/cognitive-services/computer-vision/ Does it do any better/worse than Google?

Existing API’s of machine learning models

A vast collection of deep learning machine models can be found on the platform known as Hugging Face. Serving as an AI community hub, it offers a diverse array of pre-trained, cutting-edge machine learning models accessible to all users.

Exercise: Existing API’s of machine learning models

go to https://huggingface.co/ and have ago at some of the different models, alot of them have inference API so you can have ago on the website.

Key Points

Ethics and Implications of Machine Learning

Overview

Teaching: min
Exercises: min

Questions

Objectives

Ethics and Machine Learning

There are increasing worries about the ethics of using machine learning. In recent year’s we’ve seen a number of worrying problems from machine learning entering all kinds of aspects of daily life and the economy:

The first death from an autonomous car which failed to brake for a pedestrian.[1]
Highly targetted advertising based around social media and internet usage. [2]
The outcomes of elections and referendums being influenced by highly targetted social media posts . This is compunded by the data being obtained without the users’s consent. [3]
The mass deploymeny of facial recognition technologies. [4]
The possible first use of autonomous military robots making a decision to kill in battle. [5]

Problems with bias

Machine learning systems are often presented as more impartial and consistent ways to make decisions. For example sentencing criminals or deciding if somebody should be granted bail. There have been a number of examples recently where machine learning systems have been shown to be biased because the data they were trained on was already biased. This can occur due to the training data being unrepresentative and under representing certain groups. For example if you were trying to automatically screen job candidates and used a sample of people the same company had previously decided to employ then any biases in their past employment processes would be reflected in the machine learning.

Problems with explaining decisions

Many machine learning systems (e.g. neural networks) can’t really explain their decisions. Although the input and output are known trying to explain why the training caused the network to behave in a certain way can be very difficult. If a decision is questioned by a human its difficult to provide any rationale as to how a decision was arrived at.

Problems with accuracy

No machine learning system is ever 100% accurate. Getting into the high 90s is usually considered good. But when we’re evaluating millions of data items this can translate into 100s of thousands of mis-identifications. If the implications of these incorrect decisions are serious then it will cause major problems. For instance if it results in somebody being imprisoned or even investigated for a crime or maybe just being denied insurance or a credit card.

Energy Usage

Many machine learning systems (especially deep learning) need vast amounts of computational power which in turn can consume vast amounts of energy. Depending on the source of that energy this might account for significant amounts of fossil fuels being burned. It is not uncommon for a modern GPU accelerated computer to use several kilowatts of power, running this for one hour could easily use as much energy a typical home would use in an entire day. This can be particularly bad when models are constantly being retrained or when “parameter sweeps” are done to find the best set of parameters to train with.

Ethics of machine learning in research

Not all research using machine learning will have major ethical implications. Many research projects don’t directly affect the lives of other people, but this isn’t always the case.

Some questions you might want to ask yourself (and which an ethics committee might also ask you):

Will anything your machine learning system does make a decision that somehow affects a person’s life?
Will anything your machine learning system does make a decision that somehow affects an animial’s life?
Will you be using any people to create your training data? Will they have to look at any disturbing or traumatic material during the training process?
Are there any inherent biases in the dataset(s) you’re using for training?
How much energy will this computation use? Are there more efficient ways to get the same answer?

Exercise: Ethical implications of your own research

Split into pairs or groups of three. Think of a use case for machine learning in your research areas. What ethical implications (if any) might there be from using machine learning in your research? Write down your group’s answers in the etherpad.

Key Points

Find out more

Overview

Teaching: min
Exercises: min

Questions

Objectives

Other algorithms

There are many other machine learning algorithms that might be suitable for helping to answer your research questions.

The Scikit Learn webpage has a good overview of all the features available in the library.

Ensemble Learning

Ensemble Learning is a technique which combines multiple machine learning algorithms together to improve results. A popular ensemble technique is Random Forest which creates a “forest” of decision trees and then tries to prune it down to the most effective ones. Its a flexible algorithm that can work both as a regression and a classification system. See the article Random Forest Simple Explanation for more information.

Genetic Algorithms

Genetic algorithms are a technique which tries to mimic biological evolution. They will learn to solve a problem through a gradual process of simulated evolution. Each generation is mutated slightly and then evaluated with a fitness function, the fittest “genes” will then be selected for the next generation. Sometimes this is combined with neural networks to change the network’s size structure.

This video shows a genetic algorithm evolving neural networks to play a video game.

Useful Resources

Machine Learning for Everyone - A useful overview of many different machine learning techniques, all introduced in an easy to follow way.
Google machine learning crash course - A quick course from Google on how to use some of their machine learning products.
Facebook Field Guide to Machine Learning - A good introduction to machine learning concepts from Facebook.
Amazon Machine Learning guide - An introduction to the key concepts in machine learning from Amazon.
Azure AI - Microsoft’s Cloud based AI platform.

Key Points

day 2 practical

Overview

Teaching: min
Exercises: min

Questions

Objectives

Excerise 1

There were exercises throughout the sections that we covered today. If you didn’t get chance to do them then please go back and have ago at them now. If completed, go on to exercise 2.

Excercise 2

Download File

Please use the download link to be able to access the penguin dataset which will be used for this exercise.

Task 1:

Using the penguin dataset create a KNN and SVM model using the features: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and species as the labels. see if you can get your model to generate some predictions. how do they preform?

Task 2:

Now it time to create yourself a deep learning neural network to generate some predictions. again how does it perform? Have ago at altering the number of layers and see if you get a performance increase.

Task 3:

Have a think of what you believe are the ethical implications of machine learning. what could be the biggest benefit’s and what are the dangers.

Key Points

Introduction to Machine Learning in R

Introduction to machine learning

Overview

What is machine learning?

Training Data

Types of output

Machine learning vs Artificial Intelligence

Applications of machine learning

Machine learning in our daily lives

Example of machine learning in research

Limitations of Machine Learning

Garbage In = Garbage Out

Bias or lacking training data

Extrapolation

Over fitting

Inability to explain answers

Key Points

Clustering

Overview

Clustering

Applications of Clustering

K-means Clustering

Lets look at our data

Limitations of K-Means

Advantages of K-Means

Spectral Clustering

Affinity Matrix

Graph Representation

Laplacian Matrix

Eigenvalue Decomposition

Exercise: Increasing the number of cluster centres

Key Points

Dimensional Reduction

Overview

Dimensionality Reduction

Examine the dataset

Principle Component Analysis (PCA)

t-distributed Stochastic Neighbor Embedding (t-SNE)

Exercise: Parameters

Exercise: Other Algorithms

Key Points

Regression

Overview

Linear regression

Preprocess the dataset

Try different features

Logistic Regression

Logarithms Introduction

trying different features

Key Points

day 1 practical

Overview

Excerise 1

Excercise 2

Key Points

Non-Linear Classifiers

Overview

K-Nearest Neighbour (KNN)

Confusion Matrix

Support Vector Machines (SVM)

Strengths of support vector machines:

Drawbacks of support vector machines:

SVM in R

different non-linear classifier

Key Points

Neural Networks

Overview

Introduction

Perceptrons

Coding a perceptron

Perceptron limitations

Multi-layer Perceptrons

Training Multi-layer perceptrons

Multi-layer perceptrons training in R

Prediction using a multi-layer perceptron

Confusion Matrix

Changing the different characteristics of the neural network

Cloud APIs

Exercise: Try cloud image classification

Existing API’s of machine learning models