# COS80023 - Task 10 Clustering vs. Classifiers

### Overview

With the help of given R commands, run a simple k-means clustering algorithm. Compare its results with the results produced by random forest (RF), which you investigated in Pass task 8.

Purpose

Learn to use R for clustering. Learn the difference between clustering and classification, and the difference between supervised and unsupervised learning.

Go through the steps described below. Answer the questions in a separate document.

Time

This task should be completed in your 10th tutorial or the week after and submitted to Canvas for feedback. It should be discussed and signed off in tutorial 11 or 12.

This task should take no more than 2 hours to complete (excluding introductory videos).

Resources

Feedback

Next

Credit Task 10 — Submission Details and Assessment Criteria

Follow the steps below and answer the questions in a separate file, then upload to Canvas as a PDF. Your tutor will give online feedback and discuss the tasks with you in the lab when they are complete.

library("caret")

(This loads the caret library which has functions that we need. If you need to know what a function does, you can ask for help using ?, e.g. ? traincontrol()).

iris.data\$species <- as.factor(iris.data\$species) set.seed(32984)

indexes <- createDataPartition(iris.data\$species, times = 1,

p = 0.7, list = FALSE)

train <- iris.data[indexes,] test <- iris.data[-indexes,]

trainctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

### Random Forest Classification

To remind yourself of classification, apply the Random Forest algorithm to the dataset as you did in Pass task 8.

rfmodel <- train(species ~., data=train, method="ranger", trControl=trainctrl,

preProcess = c("center", "scale"), tuneLength=10) rfresult <- predict(rfmodel, newdata=test)

### Clustering using k-means

k-means is a very simple and intuitive clustering algorithm. Before you start, watch the video introduction which illustrates the k-means process:

The function to perform k-means clustering in R is simply:

kmeans(<data>, k)

Question 1. What variable should you enter for each of the parameters? Why?

Question 2. For clustering, should we use the train, test or complete dataset? Why?

Once you have investigated and made informed decisions, run the kmeans() function and assign the result to kmeansresult.

Show the kmeansresult on the screen. You will see a listing of components at the end. You can view each component by typing kmeansresult\$<component>.

kmeansresult\$cluster shows the cluster number (assigned by kmeans()) for each entry.

### Comparing the results

Compare the results of kmeans clustering and random forest.

First, you have to find the results of both kmeans and random forest for a matching number of rows.

You can combine the results in a dataframe df. To convert data into a data frame you can use

df <- as.data.frame(<mycolumns>) To combine columns into a single table, you can use df <- (df, <othercolumns>)

Question 3. When you compare the cluster assignment of kmeans with the species prediction of random forest, do these agree? Can you tell which cluster is which species? Discuss in about five sentences.

You can find out more using a scatterplot.

ggplot(data=iris.data,aes(x=sepal_width, y=petal_length, color=species)) + geom_point() + theme_minimal()

shows two of the three variables of the iris data set, with the colouring according to the dependent variable.

Add the kmeans cluster number as an additional column to the data set. Choose this column for the colouring. Make scatterplots of different combinations of the independent variables. Take a screenshot of the most illustrative combination(s) and put it into your answer document, then use it to answer

Question 4. Can you see why kmeans() assigned the clusters as it did? State your observations.

Hire Expert

Get a Professional Help

200
Select FileChangeRemove

TOP

Limited Time Offer! - 20% OFF on all Services Get Expert Assistance Today!

X