# Exercise 3

- Due No Due Date
- Points None

**Exercise 3.1 Error rates **

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next, we use 1-nearest neighbors (i.e. k-NN with k = 1) and get an average error rate (averaged over both training and test data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

**Exercise 3.2 Iris classification using logistic regression and cross validation**

a) Use all 4 features (sepal and petal lengths and widths) for the iris flower data set and train a classifier using logistic regression. Calculate the confusion matrix for the resulting classifier. [Hint: Use iris_multiclass.ipynb and change that file]. You should get an accuracy of 98-99% when training and evaluating perfomance on the same data set.]

b) Change your previous solution to use cross validation with cv = 5 to evaluate accuracy more properly. [Hint: An example of how to use cross_val_score is given on this page in the documentation . You should get ~ 97-98% accuracy]

**Exercise 3.3 Optimal classification using prior information**

A certain medical test is used to diagnose a cancer disease on patients. The test result is a real number. From such test results for previous patients and further careful diagnosis of these, one has learned that:

- If patient is healthy then the test result has a Gaussian distribution with mean 10 and standard deviation 4
- If patient has cancer then the test result has a Gaussian distribution with mean 20 and standard deviation 5
- The prevalence of the disease is 1% (i.e. 1% of all persons that were tested turned out to really have cancer)

The test results for three patients turn out to be .

a) What are the probabilities that patients 1,2 and 3 have cancer, respectively ?

Hint: Use where is the prior probability of class , and the probability density of for an observation from class (Bayes' theorem)

b) Which prediction should you make for each patient, in order to make on average as few misclassifications as possible ? [Hint: The most probable one (Bayes' classifier)]

c) What threshold t should be used to minimize total misclassification error if we use a classifier of the form "If measurement > then patient has cancer" [Hint: This corresponds to 50% risk when the prior information of cancer prevalence is taken into account.]

Hint:

%matlab

probability = normpdf(15,10,4))

#python

from scipy.stats import norm

probability = norm.pdf(15,10,4)

d) Argue why it would be a bad idea to use the threshold from c)?

**Exercise 3.4 Classification Metrics **

Continuing on the previous exercise, the matlab file ex5_cancer.m calculates and plots the ROC curve and the precision vs recall curve for classifiers with varying thresholds t.

a) Study the file and make sure you understand the calculations. (The normcdf(x) function gives the cumulative distribution function for a Gaussian random variable with mean 0 and std 1, i.e. the probability that the random Gaussian variable is smaller than the value x.) The file however contains an error. Find the error and plot the correct curves.

b) Say that we want to find the threshold t so that we catch 99% of all the true cancer cases. Is this a requirement on precision, recall, TPR or FPR ?

c) If a patient gets a positive test result, i.e. the measurement is above t, the patient is of course interested in the probability that he/she really has cancer. Is it precision, recall, TPR or FPR that answers this? What is this probability, assuming the threshold from b) has been used.

**Exercise 3.5 Sonar Classification and Bagging**

Run through and do the tasks in the notebook ex5_bagging.ipynb.

a) You are asked to create a stacked histogram for each feature, where it is colored according to the label.

b) Try to find a couple (around 5) of good features using either this or the scatter matrix and try the first classification task using both. Did you manage to get close to the same score with your features?

c) Create a random forest classifier and train it. Try to do some hyperparameter tuning and see how well you can get it to perform on the test data. Did you improve on the score from the decision tree?

**Solution sol3.pdf**