HW5 - 3101 Matlab -- (INSERT NAME and UNI HERE)

Due 11:59pm on Tuesday, April 15th, 2008

Goals
Directions
Problem 1 - Loading USPS digit data
1.1
1.2
1.3
Problem 2 - Dimensionality reduction
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Problem 3 - Clustering
3.1
3.2
Problem 4 - Classification
4.1
4.2
4.3
4.4
4.5
4.6

Goals

Practice with reading in a dataset, analyzing it and visualizing results. Basic dimensionality rediuction, clustering, and classification using machine learning.

Directions

Please fill in your name at the top of this page and write your answers under their corresponding question. When you are done, publish this file to html using File > Publish To HTML (or the command publish('hw1.m', 'html') and create a new zip file that is labeled with your UNI and homework number, in this format: bs2018_hw1.zip, that contains this original m file, and the html directory created from publishing the file. Make sure the html file adequately shows your work, (has images, etc.), and then email this file to cs3101@gmail.com, making sure to include your name, and uni in the subject.

Problem 1 - Loading USPS digit data

1.1

Download the file hw5digits.mat from the class website, and load this file into your workspace. List the variables in your workspace.

1.2

The variable X contains 1000 images of handwritten digits from 0-9 collected by the United States Postal Service. Each digit is represented as a 64x64 pixel black and white image. However, each image has been vectorized, meaning each row of the images has been concatenated together to make a flat vector representation of the image. Get the 5th image in X as a vector and store that image vector in a variable called v. v should be of length 256x1 Do not display the output.

1.3

Use the reshape command to make v into a 16x16 matrix, and plot this matrix as an image in figure(1) in black and white.

Problem 2 - Dimensionality reduction

Currently each image can be thought of as a single point in a 256 dimensional space, where each dimension corresponds to the intensity of the image at that single pixel location. If we want to plot all of these 1000 points using only 2 dimensions, so that images that are similar are placed next to each other, we need to reduce the dimensionality of the data from 1000 to 2 without losing a significant ammount of information.

2.1

Let's start by computing the similarity between all of the 1000 images in X. The vectorized representation of X makes it simple to do this. We can use the matrix inner product between two image vectors as a measure of similarity. What is the inner product between the 1st digit and the 2nd? What is the inner product between the 1st and 3rd?

2.2

Use matrix multiplcation to compute the inner product of X's transpose and X, so that you get a matrix A which is of size 1000x1000. A can be thought of as an affinity matrix. Each value of A corresponds to computing the inner product between two vectors in X, and also represents the similarity between those two vectors; big values in A correspond to a pair of images that are very similar. You should be able to compute A with one single matrix multiply in one line.

2.3

Plot A as an image, use the jet colormap

2.4

Principal Component Analysis (PCA) allows us to take this high-dimensional data: 1000 points in 256 dimensions, and reduce the dimensionality while preserving as much information as possible. Let's start by computing the eigenvalues and eigenvectors of A, call the eigenvalues D, and th eigenvectors V.

2.5

Set d equal to the diagonal of D, then show a bar plot of the last 10 elements of d. These are the 10 biggest eigenvalues of A.

2.6

Each column of V is an Eigenvector of the matrix A, meaning that if we want to plot the data using only 2 dimensions, the best 2 dimensions to choose correspond to these last 2 eigenvectors (the ones with the biggest eigenvalues). Store the last column of V in the variable x, and the second to last column of V in the variable Y

2.7

Now we have 2 numbers to describe every image instead of 256, so we can plot the data in a 2D plane with one point per digit image by using the variables x and y. Display a scatter plot of x,y. Set the scatter plot so each dot is filled in, and not empty.

Problem 3 - Clustering

3.1

We know that the data has 10 clusters, each cluster corresponding to a digit between 0-9. The goal of clustering is to give each point a label such that all the points with a certain label are more similar then points with different labels. We can use matlab's built-in clustering tool called K-means. Apply K-means to X transpose, to get a list called C, where C labels each point in X with a value from 1 to numClusters representing which cluster that point is in. numClusters should equal 10.

3.2

Plot a new scatter plot, where each point has coordinates in x and y computed from PCA above, has size 50, and has the color determined from C. Use the colormap jet. You should be able to see how different clusters of points are grouped in similar areas on the map.

Problem 4 - Classification

Now assume that someone gives you the ground truth lables of all the images in X, can we classify new images based on what we have seen before? You will write a simple 1-nearest-neighbor classifier to accomplish this.

4.1

Download the file hw5trainlabels.mat from the website and load it into the workspace.

4.2

The variable Y contains all of the labels for each image in X. Use Y to find all images of 3s, average those images together, and plot the average 3 in figure 3 as a 16x16 black and white image.

4.3

Download the file hw5classify.mat from the class website and load it into the workspace

4.4

Write your own function called getDists, this means making a separate file called getDists.m and proplery structuring that file as a function. getDists should accept as input a single point as a vector of size 1xD, and a set of points as a matrix of size NxD, where D can be any number of dimenions, and N can be any number of points. getDists() should return a new vector of size 1xN that contains the euclidean distance from the single point to N other points. Make sure getDists is properly formatted and commented, and make sure to include getDists.m file in your submission zip file.

4.5

Xtest contains 20 new digit images that we haven't seen before. For each image in Xtest, find the image in X with the closest euclidean distance by using your getDists function. Use the label from this closest point as your prediction label. Create a new variable called predictions that contains your predicted labels for the 20 new images. Display the predictions variable.

4.6

Download from the class website and load the file hw5labels.mat. Use the variable Ytest to determine the accuracy of your prediction as a percent.