%% HW5 - 3101 Matlab -- (INSERT NAME and UNI HERE) % Due 11:59pm on Tuesday, April 15th, 2008 %% Goals % Practice with reading in a dataset, analyzing it and visualizing results. % Basic dimensionality rediuction, clustering, and classification using % machine learning. %% Directions % Please fill in your name at the top of this page and write your answers % under their corresponding question. When you are done, publish this file % to html using File > Publish To HTML (or the command publish('hw1.m', % 'html') and create a new zip file that is labeled with your UNI and % homework number, in this format: bs2018_hw1.zip, that contains this % original m file, and the html directory created from publishing the file. % Make sure the html file adequately shows your work, (has images, etc.), % and then email this file to cs3101@gmail.com, making sure to include your % name, and uni in the subject. %% Problem 1 - Loading USPS digit data %% 1.1 % Download the file hw5digits.mat from the class website, and load this % file into your workspace. List the variables in your workspace. %% 1.2 % The variable X contains 1000 images of handwritten digits from 0-9 % collected by the United States Postal Service. Each digit is represented % as a 64x64 pixel black and white image. However, each image has been % vectorized, meaning each row of the images has been concatenated together % to make a flat vector representation of the image. Get the 5th image % in X as a vector and store that image vector in a variable called v. v % should be of length 256x1 Do not display the output. %% 1.3 % Use the reshape command to make v into a 16x16 matrix, and plot this % matrix as an image in figure(1) in black and white. %% Problem 2 - Dimensionality reduction % Currently each image can be thought of as a single point in a 256 % dimensional space, where each dimension corresponds to the intensity of % the image at that single pixel location. If we want to plot all of these % 1000 points using only 2 dimensions, so that images that are similar are % placed next to each other, we need to reduce the dimensionality of the % data from 1000 to 2 without losing a significant ammount of information. %% 2.1 % Let's start by computing the similarity between all of the 1000 images in % X. The vectorized representation of X makes it simple to do this. We can % use the matrix inner product between two image vectors as a measure of % similarity. What is the inner product between the 1st digit and the % 2nd? What is the inner product between the 1st and 3rd? %% 2.2 % Use matrix multiplcation to compute the inner product of X's transpose and % X, so that you get a matrix A which is of size 1000x1000. A can % be thought of as an affinity matrix. Each value of A corresponds to % computing the inner product between two vectors in X, and also represents % the similarity between those two vectors; big values in A correspond % to a pair of images that are very similar. You should be able to compute % A with one single matrix multiply in one line. %% 2.3 % Plot A as an image, use the jet colormap %% 2.4 % Principal Component Analysis (PCA) allows us to take this % high-dimensional data: 1000 points in 256 dimensions, and reduce the % dimensionality while preserving as much information as possible. Let's % start by computing the eigenvalues and eigenvectors of A, call the % eigenvalues D, and th eigenvectors V. %% 2.5 % Set d equal to the diagonal of D, then show a bar plot of the last 10 % elements of d. These are the 10 biggest eigenvalues of A. %% 2.6 % Each column of V is an Eigenvector of the matrix A, meaning that if we % want to plot the data using only 2 dimensions, the best 2 dimensions to % choose correspond to these last 2 eigenvectors (the ones with the biggest % eigenvalues). Store the last column of V in the variable x, and the % second to last column of V in the variable Y %% 2.7 % Now we have 2 numbers to describe every image instead of 256, so we can % plot the data in a 2D plane with one point per digit image by using the % variables x and y. Display a scatter plot of x,y. Set the scatter plot % so each dot is filled in, and not empty. %% Problem 3 - Clustering %% 3.1 % We know that the data has 10 clusters, each cluster corresponding to a % digit between 0-9. The goal of clustering is to give each point a label % such that all the points with a certain label are more similar then points % with different labels. We can use matlab's built-in clustering tool % called K-means. Apply K-means to X transpose, to get a list % called C, where C labels each point in X with a value from 1 to numClusters % representing which cluster that point is in. numClusters should equal 10. %% 3.2 % Plot a new scatter plot, where each point has coordinates in x and y % computed from PCA above, has size 50, and has the color determined from % C. Use the colormap jet. You should be able to see how different % clusters of points are grouped in similar areas on the map. %% Problem 4 - Classification % Now assume that someone gives you the ground truth lables of all the % images in X, can we classify new images based on what we have seen % before? You will write a simple 1-nearest-neighbor classifier to % accomplish this. %% 4.1 % Download the file hw5trainlabels.mat from the website and load it into the % workspace. %% 4.2 % The variable Y contains all of the labels for each image in X. Use Y to % find all images of 3s, average those images together, and plot the % average 3 in figure 3 as a 16x16 black and white image. %% 4.3 % Download the file hw5classify.mat from the class website and load it into % the workspace %% 4.4 % Write your own function called getDists, this means making a separate % file called getDists.m and proplery structuring that file as a function. % getDists should accept as input a single point as a vector of size 1xD, % and a set of points as a matrix of size NxD, where D can be any number % of dimenions, and N can be any number of points. getDists() should % return a new vector of size 1xN that contains the euclidean distance from % the single point to N other points. Make sure getDists is properly % formatted and commented, and make sure to include getDists.m file in your % submission zip file. %% 4.5 % Xtest contains 20 new digit images that we haven't seen before. For each % image in Xtest, find the image in X with the closest euclidean distance % by using your getDists function. Use the label from this closest point as % your prediction label. Create a new variable called predictions that % contains your predicted labels for the 20 new images. Display the % predictions variable. %% 4.6 % Download from the class website and load the file hw5labels.mat. Use the % variable Ytest to determine the accuracy of your prediction as a percent.