Two Network Face Recognition Using Unsupervised Training

Alberto Goldberger
albertog@columbia.edu
CS W4721: Advanced Intelligent Systems
May 13, 1999

 

1. Introduction

Neural networks are particularly well suited to image classification problems because they require little understanding of the problem domain -- one may simply plug in image pixels at the network input layer and begin training the network using a set of supervised training examples. With the increasing proliferation of commercial face recognition systems, it seems worthwhile to study the performance of neural network based face recognition systems to determine their suitability to real-world problems.

Unfortunately, the canonical three layer backpropagation neural network has proven insufficiently robust for real-world face recognition applications [Zhang]. In particular, these networks tend to generalize poorly when tested on images that have not been preprocessed to very constraining specifications. Furthermore, the size of these networks makes frequent retraining (as is needed when adding or removing users from a face-recognition database) an unfeasibly time consuming endeavor. Methods that address these problems to varying degrees of success have been developed, many using advanced network architectures such as directed, acyclic graphs rather than a set of fully-connected layers [Kung]. In this paper, I evaluate the performance of one such system, developed by Cottrell and Fleming [Cottrell] which uses unsupervised feature extraction in a two network approach.

My evaluation of Cottrell's and Fleming's technique yields surprisingly discouraging results, particularly in light of the remarkably high accuracy rates they report in their own experiments with this network architecture. In particular, my numbers suggest that perhaps their architecture lacks the expressibility necessary for successful generalization, even over relatively simple images. While the reasons for the discrepancies between the two evaluations of this architecture aren't entirely clear, they certainly indicate the need for more rigorous evaluation of this approach.

The rest of this section is organized as follows: details of the network architectures, training and testing are given in section 2. Section 3 presents the results obtained from running the network along with an analysis of the feature vectors (as described in the following sections) that were produced through training. Section 4 provides pointers to other face recognition work, with concluding remarks in section 5.


2. The Approach

As previously mentioned, this approach uses two networks to identify human faces. The first network, called the autoextractor, is trained on the target concept f(x) = x. That is, for each image pixel at the input layer, it attempts to reconstruct the same pixel value at the output layer. The trick lies in the hidden layer, which is generally several orders of magnitude smaller than the output or input layers. By forcing each pixel value to pass through a narrow channel of hidden units before reconstruction at the output layer, the hidden unit is forced to encode face features that are rich in information content. The entire hidden layer is sometimes referred to as the feature vector. Activations of this feature vector comprise the input to the second network. The network is trained using standard backpropagation with unsupervised examples (although the target concept is clearly defined, there is no need for the training set to be tagged with proper target vectors for each example -- the network can do this automatically). Once trained, no retraining is necessary when enrolling new images into a face database -- the face features extracted by the network are (arguably) universal.

Classification of images takes place at the output layer of the second network. The classification network, as described by Cottrell and Fleming, is a two layer (no hidden layer) network whose input is the hidden layer of the autoextractor. One output node corresponds to each known face. Again, training is done through backpropagation (after training the autoextractor), but proceeds much faster because there are very few weights to adjust at each iteration of the learning algorithm.

The original architecture used in this project is a close analog of the on presented above, with some modifications for the specifics of our dataset. Two different autoextractors were trained: one with 20 hidden nodes and another with 40 (both had 960 input and output nodes, corresponding to the number of pixels in the 32x30 images in the dataset). Initial experiments used a classification network exactly as described above, but poor performance led to the addition of a hidden layer (one with 20 nodes, another with 30). The output layer contained one node for each of the 20 subjects in the dataset. Feature extractors were trained for 200 epochs, classifiers for 100.

The dataset consisted of 158 images of 20 subjects standardized with respect to pose and scale. Images were each 32 by 30 pixels, a resolution which allows quick training but may be too low to obtain good results. Of the 158 images, 80 were reserved for training, 38 for validation and 40 for testing, with the number of images of each subject split approximately evenly between the training set and the testing and validation sets.


3. Results

Experimental results show much poorer performance than that obtained by Cottrell and Fleming. Original runs using a two layer architecture for the classifier (no hidden layer) provided performance no better than the baseline and training converged at a local minimum within 10 epochs. Classifiers with hidden layers performed somewhat better but nonetheless yielded unsatisfactory results.

The following table summarizes accuracy as a percentage of examples correctly classified over each run. While reading this table, one should keep in mind that Cottrell and Fleming obtained an accuracy level of 100% over their own dataset, and that a canonical NN approach on the same dataset these experiments were run on yielded accuracies on the order of 90%. H1 is the number of hidden nodes in the feature extractor, H2 the number of hidden nodes in the classifier.

Architecture

Training set

Validation set

Test set

H1 = 20, H2 = 0

5.00

5.26

5.00

H1 = 40, H2 = 0

5.00

5.26

5.00

H1 = 20, H2 = 20

23.75

25.00

20.00

H1 = 40, H2 = 20

5.00

5.56

5.00

H1 = 20, H2 = 30

22.50

25.00

20.00

H1 = 40, H2 = 30

5.00

5.56

5.00

The results obtained for runs using the 40 hidden node feature extractor are unsurprising based on an analysis of the images reconstructed by the network at its output layer: they are all virtually black, with gray values clustered around the 0 to 10 range (of 255 possible gray levels) -- the network did not generalize at all after training. Furthermore, hidden unit representations of the 40 unit feature extractor reveal a tendency to represent images of specific subjects rather than general face images. One possible explanation is that each hidden unit was particularly tuned to one of the 40 images in the training set. Some of the examples below are typical of the hidden unit representations seen in this network:

Hidden unit 4

Hidden unit 9

On the other hand, the 20 hidden unit feature extractor was able to produce reconstructions of input images at the output layer. This result fits in with the weights observed in most hidden units which, unlike in the 40 unit network, seemed to identify particular features (principal components, not local features) rather than specific images, as was the case with the following two units:

Hidden unit 5

Hidden unit 9

The following table shows some of the reconstructions done by the 20 node feature extractor on images which were part of the training set:

Training Image 1

Training Image 2

Original

Reconstruction

Original

Reconstruction

On images which were not part of the training set, the following reconstructions were typical:

Test Image 1

Test Image 2

Original

Reconstruction

Original

Reconstruction

Although it is encouraging to see little difference in reconstruction quality between unseen images and those in the training set, clearly the overall quality is not good. Particularly disturbing is the similarity between the reconstructions of training image 1 and test image 2, despite the fact that they are images of different subjects and one wears sunglasses while the other does not. There are several possible explanations for this poor reconstruction: noise in the original images (as evidenced by the large black spots on three of the input images above); too many or too few hidden units (although we can't be too far below the "right" number, since 40 did so poorly); finally, and most likely, is an inadequate training set (both in terms of number of images/subjects and resolution of images -- Cottrell's and Fleming's results were obtained using images that had twice the resolution of these).


4. Related Work

Over the past two years, a lot of work on face recognition has been done as more and more players race to deliver the first commercially successful system. Another neural network approach which solves the problem of slow retraining time while achieving robustness to pose and scale variations [Kung] was developed at Princeton University and is the basis for the FaceVACS system currently marketed by Siemens-Nixdorf. This system uses a directed acyclic graph as the network architecture, with one network sub-section for each individual in the face database. Other successful approaches, documented in [Zhang], include Eigenfaces, which relies on statistical analysis to derive the principal components of a face and determine a feature vector equivalent to the feature vector in the Cottrell and Fleming extractor hidden layer. Elastic matching is the most accurate system when evaluated on standard benchmarks, but its computational cost is several orders of magnitude greater than any of the other techniques [Zhang].


5. Conclusion

Overall, the results derived from these experiments expose potentially serious problems with Cottrell's and Fleming's approach. While some of the poor performance encountered can be attributed to the dataset used for testing (particularly its low resolution and noise level), the fact that generalization accuracy did not exceed the baseline level on the two-layer classification networks raises the question of whether the architecture is sufficiently powerful to represent the target concept.

In light of these issues, and considering the technique's documented lack of robustness to rotational and scale variations [Zhang], the only conclusion that can be drawn is that this method should be avoided for face recognition applications until it has been significantly improved. Other neural network approaches seem preferable, such as the one presented in section 4 [Kung], although Cottrell's and Fleming's is the only one which has been evaluated on standard face recognition bench marks and thus the one whose limitations are best known. Perhaps the domain of face recognition is best approached from a statistical perspective like Eigenfaces, which despite its mathematical equivalence to the work presented here, seems to preform better on standard benchmark tests.


References

[Cottrell]

Cottrell, G.W. and M. Fleming, "Face Recognition Using Unsupervised Feature Extraction," in Proceedings of the International Neural Network Conference, vol. 1, Paris, France, July 9-13, 1990, pp. 322-325

[Kung]

Kung, S., M. Fang, S. Liou, M. Chiu and J. Taur, "Decision-Based Neural Network for Face Recognition System," in Proc. ICIP'96, vol. I, Washington, D.C., Oct. 1995, pp. 430-437

[Zhang]

Zhang, J., Y. Yan and M. Lades, "Face Recognition: Eigenface, Elastic Matching and Neural Nets," Proceedings of the IEEE, vol. 85, no. 9, pp. 1423-1435