Attribute and Simile Classifiers for Face Verification

In this work, we advance the state-of-the-art for face verification ("are these two images of the same person?") in uncontrolled settings with non-cooperative subjects. To this end, we present two novel and complementary methods for face verification. Common to both methods is the idea of extracting and comparing "high-level" visual features, or traits, of a face image that are insensitive to pose, illumination, expression, and other imaging conditions. Our first method -- based on attribute classifiers -- uses binary classifiers trained to recognize the presence, absence, or degree of describable visual attributes (gender, race, age, hair color, etc.). Our second method -- based on simile classifiers -- removes the manual labeling required to train attribute classifiers. The simile classifiers are binary classifiers trained to recognize the similarity of faces, or regions of faces, to specific reference people. The idea is to automatically learn similes that distinguish a person from the general population. An unseen face might be described as having a mouth that looks like Barack Obama's and a nose that looks like Owen Wilson's.

Comparing two faces is simply a matter of comparing trait vectors (i.e., from the attribute and/or simile classifiers). We present experimental evaluation results on the challenging Labeled Faces in the Wild (LFW) data set. This data set is remarkable in its variability, exhibiting all of the differences mentioned above. Remarkably, both the attribute and simile classifiers achieve state-of-the-art results on the LFW "restricted images" benchmark, and a hybrid of the two results in a 31.68% drop in error rates compared to the previous best. To our knowledge, this is the first time that a list of such visual traits have been used for face verification. For testing beyond the LFW data set, we introduce PubFig -- a new data set of real-world images of public figures (celebrities and politicians) acquired from the internet. The PubFig data set is both larger (60,000 images) and deeper (on average 300 images per individual) than existing data sets, and allows us to present verification results broken out by pose, illumination, and expression. Finally, we measure human performance on LFW, showing that humans do very well on it -- given image pairs, verification of identity can be performed almost without error.

This research was funded in part by NSF award IIS-03-25867 and ONR award N00014-08-1-0638. We are grateful to Omron Technologies for providing us the OKAO face detection system.


"Attribute and Simile Classifiers for Face Verification,"
N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar,
IEEE International Conference on Computer Vision (ICCV),
Oct. 2009.
[PDF] [bib] [©]


  Training images for attribute classifiers:

Each row shows training examples of face images that match the given attribute label (positive examples) and those that don't (negative examples). We have over a thousand training images for each of our 65 attributes. Accuracies for each attribute classifier are shown in the next image.

  Accuracies of attribute classifiers:

We present accuracies of the 65 attribute classifiers trained for our system. Example training images for the attributes in bold are shown in the previous images

  Amazon Mechanical Turk job for labeling attributes:

We use Amazon Mechanical Turk to label images with attributes. This online service allows us to easily and inexpensively label images using large numbers of human workers. This image shows an example of our attribute labeling jobs. We were able to collect over 125,000 human labels in a month, for $5,000.

  Attribute classifier outputs:

An attribute classifier can be trained to recognize the presence or absence of a describable aspect of visual appearance. The responses for several such attribute classifiers are shown for a pair of images of Halle Berry. Note that the "flash" and "shiny skin" attributes produce very different responses, while the responses for the remaining attributes are in strong agreement despite the changes in pose, illumination, expression, and image quality.

  Training images for simile classifiers:

Each simile classifier is trained using several images of a specific reference person, limited to a small face region such as the eyes, nose, or mouth. We show here three positive and three negative examples for four regions on two of the reference people used to train these classifiers.

  Simile classifier outputs:

We use a large number of "simile" classifiers trained to recognize the similarities of parts of faces to specific reference people. The responses for several such simile classifiers are shown for a pair of images of Harrison Ford. R_j denotes reference person j, so the first bar on the left displays the similarity to the eyes of reference person 1. Note that the responses are, for the most part, in agreement despite the changes in pose, illumination, and expression.

  Face Verification Results on LFW:

Performance of our attribute classifiers, simile classifiers, and a hybrid of the two are shown in solid red, blue, and green, respectively. All 3 of our methods outperform all previous methods (dashed lines). Our highest accuracy is 85.29%, which corresponds to a 31.68% lower error rate than the previous state-of-the-art.

  Amazon Mechanical Turk job for human verification:

We asked human users on Amazon Mechanical Turk to perform the face verification task on the LFW data set. This image shows an example of what these jobs looked like. Using a total of 240,000 user responses, we were able to plot human performance on LFW

  Human Face Verification Results on LFW:

Human performance on LFW is almost perfect (99.20%) when people are shown the original images (red line). Showing a tighter cropped version of the images (blue line) drops their accuracy to 97.53%, due to the lack of context available. The green line shows that even with an inverse crop, i.e., when only the context is shown, humans still perform amazingly well, at 94.27%. This highlights the strong context cues available on the LFW data set. All of our methods mask out the background to avoid using this information.

  The PubFig Data Set:

We show example images for the 140 people used for verification tests on the PubFig benchmark. Below each image is the total number of face images for that person in the entire data set.

  Face Verification Results on PubFig:

Our performance on the entire benchmark set of 20,000 pairs using attribute classifiers is shown in black. Performance on the pose, illumination, and expression subsets of the benchmark are shown in red, blue, and green, respectively. For each subset, the solid lines show results for the "easy" case (frontal pose/lighting or neutral expression), and dashed lines show results for the "difficult" case (non-frontal pose/lighting, non-neutral expression).



ICCV 2009 presentation


  PubFig Database:

As a complement to the LFW data set, we have created a data set of images of public figures, named PubFig. This data set consists of 60,000 images of 200 people. The larger number of images per person (as compared to LFW) allows us to construct subsets of the data across different poses, lighting conditions, and expressions, while still maintaining a sufficiently large number of images within each set.


FaceTracer: A Search Engine for Large Collections of Images with Faces

Face Swapping: Automatically Replacing Faces in Photographs

Appearance Matching

Labeled Faces in the Wild (UMass Project)