FaceTracer: A Search Engine for Large Collections of Images with Faces

The ability of current search engines to find images based on facial appearance is limited to images with text annotations. Yet, there are many problems with annotation-based search of images: the manual labeling of images is time-consuming; the annotations are often incorrect or misleading, as they may refer to other content on a webpage; and finally, the vast majority of images are simply not annotated.

We have created the first face search engine, allowing users to search through large collections of images which have been automatically labeled based on the appearance of the faces within them. Our system lets users search on the basis of a variety of facial attributes using natural language queries such as, "men with mustaches," or "young blonde women," or even, "indoor photos of smiling children." This face search engine can be directed at all images on the internet, tailored toward specific image collections such as those used by law enforcement or online social networks, or even focused on personal photo libraries.

Like much of the work in content-based image retrieval, the power of our approach comes from automatically labeling images off-line on the basis of a large number of attributes. At search time, only these labels need to be queried, resulting in almost instantaneous searches. Furthermore, it is easy to add new images and face attributes to our search engine, allowing for future scalability. Defining new attributes and manually labeling faces to match those attributes can also be done collaboratively by a community of users.

Our search engine owes its superior performance to two main factors:

  1. A large and diverse dataset of face images with a significant subset containing attribute labels. We currently have over 3.1 million aligned faces in our database -- the largest such collection in the world. In addition to its size, our database is also noteworthy for being a completely "real-world" dataset, encompassing a wide range of pose, illumination, imaging conditions, and captured using a large variety of cameras. Faces have been automatically extracted and aligned using the Omron face and fiducial point detector. In addition, 10 attributes have been manually labeled on more than 17,000 of the face images, creating a large dataset for training and testing classification algorithms. Our dataset is similar in spirit to the Labeled Faces in the Wild dataset.

    A subset of this dataset has been made publicly available. See below.

  2. A scalable and fully automatic architecture for attribute classification. We present a novel approach tailored toward face classification problems, which uses a boosted set of Support Vector Machines (SVMs) to automatically select the best features for a given attribute. These are used to then train a strong SVM classifier with high accuracy. A key aspect of this work is that classifiers for new attributes can be trained automatically, requiring only a set of labeled examples. Yet, the flexibility of our framework does not come at the cost of reduced accuracy -- we compare against several state-of-the-art classification methods and show the superior classification rates produced by our system.

This work was supported by ONR Muri Award No. N00014-08-1-0638 and NSF grants IIS-03-08185 and ITR-03-25867.


"FaceTracer: A Search Engine for Large Collections of Images with Faces,"
N. Kumar, P. N. Belhumeur, and S. K. Nayar,
European Conference on Computer Vision (ECCV),
pp. 340-353, Oct. 2008.
[PDF] [bib] [©] [Project Page]


  Overview of the Database Creation Process:

Images are first downloaded from the internet and run through a face and fiducial point detector to extract the faces within them. These faces are filtered for quality and affine-transformed to a common coordinate system, and then run through a set of attribute classifiers to obtain attribute labels, which are stored in the attribute database. At search time, only this attribute database needs to be searched, resulting in real-time searches.

  Image Database Statistics:

We have collected what we believe to be the largest set of aligned real-world face images (over 3.1 million so far). These images have been downloaded from the internet, and exhibit a large amount of variation -- they are in completely uncontrolled lighting and environments, taken using unknown cameras and in unknown imaging conditions, with a wide range of image resolutions. Notice that more than 45% of the downloaded images contain faces, and on average, there is one face per two images.

  List of Labeled Attributes:

We have manually labeled over 17,000 attributes in our face database. These labels are used for training our classifiers, allowing for automatic classification of the remaining faces in our database. Note that these were labeled by a large set of people, and thus the labels reflect a group consensus about each attribute rather than a single user's strict definition.

  Face Regions used for Automatic Feature Selection:

Our feature selection process automatically selects features from our set of 10 regions. On the left is one region corresponding to the whole face, and on the right are the remaining regions, each corresponding to functional parts of the face. The regions are large enough to be robust against small differences between individual faces and overlap slightly so that small errors in alignment do not cause a feature to go outside of its region. The letters in parentheses denote the code letter for the region, used as a shorthand notation for describing particular feature combinations.

  Feature Type Options:

Our feature selection process automatically selects one or more feature types from these options. A complete feature type is constructed by first converting the pixels in a given region to one of the pixel value types from the first column, then applying one of the normalizations from the second column, and finally aggregating these values into the output feature vector using one of the options from the last column. The letters in parentheses are used as code letters in a shorthand notation for concisely designating feature types.

  Results of our Training Algorithm:

This table shows error rates obtained by our training algorithm. In most cases, our error rates are under 10%. The top feature combinations selected by our algorithm are shown in the last column, in ranked order from more important to less as ``Region:feature type'' pairs, where the region and feature types are listed using the code letters defined above. For example, the first combination for the hair color classifier, ``H:r.n.s,'' takes from the hair region (H) the RGB values (r) with no normalization (n) and using only the statistics (s) of these values.

  Illustrations of Automatically-Selected Feature Combinations:

The images show the top-ranked feature combinations for (a) gender, (b) smiling, (c) environment, and (d) hair color. These feature combinations were automatically selected by our training algorithm. Notice how each classifier uses different regions and feature types of the face. It would be impossible to define all of these manually for each attribute and maintain our high accuracies.

  Comparison of Classification Performance against Prior Methods:

Our attribute-tuned global SVM performs better than prior state-of-the-art methods. Note the complementary performances of both Adaboost methods versus the full-face SVM method for the different attributes, showing the strengths and weaknesses of each approach. By exploiting the advantages of each, our approach achieves the best performance.

  Results for Query "Adults Outside":

Comparison of results for the query "adults outside" using our search engine (left) and Google image search (right). Notice that since Google is dependent on text labels for searching through images, many of their results have no relevance to the query. In contrast, our search engine finds good results because our system has labeled each face using our attribute classifiers (offline).

  Results for Query "Asian Babies":

Comparison of results for the query "asian babies" using our search engine (left) and Flickr (right). Flickr allows users to search using manually annotated image "tags." However, because it is tedious to tag each image, users often tag entire photosets with the same tags, resulting in many incorrect tags for individual images. Thus, search results often contain many wrong results. Furthermore, the vast majority of images on flickr are simply not tagged, making them "invisible" to search.

  Results for Query "Children Outside" using a Personalized Image Search:

Our search engine can also be applied to a user's own set of photos, allowing for personalized image searches. In this way, one can quickly locate specific images of friends or family members. Thus, our search engine can be a useful addition to existing personal photo management software by allowing people to easily organize their photos on the basis of the faces within them. Furthermore, one could also search for (and remove) bad quality images, or even train his own classifiers.



  ECCV 2008 Video:

This video introduces our face search engine. It describes the database creation process and shows several example queries. (With narration)



ECCV 2008 presentation


  FaceTracer Database:

We have made a large portion of our database publicly available for use by vision researchers. This dataset includes a several thousand detected faces, locations of fiducial points, links to the original webpages on which the images were found, and also a significant number of manually labeled attributes.


Attribute and Simile Classifiers for Face Verification

Appearance Matching

Face Swapping: Automatically Replacing Faces in Photographs

Labeled Faces in the Wild (UMass Project)