IEEE International Conference on Multimedia & Expo 2001
Tokyo, Japan
Electronic Proceedings
(C) 2001 IEEE


  Nariman Habili and Cheng-Chew Lim
 School of Electrical and Electronic Engineering
Adelaide University, SA 5005, Australia

Alireza Moini
VLSI Group Leader, Intelligent Pixels Inc.
R&D Center, Sanori House, 126 Grand Blvd.
Joondalup, WA 6027, Australia


In this paper, we present a hand and face segmentation algorithm using motion and color cues. The algorithm is proposed for the content based representation of sign language image sequences, where the hands and face constitute a video object. Our hand and face segmentation algorithm consists of three stages, namely color segmentation, temporal segmentation, and video object plane generation. In color segmentation, we model the skin color as a normal distribution and classify each pixel as skin or non-skin based on its Mahalanobis distance. The aim of temporal segmentation is to localize moving objects in image sequences. A statistical variance test is employed to detect object motion between two consecutive images. Finally, the results from color and temporal segmentation are analyzed to yield a change detection mask. The performance of the algorithm is illustrated by simulation carried out on the silent test sequence.


There is a growing trend towards content-based representation in image and video processing applications, as shown by the recent MPEG-4 and 7 standardization efforts. Content-based representation requires the decomposition of an image or video sequence into specific objects, known as video objects (VOs). In this context, a VO may represent a moving person, a fixed background or audio. The instances of VOs at a given time (i.e. frame) are called video object planes (VOPs). A frame can be decomposed into VOPs by means of segmentation.

In sign language communication, or simply signing, the hands and face are perceptually important and thus constitute a VO. The main objective of our research is to devise an algorithm for the segmentation of VOPs in sign language sequences. A comprehensive study on the segmentation of the hands and face, and the coding of sign language sequences was presented in [1]. The author modeled the skin color distribution as a normal mixture in the L*a*b color-space and used the Bayesian classifier to classify image pixels as skin or non-skin. The algorithm required a separate skin location algorithm to identify skin pixels for distribution training. Due to hand and face motion during signing, motion serves as an important cue for VOP segmentation. The author did not take advantage of motion information to enhance the segmentation results.

Our hand and face segmentation algorithm is composed of three stages. In the first stage, image pixels are classified as skin or non-skin to yield a skin detection mask (SDM). The skin color distribution is modeled as a bivariate normal distribution and the image pixels are classified based on their Mahalanobis distance. In the second stage, the statistical variance test is employed to localize moving objects in the image sequence and yield a change detection mask (CDM). The third stage involves the fusion of the SDM and the CDM to generate the VOP. To distinguish between the hands and face, a face identification method is proposed, employing shape features.

This paper is organized as follows. The color segmentation and temporal segmentation techniques are presented in sections 2 and 3, respectively. In section 4, we present the VOP generation method, and experimental results are presented in section 5. The paper is concluded in section 6.


  We employ color information to locate skin regions in each image. The YCbCr color space is considered since it is typically used in video coding and provides an effective use of chrominance information for modeling the human skin color. Experimental results indicate that the skin-color distribution in the CbCr plane remains constant regardless of any variation in the luminance information of an image [2, 3]. Moreover, the CbCr component of the skin pixels of people from European, African and Asian descent occupy the same region in the CbCr plane.

Pixel Classification

This section describes the classification method employed to classify pixels as skin or non-skin. The method is analogous to the single hypothesis classifier described in [4]. Single hypothesis schemes have been proposed to solve problems in which one class is well defined while others are not. It is assumed that the skin class is well defined, while the non-skin class, which may include a wide variety of different colors, is not.

The Skin-Color Model

  Let x denote the feature vector formed by the Cb and Cr components of a pixel, and x is in a 2-dimensional Euclidean space tex2html_wrap_inline688 , called the feature space. The skin and non-skin classes are denoted by tex2html_wrap_inline690 and tex2html_wrap_inline692 , respectively. The skin color distribution in the CbCr plane is modeled as a bivariate normal distribution:


where tex2html_wrap_inline694 and tex2html_wrap_inline696 are the mean vector and covariance matrix of the distribution, respectively. Normal distributions are widely used in the pattern recognition community because of their many desirable properties [4]. The parameters, tex2html_wrap_inline694 and tex2html_wrap_inline696 , are estimated from the skin training pixels. The training pixels were obtained by manually segmenting training images that included people of European, African and Asian descent.

The quantity d in


is known as the Mahalanobis distance from tex2html_wrap_inline704 to tex2html_wrap_inline694 . Pixels can be classified as skin or non-skin based on their Mahalanobis distance. The value of d is related to the probability that a given pixel belongs to class tex2html_wrap_inline690 . A small value of d indicates a high skin pixel probability and vice-versa.

Test of Normality

An effective test to check the assumption of bivariate normality is the chi-square test[5]. Equation (2) can be expressed as:


where tex2html_wrap_inline714 and A is the whitening transformation [4]. Since the mean vector and covariance matrix of tex2html_wrap_inline718 are tex2html_wrap_inline720 and the identity matrix respectively, the tex2html_wrap_inline722 's are independent random variables with zero mean and unity variance. If tex2html_wrap_inline704 is indeed normal, then tex2html_wrap_inline726 in equation (3) is a chi-square ( tex2html_wrap_inline728 ) random variable with n=2 degrees of freedom. Therefore, the test for bivariate normality is to compare the goodness of fit of the Mahalanobis distances


to tex2html_wrap_inline732 , where tex2html_wrap_inline738 and tex2html_wrap_inline740 are estimated from the skin training pixels. The procedure is as follows:

  1. The squared distances in equation (4) are ordered in ascending order as tex2html_wrap_inline734 , where tex2html_wrap_inline736 is the number of skin training pixels. Note that tex2html_wrap_inline738 is the ith smallest squared distance, whereas tex2html_wrap_inline742 is the squared distance associated with the chrominance vector for the ith skin training pixel.
  2.  tex2html_wrap_inline738 is plotted against tex2html_wrap_inline748 , where tex2html_wrap_inline748 is the tex2html_wrap_inline752 percentile of the chi-square distribution with 2 degrees of freedom (the factor of 0.5 is added as a correction for continuity).
The plot should follow a straight line. The chi-square plot of the ordered distances shown in figure 1 does not show any significant deviation from a straight line. It can therefore be asserted that the skin class pixels in the CbCr plane follow a bivariate normal distribution.

Figure 1: The chi-square plot of the ordered distances.

The Segmentation Threshold

The skin detection mask (SDM) is defined as:


where tex2html_wrap_inline758 is the segmentation threshold, and tex2html_wrap_inline760 is the Mahalanobis distance for the pixel at location (m,n). tex2html_wrap_inline758 is derived by examining the probability of classification error, tex2html_wrap_inline766 .

Let tex2html_wrap_inline768 denote the region in the feature space where the classifier decides tex2html_wrap_inline690 and likewise for tex2html_wrap_inline772 and tex2html_wrap_inline692 . There are two ways in which a classification error can occur; either an observation x falls in tex2html_wrap_inline768 and the true class is tex2html_wrap_inline692 , or x falls in tex2html_wrap_inline772 and the true class is tex2html_wrap_inline690 . Since these events are mutually exclusive and collectively exhaustive, the probability of classification error is


where tex2html_wrap_inline784 and tex2html_wrap_inline786 denote the a priori probabilities of the skin and non-skin classes, respectively. For the remainder of this paper, the following notations, borrowed from radar terminology, will be used


tex2html_wrap_inline788tex2html_wrap_inline790 and tex2html_wrap_inline792 are the probabilities of false alarm, detection and miss, respectively. Note that tex2html_wrap_inline794 .

Using the above notations, the probability of classification error can now be expressed as


where tex2html_wrap_inline796 is a threshold. Therefore, the probability of error is a function of tex2html_wrap_inline796 and the a priori probabilities. tex2html_wrap_inline790 and tex2html_wrap_inline788 are evaluated for the set of training images tex2html_wrap_inline804 , k=1,...,K,




where tex2html_wrap_inline808 and tex2html_wrap_inline810 are defined as:




tex2html_wrap_inline820 denotes the feature vector of a pixel in training image k, k=1,...,K. The a priori probabilities can be either estimated or assumed. The tex2html_wrap_inline796 that minimizes equation (8) is then designated as the segmentation threshold.


  In this section, a temporal segmentation method is developed based on the variance statistical test. The motion of a moving object from one image to the next generates intensity variations that can be represented in the form of a difference image. However, intensity variations can also occur due to camera or quantization noise. The noise is usually modeled as a zero-mean normal distribution [6]. The objective of temporal segmentation is to distinguish between temporal variations caused by noise and those caused by object motion. We refer to intensity variations caused by motion as foreground and those caused by noise as background.

Let tex2html_wrap_inline828 denote the variance of the background population, and W a sliding observation window. We use the statistical variance test to detect background and foreground regions in the difference image. The statistical variance test can be formally stated as:


The null hypothesis, tex2html_wrap_inline832 , implies that the set of difference pixels in W is drawn from a normal population with variance tex2html_wrap_inline828 . The hypothesis is rejected if the variance of the difference pixels in W is significantly greater than tex2html_wrap_inline828 . The intensity variation induced by a moving object is greater than that of the background because of a higher intensity gradient at the edge and inside of a moving object. W is set to tex2html_wrap_inline844 samples (i.e. n=9 samples) and the significance level, tex2html_wrap_inline848 , is set to 1%. If the hypothesis is true, then


has a tex2html_wrap_inline850 distribution with n-1=8 degrees of freedom. tex2html_wrap_inline854 is the sample variance. For a significance level of 1%, the critical value of Y is 20.1. Therefore if Y>20.1, we would reject the hypothesis.

The foreground and background regions in the difference image are represented in the form of a binary map, called the change detection mask (CDM). If the null hypothesis is rejected, a binary 1 is allocated to the center pixel in W, otherwise a binary 0 is allocated. The parameter tex2html_wrap_inline828 can be estimated by the histogram fitting technique described in [7] or the least median of squares technique described in [8].

Video Object Plane GENERATION

  This section describes the VOP generation method. Firstly, connected components analysis on both the SDM and the CDM is performed to remove all connected components of 50 or less pixels (with 8-neighborhood connectivity). These regions can be generally attributed to false alarms. After connected components analysis, holes in the remaining connected components are filled. This was performed to promote the formation of semantic objects and improve the accuracy of VOP generation.

SDM and CDM Analysis

Due to face and hand motion during signing, the CDM can be utilized to identify the hands and face in the SDM. First, the SDM is superposed on top of the CDM. When 80% or more of a connected component in the SDM is covered by a foreground region in the CDM, the connected component is declared as either a face or a hand.

Face Identification

It may sometimes be necessary to discriminate between the face and the hands. One method is to compare the areas of the connected components in the VOP. Intuitively, the face would have the largest area, however if a subject has part of an arm exposed, the arm may have a greater area than the face and thus result in inaccurate identification. An effective method to distinguish between the face and the hands is to model the face as a rigid object, and the hands as non-rigid objects, due to wrist and finger motion. Such a model would allow the use of shape features to differentiate between the face and hands. We have devised three tests to make the differentiation.

It is a well known fact that the shape of the face can be approximated by an ellipse [9]. The best-fit-ellipse of a connected component, tex2html_wrap_inline864 , is defined by its center tex2html_wrap_inline866 , its orientation tex2html_wrap_inline868 , and the length of its major (a) and minor (b) axes [10]. The center of gravity of tex2html_wrap_inline864 gives the center of the ellipse:




where N denotes the number of pixels in tex2html_wrap_inline864 . Orientation is defined as the angle of axis of the least moment of inertia. It can be computed by utilizing the central moments tex2html_wrap_inline880 of the connected component:


The first test is the orientation test. We have observed that during signing, the head can tilt in the range tex2html_wrap_inline882 from vertical. Therefore, if the orientation of a connected component is not within this range, it cannot be the face.

The second test deals with the aspect ratio (a/b) of tex2html_wrap_inline864 . We have observed that the aspect ratio of the face can range from 1.4 to 1.8. Therefore, any connected component outside of this range, cannot represent the face. a and b are determined by computing the moments of inertia of tex2html_wrap_inline864 . The least and greatest moments of inertia for an ellipse are




For a given tex2html_wrap_inline868 , the above moments can be calculated as




The requirements for a best fit ellipse are tex2html_wrap_inline896 and tex2html_wrap_inline898 , which gives the lengths of a and b, respectively:




The final test is to assess the similarity between a connected component and its best fit ellipse. This is accomplished by computing the difference between the area of tex2html_wrap_inline864 inside and outside the ellipse. The difference is then divided by the area of the ellipse. We have found that the above similarity measure should be 0.8 or higher for facial regions.


  For simulation we used the silent test sequence. Results for frames 11 and 12 are shown in figure 2. The SDMs are shown in figure 2(b). The false alarms present in the SDMs are due to similar skin and background color characteristics. The CDMs, shown in figure 2(c), also contain false alarms. The false alarms to the subject's right are due to shadow, induced by hand motion. The false alarms are largely eliminated after connected components analysis. The face and hands of the subject have been segmented quite effectively, as shown in figure 2(d).

Figure 2: (a): Frames 11 and 12 of the silent sequence. (b): SDMs. (c): CDMs. (d): VOPs.


  A new hand and face segmentation algorithm has been presented in this paper. The algorithm consists of three steps, namely color segmentation, temporal segmentation and VOP generation. In color segmentation, the aim is to segment skin regions in an image. Meanwhile, in temporal segmentation, moving objects in the image sequence are localized. The color and motion information is then used to generate the VOP. Experimental results indicate that the technique is capable of segmenting the hands and face quite effectively. The algorithm allows the flexibility of incorporating additional techniques to enhance the results. Work is currently under way to incorporate a tracking technique to track the hands and face throughout the sequence.


 R. P. Schumeyer, A Video Coder Based on Scene Content and Visual Perception, PhD thesis, University of Delaware, 1998.
 D. Chai and K. N. Ngan, ``Face Segmentation Using Skin-Color Map in Videophone Applications'', IEEE Trans. Circuits Sys. Video Tech., vol. 9, no. 4, pp. 551-564, June 1999.
 H. Wang and S.-F. Chang, ``A Highly Efficient System for Automatic Face Region Detection in MPEG Video'', IEEE Trans. Circuits Sys. Video Tech., vol. 7, no. 4, pp. 615-628, Aug. 1997.
 K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, Boston, 1990.
 R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, Prentice Hall, Englewood Cliffs, N.J, 1982.
 M. Kim, J. G. Choi, D. Kim, H. Lee, M. H. Lee, C. Ahn, and Y.-S. Ho, ``A VOP Generation Tool: Automatic Segmentation of Moving Objects in Image Sequences Based on Spatio-Temporal Information'', IEEE Trans. Circuits Sys. Video Tech., vol. 9, no. 8, pp. 1216-1226, Dec. 1999.
 N. Habili, A. R. Moini, and N. Burgess, ``Histogram Based Temporal Object Segmentation for VOP Extraction in MPEG-4'', in Proc. The Fist IEEE Pacific-Rim Conference on Multimedia, Sydney, Australia, Dec. 2000, pp. 310-313.
 P. L. Rosin, ``Thresholding for Change Detection'', Tech. Rep. ISTR-97-01, Brunel University, UK, June 1997.
 K. Sobottka and I. Pitas, ``A novel method for automatic face segmentation, facial feature extraction and tracking'', Signal Processing: Image Communication, vol. 12, no. 3, pp. 263-281, June 1998.
10     A. Jain, Fundamentals of Digital Image Processing, Prentice Hall, Englewood Cliffs, N.J, 1989.

Fri May 25 15:31:50 CST 2001