Appearance based retrieval for tracked objects in surveillance videos

This paper focuses on indexing and retrieval at the object level for video surveillance. Object retrieval is difficult due to imprecise object detection and tracking. In the indexing phase, a new representative blob detection method allows to choose the most relevant blobs that represent various object's visual aspects. In the retrieval phase, a new robust object matching method retrieves successfully objects even though they are not perfectly tracked. We validate our approach thanks to videos coming from a subway monitoring project. The representative blob detection method improves the state of the art. The obtained retrieval results show that the object matching method is robust while working with imprecise object tracking algorithms.


Introduction
Video surveillance is a rapidly growing industry. Driven by low-hardware costs, heightened security fears and increased capabilities, video surveillance equipment is being deployed more widely and with greater storage than ever. This provides a huge amount of video data. Associating to these video data, retrieval facilities become very useful for many purposes and many kinds of staff. Recently, several approaches have been dedicated to retrieval facilities for surveillance data ) (Zhang, Chen et al. 2009). Figure 1 shows how indexing and retrieval facility can be integrated in a surveillance system. Videos coming from cameras will be interpreted by the video analysis module. There are two modes for using the analysed results: (1) the corresponding alarms are sent to members of the security staff to inform them about the situation; (2) the analysed results are stored in order to be used in the future. In this chapter, we focus on analysing current achievements in surveillance video indexing and retrieval. Video analysis (Senior 2009) is beyond the scope of this chapter.
Video analysis module provides two main result types of result: objects and events. Thus, surveillance video indexing and retrieval approaches can divided into two categories: surveillance video indexing and retrieval at the object level (Calderara, Cucchiara et al. 2006;Ma and Cohen 2007; and at the event level (Zhang, Chen et al. 2009;Velipasalar, Brown et al. 2010). As events of interest may vary significantly among different applications and users, this chapter focuses on presenting the work done for surveillance video indexing and retrieval at the object level.
The remaining of the chapter is organized as follows: In Section 2, we give a brief overview of surveillance object retrieval. Section 3 aims at analysing in detail appearance-based surveillance object retrieval. We first give some definitions and point out the existing challenges. Then, we describe the solutions proposed for two important tasks: object signature building and object matching in order to overcome these challenges. Section 4 presents current achievements and discusses about open problems in this domain. Fig. 1. Indexing and retrieval facility in a surveillance system. Videos coming from cameras will be interpreted by the video analysis module. There are two modes for using the analysed results: (1) the corresponding alarms are sent to security staffs to inform them about the situation; (2) the analysed results are stored in order to be used in the future.

Object retrieval for surveillance videos
T h i s s e c t i o n a i m s t o g i v e a n o v e r v i e w o f e x i s t i n g a p p r o a c h e s f o r o b j e c t r e t r i e v a l i n surveillance videos.

Architecture
In the same way as video analysis systems which have two main architectures, i.e. centralized and decentralized architecture (Senior 2009), object video retrieval for surveillance systems has also two main modes: late fusion and early fusion modes. In the late fusion mode (cf. Fig. 2), the object detection and tracking are performed on the video stream of each camera. Then, the object matching compares the query and the detected objects for each camera. The matching results are fused to form the retrieval results. In the early fusion mode (cf. Fig. 3), the data fusion is done in the object detection and tracking module. We can see that the object retrieval method in this early fusion mode has more opportunities to obtain a good result because if an object is not totally observed by a camera, it may be well captured by other cameras. Most of the state of the art work belongs to the early fusion mode. However, the fusion strategy is not explicitly discussed except in the work of Calderara et al. (Calderara, Cucchiara et al. 2006). www.intechopen.com Fig. 2. Late fusion object retrieval approach: the object detection and tracking is performed on video stream of each camera. Then, the object matching compares the query and the detected objects of each camera. The matching result is then fused to form the retrieval results.

Object feature extraction and representation
Since objects in video surveillance are physical objects (e.g. people, vehicles) that are present in the scene at a certain time, in general, they are detected and tracked in a large number of frames. Objects in videos possess two main characteristics named spatial and temporal characteristics. Spatial characteristics of an object may be its positions in frames (in 2D coordinates) and positions in scene (in 3D coordinates), its spatial relationships with other objects and its appearance. Temporal characteristics of an object contain its movement and its temporal relationships with other objects. Therefore, an object may be represented by one sole or several characteristics. However, among these characteristics, object movement and object appearance are the two most important characteristics and are widely used in the literature.
Concerning the object representation based on object movement, in the literature, a number of different approaches have been proposed for object movement representation and matching (Broilo, Piotto et al. 2010). Certain approaches directly use detected object positions across frames that are represented in trajectory form (Zheng, Feng et al. 2005). As object trajectory may be very complex, other authors try to segment an object trajectory into several sub-trajectories (Buchin, Driemel et al. 2010) with the purpose that each subwww.intechopen.com trajectory represents a relatively stable pattern of object movement. Other work attempts to move to higher levels of object trajectory representation, named symbolic level and semantic level. At symbolic level, (Chen, Ozsu et al. 2004;Hsieh, Yu et al. 2006;Le, Boucher et al. 2007) aim to convert object trajectory into a character sequence. The advantage is that they promote the applying of successful and famous methods in text retrieval such as the Edit Distance for object trajectory matching. The approaches dedicated to object trajectory representation at the semantic level try to learn the semantic meaning such as turn left, low speed from object movement (Hu, Xie et al. 2007). As results, the output is close to the human manner of thinking. However, they strongly depend on applications.
Object representation based on its appearance has attracted a lot of research interest. Appearance-based object retrieval methods for surveillance video are distinguished each other by two criteria. The first criterion is the appearance feature extracted on the image/frame where the object is detected and the second one is the way to create object signature from all features extracted over the object's life time and to match objects based on their signatures. In the next section, we describe in detail the object signature building and object matching methods. In this section, we only present the object appearance feature.
There is a great variety of object features used for surveillance object representation. In fact, all features that are proposed for image retrieval can be applied for surveillance object representation. Appearance object features can be divided into two categories: global and local. Global features are color histogram, dominant color, covariance matrix, just to name a few. Besides global features, local features such as interest points and SIFT descriptor can be extracted from the object's region.
In (Yuk, Wong et al. 2007), the authors have proposed to use MPEG-7 descriptors such as dominant colors, edge histograms for surveillance retrieval. In the context of one research project conducted by IBM research center 1 , the researchers have evaluated a large number of color features for surveillance application that are standard color histograms, weighted color histograms, variable bin size color histograms and color correlograms. Results show color correlogram to have the best performance. Ma et Cohen (Ma and Cohen 2007) suggest to use the covariance matrix as object feature. According to the authors, the covariance matrix is appealing because it fuses different types of features and has small dimensionality. The small dimensionality of the model is well suited for its use in surveillance videos because it takes very little storage space. In our research (Le, Boucher et al. 2010), we have evaluated the performance of 4 descriptors which are dominant color, edge histogram, covariance matrix (CM) and SIFT descriptor for surveillance object representation and matching. The obtained results show that if the objects are detected while the background and context objects are not present in the object region, the used descriptors allow retrieving objects with relatively good results. For other cases, the covariance matrix is more effective than the other descriptors. According to our experiments, it is interesting to see that when the covariance matrix represents information of all pixels in a blob, the points of interest use only few pixels. The dominant color and the edge histogram use the approximate information of pixel color and edge. A pair of descriptors (covariance matrix and dominant color) or (covariance matrix and edge histogram) or (covariance matrix and SIFT descriptors) may be chosen as default descriptors for object representation.

Appearance-based object retrieval in surveillance videos
In this section, we firstly give some definitions and point out the existing challenges for appearance-based object retrieval in surveillance videos. Then, we describe the solutions proposed for two important tasks: object signature building and object matching in order to overcome these challenges.

Definitions
Definition 1: An object blob is a region determined by a minimal bounding box in a frame where the object is detected.
The minimal bounding box is calculated by the object detection module in video analysis and an object has one sole minimal bounding box. Fig. 4 gives some examples of detected objects and their corresponding blobs.

Definition 2: Object representation
In surveillance applications, one object is in general detected and tracked in a number of frames. In other words, a set of object blobs is defined for an object. Therefore, an object can be represented as: where O is object, B i is the i th object blob, N is the total number of blobs of object O.
It is worth noting that object blobs can be non-consecutive since an object may not be detected in certain frames and the value of N varies depending on the object life time in the scene. Fig. 5 gives an example of an object that is represented by its blobs. As we can notice, with poor object detection, several object blobs do not cover well the object appearance.

Challenges in appearance-based object retrieval for surveillance videos
This section aims at pointing out existing challenges in appearance-based object retrieval for surveillance videos. As object indexing and retrieval take the output of video analysis as its input (cf. Fig. 1), the quality of the video analysis has a huge influence on object indexing and retrieval. Current achievements on surveillance video analysis show that video analysis is far from perfect since it is hampered by issues in low resolution, pose and lighting variations and object occlusion. In this section, we point out the challenges in appearancebased object retrieval by analyzing the effect of two modules of video analysis on the object indexing and retrieval quality: the object detection and the object tracking modules.
The object detection module is the module that allows to determine the object blobs. An object detection module is good if all blobs of a detected object (1) cover totally this object and (2) do not contain other objects. However, these constraints are not always met. Object retrieval has to address three difficult cases as shown in Fig. 6. In the first case, the object is not present at all in the blob (Fig. 6a). With the second case, the object is partially present in the blob (Fig. 6b) while with the third case, the blob of the detected object covers totally this object, however, it contains also other objects ( Fig. 6c and Fig. 6d).
Concerning the object tracking quality, two metrics that are widely used for evaluating the performance of object tracking in the video surveillance community are object ID persistence and object ID confusion (Nghiem, Bremond et al. 2007). The object ID persistence metric helps to evaluate the ID persistence. It computes over the time how many tracked objects (output of the object tracking module) are associated to one ground-truth object. On the contrary, the object ID confusion metric computes the number of objects per detected object (having the same ID). A good object tracking algorithm obtains a small value for these two metrics (minimum is 1). However, the obtained results in several video surveillance benchmarks show that current achievement on object tracking is still limited (object ID persistence and object ID confusion metrics are generally much greater than 1). Fig. 7 shows an example of the object ID persistence problem: two tracked objects created for one sole ground-truth object, therefore object ID persistence is equal to 2. Fig. 8 illustrates an example of object ID confusion: three ground-truth objects IDs associated to one sole detected object (object ID confusion = 3). Fig. 7. An example of the object ID persistence problem: two tracked objects created for one sole ground-truth object (object ID persistence = 2).
Based on the above-mentioned analysis, the main challenge in surveillance object indexing and retrieval is the poor quality of object detection and tracking. An object indexing and retrieval algorithm is robust if it can work with different quality of the object detection and tracking.
With the object representation as defined in Eq. 1, we believe that object indexing and retrieval methods can address the poor quality of object detection and tracking problem if they have an effective object signature building and a robust object matching.
www.intechopen.com Fig. 8. An example of object ID confusion: three ground-truth object IDs associated to one sole detected object (object ID confusion = 3).

Object signature building
Object signature building is a process that aims at calculating one or a set of descriptors, named object signature, from a set of object blobs.
The calculated signature should (1) be able to represent all object appearance aspects, (2) be distinctive and (3) be as compact as possible. Among these characteristics, the two first characteristics ensure the robustness of the retrieval part. The third characteristic relates to the effectiveness of the indexing part. If the signature is compact, it does not require much storage.
Object signature building methods for surveillance video are divided into two approaches. The first object signature building approach is based on the following observation: Surveillance objects are generally detected and tracked in a large number of frames. Consequently, an object is represented by a set of blobs. Due to errors in object detection, using all these blobs for object indexing and retrieval is irrelevant. Moreover, it is redundant because of the similar content between blobs (two consecutive blobs of an object are closely similar). Based on this observation, methods belonging to the first approach try to select the most relevant and representative blobs from a set of blobs and then to compute object features on these blobs. This process is defined by Eq. 2. This approach is composed of two steps. The first step, called representative blob detection, chooses from the object blobs the most relevant and representative ones that represent significantly the object appearance while the second step computes the object features mentioned in Section 2.2 from the calculated representative blobs.

    
(1) , Instead of calculating only the representative blobs, several authors compute a set of pairs: the representative blob and its associating weight while the weight associated with a representative blob shows the importance of this blob. With this, the first approach is defined as follows:  The methods presented in (Ma and Cohen 2007) and in ) are the most significant ones of the first object signature building approach. These methods are distinguished each from the other by the way to define the representative blobs.

The representative blob detection method proposed by Ma et Cohen (Ma and Cohen 2007) is
based on the agglomerative hierarchical clustering and the covariance matrix extracted from the object blobs. This method is composed of the three following steps: Step 1. Do agglomerative clustering on the original set of object blobs based on the covariance matrix.
Step 2. Remove clusters having a small number of elements.
The first step aims at forming clusters of similar blobs. The similarity of two blobs is defined by using the covariance matrix. The covariance matrix is built over a feature vector f, for www.intechopen.com each pixel, that is: f(x,y)=[x, y, R(x, y), G(x, y), B(x, y), ▽R T (x, y), ▽G T (x, y),▽B T (x, y)] where R, G, B are the colorspace axes and x, y are the coordinates of the pixel contributing to the color and the gradient information. The covariance matrix is computed for each detected blob as follows: The covariance matrices for blobs of different sizes have the same size. In fact, the covariance matrix is a N×N matrix while N is the dimension of the feature vector f.
The distance between two blobs is calculated as: For the agglomerative clustering, the distance (,) dAB between two clusters A and B is computed by average linkage as: The objective of the second step is to detect and remove outliers that are clusters containing a small number of elements. The final step determines one representative blob for each cluster. For a cluster B, the representative blob B l is defined as: where ( , ) i j dB B is the blob distance defined in Eq. 5. We can see that this method can dominate errors of the object detection if they occur in a small number of frames. However, if the detection error occurs in a large number of frames, the cluster containing the blobs of these frames will be defined as valid cluster by this method (the validity of clusters is decided by their sizes).
Our work presented in ) is an improvement of Ma and Cohen work (Ma and Cohen 2007), based on two remarks. The first remark is that the drawback of Ma and Cohen's method is that it cannot work well with imperfect object detection since it processes all object blobs including relevant and irrelevant ones. We can resolve this drawback by removing all irrelevant blobs before doing the agglomerative clustering. The second remark is that one blob of an object is relevant if it contains this object or objects belonging to the same class of this object. For example, one blob of a detected person is relevant if it represents somehow the person class. With these analyses, we add two preliminary steps in Ma and Cohen's work. These steps will be performed before the first step of Ma and Cohen's work.
Step 0. Classify blobs of all objects into relevant (with the object of interest) and irrelevant blobs (without object of interest) by a two-class SVM classifier with radial basis function (RBF) kernel using edge histograms (Won, Park et al. 2002).
Step 1. Remove irrelevant blobs from the set of blobs for each object.
It is worth noting that the appearance of tracked objects may vary but their blobs usually have some common visual characteristics (e.g. human shape characteristics for the blobs of different tracked persons). As we can see, the two added steps allow to remove irrelevant blobs before agglomerative clustering. Therefore, this object signature building method is robust while working with poor quality object detection.
The second object signature building approach does not perform explicitly the representative blob detection. It attempts to sum up all object appearances into one sole signature. This approach is defined as follows: The work presented in (Calderara, Cucchiara et al. 2006) belongs to the second object signature building approach. In this work, the authors have proposed three notations that are person's appearance (PA), single camera appearance trace (or SCAT in short) and multicamera appearance trace (or MCAT in short). SCAT of the person P on camera C i is composed of all the past person's appearance (PA) of P at instant time t: where t represents the samples in time in which the person P was visible from the camera C i and N i P is the total number of frames in which he was visible and detected.
MCAT for a person P is composed of all the SCAT i P for any camera C i in which, at the current moment, the person P has been detected at least for one frame. We can see that SCAT is equivalent to MCAT if the surveillance system has only a camera and SCAT is in our definition.
The object signature building based on mixture of Gaussians is performed as follows: Step 1. Using the first PA in the MCAT, the ten principal modes of the color histogram are extracted; Step 2. The Gaussians are initialized with a mean μ equal to the color corresponding to the mode and a fixed variance σ 2 ; weights are equally distributed for each Gaussian; Step 3. successive PA belonging to the MCAT are processed to extract again the ten main modes that are used to update the mixture; then, for each mode:  (a) its value is checked against the mean of each Gaussian and if for none of them the difference is within 2.5σ of the distribution, the mode generates a new Gaussian (using the same process reported above) replacing the existing Gaussian with the lowest weight;  (b) the Mahalanobis distance is computed for every Gaussian satisfying the abovereported check, and the mode is assigned to the nearest Gaussian; the mean and the variance of the selected Gaussian are updated with the following adaptive equations: where X t is the vector with the values corresponding to the mode and α is the fixed learning factor; the weights are also updated by increasing that of the selected Gaussian and decreasing those of the other Gaussians consequently.
At the end of this process, ten Gaussians and the corresponding weights for each MCAT are available and are used as object signature.

Object matching
Object matching is the process that computes the similarity/dissimilarity between two objects based on their signatures calculated by above-mentioned approaches. In information retrieval in general and in surveillance object retrieval in particular, with a given query, the system will (1) compute the similarity between this query and all elements in the database and (2) return the retrieved results which are a list of elements sorted by their similarity with the query. The number of returned results will be decided for each application.
Corresponding to the two approaches for object signature building, there are two approaches for the object matching. Object matching for the first object signature building approach is expressed in Eq. 11. In this equation respectively. The object matching methods allow to define a similarity/dissimilarity between two sets of blobs. These sets may have different sizes. It is worth noting that we can always compute the similarity/dissimilarity of a pair of blobs based on visual features such as color histogram, covariance matrix.
In (Ma and Cohen 2007), the authors define a similarity measure between two objects O q and O p using the Hausdorff distance (Eq. 12). The Hausdorff distance is the maximum distance of a set to the nearest point in the other set.
where (,) qp i j dF F is the distance between two blobs by using the covariance matrix.
The above object matching allows to take into consideration multiple appearance aspects of the object being tracked. However, the Hausdorff distance is not relevant when working with object tracking algorithms having a high value of object ID confusion because this distance is extremely sensitive to outliers. If two sets of points A and B are similar, all the points are perfectly superimposed except only one single point in A which is far from any point in B, then the Hausdorff distance determined by this point.
In ), we propose a new object matching based on the EMD (Earth Mover's Distance) (Rubner, Tomasi et al. 1998). This method is widely applied with success in image and scripted video retrieval.
Computing the EMD is based on a solution to the old transportation problem. This is a bipartite network flow problem which can be formalized as the following linear programming problem: Let I be a set of suppliers, J a set of consumers, and c ij the cost to ship a unit of supply from i ∈ I to j ∈ J. We want to find a set of flows f ij that minimizes the overall cost: where x i is the total supply of supplier i and y j is the total capacity of consumer j. Once the transportation problem is solved, and we have found the optimal flow F * = {f * ij }, the EMD is defined as: When applied to surveillance object matching, the cost c ij becomes the distance of two blobs and the total supply x j and y j are the blob weights. c ij can be various descriptor distance between two blobs such as color histogram distance, covariance matrix.
In comparison with the matching method based on the Hausdorff distance (Ma and Cohen 2007), our matching method based on the EMD distance possesses two precious characteristics. Firstly, it considers the participation of each blob in computing the distance based on its similarity with other blobs and its weight. Thanks to the representative blob detection method, blob weight expresses the important degree of this blob in object representation. The proposed matching method ensures a minor participation of irrelevant blobs produced by errors in object tracking because these blobs are relatively different from other blobs and have a small weight. Therefore, the matching method is robust when working with object tracking algorithms having a high value of Object Id Confusion. Secondly, the proposed object matching allows partial matching.
We analyze here an example of these object matching methods: We want to compute the similarity/dissimilarity between object O q with 4 representative blobs and object O p with 5 representative blobs (Fig. 12). The Object Id Confusion values of the object tracking module for the first object and the second object are 2 and 1 respectively.
In order to carry out object matching, firstly, we need to compute the distance of each pair of blobs. Tab. 1 shows the distance of each pair of blobs computed on covariance matrix distance (cf. Eq. 5) while Fig. 12 presents the result of object matching methods. Hausdorffbased object matching is determined by the distance between blob 1 of object O q and blob 5 of object O p (dot line) while EMD-based object matching search for an optimal solution with the participation of each blob. This example shows how the EMD-based object matching method overcomes the poor object tracking challenge. With the output of the second object signature building approach, the object matching is relatively simple.

Databases
Despite the fact that a number of surveillance video systems have been deployed, very few surveillance databases are available. One reason is that surveillance videos concern to human and organization privacy. Recently, several surveillance video databases such as CAVIAR, i-LIDS, CARETAKER have been released for research purpose. CAVIAR (Context Aware Vision using Image-based Active Recognition) is a project funded by the EC's Information Society Technology's programme project IST 2001 37540. This project addresses two surveillance applications: city centre surveillance and marketers. Corresponding to these applications, two databases are available. Video clips in the first database were filmed with a wide angle camera lens in the entrance lobby of the INRIA Labs at Grenoble (France) while those of the second database are filmed with a wide angle lens along and across the hallway in a shopping centre in Lisbon (Portugal). Moreover, videos of these databases are annotated. 2008 i-LIDS Multiple-Camera Tracking Scenario (MCTS) is a data set with multiple camera views from a busy airport arrival hall (Zheng, Gong et al. 2009). In the context of CARETAKER (Content Analysis and REtrieval Technologies to Apply Extraction to massive Recording), a video surveillance database is available. This project aims at studying, developing and assessing multimedia knowledge-based content analysis, knowledge extraction components and meta data management sub-systems in the context of automated situation awareness, diagnosis and decision support. During this project, a real testbed sites inside the metro of Roma and Torin, involving more than 30 sensors (20 cameras and 10 microphones) have been provided.

Surveillance object retrieval results
In recent years, a number of surveillance video retrieval results have been published. However, with the lack of common benchmarks and databases, the comparison of these results is difficult (even impossible). Two preliminary comparisons of three object signature building and object matching methods with CAVIAR and CARETAKER dataset have been presented in (Le, Thonnat et al. 2009a) ). However, these comparisons are done with a relatively small dataset.

Conclusions
In this chapter, firstly a brief overview of surveillance object retrieval is given. Then, current work dedicated to appearance-based surveillance object retrieval are analysed in detail. The analysis shows that preliminary and promising results have been obtained for surveillance object retrieval. However, it is still a challenging issue. This issue needs more work and contributions on surveillance video analysis, feature extraction and common benchmark for surveillance object retrieval evaluation. www.intechopen.com