Querying in information retrieval has to be "user-friendly", that is, formulated in such a way that maximally corresponds to the user's cognition and intuition. A text-based paradigm assuming that the information the user is looking for can be described with a few keywords or phrases is mostly invalid in the case when visual information is sought. The global content of an image may easily be described by the names of objects composing it, but it is not an easy matter to exactly describe the semantic meaning of the image content, especially when the user only has a "fuzzy" idea of what should be looked for (Rahman, 2002).
A query-by-image-example (QBE) paradigm pursues the goal of solving the above-mentioned problem by more flexible description of the content of an image and more versatile formulation of a query than the textual annotations may permit. The QBE paradigm assumes that
Because it is unlikely to find an example image containing exactly all the features needed, the natural extension of the QBE paradigm is a querying system accepting multiple examples from the user, gathering all the presented information into a joint pseudo-example, and responding with the best possible matches to this latter image from the database (Rahman, 2002). The QBE-based retrieval of visual information complements the text-based querying, rather than tends to replace it.
Due to subjective choices and typical incompleteness of examples, one need not expect that the CBVIR system will retrieve the correct data at the very first querying step. To improve the result on the basis of a previous search, relevance feedback is frequently used to generalise the QBE. In this case the returned set of retrieved images classified by the user into positive and negative matches is used to reformulate the query. Such a dynamic dialog between the user and the CBVIR system stimulates on-line learning and facilitates the target information search.
Therefore, for a human, the main goal of the QBE is to provide a more comfortable and intuitive querying paradigm. From a computational viewpoint, the QBE paradigm relies on a detailed automated analysis of the content of the query image(s). From the cognitive side, the QBE paradigm relies on explicit representation of the knowledge about the search domain (Smeulders e.a., 2000). Syntactic (literal), perceptual, and physical criteria of similarity and equality between pixels or features of images and geometric and topological rules describing equality and differences of 3D objects on images are typical examples of this general knowledge.
Although the user seeks mostly semantic similarity, the CBVIR can only provide similarity by data processing results (Smeulders e.a., 2000). The challenge for the CBVIR engines is to focus on a narrow information domain the user has in mind via specification, examples, and interaction. Early CBVIR engines required from users to first select some low-level visual features of interest and then specify the relative weight for each their possible representation (Castelli & Bergman, 2002). In this case the user had to know in detail how the features are represented and used in the engine. Limitations of such a retrieval have been caused also by the difficulties of representing semantic contents in terms of low-level features and by the highly subjective human visual perception. The user typically forms queries based on semantics, e.g., "penguins on icebergs" or "a sunset image", rather than on low-level features, such as "predominantly oval black blobs on a white background" or "a predominantly red and orange image", respectively. It is obvious that these latter queries will retrieve a big number of irrelevant images having the like dominant colours. One should conclude that low-level features cannot adequately represent image content. Moreover, due to the subjective nature of human perception, different users and even the same user under different conditions may interpret the same image differently. Images are visually perceived as similar due to similar semantic meaning, rather than similar low-level features.
This is why recent experimental CBVIR systems such as Photobook (with FourEyes) or MARS are based on an interactive retrieval process where user's feedback helps the system to adjust the query and approach closer the user's expectations.
Architecture of an interactive CBVIR system
An interactive CBVIR system contains an image database, a feature database, a selector of feature similarity metric, and a block for evaluating feature relevance. When a query arrives, the system has no prior knowledge about the query. All features have the same weight and are used to compute the similarity measure. Then a fixed number of the top-rank (by the similarity to the query) images are retrieved to the user who provides relevance feedback. Learning algorithms are used in the feature relevance block in order to re-evaluate the weights of each feature in accord with the user's feedback. The metric selector chooses the best similarity metric for the weighted features by using reinforcement learning. By iteratively accounting for the user's feedback, the system automatically adjusts the query and brings the retrieval results closer to the user's expectations. Simultaneously, the user need not map semantic concepts onto features and specify weights. Instead, the user should only inform the system which images are relevant to the query, and the weight of each feature in the similarity computation is iteratively updated in accord with the high-level and subjective human perception.
The interactive retrieval based the relevance feedback results in a two-stage process of formulating a query:
At the second stage, the user gives some positive and negative feedback to the system by labelling the retrieved images in accord with their relevance to user's expectations, e.g., as highly relevant, relevant, neutral, irrelevant, or highly irrelevant retrieval results. The CBVIR system processes then both the query and the labelled retrieved images in order to update the feature weights and choose more adequate similarity metric such that the irrelevant outputs are suppressed and the relevant ones are enhanced (Castelli & Bergman, 2002). For instance, if the range of feature values for the relevant images is similar to that for the irrelevant ones, then this feature cannot effectively separate these image sets and its weight should decrease. But if the "relevant" values vary in a relatively small range containing no or almost no "irrelevant" values, it is a crucial feature which weight should increase.
To evaluate performance of a QBE-based CBVIR system, a test bed containing a collection of images, a set of benchmark queries referring to the test bed data, and the "ground-truth" quantitative assessment of the relevance of each image for each benchmark query should be available (Castelli & Bergman, 2002). The retrieval performance takes into account the numbers of relevant and irrelevant results presented to the user.
Let a benchmark query to a CBVIR system result in the output images in rank order. Let Wr in the range [0,1] denote a quantitative relevance of the image of rank r to that query. Then, for each particular cutoff number n; n = 1, ..., N where N is the size of a given test bed, the relevance and irrelevance of the obtained retrieval outputs is evaluated with the following four characteristics:
An = W1 + ... + Wn |
Bn = n - An |
Cn = Wn+1 + ... + WN |
Dn = N - n - Cn |
These values suggest the three retrieval effectiveness measures:
The retrieval performance of a system can be evaluated on the basis of recall and precision averaged over all the benchmark queries.
The multimedia content description interface MPEG-7 "Multimedia Content Description Interface" developed by the International Standard Organization (ISO) and Moving Picture Expert Group (MPEG) pursues the goals of providing semantic descriptions of multimedia data, such as still images, graphics, 3D models, audio, speech, and video. There are standard representations of image features, extracted objects, and object relationships to enable fast and effective content-based multimedia searching and filtering, namely, the standardised descriptors for colour, texture, shape, motion, and other features of audiovisual datra. The development of this standard has resulted in a variety of proposals to describe and structure multimedia information.
The search includes feature extraction and standard descriptions although MPEG-7 does not specify algorithms for feature extraction and only defines the standard description to be fed to a search engine. The description framework consists of a set of descriptors (Ds), a set of description schemes (DS), a language to specify description schemes (DDL), and one or more schemes for encoding the description.
A descriptor represents a feature and defines the syntax and semantics of the feature representation. Multiple descriptors can be assigned to a single feature, e.g., the colour feature has the descriptors such as the average colour, the dominant colour, and the colour histogram.
A description scheme specifies the structure and semantics of the relationship between various elements (both descriptors and description schemes). Description schemes refer to image classification or describe image contents with specific indexing structures. For example, a classification DS should contain the description tools to allow image classification or a segment DS should represent a section of a multimedia content unit.
A description definition language (DDL) allows to create new or extend existing description schemes and descriptors. The XML eXtensible Markup Language with MPEG-7-specific extensions is adopted as the DDL.
The MPEG-7 framework allows many ways to encode the description, specifies a standard set of descriptors for describing various types of multimedia information, and provides standards for defining other descriptors and structures for descriptors and their relationships. This description is associated with the contents itself, thus allowing fast and efficient search for material of interest. In particular, images become more self-describing of their contents by carrying the MPEG-7 annotations, enabling richer descriptions of features, structure, and semantic information (Castelli & Bergman, 2002).
Images represent information at multiple levels, starting from the most basic level of pixelwise responses to light (intensities or colours). The pixel patterns produce more general low-level elements such as colour regions, texture, motion (inter-frame changes in a video sequence), shapes (object boundaries), and so on. No special knowledge is involved at these levels. But at the most complex level, images represent abstract ideas depending on individual knowledge, experience, and even on a particular mood (Castelli & Bergman, 2002).
Image syntax refers to perceived visual elements and their spatial - temporal arrangement with no consideration to the meaning of the elements or arrangements, wheras semantics deals just with the meaning. Syntax can be met at several perceptual levels - from simple global colour and texture to local geometric forms, such as lines and circles. Semantics can also be treated at different levels.
Objects depicted in images are characterised both with general concepts and visual concepts. These concepts are different and may vary among individuals. A visual concept includes only visual attributes, and a general concept refers to any kind of attribures. In the CBVIR different users have different concepts of even simple objects, and even simple objects can be seen at different conceptual levels. Specifically, general concepts help to answer the question: "What is it?", whereas visual concepts helps to answer the question "What does it look like?"
General and visual attributes used by different individuals to describe the same object (a ball).
The above figure shows attributes selected by different individuals, namely, a volleyball player (the left circle) and a baseball player (the right circle), for describing the same object. The volleyball and the baseball player choose "soft, yellow, round, leather, light weight" and "hard, heavy, white, round, leather" as the general attributes, respectively, because the both individuals have different general concepts of a ball. Naturally, there is also a correlation between some visual and general attributes (e.g., big and round). Thus, in creating conceptual indexing structures one needs to discriminate between visual and nonvisual content. The visual content of an image corresponds to directly observed items such as lines, shapes, colours, objects, and so on. The nonvisual content corresponds to information that is closely related to, but is not present in the image. QBE relates primarily to the visual content, although an indexing structure for the nonvisual content is also of a notable practical interest. Generally, the visual content is a multilevel structure where the lower levels refer to syntax, rather than semantics. The pyramidal indexing structure below has been developed in Columbia University New York, USA, and proposed to MPEG-7 (Castelli & Bergman, 2002).
The pyramidal indexing structure
(the width of each layer represents the amount of knowledge
that is necessary for operating at that level).
The bottom four levels focus on image perception, require no knowledge of actual objects to index an image, and involve only low-level processing. At the most basic level of types (categories) and techniques, the user is interested in the general visual characteristics of the image or the video sequence such as painting, drawing, black and white photo, or colour photo. Digital images may include additional descriptions such as number of colours, compression scheme, resolution, and so on.
The type and technique level provides general information about the image or video sequence, but gives almost no information about the visual content. The next global distribution level provides a global description of the image as a whole, without detecting and processing individual components of the content. Global distribution perceptual features include global colour (e.g., dominant colour, average colour, or colour histogram), global texture in terms of coarseness, contrast, directionality, or other descriptors, and global shape (e.g., aspect ratio). For video data, the features include also global motion (e.g., speed, acceleration, and trajectory), camera motion, global deformation (e.g., growing speed), and temporal and spatial dimensions. Some global characteristics are less intuitive than others. For example, it is difficult for a human to imagine what the colour histogram of an image looks like). Nonetheless, these global low-level features have been succesfully used in various CBVIR systems, e.g., QBIC, WebSEEk, or Virage, to perform QBE and to organise the contents of a database for browsing.
At the higher local structure level basic image components, or basic syntax symbols, are extracted by low-level processing. The elements include dots, lines, and texture, as well as temporal and spatial position, local colour, local motion, local deformation, local shape, and 2D geometry. Such elements have also been used in CBVIR systems, mainly in query-by-user sketch interfaces. This level manipulates with basic elements that represent objects, and may include some simple shapes such as a circle, an ellipse, and a polygon.
While local structure is given by basic elements, global composition refers to the arrangement (spatial positioning) of these elements in terms of general concepts such as balance, symmetry, region or centre of viewing, dot, leading line, viewing angle, and so forth. This level involves no specific objects and considers only basic elements or their groups. An image is represented by a structured set of basic forms - lines, squares, circles, etc.
Although perceptual aspects of the image are easier for automatic indexing and classification, humans mainly rely on higher-level attributes when describe, classify, and search for images. The level of generic objects accounts for the object's attributes which are common to all or most members of the category. The objects are recognised using only such a general knowledge. The level of generic scenes uses general knowledge to index an image as a whole, based on all the objects it contains. Both these levels need powerful techniques of object detection and recognition. But in spite of current advances in pattern recognition and computer vision, the recognition systems still have serious limitations regarding the CBVIR because of a number of additional factors complicating the recognition process, in particular, varying illumination of objects, shadows, occlusions, specular reflections, different scales, large changes of viewpoint intensive noise, arbitrary backgrounds, and clutter (foreign objects) making feature extraction more difficult.
Even more difficulties arise on the levels of specific objects and specific scenes where specific objective knowledge of individual objects and groups of objects is required. In indexing, the correct relationship between the generic and specific labels should be maintained. In particular, consistency of indexing is to be preserved, e.g., by using special templates and vocabularies. The levels of abstract objects and abstract scenes are the most challenging for indexing because they are very subjective (the interpertative knowledge of what the objects or scenes represent vary greatly among different users).
Visual and nonvisual information to semantically characterise an image or its parts.
Relationships between elements of visual content within each level are also of two types: syntactic (related to perception) and semantic (related to meaning). Syntactic relationships such as spatial, temporal, or photometric (visual), may occur at any level, but the semantic ones occur only at the semantics levels 5 through 10 of the above pyramid.
Due to difficulties in formal specification of semantics, most of the existing CBVIR systems operate at syntactic levels. In particular, WebSeek and QBIC exploit only the levels 1 (type and technique) and 2 (global distribution), VideoQ involves also the level 3 (local structure), and Virage adds the level 4 (global composition). Only very few experiments have been done to account at least for generic levels of semantics.