QBE and Visual Information Representation

Query by Image Example (QBE) Paradigm
The MPEG-7 Standard
Visual Information at Multiple Levels
References

Query by Image Example (QBE) Paradigm

Querying in information retrieval has to be "user-friendly", that is, formulated in such a way that maximally corresponds to the user's cognition and intuition. A text-based paradigm assuming that the information the user is looking for can be described with a few keywords or phrases is mostly invalid in the case when visual information is sought. The global content of an image may easily be described by the names of objects composing it, but it is not an easy matter to exactly describe the semantic meaning of the image content, especially when the user only has a "fuzzy" idea of what should be looked for (Rahman, 2002).

A query-by-image-example (QBE) paradigm pursues the goal of solving the above-mentioned problem by more flexible description of the content of an image and more versatile formulation of a query than the textual annotations may permit. The QBE paradigm assumes that

it is easier for a user to show examples of what should be looked for rather than verbally describe it,
the query becomes more compact, natural, and convenient for the user, and
the results can be more easily evaluated by the user through a direct comparison with the initial query.

A user-sketch query paradigm is very similar to the QBE, namely, the retrieval of images that look like that sketch. Both the paradigms involve usually low-level features such as colour, texture, and so on, rather than the meaning (semantics: objects, events, etc.). But it should be noted that users are generally interested in semantics. Generally, the goal and paradigms of the CBVIR are application-dependent. For example, it makes no sense to search for pictures of George W. Bush, using texture, whereas in many instances of medical imaging the textual (metadata) search is not nearly as powerful as QBE.

Because it is unlikely to find an example image containing exactly all the features needed, the natural extension of the QBE paradigm is a querying system accepting multiple examples from the user, gathering all the presented information into a joint pseudo-example, and responding with the best possible matches to this latter image from the database (Rahman, 2002). The QBE-based retrieval of visual information complements the text-based querying, rather than tends to replace it.

Due to subjective choices and typical incompleteness of examples, one need not expect that the CBVIR system will retrieve the correct data at the very first querying step. To improve the result on the basis of a previous search, relevance feedback is frequently used to generalise the QBE. In this case the returned set of retrieved images classified by the user into positive and negative matches is used to reformulate the query. Such a dynamic dialog between the user and the CBVIR system stimulates on-line learning and facilitates the target information search.

Basic goals of the QBE retrieval

Therefore, for a human, the main goal of the QBE is to provide a more comfortable and intuitive querying paradigm. From a computational viewpoint, the QBE paradigm relies on a detailed automated analysis of the content of the query image(s). From the cognitive side, the QBE paradigm relies on explicit representation of the knowledge about the search domain (Smeulders e.a., 2000). Syntactic (literal), perceptual, and physical criteria of similarity and equality between pixels or features of images and geometric and topological rules describing equality and differences of 3D objects on images are typical examples of this general knowledge.

Although the user seeks mostly semantic similarity, the CBVIR can only provide similarity by data processing results (Smeulders e.a., 2000). The challenge for the CBVIR engines is to focus on a narrow information domain the user has in mind via specification, examples, and interaction. Early CBVIR engines required from users to first select some low-level visual features of interest and then specify the relative weight for each their possible representation (Castelli & Bergman, 2002). In this case the user had to know in detail how the features are represented and used in the engine. Limitations of such a retrieval have been caused also by the difficulties of representing semantic contents in terms of low-level features and by the highly subjective human visual perception. The user typically forms queries based on semantics, e.g., "penguins on icebergs" or "a sunset image", rather than on low-level features, such as "predominantly oval black blobs on a white background" or "a predominantly red and orange image", respectively. It is obvious that these latter queries will retrieve a big number of irrelevant images having the like dominant colours. One should conclude that low-level features cannot adequately represent image content. Moreover, due to the subjective nature of human perception, different users and even the same user under different conditions may interpret the same image differently. Images are visually perceived as similar due to similar semantic meaning, rather than similar low-level features.

This is why recent experimental CBVIR systems such as Photobook (with FourEyes) or MARS are based on an interactive retrieval process where user's feedback helps the system to adjust the query and approach closer the user's expectations.

Architecture of an interactive CBVIR system

An interactive CBVIR system contains an image database, a feature database, a selector of feature similarity metric, and a block for evaluating feature relevance. When a query arrives, the system has no prior knowledge about the query. All features have the same weight and are used to compute the similarity measure. Then a fixed number of the top-rank (by the similarity to the query) images are retrieved to the user who provides relevance feedback. Learning algorithms are used in the feature relevance block in order to re-evaluate the weights of each feature in accord with the user's feedback. The metric selector chooses the best similarity metric for the weighted features by using reinforcement learning. By iteratively accounting for the user's feedback, the system automatically adjusts the query and brings the retrieval results closer to the user's expectations. Simultaneously, the user need not map semantic concepts onto features and specify weights. Instead, the user should only inform the system which images are relevant to the query, and the weight of each feature in the similarity computation is iteratively updated in accord with the high-level and subjective human perception.

Interactive QBE retrieval

The interactive retrieval based the relevance feedback results in a two-stage process of formulating a query:

the initial formulation, when the user has no precise idea of what should be searched for, and
the refined formulation, after the user took part in the iterative feedback process.

At the first stage, the system helps users in formulating an "imprecise" query by providing them with specific browsing and sketching tools. Basically, browsing can be sequential or feature-guided. The former presents to users all the images stored in the database in order of their physical location, whereas the latter ranks the images according to a specific feature of a query example. The feature-guided initial browsing is supplemented sometimes with some editing capabilities that allow an experienced user to modify features of a query image in order to control browsing.

At the second stage, the user gives some positive and negative feedback to the system by labelling the retrieved images in accord with their relevance to user's expectations, e.g., as highly relevant, relevant, neutral, irrelevant, or highly irrelevant retrieval results. The CBVIR system processes then both the query and the labelled retrieved images in order to update the feature weights and choose more adequate similarity metric such that the irrelevant outputs are suppressed and the relevant ones are enhanced (Castelli & Bergman, 2002). For instance, if the range of feature values for the relevant images is similar to that for the irrelevant ones, then this feature cannot effectively separate these image sets and its weight should decrease. But if the "relevant" values vary in a relatively small range containing no or almost no "irrelevant" values, it is a crucial feature which weight should increase.

Evaluation of the QBE retrieval

To evaluate performance of a QBE-based CBVIR system, a test bed containing a collection of images, a set of benchmark queries referring to the test bed data, and the "ground-truth" quantitative assessment of the relevance of each image for each benchmark query should be available (Castelli & Bergman, 2002). The retrieval performance takes into account the numbers of relevant and irrelevant results presented to the user.

Let a benchmark query to a CBVIR system result in the output images in rank order. Let W_r in the range [0,1] denote a quantitative relevance of the image of rank r to that query. Then, for each particular cutoff number n; n = 1, ..., N where N is the size of a given test bed, the relevance and irrelevance of the obtained retrieval outputs is evaluated with the following four characteristics:

the numbers A_n and B_n of relevant and irrelevant results, respectively, returned among the top n outputs:

A_n = W₁ + ... + W_n
B_n = n - A_n
the numbers C_n and D_n of relevant and irrelevant results, respectively, not returned among the top n outputs:

C_n = W_n+1 + ... + W_N
D_n = N - n - C_n

These values suggest the three retrieval effectiveness measures:

recall R_n = A_n / (A_n + C_n), or the relative number of desired results returned among the n best matches;
precision P_n = A_n / n, or the proportion of the relevant results among the n best matches; and
fallout F_n = B_n / (B_n + D_n), or the relative number of rejected irrelevant items.

The retrieval performance of a system can be evaluated on the basis of recall and precision averaged over all the benchmark queries.

The MPEG-7 Standard

The multimedia content description interface MPEG-7 "Multimedia Content Description Interface" developed by the International Standard Organization (ISO) and Moving Picture Expert Group (MPEG) pursues the goals of providing semantic descriptions of multimedia data, such as still images, graphics, 3D models, audio, speech, and video. There are standard representations of image features, extracted objects, and object relationships to enable fast and effective content-based multimedia searching and filtering, namely, the standardised descriptors for colour, texture, shape, motion, and other features of audiovisual datra. The development of this standard has resulted in a variety of proposals to describe and structure multimedia information.

The search includes feature extraction and standard descriptions although MPEG-7 does not specify algorithms for feature extraction and only defines the standard description to be fed to a search engine. The description framework consists of a set of descriptors (Ds), a set of description schemes (DS), a language to specify description schemes (DDL), and one or more schemes for encoding the description.

A descriptor represents a feature and defines the syntax and semantics of the feature representation. Multiple descriptors can be assigned to a single feature, e.g., the colour feature has the descriptors such as the average colour, the dominant colour, and the colour histogram.

A description scheme specifies the structure and semantics of the relationship between various elements (both descriptors and description schemes). Description schemes refer to image classification or describe image contents with specific indexing structures. For example, a classification DS should contain the description tools to allow image classification or a segment DS should represent a section of a multimedia content unit.

A description definition language (DDL) allows to create new or extend existing description schemes and descriptors. The XML eXtensible Markup Language with MPEG-7-specific extensions is adopted as the DDL.

The MPEG-7 framework allows many ways to encode the description, specifies a standard set of descriptors for describing various types of multimedia information, and provides standards for defining other descriptors and structures for descriptors and their relationships. This description is associated with the contents itself, thus allowing fast and efficient search for material of interest. In particular, images become more self-describing of their contents by carrying the MPEG-7 annotations, enabling richer descriptions of features, structure, and semantic information (Castelli & Bergman, 2002).

Visual Information at Multiple Levels

Images represent information at multiple levels, starting from the most basic level of pixelwise responses to light (intensities or colours). The pixel patterns produce more general low-level elements such as colour regions, texture, motion (inter-frame changes in a video sequence), shapes (object boundaries), and so on. No special knowledge is involved at these levels. But at the most complex level, images represent abstract ideas depending on individual knowledge, experience, and even on a particular mood (Castelli & Bergman, 2002).

Image syntax refers to perceived visual elements and their spatial - temporal arrangement with no consideration to the meaning of the elements or arrangements, wheras semantics deals just with the meaning. Syntax can be met at several perceptual levels - from simple global colour and texture to local geometric forms, such as lines and circles. Semantics can also be treated at different levels.

Objects depicted in images are characterised both with general concepts and visual concepts. These concepts are different and may vary among individuals. A visual concept includes only visual attributes, and a general concept refers to any kind of attribures. In the CBVIR different users have different concepts of even simple objects, and even simple objects can be seen at different conceptual levels. Specifically, general concepts help to answer the question: "What is it?", whereas visual concepts helps to answer the question "What does it look like?"

General and visual attributes used by different individuals to describe the same object (a ball).

The above figure shows attributes selected by different individuals, namely, a volleyball player (the left circle) and a baseball player (the right circle), for describing the same object. The volleyball and the baseball player choose "soft, yellow, round, leather, light weight" and "hard, heavy, white, round, leather" as the general attributes, respectively, because the both individuals have different general concepts of a ball. Naturally, there is also a correlation between some visual and general attributes (e.g., big and round). Thus, in creating conceptual indexing structures one needs to discriminate between visual and nonvisual content. The visual content of an image corresponds to directly observed items such as lines, shapes, colours, objects, and so on. The nonvisual content corresponds to information that is closely related to, but is not present in the image. QBE relates primarily to the visual content, although an indexing structure for the nonvisual content is also of a notable practical interest. Generally, the visual content is a multilevel structure where the lower levels refer to syntax, rather than semantics. The pyramidal indexing structure below has been developed in Columbia University New York, USA, and proposed to MPEG-7 (Castelli & Bergman, 2002).

The pyramidal indexing structure
(the width of each layer represents the amount of knowledge
that is necessary for operating at that level).

The bottom four levels focus on image perception, require no knowledge of actual objects to index an image, and involve only low-level processing. At the most basic level of types (categories) and techniques, the user is interested in the general visual characteristics of the image or the video sequence such as painting, drawing, black and white photo, or colour photo. Digital images may include additional descriptions such as number of colours, compression scheme, resolution, and so on.

The type and technique level provides general information about the image or video sequence, but gives almost no information about the visual content. The next global distribution level provides a global description of the image as a whole, without detecting and processing individual components of the content. Global distribution perceptual features include global colour (e.g., dominant colour, average colour, or colour histogram), global texture in terms of coarseness, contrast, directionality, or other descriptors, and global shape (e.g., aspect ratio). For video data, the features include also global motion (e.g., speed, acceleration, and trajectory), camera motion, global deformation (e.g., growing speed), and temporal and spatial dimensions. Some global characteristics are less intuitive than others. For example, it is difficult for a human to imagine what the colour histogram of an image looks like). Nonetheless, these global low-level features have been succesfully used in various CBVIR systems, e.g., QBIC, WebSEEk, or Virage, to perform QBE and to organise the contents of a database for browsing.

At the higher local structure level basic image components, or basic syntax symbols, are extracted by low-level processing. The elements include dots, lines, and texture, as well as temporal and spatial position, local colour, local motion, local deformation, local shape, and 2D geometry. Such elements have also been used in CBVIR systems, mainly in query-by-user sketch interfaces. This level manipulates with basic elements that represent objects, and may include some simple shapes such as a circle, an ellipse, and a polygon.

While local structure is given by basic elements, global composition refers to the arrangement (spatial positioning) of these elements in terms of general concepts such as balance, symmetry, region or centre of viewing, dot, leading line, viewing angle, and so forth. This level involves no specific objects and considers only basic elements or their groups. An image is represented by a structured set of basic forms - lines, squares, circles, etc.

Although perceptual aspects of the image are easier for automatic indexing and classification, humans mainly rely on higher-level attributes when describe, classify, and search for images. The level of generic objects accounts for the object's attributes which are common to all or most members of the category. The objects are recognised using only such a general knowledge. The level of generic scenes uses general knowledge to index an image as a whole, based on all the objects it contains. Both these levels need powerful techniques of object detection and recognition. But in spite of current advances in pattern recognition and computer vision, the recognition systems still have serious limitations regarding the CBVIR because of a number of additional factors complicating the recognition process, in particular, varying illumination of objects, shadows, occlusions, specular reflections, different scales, large changes of viewpoint intensive noise, arbitrary backgrounds, and clutter (foreign objects) making feature extraction more difficult.

Even more difficulties arise on the levels of specific objects and specific scenes where specific objective knowledge of individual objects and groups of objects is required. In indexing, the correct relationship between the generic and specific labels should be maintained. In particular, consistency of indexing is to be preserved, e.g., by using special templates and vocabularies. The levels of abstract objects and abstract scenes are the most challenging for indexing because they are very subjective (the interpertative knowledge of what the objects or scenes represent vary greatly among different users).

Visual and nonvisual information to semantically characterise an image or its parts.

Relationships between elements of visual content within each level are also of two types: syntactic (related to perception) and semantic (related to meaning). Syntactic relationships such as spatial, temporal, or photometric (visual), may occur at any level, but the semantic ones occur only at the semantics levels 5 through 10 of the above pyramid.

Due to difficulties in formal specification of semantics, most of the existing CBVIR systems operate at syntactic levels. In particular, WebSeek and QBIC exploit only the levels 1 (type and technique) and 2 (global distribution), VideoQ involves also the level 3 (local structure), and Virage adds the level 4 (global composition). Only very few experiments have been done to account at least for generic levels of semantics.

References

R. Blumberg and P. Hughes. Visual realism and interactivity for the Internet. Proc. IEEE Computer Society Conf. (Compcon'97), 23-26 Feb. 1997, pp. 269 - 273.
R. R. Buckley and G. B. Beretta. Color Imaging on the Internet. NIP-16: Vancouver, 2000.
V.Castelli and L.D.Bergman (Eds.). Image Databases: Search and Retrieval of Digital Imagery. John Wiley & Sons: New York, 2002.
G. Chang, M. J. Healey, J. A. M. McHugh, and J. T. L. Wang. Minimg the World Wide Web: An Information Search Approach. Kluwer Academic: Norwell, 2001.
V. Della Mea, V. Roberto, and C. A. Beltrami. Visualization issues in Telepathology: the role of the Internet Imaging Protocol. Proc. 5th Int. Conf. Information Visualization, 2001, pp. 717 - 722.
A.Hanjalic, G. C. Langelaar, P. M. B. van Roosmalen, J. Biemond, and R. Lagendijk. Image and Video Data Bases: Restoration, Watermarking and Retrieval. Elsevier Science: Amsterdam, 2000.
S. M. Rahman (Ed.). Interactive Multimedia Systems. IRM Press: Hershey, 2002.
T. K. Shih. Distributed Multimedia Databases: Techniques & Applications. Idea Group Publishing: Hershey, 2002.
A. W. M. Smeulders and R. Jain (Eds.). Image Databases and Multimedia Search. World Scientific: Singapore, 1997.
M. Stokes, M. Anderson, S. Chandrasekar, R. Motta. A standard default colot space for the Internet - sRGB. Version 1.10, Nov. 5, 1996. ICC, 1996.

Return to the table of contents