Content-Based Information Retrieval

Information Era
- The Internet and the WWW
- Milestones of the Information Era
Multimedia Information Retrieval
- Keyword / Text - based Search
- Content-based Search
References

Information Era

The second half of the 20th century and the beginning of the third millenium can be described as the information era of the mankind's development because of the enormous impact of information on the human lifestyle and way of thinking. Permanent and intensive exposure to information broaden people's views and deepen knowledge and awareness about the environment and the world in general. This process globalises society and creates new living and educational standards (Hanjalic e.a., 2000).

The information era is a consequence of digital revolution, which started in the nineteen forties - fifties and has continuously built up. Representation of information in the digital form allows for a lossless data compression or lossy compression with a low quality loss, which in turn results in a large reduction of the time and channel capacity for data transmission and of the space required for data storage. Possibilities to combine and transmit or process different types of information such as audio, visual or textual data without quality loss and permanently increasing performance-to-price ratio of digital transmission, storage, and processing resulted in the advent and continuous advance of multimedia systems and applications. Today's digital telecommunication networks such as the Internet provide extremely high-speed information transfer, called frequently "information superhighway".

Digital imagery is the most important part of multimedia data types. If video and audio data are in the predominant use of entertainment and news applications, images are actively exploited in almost all human activities (Shih, 2002).

The Internet and the WWW

The terms Internet and the World Wide Web are not synonymous although describe two related things. The Internet is a massive networking infrastructure linking millons of computers together globally. In this network any computer can communicate with any other computer as long as they are both connected to the Internet. More than 100 countries are linked into exchanges of data, news and opinions. Unlike online services, which are centrally controlled, the Internet is decentralised by design. Each Internet computer, called a host, is independent. Its operator chooses which Internet services to use and which local services to make available to other computers. "Remarkably, this anarchy by design works exceedingly well" (Webopedia, 2002).

Information travels over the Internet via a variety of languages known as protocols. A protocol consists of a set of conventions or rules, which govern communications by allowing networks to interconnect and ensuring compatibility between devices of different manufacturers. Examples of the protocols are:

TCP	Transmission Control Protocol	converts messages into streams of packets at the source and then reassembles them back into messages at the destination
IP	Internet Protocol	handles addressing and routing of packets across multiple nodes and even multiple networks with multiple standards
TCP/IP	combined TCP and IP
FTP	File Transfer Protocol	transfers files from one computer to another; based on TCP/IP protocol
HTTP	Hypertext Transfer Protocol	transfers compound documents with links; based on TCP/IP protocol
IPP	Internet Printing Protocol
IIP	Internet Imaging Protocol	transports high-quality images and metadata across the Internet, using the Flashpix format; integrated with TCP/IP and HTTP
SMTP	Simple Mail Transfer Protocol

The protocols deal with Internet media types, which identify type/subtype and encoding of transmitted data. The media types are used by Multipurpose Internet Mail Extensions (MIME) and others. Basic standard media types registered with Internet Assigned Numbers Authority (IANA) are text, application (e.g. document), audio, image, and video (Buckley & Beretta, 2000).

The World Wide Web, or simply Web, is a way of accessing information over the medium of the Internet. It is an information-sharing model built on the top of the Internet. The Web is based on three specifications: URL (Uniform Resource Locator) to locate information, HTML (Hypertext Markup Language) to write simple documents, and HTTP. The Web uses the HTTP protocol, only one of the languages spoken over the Internet, to transmit data. Web services, which use the HTTP to allow applications to communicate, use the Web to share information. The Web utilises browsers to access Web documents (called Web pages) that are linked to each other via hyperlinks. Web documents also contain text, graphics, sounds, and video.

The Web is just one of the ways that information can be disseminated over the Internet. The Internet, not the Web, is also used for e-mail, which relies on the SMTP, instant messaging, Usenet news groups, and FTP. Thus the Web is only a large portion of the Internet as well as the Web is the basic way to publish and access information on networks within companies (Intranet). Although it is nominally based on the HTML standard, a steady stream of innovations in the domains of multimedia and interactivity has greatly expanded Web capabilities (Blumberg & Hughes, 1997).

Milestones of the Information Era

For last two decades, an average information consumer permanently was raising his expectations regarding the amount, variety and technical quality of the received multimedia information, as well as of the systems for information receiving, processing, storage, and replay or display. The Internet and the Web created a virtual world linking us together, having unique multimedia capabilities, and yielding a new concept of E-Utopia. The new concept realises new activities such as e-conferencing, e-entertainment, e-commerce, e-learning, telemedicine, and so forth. All these activities involve distributed multimedia databases and techniques for content-based information search and retrieval (Hanjalic e.a., 2000; Shih, 2002).

The advent of virtual reality environments (e.g., Apple Computer's Quick Time VR) and the virtual reality markup language (VRML) for rendering 3D objects and scenes added much to the Web unique multimedia capabilities. But the Web continues to grow as both an interactive and a publishing environment and offers new types of interactions and ways to distribute, utilise, and visualise information. Some experts anticipate that today's interaction with a database via the Web will necessarily evolve to become the interaction with a variety of knowledge bases and will result in the more intelligent Web.

In addition to the present e-activities over the Web, one can expect the advent of smart houses, which can communicate with owners, repair services, shops, police, and others over the Web in order to suggest appropriate decisions using current measurements together with specific knowledge bases. For instance, an appliance may connect through the Web to a central facility and inform the vendor about the status of all of its subsystems for deriving the most cost-effective time schedule for service and routing of service.

In near future most of households are expected to be equipped with receivers for Digital Video Broadcasting (DVB) and Audio Broadcasting (DAB), providing together hundreds of high-quality audiovisual channels, combined with a high-speed Internet connection to access countless archives of invormation all over the world. Today we witness a fast development of home digital multimedia archives and digital libraries for a large-scale collection of multimedia information, e.g. digital museum archives or professional digital multimedia archives at service providers such as TV and radio broadcasters, Internet providers, etc. Digital medical images are widely used in telemedicine based on the Web, mainly, for continuing medical education and diagnostic purposes (Della Mea e.a., 2001).

At present, more than a hundred million digital images and videos are already embedded in Web pages, and these collections are rapidly expanding because "a picture is worth a thousand words". Gigabytes of new images, audio and video clips are stored everyday in various repositories accessed through the Web (Shih, 2002). In some cases, such as space and aerial imagery of the Earth's surface, amounts of the stored data exceed thousands of Terabytes. Thus, among other new challenges of the information era, mechanisms for content-based information retrieval, especially, for efficient retrieval of image and video information stored in the Web-based multimedia databases, become the most important and difficult issue.

Multimedia Information Retrieval

"Anyone who has surfed the Web has explained at one point or another that there is so much information available, so much to search and so much to keep up with". (Smeulders & Jain, 1997)

Multimedia information differs from conventional text or numerical data in that multimedia objects require a large amount of memory and special processing operations. A multimedia database management system should be able to handle various data types (image, video, audio, text) and a large number of such objects, provide a high-performance and cost-effective storage of the objects, and support such functions as insert, delete, update, and search (Shih, 2002). A typical multimedia document or presentation contains a number of objects of different types, such as picture, music, and text. Thus content-based multimedia information retrieval has become a very important new research issue. Unlike a traditional searching scheme based on text and numerical data comparison, it is hard to model the searching and matching criteria of multimedia information.

Image and video retrieval is based on how contents of an image or a chain of images can be represented. Conventional techniques of text data retrieval can be applied only if every image and video record is accompanied with a textual content description (image metadata). But image or video content is much more versatile compared to text, and in the most cases the query topics are not reflected in the textual metadata available. Images, by their very nature, contain "non-textual", unstructured information, which hardly can be captured automatically. Computational techniques that pursue the goals of indexing the unstructured visual information are called content-based image retrieval (CBIR), or content-based video information retrieval (CBVIR).

Architecture of a CBVIR system

In CBVIR, the user should describe the desired content in terms of visual features, images should be ranked with respect to similarity to the description, and the top-rank (or most similar) images should be retrieved. At the lowest, or initial level of description, an image is considered as a collection of pixels. Although a pixel-level content might be of interest for some specific applications (say, in remote sensing of the Earth's surface), today's CBVIR is based on more elaborated descriptors showing specific local and global photometric and geometric features of visual objects and semantic relationships between the features.

Features can be divided into general-purpose and domain-specific. In the most cases general features are colour, texture, geometric shape, sketch, and spatial relationships. Domain-specific features are used in special applications such as surveying and mapping of the Earth's surface using remotely sensed imagery or biometrics based on human face or fingerprint recognition. But extraction of adequate descriptors and especially inference of semantic content are extremely difficult problems having no universal solution. Higher levels of image content description involve objects and abstract relationships. Such a description is more or less easily formed by human vision, but it is often difficult to detect and recognise objects of interest in one or more images (Castelli & Bergman, 2002).

The most difficult issue of multimedia information retrieval is how to make a query describing the needs of the user. For example, it is a hard task to conduct a query like "Find me a picture with a house and a car" and it is even harder to match a specification against the large amount of picture files in a multimedia database. Generally, human and automated content-based information retrieval differ much. Human retrieval tasks (queries) are stated at cognitive level and exploit human knowledge, analysis, and understanding of the information context in terms of objects, persons, sceneries, meaning of speech fragments or the context of a story in general. Therefore, the queries by content can be formulated in different ways, e.g.

"Find me an image with a bird"
"Find me an image of an eagle"
"Find me the movie scene where Titanic hits the iceberg"
"Classify all the images according to the place where they are taken"
"Select me aerial images of Rangitoto"
"Find me images depicting similar tornadoes in Alabama"
"Select me most impressive sunset images" and so on.

The notion of content is hardly formalised at present. First, there exist a "sensory gap" (Smeulders e.a., 2000) caused by distinctions between the properties of an object in the world and the properties of its computational description derived from an image or a series of images. The sensory gap results in the ill-posed problem of content description and notably limits capabilities for the formal representation of image content. Secondly, there is a semantic gap, or "a discrepancy between the query a user ideally would and one which the user actually could submit to an information retrieval system" (Castelli & Bergman, 2002). The semantic gap results in considerable distinctions between the description extracted from the visual data and the human interpretation of the same data in each particular case. The main restriction of the content-based retrieval is that the user searches for semantic similarity, whereas the CBVIR system provides only similarity of quantitative features obtained with data processing.

Informally, content of a still image includes, in increasing level of complexity, perceptual, or algorithmic properties of visual information, semantic properties, e.g. abstract primitives such as objects, roles, and scenes, and subjective attributes such as impressions, emotions and meaning associated to the perceptual properties (Shih, 2002). Content-based retrieval of video records involves not only the objects shown but also the timing of object movement. But tools for content description by computational image / video understanding, object tracing, and semantic analysis are still and will be for a very long time in future under development. First of all, the content of an image is a very subjective notion, and there are no "objective" ways to annotate the content at a semantic level to reflect all or even most of subjective interpretations of this image. Secondly, the gaps between "formal" and "human" (user) semantics should be bridged from both sides, by extending the image descriptions and adapting the user queries to how a CBVIR system operates.

The users of a CBVIR system have a diversity of goals, in particular, search by association, search for a specific image, or category search (Smeulders e.a., 2000). Search by association has first no partricular aim and implies highly interactive iterative refinement of the search using sketches or example images. Search for a precise copy of the image in mind (e.g., in an art catalogue) or for another image of the same object assumes that the targer can be interactively specified as similar to a group of given examples. Category search retrieves an arbitrary image representative of a certain class either specified also by an example or derived from labels or other database information.

At present, the only feasible analysis of a video, or an image, or a musical piece, or a speech fragment, or a text can be performed only at algorithmic level. Such analyses involve computable features of audio and video signals, e.g. colour, texture, shape, frequency components, temporal characteristics of signals, as well as algorithms operating on these features.

In image and video retrieval, various algorithms of image segmentation into homogeneous regions, detection of moving objects in successive frames, extraction of particular (e.g., spatially invariant) types of textures and geometric shapes, determination of relations among different objects, and analysis of 2D frequency spectrum are used for getting features. But in contrast to most of computer vision applications, image and video retrieval combines automatic image recognition with active user participation in the retrieval process (Castelli & Bergman, 2002). Also, retrieval relates inherently to image ranking by similarity to a query example, rather than to image classification by matching to a model. In CBVIR systems the user evaluates system responses, refines the query, and determines whether the receieved answers are relevant to that query.

Of course, there is almost no parallelism in results of the cognition-based and feature-based retrieval even in the simple tasks like "an image containing a bird". As underlined in Chang e.a., "the multimedia information is highly distributed, minimally indexed, and lacks appropriate schemas. The critical question in multimedia search is how to design a scalable, visual information retrieval system? Such audio and visual information systems require large resources for transmission, storage and processing, factors which make indexing, retrieving, and managing visual information an immense challenge".

Keyword / Text - based Search

An image can hardly be described by text annotations or keywords, although these latter are to some extent associated with semantics. At present, most popular multimedia search engines, including all first generation visual information retrieval (VIR) systems, are still textual, even though the Web is now a multimedia-based repository with a variety of audio, video, image, and text formats. Some popular formats for differend media types are as follows (Chang e.a., 2001):

Media type	Media format	File extension
text	plain	`txt`
text	HTML	`html`, `htm`
document	PDF Portable Document Format	`pdf`
	TEX DVI Device Independent Data	`dvi`
	Postscript	`ai`, `eps`, `ps`,
image	PNG (Portable Network Graphics) image	`png`
	Windows Bitmap	`bmp`
	X Bitmap	`xbm`
	TIFF (Tag Image File Format) image	`tif`
	JPEG (Joint Photographic Experts Group) image	`jpg`
	GIF (Graphics Interchange Format) image	`gif`
audio	Midi	`midi`
	MP3	`mp3`
	RealAudio	`ra`, `ram`
	WAV Audio	`wav`
video	MPEG (Moving Picture Experts Group) Video	`mpeg`, `mpg`, `mpe`, `mpv`, `mpegv`
	QuickTime	`qt`, `mov`, `moov`
	RealMedia	`ra`, `ram`
	MPEG Audio	`mp2`, `mpa`, `abs`, `mpega`
	AVI	`avi`

In the case of text or keyword - based search, users specify keywords, and multimedia relevant to these keywords should be retrieved. Such retrieval relies strongly on metadata represented by text strings, keywords, or full scripts (Shih, 2002). Several recently developed and deployed efficient commercial multimedia search engines, such as Google Image Search, AltaVista Photo Finder, Lycos Pictures and Sounds, Yahoo! Image Surfer, and Lycos Fast MP3 Search, exploit text or keyword-based retrieval. It requires an inverted file index that describes the multimedia content and allows for obtaining fast query response. Building an index is the core part of the keyword-based multimedia information search.

Another indexing techniques are partitioning multimedia content into categories, which the user can browse through for images of interest that match category keywords and using the text embedded around multimedia content as a way to identify its content. But the keywords and texts relate only implicitly to image / video / audio content, and be it possible to examine directly such a content, the search results could be notably refined.

Content - based Search

In content or semantics-based search, retrieval criteria and queries are specified in terms of computable data features related to semantic content of a multimedia object (audio, image, or video). Most content-based VIR (CBVIR) systems allow for searching the visual database contents in several different ways, either alone or combined (Chang e.a., 2001; Shih, 2002), Smeulders e.a., 2000): .

General interactive browsing by a user with no specific idea about the desired images or video clips.
- category (subject) search to retrieve an arbitrary image that represents a specific class (in this case clustering of visually similar images into groups can reduce the number of retrieved undesired images and navigation through a subject hierarchy allows to get to the target subject and then browse or search only that limited subset of images);
- search by association having no specific aim and iteratively refined on the basis of the user's relevance feedback.
Search to illustrate a story or document or for arbitrary pictures of expected aesthetical value.
Search for a specific image, or Query-by-X where "X" can be:
- an image example or a group of examples (QBE) specified by a user virtually anywhere in the Internet; the images that are most similar to the query image(s) are presented in descending order of similarity score;
- a visual sketch of the desired image or video clip drawn by a user with graphics tools provided by the system;
- direct specification of visual features (e.g., colour, texture, shape, and motion properties), which appeals to more technical users; or
- a keyword or complete text entered by the user to search for visual information that has been previously annotated with that set of keywords.

The CBVIR systems are most frequently called content-based image retrieval (CBIR) systems although the same abbreviation stands for content-based information retrieval, too. If a CBVIR follows a QBE framework, the colour, texture, shape, or other features of the query image, extracted and stored as metadata, are matched to the image metadata in the database of indexed images and returned results are based on matching scores. Queries can also be formulated to find images containing certain geometric shapes (Chang e.a., 2001).

Numerous commercial and research CBVIR systems have been developed in recent years. Combination of textual ques (like keywords) and visual feature extraction is the basis for these systems. Because high-level semantic descriptions can hardly be automatically obtained at present for a majority of the available images, the systems take mostly account of selected low-level characteristics such as colour, texture, shape of dominating image regions, and add to them sometimes a few specific features that characterise a particular application domain (e.g. human faces, skin features, or fingerprints). Some of the currently developed CBVIR systems are enumerated below (Shih, 2002):

CBVIP system	Developed by	Developed in	Retrieval features	Search criteria
QBIC Query-By-Image-Content	IBM Almaden Research Center, USA	1993 - 1997	Example images, user-constructed sketches, selected colour / texture patterns	Content-based image similarity, text-based keyword search
Photobook	MIT Media Lab., USA	1996	Shape, texture, face features	Selected subset of features
FourEyes	MIT Media Lab., USA	1996	Improved version of Photobook included user relevance feedback	Learning which search model is the best from a given set of positive and negative examples
MARS Multimedia Analysis and Retrieval System	University of Illinois at Urbana-Champaign, USA	1997 - 1998	Organisation of various visual features into a meaningful retrieval framework that dynamically adapts to different users and applications	Integration of a relevance feedback architecture at various retrieval levels, including query vector refinement, automatic selection of matching tools, and automatic feature adaptation
PicToSeek	University of Amsterdam, The Netherlands	1999	Automatic building of a catalogue of images collected by autonomous Web crawlers, classification of the images into predefined classes, and extraction of their relevant features	Query by using image features, an example image, or simple browsing of the precomputed image catalogue.
ImageRover	Boston University, USA	1997	Gathering information about HTML pages via a fleet of Web-based automated robots that gather, process, and store the image metadata in a vector format	Search for the metadata to provide a user with tumbnail images as a relevance feedback; the user selects the relevant images to the search in order to utilise the content-based searching capabilities of the system
VisualSEEk	Columbia University, USA	1996 - 1997	Visual features and their spatial relations	Queries based on features and their relationships
WebSEEk	Columbia University, USA	1996 - 1997	Similar to ImageRover in web-robot-based information gathering and also performs video search and collection	Relevance user's feedback in the form of thumbnail images and motion icons or spatially and temporally reduced video forms (short GIF files)
Blobworld	University of California at Berkeley, USA	1999	Regions obtained by automatic image segmentation that roughly correspond to objects or parts`of objects; spatial organisation of the regions	Query for images containing particular objects; both textual and content-based searching

The search methods for images differ much from those for texts or numerical strings. Exact queries are of interest to only search for textual metadata. The multimedia information is searched for and retrieved using a query-by-similarity. The user defines what to retrieve using the available interface, and this query is represented in terms of requirements to a set of quantitative features desribing the desired data. Basic groups of the similarity requirements are as follows (Castelli & Bergman, 2002):

Range search finding all data items (e.g., images) with features within given ranges, that is, feature f_i within range [f_{min, i}, f_{max, i}].
k-Nearest-neigbour search (kNN) finding the k most similar data items to a query template.
Within-distance search, or A-cut (alpha-cut), that finds all data items with a similarity score better than a given threshold A, or at a distance less than d from a query template

query-by-example

References

R. Blumberg and P. Hughes. Visual realism and interactivity for the Internet. Proc. IEEE Computer Society Conf. (Compcon'97), 23-26 Feb. 1997, pp. 269 - 273.
R. R. Buckley and G. B. Beretta. Color Imaging on the Internet. NIP-16: Vancouver, 2000.
V.Castelli and L.D.Bergman (Eds.). Image Databases: Search and Retrieval of Digital Imagery. John Wiley & Sons: New York, 2002.
G. Chang, M. J. Healey, J. A. M. McHugh, and J. T. L. Wang. Minimg the World Wide Web: An Information Search Approach. Kluwer Academic: Norwell, 2001.
V. Della Mea, V. Roberto, and C. A. Beltrami. Visualization issues in Telepathology: the role of the Internet Imaging Protocol. Proc. 5th Int. Conf. Information Visualization, 2001, pp. 717 - 722.
A.Hanjalic, G. C. Langelaar, P. M. B. van Roosmalen, J. Biemond, and R. Lagendijk. Image and Video Data Bases: Restoration, Watermarking and Retrieval. Elsevier Science: Amsterdam, 2000.
T. K. Shih Distributed Multimedia Databases: Techniques & Applications. Idea Group Publishing: Hershey, 2002.
A. W. M. Smeulders and R. Jain (Eds.). Image Databases and Multimedia Search. World Scientific: Singapore, 1997.
A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 12, 2000, 1349 - 1380.
http://www.webopedia.com/TERM/I/Internet.html

Return to the table of contents