The second half of the 20th century and the beginning of the third millenium can be described as the information era of the mankind's development because of the enormous impact of information on the human lifestyle and way of thinking. Permanent and intensive exposure to information broaden people's views and deepen knowledge and awareness about the environment and the world in general. This process globalises society and creates new living and educational standards (Hanjalic e.a., 2000).
The information era is a consequence of digital revolution, which started in the nineteen forties - fifties and has continuously built up. Representation of information in the digital form allows for a lossless data compression or lossy compression with a low quality loss, which in turn results in a large reduction of the time and channel capacity for data transmission and of the space required for data storage. Possibilities to combine and transmit or process different types of information such as audio, visual or textual data without quality loss and permanently increasing performance-to-price ratio of digital transmission, storage, and processing resulted in the advent and continuous advance of multimedia systems and applications. Today's digital telecommunication networks such as the Internet provide extremely high-speed information transfer, called frequently "information superhighway".
Digital imagery is the most important part of multimedia data types. If video and audio data are in the predominant use of entertainment and news applications, images are actively exploited in almost all human activities (Shih, 2002).
The terms Internet and the World Wide Web are not synonymous although describe two related things. The Internet is a massive networking infrastructure linking millons of computers together globally. In this network any computer can communicate with any other computer as long as they are both connected to the Internet. More than 100 countries are linked into exchanges of data, news and opinions. Unlike online services, which are centrally controlled, the Internet is decentralised by design. Each Internet computer, called a host, is independent. Its operator chooses which Internet services to use and which local services to make available to other computers. "Remarkably, this anarchy by design works exceedingly well" (Webopedia, 2002).
Information travels over the Internet via a variety of languages known as protocols. A protocol consists of a set of conventions or rules, which govern communications by allowing networks to interconnect and ensuring compatibility between devices of different manufacturers. Examples of the protocols are:
TCP | Transmission Control Protocol | converts messages into streams of packets at the source and then reassembles them back into messages at the destination |
IP | Internet Protocol | handles addressing and routing of packets across multiple nodes and even multiple networks with multiple standards |
TCP/IP | combined TCP and IP | |
FTP | File Transfer Protocol | transfers files from one computer to another; based on TCP/IP protocol |
HTTP | Hypertext Transfer Protocol | transfers compound documents with links; based on TCP/IP protocol |
IPP | Internet Printing Protocol | |
IIP | Internet Imaging Protocol | transports high-quality images and metadata across the Internet, using the Flashpix format; integrated with TCP/IP and HTTP |
SMTP | Simple Mail Transfer Protocol |
The protocols deal with Internet media types, which identify type/subtype and encoding of transmitted data. The media types are used by Multipurpose Internet Mail Extensions (MIME) and others. Basic standard media types registered with Internet Assigned Numbers Authority (IANA) are text, application (e.g. document), audio, image, and video (Buckley & Beretta, 2000).
The World Wide Web, or simply Web, is a way of accessing information over the medium of the Internet. It is an information-sharing model built on the top of the Internet. The Web is based on three specifications: URL (Uniform Resource Locator) to locate information, HTML (Hypertext Markup Language) to write simple documents, and HTTP. The Web uses the HTTP protocol, only one of the languages spoken over the Internet, to transmit data. Web services, which use the HTTP to allow applications to communicate, use the Web to share information. The Web utilises browsers to access Web documents (called Web pages) that are linked to each other via hyperlinks. Web documents also contain text, graphics, sounds, and video.
The Web is just one of the ways that information can be disseminated over the Internet. The Internet, not the Web, is also used for e-mail, which relies on the SMTP, instant messaging, Usenet news groups, and FTP. Thus the Web is only a large portion of the Internet as well as the Web is the basic way to publish and access information on networks within companies (Intranet). Although it is nominally based on the HTML standard, a steady stream of innovations in the domains of multimedia and interactivity has greatly expanded Web capabilities (Blumberg & Hughes, 1997).
For last two decades, an average information consumer permanently was raising his expectations regarding the amount, variety and technical quality of the received multimedia information, as well as of the systems for information receiving, processing, storage, and replay or display. The Internet and the Web created a virtual world linking us together, having unique multimedia capabilities, and yielding a new concept of E-Utopia. The new concept realises new activities such as e-conferencing, e-entertainment, e-commerce, e-learning, telemedicine, and so forth. All these activities involve distributed multimedia databases and techniques for content-based information search and retrieval (Hanjalic e.a., 2000; Shih, 2002).
The advent of virtual reality environments (e.g., Apple Computer's Quick Time VR) and the virtual reality markup language (VRML) for rendering 3D objects and scenes added much to the Web unique multimedia capabilities. But the Web continues to grow as both an interactive and a publishing environment and offers new types of interactions and ways to distribute, utilise, and visualise information. Some experts anticipate that today's interaction with a database via the Web will necessarily evolve to become the interaction with a variety of knowledge bases and will result in the more intelligent Web.
In addition to the present e-activities over the Web, one can expect the advent of smart houses, which can communicate with owners, repair services, shops, police, and others over the Web in order to suggest appropriate decisions using current measurements together with specific knowledge bases. For instance, an appliance may connect through the Web to a central facility and inform the vendor about the status of all of its subsystems for deriving the most cost-effective time schedule for service and routing of service.
In near future most of households are expected to be equipped with receivers for Digital Video Broadcasting (DVB) and Audio Broadcasting (DAB), providing together hundreds of high-quality audiovisual channels, combined with a high-speed Internet connection to access countless archives of invormation all over the world. Today we witness a fast development of home digital multimedia archives and digital libraries for a large-scale collection of multimedia information, e.g. digital museum archives or professional digital multimedia archives at service providers such as TV and radio broadcasters, Internet providers, etc. Digital medical images are widely used in telemedicine based on the Web, mainly, for continuing medical education and diagnostic purposes (Della Mea e.a., 2001).
At present, more than a hundred million digital images and videos are already embedded in Web pages, and these collections are rapidly expanding because "a picture is worth a thousand words". Gigabytes of new images, audio and video clips are stored everyday in various repositories accessed through the Web (Shih, 2002). In some cases, such as space and aerial imagery of the Earth's surface, amounts of the stored data exceed thousands of Terabytes. Thus, among other new challenges of the information era, mechanisms for content-based information retrieval, especially, for efficient retrieval of image and video information stored in the Web-based multimedia databases, become the most important and difficult issue.
"Anyone who has surfed the Web has explained at one point or another that there is so much information available, so much to search and so much to keep up with". (Smeulders & Jain, 1997) |
Multimedia information differs from conventional text or numerical data in that multimedia objects require a large amount of memory and special processing operations. A multimedia database management system should be able to handle various data types (image, video, audio, text) and a large number of such objects, provide a high-performance and cost-effective storage of the objects, and support such functions as insert, delete, update, and search (Shih, 2002). A typical multimedia document or presentation contains a number of objects of different types, such as picture, music, and text. Thus content-based multimedia information retrieval has become a very important new research issue. Unlike a traditional searching scheme based on text and numerical data comparison, it is hard to model the searching and matching criteria of multimedia information.
Image and video retrieval is based on how contents of an image or a chain of images can be represented. Conventional techniques of text data retrieval can be applied only if every image and video record is accompanied with a textual content description (image metadata). But image or video content is much more versatile compared to text, and in the most cases the query topics are not reflected in the textual metadata available. Images, by their very nature, contain "non-textual", unstructured information, which hardly can be captured automatically. Computational techniques that pursue the goals of indexing the unstructured visual information are called content-based image retrieval (CBIR), or content-based video information retrieval (CBVIR).
Architecture of a CBVIR system
In CBVIR, the user should describe the desired content in terms of visual features, images should be ranked with respect to similarity to the description, and the top-rank (or most similar) images should be retrieved. At the lowest, or initial level of description, an image is considered as a collection of pixels. Although a pixel-level content might be of interest for some specific applications (say, in remote sensing of the Earth's surface), today's CBVIR is based on more elaborated descriptors showing specific local and global photometric and geometric features of visual objects and semantic relationships between the features.
Features can be divided into general-purpose and domain-specific. In the most cases general features are colour, texture, geometric shape, sketch, and spatial relationships. Domain-specific features are used in special applications such as surveying and mapping of the Earth's surface using remotely sensed imagery or biometrics based on human face or fingerprint recognition. But extraction of adequate descriptors and especially inference of semantic content are extremely difficult problems having no universal solution. Higher levels of image content description involve objects and abstract relationships. Such a description is more or less easily formed by human vision, but it is often difficult to detect and recognise objects of interest in one or more images (Castelli & Bergman, 2002).
The most difficult issue of multimedia information retrieval is how to make a query describing the needs of the user. For example, it is a hard task to conduct a query like "Find me a picture with a house and a car" and it is even harder to match a specification against the large amount of picture files in a multimedia database. Generally, human and automated content-based information retrieval differ much. Human retrieval tasks (queries) are stated at cognitive level and exploit human knowledge, analysis, and understanding of the information context in terms of objects, persons, sceneries, meaning of speech fragments or the context of a story in general. Therefore, the queries by content can be formulated in different ways, e.g.
The notion of content is hardly formalised at present. First, there exist a "sensory gap" (Smeulders e.a., 2000) caused by distinctions between the properties of an object in the world and the properties of its computational description derived from an image or a series of images. The sensory gap results in the ill-posed problem of content description and notably limits capabilities for the formal representation of image content. Secondly, there is a semantic gap, or "a discrepancy between the query a user ideally would and one which the user actually could submit to an information retrieval system" (Castelli & Bergman, 2002). The semantic gap results in considerable distinctions between the description extracted from the visual data and the human interpretation of the same data in each particular case. The main restriction of the content-based retrieval is that the user searches for semantic similarity, whereas the CBVIR system provides only similarity of quantitative features obtained with data processing.
Informally, content of a still image includes, in increasing level of complexity, perceptual, or algorithmic properties of visual information, semantic properties, e.g. abstract primitives such as objects, roles, and scenes, and subjective attributes such as impressions, emotions and meaning associated to the perceptual properties (Shih, 2002). Content-based retrieval of video records involves not only the objects shown but also the timing of object movement. But tools for content description by computational image / video understanding, object tracing, and semantic analysis are still and will be for a very long time in future under development. First of all, the content of an image is a very subjective notion, and there are no "objective" ways to annotate the content at a semantic level to reflect all or even most of subjective interpretations of this image. Secondly, the gaps between "formal" and "human" (user) semantics should be bridged from both sides, by extending the image descriptions and adapting the user queries to how a CBVIR system operates.
The users of a CBVIR system have a diversity of goals, in particular, search by association, search for a specific image, or category search (Smeulders e.a., 2000). Search by association has first no partricular aim and implies highly interactive iterative refinement of the search using sketches or example images. Search for a precise copy of the image in mind (e.g., in an art catalogue) or for another image of the same object assumes that the targer can be interactively specified as similar to a group of given examples. Category search retrieves an arbitrary image representative of a certain class either specified also by an example or derived from labels or other database information.
At present, the only feasible analysis of a video, or an image, or a musical piece, or a speech fragment, or a text can be performed only at algorithmic level. Such analyses involve computable features of audio and video signals, e.g. colour, texture, shape, frequency components, temporal characteristics of signals, as well as algorithms operating on these features.
In image and video retrieval, various algorithms of image segmentation into homogeneous regions, detection of moving objects in successive frames, extraction of particular (e.g., spatially invariant) types of textures and geometric shapes, determination of relations among different objects, and analysis of 2D frequency spectrum are used for getting features. But in contrast to most of computer vision applications, image and video retrieval combines automatic image recognition with active user participation in the retrieval process (Castelli & Bergman, 2002). Also, retrieval relates inherently to image ranking by similarity to a query example, rather than to image classification by matching to a model. In CBVIR systems the user evaluates system responses, refines the query, and determines whether the receieved answers are relevant to that query.
Of course, there is almost no parallelism in results of the cognition-based and feature-based retrieval even in the simple tasks like "an image containing a bird". As underlined in Chang e.a., "the multimedia information is highly distributed, minimally indexed, and lacks appropriate schemas. The critical question in multimedia search is how to design a scalable, visual information retrieval system? Such audio and visual information systems require large resources for transmission, storage and processing, factors which make indexing, retrieving, and managing visual information an immense challenge".
Media type | Media format | File extension | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
text | plain | txt
HTML
| html, htm
| document
| PDF Portable Document Format
| pdf
| TEX DVI Device Independent Data
| dvi
| Postscript
| ai, eps, ps,
| image
| PNG (Portable Network Graphics) image
| png
| Windows Bitmap
| bmp
| X Bitmap
| xbm
| TIFF (Tag Image File Format) image
| tif
| JPEG (Joint Photographic Experts Group) image
| jpg
| GIF (Graphics Interchange Format) image
| gif
| audio
| Midi
| midi
| MP3
| mp3
| RealAudio
| ra, ram
| WAV Audio
| wav
| video
| MPEG (Moving Picture Experts Group) Video
| mpeg, mpg, mpe, mpv, mpegv
| QuickTime
| qt, mov, moov
| RealMedia
| ra, ram
| MPEG Audio
| mp2, mpa, abs, mpega
| AVI
| avi
| |
In the case of text or keyword - based search, users specify keywords, and multimedia relevant to these keywords should be retrieved. Such retrieval relies strongly on metadata represented by text strings, keywords, or full scripts (Shih, 2002). Several recently developed and deployed efficient commercial multimedia search engines, such as Google Image Search, AltaVista Photo Finder, Lycos Pictures and Sounds, Yahoo! Image Surfer, and Lycos Fast MP3 Search, exploit text or keyword-based retrieval. It requires an inverted file index that describes the multimedia content and allows for obtaining fast query response. Building an index is the core part of the keyword-based multimedia information search.
Another indexing techniques are partitioning multimedia content into categories, which the user can browse through for images of interest that match category keywords and using the text embedded around multimedia content as a way to identify its content. But the keywords and texts relate only implicitly to image / video / audio content, and be it possible to examine directly such a content, the search results could be notably refined.
The CBVIR systems are most frequently called content-based image retrieval (CBIR) systems although the same abbreviation stands for content-based information retrieval, too. If a CBVIR follows a QBE framework, the colour, texture, shape, or other features of the query image, extracted and stored as metadata, are matched to the image metadata in the database of indexed images and returned results are based on matching scores. Queries can also be formulated to find images containing certain geometric shapes (Chang e.a., 2001).
Numerous commercial and research CBVIR systems have been developed in recent years. Combination of textual ques (like keywords) and visual feature extraction is the basis for these systems. Because high-level semantic descriptions can hardly be automatically obtained at present for a majority of the available images, the systems take mostly account of selected low-level characteristics such as colour, texture, shape of dominating image regions, and add to them sometimes a few specific features that characterise a particular application domain (e.g. human faces, skin features, or fingerprints). Some of the currently developed CBVIR systems are enumerated below (Shih, 2002):
CBVIP system | Developed by | Developed in | Retrieval features | Search criteria |
---|---|---|---|---|
QBIC Query-By-Image-Content | IBM Almaden Research Center, USA | 1993 - 1997 | Example images, user-constructed sketches, selected colour / texture patterns | Content-based image similarity, text-based keyword search |
Photobook | MIT Media Lab., USA | 1996 | Shape, texture, face features | Selected subset of features |
FourEyes | MIT Media Lab., USA | 1996 | Improved version of Photobook included user relevance feedback | Learning which search model is the best from a given set of positive and negative examples |
MARS Multimedia Analysis and Retrieval System | University of Illinois at Urbana-Champaign, USA | 1997 - 1998 | Organisation of various visual features into a meaningful retrieval framework that dynamically adapts to different users and applications | Integration of a relevance feedback architecture at various retrieval levels, including query vector refinement, automatic selection of matching tools, and automatic feature adaptation |
PicToSeek | University of Amsterdam, The Netherlands | 1999 | Automatic building of a catalogue of images collected by autonomous Web crawlers, classification of the images into predefined classes, and extraction of their relevant features | Query by using image features, an example image, or simple browsing of the precomputed image catalogue. |
ImageRover | Boston University, USA | 1997 | Gathering information about HTML pages via a fleet of Web-based automated robots that gather, process, and store the image metadata in a vector format | Search for the metadata to provide a user with tumbnail images as a relevance feedback; the user selects the relevant images to the search in order to utilise the content-based searching capabilities of the system |
VisualSEEk | Columbia University, USA | 1996 - 1997 | Visual features and their spatial relations | Queries based on features and their relationships |
WebSEEk | Columbia University, USA | 1996 - 1997 | Similar to ImageRover in web-robot-based information gathering and also performs video search and collection | Relevance user's feedback in the form of thumbnail images and motion icons or spatially and temporally reduced video forms (short GIF files) |
Blobworld | University of California at Berkeley, USA | 1999 | Regions obtained by automatic image segmentation that roughly correspond to objects or parts`of objects; spatial organisation of the regions | Query for images containing particular objects; both textual and content-based searching |
The search methods for images differ much from those for texts or numerical strings. Exact queries are of interest to only search for textual metadata. The multimedia information is searched for and retrieved using a query-by-similarity. The user defines what to retrieve using the available interface, and this query is represented in terms of requirements to a set of quantitative features desribing the desired data. Basic groups of the similarity requirements are as follows (Castelli & Bergman, 2002):