File systems - Part 3

Attributes

We have already spoken about other things which the system must know about files, such as where is the information stored (this could include information both on which device the file is stored on and which areas of the device, almost always disk).

Since this sort of information is going to be stored in some file table it makes sense to include other information as well. What other sorts of things do we want the system to know about our files?

We may want the system to know what type of file it is. This means that the system may be able to help us with what we want to do with the file. More about this in the next section. Similarly many systems want to know the record length of information in the file, and the number of records, fixed or variable length? etc.

The system must know who the file belongs to. As we saw in the section on security and protection, there are different ways in which the system can provide checking of access rights to the files. We could include an access control list, or more commonly just a set of protection codes.

It is useful to know when files are created and modified. When they are last read can show us if a file is necessary or whether it could be moved to archival storage (it is also useful when tracking access violations). We may even find it useful to know when this list of attributes was last changed.

The size of the file must be somewhere. If we have a file system which keeps generations we need to know the version number of the file.

All this information needs to be kept somewhere. This information must hang around between programs which access the file and so must be stored somewhere in the file system also. There are different places to keep the information. It could be kept in the files immediate directory. As long as the amount of file attribute information can be kept small this is reasonable. What advantage does it have to keep the information there?

Some (but not all) of it could be kept with the file itself. What advantage does this have?

UNIX doesn't keep it with either. This is where the on device inodes come in. All this information is kept in information nodes "inodes". These in turn are stored in a table (or a series of tables). All a directory entry needs is the name and the corresponding inode number. (It actually has a few more things, but you get the idea).

Data

So far we haven't talked about the most important aspect of a file, the data stored there. This is partly because with many operating systems the file system does not concern itself with the contents of files.

Because there is an infinite variety of the type of information you may want to store in a file, many file systems are designed to store and retrieve collections of bytes. Any further structure on top of this has to be provided by the application program.

Why should we want anything more than this? Remember that the job of the operating system is to help us get our work done. A file system which has information about the types of files it is dealing with can conceivably give better service to the programs which use it.

From a programmers' point of view, it would be helpful if the file system (in conjunction with the compiler) could store any data type. Imagine being able to store a tree, or a doubly linked list or a sequence of sounds, directly. Instead of having to take the tree to bits and store the nodes in some form from which your program can reconstruct the tree when it retrieves it etc.

Some programming languages give you the ability to store arbitrary structures, but in this case the programming language does the necessary translation and what is possible is usually limited. Pascal allows you to write records to and from files. C doesn't even allow this as such.

Some (even simple) operating systems associate some file type information as one of the file attributes. Frequently though this information is provided for the benefit of programs only, so that they know this file is in the format they need. The file system does not use this information itself or if it does it is minimal e.g. converting linefeed to carriage return for text files.

One type of file most operating systems know something about are the executable files. These have a definite structure, including information about where the different parts need to be loaded in memory, which dynamic libraries are required by the program, and symbol table information for debugging purposes amongst others.

Other file systems do work with data in specific records. The program specifies the size of the record, in this way the file can be regarded as an array of records.

(list of file types?)

Multimedia

As a practical example of the sort of thing that is possible consider a helpful file system designed to work with multimedia applications. In this case there are all sorts of strange things which the file system is asked to store - pictures, video, sound, text, simulations, links.

Until recently most general purpose operating systems haven't worried about real time access to files in the sense of having to know exactly how long it will take the file system to respond to a particular request. Obviously real time systems do exist it is just that most people didn't need one.

However with the arrival of multimedia, with large amounts of pictures, video and audio, things have to change. It is very annoying to be presented with jerky pictures or unsynchronized sound. Because of the vast amounts of memory taken up with digitized video and sound, we seldom want to have all the information of even a short movie in memory. This means we want real-time access to the file. The data has to arrive when we need to display the next frame of our movie, we do not want the user to have wait with the current frame frozen on the screen.

Of course for other types of data, there could be different requirements. It is likely that we could have one file for the video and another for the audio of the same clip. We might do this to provide another language for example. In this case we need to be able to synchronize the arrival of the two sources of data. Work is being done in this area. In fact there is research in operating systems specifically designed for multimedia applications.

Other things

So far our files consist of a name which leads us to a list of attributes and the stored data. In most file systems that is all we get. The Macintosh broke new ground by adding an extra component, the resource fork. The data section then got the name data fork.

Resource forks hold what are known as the resources of the program or file. A resource is usually something which will not be altered by any program which uses the file. For programs this includes the program code, and things like the layout of user interface components. Some files don't have resource forks, everything is regarded as changeable data. Some files don't have a data fork, almost exclusively program files (I think).

The Macintosh file system has special routines to handle the resource forks and there is a resource editor which enables a user to change resources, in order to customise a program for example.

The only thing which stops us generalising this approach and adding even more components to a file is that most file systems are not designed to be extensible.

Object oriented systems

In practise if our file system is to understand and know about all different sorts of files, it is going to get increasingly large, sometimes known as "file system bloat". One way out of this problem is to place the problem back on the files themselves. If we develop an object oriented system in which files are objects. In this case the file knows how to handle itself we can just send it messages. In this case all the file system has to do is to pass messages to the correct file, the file itself knows how to store and retrieve its type of information, as well as more peculiar things such as print itself.

This idea is being promoted as the approach the WWW can take to information distribution. When data is downloaded off the Internet it comes with routines (possibly written in Java) which can be used to display and modify that data.

The Macintosh looks like it does this, where all it really does is store a pointer to the program which can deal with this type of file, there is no concept of class structure. Even more artificially, Windows keeps a table of file extensions with corresponding applications.

Where on the disk

One thing we have mentioned many times, which is stored in the directory or on device file table, is moved to memory into the file information block and is essential for all file accesses is "where are the blocks corresponding to this file on the device?".

Give a list of different ways in which this information could be maintained.

Sequential storage

If the data for each file is stored in contiguous blocks then the only information which is required to know the file location is the address of the first block and the number of blocks. In this case we have to decide on the allocation strategy. When a file wants some space do we put it in the smallest space which fits, this reduces external fragmentation but means that as soon as the file grows a little bit a new place has to be found for it.

Another scheme is to put the file in the first area of space which is big enough to hold it, or the largest possible space, thereby maximising the room for growth.

Usually we have to resort to some method of compacting the files (moving them to one end of the disk for example) to make large areas of space. This can be done in the background by a

If we allow the file data to be stored on any blocks which could possibly be scattered over the device, we need a method to keep track of all blocks associated with the file.

Linked lists

If the directory or on device file table has the number of the first block in the file, we can maintain a linked list whereby each following block is pointed to by a field in the data block. A similar approach is taken by MSDOS with its FAT (File Allocation Table). Except in this case the blocks are represented by an array stored in a particular place on the disk. Rather than storing the links in the blocks themselves the links are stored in this array.

The FAT system has the advantage over storing the links in blocks in that it is possible to find a particular block without having to read all previous blocks.

Indexing

Each file can have a table associated with it with all the blocks used by that file. This method is not as susceptible to corruption as the linked list method and also provides random block assesses easily. We have the usual sorts of problems, how big should the table be. If the table can grow, it will be stored over several blocks itself.

To accommodate very large files, there may be indirect tables. The top level table refers to further tables, which contain the numbers of the blocks used by the file.

It is almost certain that on very large files, the entire table cannot be cached in memory in the file information block. This means multiple reads to make single reads (not all the time, but occasionally).

When we look closely at file block usage we see that the vast majority of files are quite small, usually under 16Kbytes. This has prompted some systems to have different ways of storing this information according to the size of the file.

Many versions of UNIX have a part of the inode with the block numbers of the first few blocks, an indirect pointer to a block with the next few blocks, a double indirect pointer to a block of indirect pointers, and a triple indirect pointer to a block of double indirect pointers. See Figure 11.7.

Fragmentation

Many systems (especially when the device gets quite full) suffer from fragmentation. The only available space is scattered all over the device. In this case assessing the file is much slower than it would be if the data were closer together. We have all used defragmenting programs to improve disk performance.

Some operating systems use cleverer methods of allocating file space than others and hence suffer less from fragmentation, the Linux extended file system 2 method works very well.

Generations and Archiving

We mentioned files several times when discussing security and protection. Two areas we brushed over were making file generations and archiving so we will spend a little time looking at these.

File generations

Remeber that the purpose of keeping file generations was so that we could undo mistakes. As long as we kept copies of older versions of our files we could always revert to them if we did something stupid or just decided that we liked something better the way it was.

Filename extensions

One easy way of doing this is to add an extension to the filename, e.g. bak1, bak2 etc (not a good choice because it has nothing to do with file backups). Of course we don't really want to bother our users with directories full of backup files, so at the least we would normally like the files to be invisible. UNIX has a simple (too simple?) way of making files invisible with most commands. It simply ignores files starting with a ".".

Obviously this type of file generation system can be provided by user level programs. It would be convenient if the operating system helped out.

Extended directories

A preferable method would be to have several spaces in the directory for each file. We will want to keep only a certain number of generations, we can't keep preserving everything for ever. As successive generations are saved we move the information down one bumping off the oldest generation.

When we delete a file we really only want to hide it. We may want to provide a "delete all generations" command, to truly delete all traces of a file. Otherwise we will have to use an age limit to remove such unwanted files.

Tree method

The big problem with an approach like this is that it takes up an awful lot of disk space. A novel approach only stores the changes to the file. When a new generation is saved as much of the previous generation as possible is used. Pointers to the still current information are kept with the new file, along with pointers to any differing information. The different sections of file have to be concatenated to produce the actual generation.