|
|
|
Department of Computer Science, |
||||
|
Telephone : |
- none - |
|||||
|
Fax : |
+ 64 9 373 7453 |
|||||
|
email : |
p.fenwick@auckland.ac.nz |
|||||
|
|
||||||
After about 20 years in the Department of Computer Science (and 12 years before that in the Computer Centre) at the University of Auckland, I have taken early retirement from Jan 2005.
My retiring rank as Associate Professor in the Department of
Computer Science continues with a position as Honorary Associate
Professor until the end of 2007, allowing me to maintain a
relationship with the University.
My University email and other
addresses (but not telephone) should continue to be available, but
will be handled less often ....
Important notice -- I am NOT available for supervising student projects or similar activities.
1 Nov 2007
My primary interests are in computer architecture (including arithmetic), communications and networks, and text compression.
Recently I have been concentrating on the new "block sorting" text compressor (or block reduction, or Burrows-Wheeler Transform), first described by Burrows and Wheeler in April 1994. Early results on this are reported in Tech Report 111 and Tech Report 120, with a final report in Technical Report 130. Following from that is other work in "symbol ranking" text compression.
The complete source code of a fast constant-order symbol ranking compressor is available as srank.c.
Other recent related work has involved fast text searching and universal or variable-length integer codes based on Fibonacci numbers and prime numbers.
And even more recently I have returned to the work that started me in computing 40 years ago (!) on the solution of the potential distribution in a Hall effect plate for a magnetometer.
A recent survey identified me as a "Top
Scholar in Systems and Software Engineering", based on a
publications count in important international journals.
See "An
assessment of systems and software engineering scholars and
institutions (1998 - 2002)",
R.L. Glass, T.Y. Chen, The
Journal of Systems and Software Vol 68 (2003) pp 77 - 84
I have written four chapters in the "Lossless Data Compression Handbook" (ed Khalid Sayood), Academic Press, 2003. ISBN 0-12-620861-1
Universal codes
This
chapter describes all that I have been able to find about Universal
Codes (or Variable Length Codes for the Integers). There is a
surprising variety of codes (and some varieties of surprising
codes).
Burrows Wheeler Compression
A comprehensive review of the work to date on Burrows Wheeler
compression.
Symbol Ranking and ACB
Compression
Essentially a revision of earlier work described
in Software - Practice and Experience and J.UCS.
Compression of Unicode Files
A brief discussion of
the problems of compressing Unicode files, based on the work for the
1998 Data Compression Conference.
Some of my recent ideas on Burrows Wheeler compression were presented at a workshop at Rutgers University in August 2004, celebrating the 10th anniversary of the Burrows Wheeler Transform. Here are a preliminary report and the presentation slides which include later work and go somewhat beyond the report.
The essence of this material has now
been published in The ideas fall into three groups, in
order of decreasing work-done. Use a BW transform to derive all
contexts of some fixed order, say 4, and use these to drive an
escape-free PPM compressor/decompressor. This is the main substance
of the report and presentation; it worked, but not nearly as well as
hoped. "Break open" the reverse
BW transform to generate partial contexts during both compression
and decompression, so aiding the operation of the coder and
decoder. Is there some relation between
data compression and error correction codes? Compression can be
thought of as a very noisy channel (encoding a single symbol into a
range of possible symbols) and then generation of the codes to
select the correct symbol from the "noise" codes; the
encoder and decoder track each other and the "corrections"
are what the compressor emits. (The ideas follow most readily from
my earlier work on "Symbol Ranking" compression which uses
an explicit symbol predictor/corrector mechanism). Can we omit some symbols (giving an erasure code) and then
use a Viterbi-like trellis to recover feasible decodings. (Whew!) The relatively new error-correcting "turbo-codes"
use two interacting but complementary error-correcting codes to give
superlative performance. Can something similar be done for
compression? (It also ties in with the previous point.) (The links for the electronic journal J.UCS, the "Journal for
Universal Computer Science" are to the J.UCS Page for the
appropriate month. The journal contains compressed Postscript files
which may be copied and printed.) Most of my papers and reports are available by anonymous ftp from
ftp.cs.auckland.ac.nz/pub/staff/peter-f “Burrows–Wheeler
compression : Principles and reflections”
Theoretical
Computer Science Vol
387(2007) pp 200–219 "Burrows Wheeler --
Alternatives to Move to Front", "Burrows Wheeler
Compression with Variable Length Integer Codes", Software
- Practice and Experience, Vol 32, No 13, Nov 2002, pp 1307 -
1316. "Variable-Length Integer
Codes Based on the Goldbach Conjecture, and Other Additive Codes",
IEEE Trans. Inform. Theory, Vol 48, No 8, pp 2412 -
2417,Aug 2002. "Zeckendorf Integer
Arithmetic", Fibonacci Quarterly Nov 2003, pp
405 - 413. "Some Perils of
Performance Prediction : a case study on Pattern Matching",
Software - Practice and Experience, Vol 31, pp 835-843
April 2001. "Fast string matching for
multiple searches", Software - Practice and
Experience, Vol 31, pp 815-833 April 2001. "Symbol Ranking Text
Compressors: Review and Implementation", Software -
Practice and Experience Vol 28 No 5 pp 547-559 April 1998.
"Compression
of Unicode Files", DCC-98 Data Compression
Conference, Snowbird, Utah, March 1998. "Symbol
Ranking Text Compression with Shannon Recodings", J.UCS
Vol 3, No 2 pp 70 85 Feb 1997. "The Burrows-Wheeler
Transform for Block Sorting Text Compression -- Principles and
Improvements", The Computer Journal, Vol 39,
No 9, pp 731-740, 1996. "
Queue Prediction: an efficient scheduler for fast ATM cell
transmission ", ACSC'97, Sydney, Australia Feb 1997.
"Block
Sorting Text Compression", ACSC'96, Melbourne,
Australia Jan 1996. "Differential
Ziv-Lempel Text Compression", J.UCS Vol 1,
No 8 pp 587 598 Aug 1995. "Huffman code efficiencies for extensions of
sources", IEEE Trans. Comm., Vol 43 No 2 pp163
165, Feb 1995. "High-radix
Division with Approximate Quotient-digit Estimation",
Journal for Universal Computer Science (J.UCS)
Vol 1, No 1, pp 2 22 Jan 1995. "A new data structure for cumulative frequency
tables", Software Practice and Experience, Vol
24, No 3, pp 327 336 Mar 1994. "
Reflections
on Burrows Wheeler compression", The University of
Auckland, Department of Computer Science Report No 172, Jul.
2004. " A
fast, constant-order, symbol ranking text compressor", The
University of Auckland, Department of Computer Science
Report No 145, Apr. 1997. " Punctured
Elias codes for variable-length coding of the integers",The
University of Auckland, Department of Computer Science
Report No 137, Dec. 1996. "Symbol
Ranking Text Compression", The University of
Auckland, Department of Computer Science Report No 132,
Jun. 1996. "
Block Sorting Text Compression -- Final Report", The
University of Auckland, Department of Computer Science
Report No 130, Apr. 1996. "
Improvements to the Block Sorting Text Compression Algorithm" ,
The University of Auckland, Department of Computer
Science Report No 120, Aug. 1995. "Experiments
with a Block Sorting Text Compression Algorithm", The
University of Auckland, Department of Computer Science
Report No 111, May 1995. [Initial results and miscellaneous
thoughts, on the block sorting algorithm. It includes many
experimental results which are not found elsewhere.]
"
A New Data Structure for Cumulative Probability Tables : an Improved
Frequency-to-Symbol Algorithm", The University of
Auckland, Department of Computer Science Report No 110, Feb.
1995. "Fast LZ-77 string matching",
(with P. Gutmann) The University of Auckland, Department of
Computer Science Report No 102, Sep. 1994.
"Huffman
code efficiencies for extensions of sources", The
University of Auckland, Department of Computer Science Report No 89,
Feb. 1994.
"A
new data structure for cumulative frequency tables", The
University of Auckland, Department of Computer Science Report No 88,
May 1993. [Essentially the S P & E paper.]
For this I simply reproduce the abstract and introduction from my
report. Read further if you are interested!!
This report describes the solution of a problem which arose about
40 years ago and seemed eminently suitable for computer solution.
But numerical limitations made it quite unsuitable for complete
solution on the computers then available. It is only now that a
solution is feasible on generally-available machines.
The problem itself involves the solution of potentials over a
Hall-effect plate driven with a constant voltage instead of the more
usual constant current. The change from current-drive to
voltage-drive greatly reduces the temperature sensitivity of the
Hall output voltage, but at the cost of non-linear response at high
magnetic fields. The problem is to find the shape of the response.
This is the problem that introduced me to computing. For reasons
that will become apparent, its solution languished and it is only
now, 40 years later, that I have at last completed it.
In the summer of 1963-4 I was working as a casual laboratory
technician in the University of Auckland Physics Department. Much of
my work seemed to be with magnets for Advanced Laboratory
experiments - I remember working with two large electromagnets, both
weighing 10s of kilogrammes. One produced a high field of about 1
Tesla (10,000 gauss for the old-fashioned) for studying the Zeeman
effect. The other produced a lower field, about 0.3 T, but very
uniform, for demonstrating Nuclear Magnetic Resonance.
Also working with things magnetic I built a Hall-effect fluxmeter
to measure magnetic fields. The Hall element was driven by a
constant voltage, using AC at about 1 kHz, with a switched gain
amplifier and synchronous rectification. Its maximum full-scale
sensitivity was about 100 microTesla, or 1 gauss, comparable to the
Earth's field.
But I knew that the output saturated at high fields (say B>0.1T)
and wondered just how the output varied with a high magnetic field.
The University had then just got its first computer, an IBM 1620
with all of 20k decimal digits of storage and a FP multiply time of
10 ms. Murray Johns suggested that I solve my problem on the
computer as it seemed a good one for numerical solution. So I taught
myself Fortran (with Format).
But you may read the complete
report as a PDF file....
“Burrows–Wheeler
compression : Principles and reflections”
Theoretical
Computer Science Vol
387(2007) pp 200–219 (see my publications list below for an
abstract).
You can read a preprint of this paper;
the "preliminary report" above contains material that was omitted to save space when
two earlier papers were merged into the final version.
The report starts by developing analogies between the
Fourier Transform and its associated spectra, autocorrelations, etc,
on the one hand and the the Burrows Wheeler transform and PPM on the
other. Invoking ideas of symmetry leads to a significant new
understanding of the Burrows Wheeler operation.
(I have long
thought that we did not really understand the Burrows-Wheeler
Transform. But the forward transform is essentially simple, at least
in concept, and showed little scope for improvement. With these
ideas the reverse transform looks the more likely for further study
and deeper understanding. At least I hope so.)
This third
idea may provide a useful starting point for somebody with more
imagination than me. It leads to two other ideas relating to the
possible relationship between Error Correction and compression.
But do
we have suitably different compressors? The two compressors must
work with the same symbol order, probably precluding a BW and PPM
combination. But of those which retain the original "text"
order, PPM and LZ-77 were shown to be equivalent by Bell, Cleary &
Witten (in their book "Text Compression"). So until
somebody invents say a symbol-by-symbol compressor which does not
depend on contexts, there may be no suitable complementary
candidates!
If you have ideas, please try them!
Some recent papers.
Abstract
After
a general description of the Burrows Wheeler Transform and a brief
survey of recent work on processing its output, the paper examines
the coding of the zero-runs from the MTF recoding stage, an aspect
with little prior treatment. It is concluded that the original
scheme proposed by Wheeler is extremely efficient and unlikely to be
much improved.
The paper then proposes some new interpretations
and uses of the Burrows Wheeler transform, with new insights and
approaches to lossless compression, perhaps including techniques
from error correction.
A preprint of this paper is available
from here
Authors - Dr
Peter Fenwick, Dr Mark Titchener, Michelle Lorenz. Presented at Data
Compression Conference DCC-2003.
Abstract The Burrows
Wheeler Transform, first published in 1994, is a relatively new
approach to text compression and has already proven to produce
excellent results. However, there has been much research directed
towards improving the efficiency of the Move to Front algorithm with
varying degrees of complexity. This paper examines a relatively
simple technique using a cached modeling system proposed by Fenwick
and achieves very promising results.
Information loss during the
Burrows Wheeler Transform is also examined using Deterministic
Information Theory measuring techniques developed by Titchener. This
provides an interesting insight into what is actually occurring
during the complete Burrows Wheeler Transform and can confirm
whether manipulating the MTF algorithm is the correct approach to
refining an already successful compression scheme.
Full papers
in PDF
and in PostScript.
(Please tell me if you have problems with these files. On my
computer the Acrobat Reader has some trouble with fonts, but the
hard-copy print is fine.)
Abstract The final coder in Burrows-Wheeler
compression is usually either an adaptive Huffman coder (for speed)
or a complex of arithmetic coders for better compression. This
article describes the use of conventional pre-defined variable
length codes or universal codes and shows that they too can give
excellent compression. The paper also describes a "sticky
Move-to-Front" modification which gives a useful improvement in
compression for most files.
Abstract This paper introduces a new
family of variable length codes for the integers, initially based on
the Goldbach conjecture that every even integer is the sum of two
primes. For an even integer we decompose the value into its two
constituent primes and encode the ordinal numbers of those primes
with an Elias gamma code. The method is then elaborated to handle
odd integers. The paper then develops a more general method of
encoding any integer as the sum of two integers and developing
suitable basis sets of integers. Although the codes which are
generated by these methods are characterised by widely-varying and
unpredictable lengths, they are over some ranges shorter than most
other variable length codes.
For something quite different.... Fibonacci numbers
are the basis of the "Zeckendorf" number representation
which looks like a binary number but has no consecutive 1 bits. (By
sending the Zeckendorf representation least-significant bit first
and following its most significant 1-bit with another 1-bit the
illegal pair 11 acts as terminator. The result is a very good
variable-length code for integers.)
However, can we perform
arithmetic on these Zeckendorf representations? This paper shows
that we can, giving procedures for addition, subtraction (and
complementing), multiplication and division. "(preprint
in PDF format)"
Abstract Recent experience with string
searching or pattern matching algorithms revealed wide discrepancies
between predicted and observed performance. Further observations on
a variety of computers revealed even greater variations between
those algorithms on different computers. These observations are
collected here as an example of problems in real-world comparison of
algorithms.
This is a companion to the following paper.
Abstract
We present a string matching or pattern matching method which is
especially useful when a single block of text must be searched
repeatedly for different patterns. The method combines linking the
text according to digrams, searching on the least-frequent digram,
and probing selected characters as a preliminary filter before full
pattern comparison. Tests on real alphabetic data show that the
number of character comparisons may be decreased by two orders of
magnitude compared with Knuth-Morris-Pratt and similar searching,
but with an initialisation overhead comparable to 5 to 10
conventional searches.
But see also the previous paper..
This paper reviews several previous text compressors with
'symbol ranking' as a unifying theme. It includes the compressor
described in Tech Rep 145.
[A poster was presented
at DCC-98 on the compression of Unicode files, showing that they can
behave rather differently from conventional byte-oriented data.
Experiments with a compressor using 16-bit data show that it often
works better than standard byte compressors for Unicode data.] A
pdf version of the paper is also available.
[A general introduction to
"symbol ranking" text compression and a description of a
proof-of-concept compressor. An extended version of Tech
Rep 132]
[This is the paper which was written as
the definitive summary of my experience with block sorting
compression.]
[This paper describes a new technique for controlling ATM cell
transmission.]
[This paper is the general overview of block
sorting compression.]
[ An account of an attempt to combine
Ziv-Lempel (lossless) compression with Vector Quamtisation (lossy)
compression. It includes some predictions on the limits of LZ
compression.]
[Athough coding the extensions of sources is a
fundamental concept in Information Theory, there seems to be no work
describing what actually happens in practice. This paper arose from
a student assignment which showed that increasing the code extension
can lead to a poorer code! Also available as Tech
Rep 89.]
[A major problem in high-radix
division is estimating the correct digit (and divisor multiple).
While it is easy to get the right answer most of the time, it is
much harder to get it right all of the time. This paper shows that
allowing a bad estimate to be corrected can lead to much simpler
control logic, with very little speed penalty.]
[This paper describes a new data
structure which is especially designed to handle the cumulative
frequency tables of arithmetic compression. See also Tech
Rep 88 & 110.]
And some Technical Reports
[This report presents some speculations on the nature of
the Burrows Wheeler transform, from analogies with the better-known
techniques of signal processing. These analogies, plus
considerations of symmetry and inverse operations, suggest some
possible developments to the Burrows Wheeler transform for lossless
compression. ]
[A symbol ranking text
compressor which is designed for speed rather than good compression.
A software implementation 3.56 bits/byte on the Calgary Corpus,
running at 1 Mbyte/s on a 275MHz DEC Alpha. A version more suited to
hardware implementation compresses to 3.95 bits/byte and should be
able to run at hundreds of megabytes per second.]
[A general discussion of all of
the variable length codings I have managed to discover, followed by
a variant of the Elias gamma code which is useful for some text
compression codings.]
Later note - these codes are not really
that good....
[This describes an introductory implementation of a
text compressor which follows directly from Shannon's 1951 paper on
the information content of English text. For each context, the
compressor prepares a list of symbols, ranked from most-likely to
least likely, and emits the position of the actual symbol in the
ordered list.]
[The final report on block
sorting compression, extending somewhat the results of the ACSC
paper.]
[Further work on the
block sorting compression, intermediate between TR111 and the ACSC
paper.]
[Correction to the S P & E paper above, and to Tech
Rep 88]
My Hall Effect Project
Abstract
Background