``Why is it that we entertain the belief that for every purpose odd numbers are the most effectual?'' - Pliny the Elder.
The PDL project began in February 1996, when I decided to experiment with writing my own `Data Language'. I am an astronomer. My day job involves a lot of analysis of digital data accumualated on many nights observing on telescopes around the world. Such data might for example be images containing millions of pixels and thousands of images of distant stars and galaxies. Or more abstrusely, many hundreds of digital spectral revealing the secrets of the composition and propertues of these distant objects.
Obviously many astronomers before have dealt with these problems, and a large amount of software has been constructed to facilitate their analysis. However, like many of my colleagues, I was constantly frustrated by the lack of generality and flexibility of these programs and the difficulty of doing anything out of the ordinary quickly and easily. What I wanted had a name: `Data Language', i.e. a language which allowed the manipulation of large amounts of data with simple arithmetic expressions. In fact some commericial software worked like this, and I was impressed with the capabilities but not with the price tag. And I thought I could do better.
As a fairly computer literate astronomer (read `nerd' or `geek' according to your local argot) I was very familiar with `Perl', a computer language which now seems to fill the shelves of many bookstores around the world. I was impressed by it's power and flexibility, and especially it's ease of use. I had even explored the depths of it's internals and written an interface to allow graphics (the PGPLOT module0.1). The ease with which I could then create charts and graphs, for my papers, was refreshing.
Version 5 of Perl had just been released, and I was fascinated by the new features available. Especially the support of arbitrary data structures (or `objects' in modern parlance) and the ability to `overload' operators -- i.e. make mathematical symbols like +-*/ do whatever you felt like. It seemed to me it ought to be possible to write an extension to Perl where I could play with my data in a general way: for example using the maths operators manipulate whole images at once.
Well one slow night at an observatory I just thought I would try a little experiment. In a bored moment I fired up a text editor and started to create a file called `PDL.xs' -- a Perl extension module to manipulate data vectors. A few hours later I actually had something half decent working, where I could add two images in the Perl language, fast! This was something I could not let rest, and it probably cost me one or two scientific papers worth of productivity. A few weeks later the Perl Data Language version 1.0 was born. It was a pretty bare infant: very little was there apart from the basic arithmetic operators. But encouraged I made it available on the Internet to see what people thought.
Well people were fairly critical -- among the most vocal were Tuomas Lukka and Christian Soeller. Unfortunately for them they were both Perl enthusiasts too and soon found themselves improving my code to implement all the features they thought PDL ought to have and I had heinously neglected. PDL is a prime example of that modern phenomenon of authoring large free software packages via the Internet. Large numbers of people, most of whom have never met, have made contributions ranging for core functionality to large modules to the smallest of bug patches. PDL version 2.0 is now here (though it should perhaps have been called version 10 to reflect the amount of growth in size and functionality) and the phenomenon continues.
I firmly believe that PDL is a great tool for tackling general problems of data analysis. It is powerful, fast, easy to add too and freely available to anyone. I wish I had had it when I was a graduate student! I hope you too will find it of immense value, I hope it will save you from heaps of time and frustration in solving complex problems. Of course it can't do everything, but it provides the framework, the hammers and the nails for building solutions without having to reinvent wheels or levers.
- Karl Glazebrook, Sydney, Australia. 4/March/1999
This book is for anybody who has to work on large volumes of data, for mathematical analysis or visualisation. You could be a scientist or an engineer working or research or design problems, a businessman trying to make sense of vast numbers of stock indices, a statistician crunching numbers and trying to figure out trends or a web site manager monitoring page accesses.
You know the benefits of being able to work in a very high-level language such as Perl (or TCL or Python), you can see how much easier it would be if you could take the same approach to your data analysis rather than griping around in low-level C or FORTRAN. You may have had experience with the high-level approach in commercial packages and wish for something similar which is free, public domain and Open Source so you can share your code with colleagues.
We are not assuming that you are a Perl expert, rather we hope you will pick up enough of a smattering of Perl in the early chapters to try out (and hopefully be impressed by!) the tutorials and examples. We hope this will inspire you to go out and learn more about Perl if you don't know it already. (We mention some books below).
If you are a Perl aficionado and you work on these types of problems, then PDL is definitely for you! Read on...
The reason is though subroutines are available low-level languages still force a lot of complexity on you. You must manage memory yourself, declare variables however trivial, call subroutines with a whole bunch of arguments in case just one of them is needed, etc. And you must be able to pull together seperate subroutine libraries to do file input/output, user interaction, data processing and graphics.
Whereas all you really want to do is tell the computer things like `read this', `Fourier transform that', and `Plot this', and have it be smart enough to do the right thing. What you are wishing for is in effect a high-level language, in this case it is called `English'.
While natural language understanding is still quite a long way off, high-level
computer languages are currently proliferating. Examples include Perl, TCL,
JAVAscriptm, Visual Basic, Python, and many more. Such systems have also
been developed for data processing. Worthy of note are commericial
software such as IDL
(`Image Data
Language' from Research Systems Inc.0.2), MATLAB
(from The Mathworks,
Inc.0.3) and the public domain program
Octave0.4.
These implement special-purpose
high-level languages where data is handled in large chunks, via `vector
operations'.
What does this mean in practice? It means if you say:
Having used these systems ourselves the authors of PDL can attest to the superiority of that approach in terms of plain getting things done. We of course believe that PDL is now better than all those systems, for quite a few reasons, and that your life will be easier if you get it and use it.
The free software community has taken off to an extraordinary extent in the last few years. This has been most vivid in the success of Linux, a free UNIX-like Operating System. Sometimes this movement is also described as `Open Source' rather than `free,' and the term `free' is often used to mean freedom of use rather than freedom from price. Athough much of the code is indeed free/public domain money is made out of the sale of packaged distributions, support, books, etc. Nevertheless the software is usually available at minimal cost.
One key point is that the source code is available, so that however the software is obtained one has the ability to take it and in principle be able to change it to do whatever is required with it.
How is this relevant to data languages? The authors of PDL are all scientists. We write, obviously, as scientists but believe our ideas are directly relevant to all users of PDL. The scientific community has for hundreds of years believed in the free exchange of ideas. It has been traditional to publish full details about how research is done openly in journals. This is very close in spirit to the ideas behind the free software. These days much of what scientists do involves software, in fact large software packages to facilitate certain kinds of analysis are often the subject of major papers themselves with the software being freely available on the Internet. Such software is commonly written in C or FORTRAN to allow general use.
Why aren't they working at a higher level? As we explained above this would allow faster creation and make the software more portable and more easily customisable. Well in our view one of the reasons this has not happened is because of the lack of a suitable free high-level data-centric language, with powerful enough facilities.
This is not just a minor point, it is critical. Even if software is not published and is for internal use among a team of researchers, in the modern world the team is often distributed among dozens of individuals across many instititutes and nations. The only way to ensure that all will be able to use software is if it is freely available. All the PDL authors have had direct experience with this problem in the past. We have often been hindered in sharing our code by collaborators having lack of access to software.
Moreover scientific work often involves extensive innovations and modifications to old ways of doing things. For software as well as being freely available it is critical to have access to the source code to permit easy customisation.
Finally there is also the issue of cost. Equivalent commercial packages cost several thousand dollars per workstation. We are not anti-commericial, these packages are very powerful and useful. However we certainly think there should be something like PDL that anybody can use and develop for free. Science is a worldwide activity and we like to think that anybody with a PC could use PDL to do research and analysis.
In our view PDL -- a free, public domain, Open Source, data language -- meets a great need. Today it is openly developed by a group of several dozen people collaborating via the Internet. Anybody with time, expertise or dedication can contribute to improving PDL.
So we chose Perl as our implementation language. Our basic data language extensions could have been built around quite a few high-level languages so why did we choose Perl? 0.5
Perl, of course, fills all these constraints most admirably. Perhaps the runner-up would be TCL, though the lack of a consistent object-oriented framework is a problem for TCL. Of course we just said Python was too object-oriented, this is not a contradiction -- in our view Perl gets it just right! Perl also has the singular advantage of being widespeadly used and having a huge collection of well organised modules, publically distributed worldwide on the CPAN network of Internet sites. As scientists who already used Perl a lot for day-to-day programming tasks PDL means we can do just about everything in Perl. Such integration is extremely productive.
OK enough advocacy, you are still reading to let us get on with the task in hand. So what can the reader expect to get from this book?
This book is intended to be a complete introduction to PDL. We believe the best way to learn something useful is to learn by doing. So we kickstart the book with some examples of real use of PDL to rapidly show the reader what is is all about.
Then we go through the features of PDL systematically, showing how to use us and drawing on example problems from a range of scientific disciplines to give a sense of how real problems can be solved with PDL. We look at PDL graphics capabilties and how they can be used to visualise problems.
The further in you go the more technical the book will become and we will look at the feature set and the internals and show how to use advanced features such as modules, dataflow and Object-Oriented Programming.
Finally the book concludes with a demonstration of the power of PDL: we show how clever use of PDL can achieve amazing results in only a few lines of code. Deconstructing these is used as a tool to show how better use can be made of PDL.
This book is not about teaching Perl, although we hope that even if you don't know anything you will learn pidgin Perl as you read through the first few chapters. Perl is a pretty good `learn as you go along' language. For a more formal introduction to Perl the following books are recommened:
`Learning Perl', by Randal Schwartz & Tom Phoenix. Published by O'Reilly and Associates, 2001.
For more advanced usage of Perl we recommend:
`Programming Perl', by Larry Wall, Tom Christiansen & Jon Orwant. Published by O'Reilly and Associates, 2000.
This book is not about building Graphical User Interfaces using widgets, though Perl is quite adept at this and we will show at least one example of combining PDL and a GUI done in perl/gTk. For widgets perhaps a book to try is:
`Learning Perl/Tk', by Nancy Walsh. Published by O'Reilly and Associates, 1999.
This book is not about algorithms for analysing data, though PDL is full of them. For a deep mathematical discussion of fitting, Fourier transforming, sorting, inverting and innumerable other useful and hideous things to do to your data the eternal best book is:
`Numerical Recipes in C : The Art of Scientific Computing' (and similarly `Numerical Recipes in FORTRAN : The Art of Scientific Computing'), by William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery. Published by Cambridge University Press, 1993.
This book is not about Perl advocacy. You have got some of that here in the introduction! For the rest we will let the language speak for itself.
This book assumes at least the following versions of software:
The following typographic conventions are used in this book: