Subsections

The Basics of PDL

In Chapter 1 we had a whirlwind tour of the sort of things you can do with PDL. If you are still reading this you must now be convinced that PDL is worth using and learning in more depth. We will now focus on a more measured introduction to the basic concepts and features of the language.

Basic manipulations of piddles

Perhaps the best place to start with PDL is how the data arrays are represented to Perl. We use Perl objects, which makes it extremely powerful, but we will need to explain how piddles relate to everyday Perl variables.

What is a piddle?

PDL uses Perl `objects' to hold piddle data. An `object' is like a user-defined data-type and is a very powerful feature of Perl, PDL creates it's own class of `PDL' objects to store piddles. These look like ordinary Perl variables such as $x, $Foo, $MyData, etc.

Most of the time you can forget about the fact that piddles are objects and treat them like ordinary variables:

  $x = rfits 'file.fits';
  $y = rvals($x);
  $z = $x/$y;
  print sqrt($x+$y+$z);
The only time the distinction becomes important is when creating piddles and using the `=' operator.

Creating piddles

Creation of new piddles has to be done via a PDL function. Otherwise Perl doesn't know the variable has PDL magic.

The simplest method for small piddles is to use the pdl function which can accept numbers and lists of numbers, as well as other piddles:

   $a = pdl (1,2,3,4,5);                  # Create 1D piddle
   $a = pdl ([1,2,3,4,5], [2,3,4,5,6]);   # Create 2D piddle
   $a = pdl 42;                           # Create 0D (scalar) piddle
The numbers are specified using Perl's `[]' list reference syntax. See the help on pdl for more details. It is not normally necessary to worry about converting single numbers (Perl 'scalars') to piddles -- all PDL functions will do this on-the-fly. So you can say:
   $a = pdl (1,2,3,4,5); 
   $b = $a + 2;
without having to care whether `2' is a piddle or not. The result, $b, is a piddle of course.

Of course if you are a heavy-duty PDL user your piddles will be BIG. Thus you will need to use a PDL I/O function to read your data from a file in some format. For example in Chapter 1 we used rfits:

  $a = rfits "fixed_gal.fits";
rfits is a PDL function which reads the data and returns it as a piddle. Another useful function is rcols which reads columns of data from a text file:
  ($a, $b) = rcols 'text.dat';
It is actually surprisingly easy to write these I/O functions -- most of the code for rfits is straight forward Perl. PDL comes with a plethora of I/O functions for reading free-formatted data, raw binary formats and specially defined formats like FITS and GIFs.

Another way of creating BIG piddles is using functions such as zeroes, This creates piddles, filled with zeroes, of whatever dimensions you specify:

   $a = zeroes 10;      # 10 element 1D piddle
   $a = zeroes 100,100; # 100x100 2D piddle
   $a = zeroes 3,4,5,6  # 3x4x5x6 4D piddle
Or you can use another piddle as a template for the dimensionaligy:
   $a = zeroes $b;
Other creation functions like random and rvals work in this way. Table ??? lists the Top Ten ways of creating piddles.

Table???.

Elementary arithmetic.

The following binary perl operators act on piddles:

  + * - / ** > < <= >= == !=  << >> | & ^  ! % <=>
these behave identically to their ordinary Perl equivalents, except they act element-by-element on the whole piddle. e.g.:
  perldl>   $a = pdl (3,2,1,0,-1,-2,-3);
  perldl>   $b= $a * 2; print $b;
  [6 4 2 0 -2 -4 -6]
  perldl>   print $b % 3;
  [0 1 2 0 -2 -1 0]
  
  perldl>   print !$b;
  [0 0 0 1 0 0 0]
  
  perldl>   print $a <=> $b;
  [-1 -1 -1 0 1 1 1]
Additionally the following perl operators, which normally modify the variables in-place (like C), also act in-place on piddles:
  ++  --  +=  -=  *=  /=  |=  &=  ^=  %= 
  perldl> $a = pdl (3,2,1,0,-1,-2,-3);
  perldl> $a++; print $a;
  [4 3 2 1 0 -1 -2]
  
  perldl> $a *= 3; print $a
  [12 9 6 3 0 -3 -6]
  
  perldl> $a |= 3; print $a
  [15 11 7 3 3 -1 -5]
The operators:
  ~  x  .=
have been redefined in PDL to be more useful. The `~' operator produces an matrix transpose in PDL (more useful than the perl XOR meaning, which can be produced other ways). `x' -- also usually a string operator -- implements matrix multiplication. The `.=' operator is used for the PDL assignment operator. (See below)

The standard Perl mathematical functions:

  sqrt sin cos atan2 abs log exp
also act element-by-element on piddles.
   perldl> print sin pdl (3,2,1,0,-1,-2,-3);
   [0.14112001 0.90929743 0.84147098 0 -0.84147098 -0.90929743 -0.14112001]
PDL does of course come with a whole grab-bag of mathematical functions and algorithms which go far beyond what is available in vanilla Perl. You'll come across some more of these later in this book, and there is a big list in Appendix ??.

In-place arithmetic

It is pretty handy in PDL to be able to say $a++ instead of $b = $a+1, the latter generates a second piddle and lots of memory will be consumed if $a is large. $a++ in PDL is known as an operation. The operators such as `+=', `*=', `/=', etc. allow in-place versions of all the standard arithmetic operations to be performed.

But what about functions like sin, sqrt, etc. which we would also like to do in-place? Well as well as saying:

   $b = sin($a);
with piddles it is also possible to say:
   sin(inplace($a));
The inplace function sets a special flag on the piddle that the next operation is to be done in-place, if the operation supports it. All the standard mathematical functions defined by Perl (Section???) can be done in-place.

Using = and .=

Piddles are pretty convenient to use. But one thing to beware of:

   # $a is a piddle
   $b = $a; # Does NOT copy the piddle data
   $b++;    # Changes $a !!!
This is because piddles are represented by Perl objects, $a just holds a reference to the piddle data (kind of like a pointer in C). Saying $b = $a just copies the reference.

A similar effect happens in a subroutine call:

   $a = pdl (1,2,3);
   foo($a);
   print $a; # Answer: [2,3,4]
   
   sub foo {
     my $x = shift;
     $x++;
   }
The subroutine argument is passed by reference -- any changes to $x change $a. This behaviour is actually a good thing for PDL: because we expect piddles to be LARGE we do not want a lot of unnecessary copying.

So how do we get a real copy? The recommended way is to use the pdl function:2.1

   $b = pdl($a);
Once we have element-by-element arithmetic it turns out that as well as simple copying we need a special assignment operator and `.=' plays this role. Since PDL does not use strings we have hijacked this to provide a convenient assigment operator. For example:
 perldl> $b = pdl (1,2,3);
 perldl> $b .= 2
 perldl> print $b
 [2 2 2]
The assignment operates element-by element -- just like the arithmetic operators. You can see that this is the same as $b *= 0; $b += 2; -- only more succinctly and much faster. The assignment operator, like the arithemetic and other operators, also partakes in sophisticated PDL features such as threading and dataflow which we will talk about in Chapter XXX.

Piddles are NOT Perl `arrays'

It is now time to answer a question which has probably been nagging at the back of your mind for a while.

Why bother with piddles? Why not just use normal Perl `arrays'?

By Perl `arrays' we of couse mean entities like @x and @Data z which one would normally create and manipulate like this:

   @x = (1,2,3);
   push @x, 42;
   $y = pop @x;
So why don't we just use Perl `arrays'? Several very good reasons:

  1. It is impossible to manipulate Perl `arrays' arithmetically as one would like. i.e.:
      @y = @x * 2;  # Wrong!
    
    can not be made to operate element by element.

  2. Perl `arrays' are really what are known in computer science as `lists' (and are represented internally by a list data structure). In fact if the PDL-Porters had their evil way they would ban the term `array' from all of the standard Perl documentation and books. This is why the term `piddle' was invented for use in PDL for what we think really are `true arrays'.

  3. Perl lists are intrinsically one-dimensional. You can have `lists of lists' but this is not the same thing as true multi-dimensional arrays. Honest.

  4. Perl lists consume a lot of memory. At least 20 bytes per number, of which only a few are for the actual value. This is because Perl lists are flexible, and can contains text strings as well as numbers. This flexibility requires an internal complex data structure which contains extra information such as a place holder for the number, a place holder for the text and pointers forward and back along the list.

  5. Perl lists are scattered about memory. The list data structure means consecutive numbers are not stored in a neat block of consecutive memory addresses as C and FORTRAN programmers are used to. This makes it difficult to pass the arrays to low-level C and FORTRAN routines for processing -- the numbers must be collected together -a process known as `packing' -- processed and unpacked back into lists. If you have `lists of lists' then it get's even worse.

  6. Perl lists do not support the range of datatypes that piddles do (byte arrays, integer arrays, single precision, double precision, etc.)

That is why PDL does not use Perl lists. Just to be clear from now on we'll always refer to PDL numeric data arrays as `piddles' and Perl-style number/text arrays as `lists'.

The benefits of piddles.

So what we have done is essentially implemented a new kind of data structure, the piddle, in Perl to hold large chunks of numerical data which we can process all at once.

The piddle data is held in a large chunk of consecutive memory addresses, which means it can be passed easily to C or FORTRAN code as if it was a C or FORTRAN array.

We have written a lot of C code to do all the obvious stuff, such as element-by-element arithmetic, for multiple datatypes, and this is all built into PDL. Thus when you say:

   # $a,$b and $c are piddles
   $b = $a+1;
   $a++;
   $c = 1+$a**2/($b-2);
you have it done on all elements, even if $a, $b, $c contain millions of them. Furthermore this arithmetic proceeds at C/FORTRAN speed.2.2 $b=$a+1 gets interpreted one, but might get executed millions of times in a compiled C loop. In short PDL is a vectorised language.

Lists and Hashes of piddles

Of course all this is not to say that Perl lists do not have tbeir uses. Since piddles are singular objects we can make lists of them:

  @mylist = (rfits("my.fits"), zeroes(3,3), pdl(32));
This can be very convenient for many kinds of processing. For example one common PDL idiom is instead of:
  ($x,$y,$z) = rcols 'datafile'; # Read columns of data
one can say:
  @x =  rcols 'datafile'; # Read columns of data
Then one can manipulate the individual piddles as $x[0], $x[1] etc. (Each of which can have completely different dimensions).

The other major data struture in standard Perl is the hash (also known as an associative array. These can also contain many piddles, this time indexed by a text string rather than position in a list. For example a face recognition program might want to store it's images this way:

  %face = ();
  $face{"Jim"} = rpic('jim.jpg');
  $face{"Fred"} = rpic('fred.jpg');
  $face("Default"} = zeroes(256,256);
So you can see that the combination of piddles, lists and hashes can be quite useful, albeit a tad confusing at times!

PDL Datatypes

Standard Perl only offers long integers and double-precision floating point as datatypes. This is insufficient for efficient numerical processing -- both speed and memory consumption usually requires a number representation be only as accurate as required to represent the data an no more. For example if you are dealing with standard image formats like GIF or JPEG you might want only an 8-bit representation for each colour -- in this case you then want to represent each number with a single byte.

For this reason PDL provides a whole bunch of datatypes. We can have piddles containing any of these:


Datatype PDL function Bytes/ Numerical range
    element  
Byte byte() 1 0-255
unsigned short integer ushort() 2 0-65535
short integer short() 2 -32768 - 32767
long integer long() 4 -2147483648 - 2147483647
IEEE single-precision float() 4 XXXX
floating point      
IEEE double-precision double() 8 XXXX
floating point      


Each of these corresponds to a C primitive datatype used in the low-level PDL code. The philosophy in PDL is that the types are machine independent -- this means that a long integer is always 4 bytes in PDL across all types of computer/OS. Definitions are set up apporopriately at the C level to handle this.

Type conversion and creation

The functions listed in Table ??? provide type conversion:

   # If $a is any piddle 
   
   $b = float $a;  # $b is a 'float' piddle
   $b = byte $a;   # $b is a 'byte' piddle
and piddle creation with a specified type:
   $b = ushort (1,2,3);           # $b is a 1-D 'ushort' piddle 
   $b = double ([1,2,3],[4,5,6]); # $b is a 2-D 'double' piddle
   $b = long 42;                  # $b is a 0-D (scalar) 'long'  piddle
These use the same `[]' list-reference syntax as the pdl function. (Note that the pdl function defaults to double-precision just like Perl.)

Finally the type functions also function without arguments as tokens when creating arbitrary size piddles. For example to create a 512x512x512 byte data cube filled with zeroes:

   $a = zeroes(byte, 512,512,512); # Note the "," after "byte"!
This array consumes 128Mb of memory on Karl's laptop. If the byte token was omitted PDL would try to create the piddle as double precision -- being 8 times bigger this would be far too much for the laptop's memory! 2.3

Complex types

For a time a long debate reigned over where PDL should support types of complex numbers as well. One issue was that standard C, on which PDL is built, does not have a primitive complex number type. Because of this C routines in numerical libraries which handle complex arrays take the real and imaginary components as seperate arguments. As PDL needs to use these routines and work efficiently we take the same approach -- PDL functions which deal with complex numbers use seperate real and imaginary arrays.

A good example of this is the PDL::FFT module for dealing with Fourier Transforms. One says:

   fft $real, $imag;
where the arguments are piddles. The PDL::FFT module also provides cmul and cdiv functions for complex arithmetic.

PDL::Complex module ??? XXXX

PDL projection operators

We've now encountered the basics of PDL -- element by element arithmetic. It is time to look at more advanced features which involve combining elements together inside a piddle. Perhaps the next most common class of operations after arithmetic are those known as projection operations which involve combing elements along axes.

One example of such an operator is summation. The sumover function implements this. To sum along rows:

 perldl> $a = sequence (2,3);
 perldl> print $a;
 [
  [0 1]
  [2 3]
  [4 5]
 ]
 perldl> $b = sumover $a; # $b is now summed along axis 0
 perldl> print $b;
 [1 5 9]
sumover is paired with the related sum function which sums over all elements to form a scalar:
 perldl> print sum $a 
 15
 perldl> print sum $b  # Should be the same of course!
 15
The rest of the projection operators follow this basic pattern, for example maximum and max:
 perldl> print maximum $a
 [1 3 5]
 perldl> print max $a    
 5
One can think of this as maximum projecting from N to N-1 dimensions and max projecting from N to 0 dimensions Here is a complete Table of those in standard PDL (of course you are free to define your own):


Operation Dimensions Projection
  $N \rightarrow N-1$ $N \rightarrow 0$
Pixel sum sumover sum
Pixel average average avg
Pixel minimum minimum min
Pixel maximum maximum max
Pixel median medover median
Pixel `odd median' oddmedover oddmedian


What about projecting along other dimensions?

You might now be asking yourself: what about summing along columns as well as rows? Or more generally projecting along a different axis to the first? Are there special functions for this or are there special options to the projection functions?

The answer is NO. It works much more elegantly than that!

PDL comes with special dimensions manipulation functions. We simply use sumover (or whatever) with a dimension operation which brings the required dimension to the front. For example to sum along columns:

  $a = random(10,3);
  $b = sumover(mv($a,1,0)); # $b has dims 10.
This uses the mv function to transform $a from 10 by 3 to a 3 by 10 piddle. This is done `virtually' -- i.e. no more memory is used. This is an extremely powerful feature of PDL -- functions like sumover only need to be written to handle the dimensions they care about, PDL with it's dimension operators can handle the rest.

`Threading' over extra dimensions.

The projection operators are written in PDL as functions of 1-D vectors -- they project along the vector.

You will notice:

 $a = sumover random(10);        # $a is 0-D: 1     element
 $a = sumover random(10,9);      # $a is 1-D: 9     elements
 $a = sumover random(10,9,8);    # $a is 2-D: 9x8   elements
 $a = sumover random(10,9,8,7);  # $a is 3-D: 9x8x7 elements
i.e. even though sumover is written to handle 1-D piddles if supplied with extra dimensions sumover will automatically loop over the extras. Moreover this iteration happens at the compiled C level, so it is extremely very fast.

This feature of implicit looping over extra dimensions is known as `threading over the dimensions'2.4. Here is another example:

 $a = random (10,20,30);
 $b = random (10,20);
 
 $c = $a + $b;  # $c is 10x20x30
Yes the `+' function is threaded. This means you can do incredibly useful things like add images to datacubes and have the operation repeat for each image slice in the cube. With no special syntax.

You have of course seen before:

  $a = random (10);
  $b = $a + 2;
But perhaps you now realise this is again a case of threading. Here the 0-D piddle `2' is threaded over the 1-D vector $a. The previous example is just a higher-dimensional analogy.

We will look in more detail at the rules regarding threading in Chapter 4. For now we note that implicit threading works pretty much as you might expect:

  1. Extra dimensions of size one are implicitly added on the end of all piddles to match the piddle with the biggest number of dimensions.

  2. Each dimension threaded over must either be of size one or of a size matching the biggest piddle.

  3. Dimensions of size one are continuously repeated during the thread loop.

Thus in:

  sequence(10,20,3) + sequence(10) + sequence(10,20);
This becomes:
  sequence(10,20,3) + sequence(10,1,1) + sequence(10,20,1);
and the result is as if the data was repeated along the unit dimensions, i.e. like:
  sequence(10,20,3) + sequence(10,20,3) + sequence(10,20,3);
We'll restate those rules more formally in Chapter XXX -- and also look at a way to make your own rules called äexplicit threading. But the easiest way to find out how threading works is to experiment.

Before we get too deeply involved in threading and how to write threaded functions we should look in more detail at the other half of the game: the variety of operations available in PDL for manipluating piddle dimensions. Chapter 3 will go into this in detail.

Example Problem

WHAT??? `

lapeyre 2006-07-23