20161216

(R)creational Common Lisp


In my real work, I (or better, my students and colleagues) use R to program our data analyses (if you are interested, you can check out the TRONCO package.)

Of course, being a Common Lisp quite-high priest at this point and a recreational programmer in it, I wondered how to render several R constructs in the language with parentheses.  It turns out that it was quite fun to re-engineer some of the key features of the R language.

This is not the first such project trying to mimic R.  I know at least two of them; the most public one being Common Lisp Stat, but, being known for suffering from NIH and RTW syndromes, I had to jump in.

Here is a short summary of what I did and what was achieved.

Generic Math, Arrays and Vectors

R provides generic math on vectors.  Pretty much everything in R is a vector (stored in column major order; more on this later) including arrays.  The generic math helps you writing things like
> 42 * 1:5 # The ':' operator produces a vector.
[1]  42  84 126 168 210

Indexing

The generic math and the construction of some operators in CL is pretty much straightforward.  Where things get hairy is in the handling of the several ways that a R programmer may use to index, slice and dice vectors and arrays. As an example, you could write
> a = array(11:14, dim = c(2, 2), dimnames = list(list("a", "b"), list("x", "y")))

> a
   x  y
a 11 13
b 12 14

In brief, you have a matrix with "named" rows and columns. There are other idiosyncrasies but this is already something interesting to deal with, as you can now write things like:
> a[1, "y"]
[1] 3

> a["b", ]
 x  y 
12 14

The second "indexing" selects, or slices, the entire second row (counting from 1) and, as a result, returns a vector with the column names retained.

Indexing can also become interesting because of other idiosyncrasies in R.  Consider the following:
> a[2]
[1] 12

> a[42]
[1] NA # 'NA' is the 'not available' indicator.

> a[c(2, 3, 4, 10)]
[1]  12  13  14 NA

Indexing by only one index accesses the underlying vector in column major order. If the index is beyond the current limit, the value NA ('not available') is returned.  If the sole index is a vector, then the indexing returns a vector of the values in the array accessed column major order.

Indexing and Assignments

Of course, once you handle slicing and dicing of arrays, you have to deal with setting these slices and dice.

In R you can have a lot of what you can dream of (and something more, some of which, you wished was not there).

Basic assignment on vectors and arrays works as expected. Consider:
> a = array(11:19, dim = c(3, 3), dimnames = list(list("a", "b", "c"), list("x", "y", "z")))

> a
   x  y  z
a 11 14 17
b 12 15 18
c 13 16 19

> a[2, "y"] = 42

> a
   x  y  z
a 11 14 17
b 12 42 18
c 13 16 19

So far, nothing special is happening. Things actually make sense when trying to set a sub-slice of the array.
> b = array(42, dim = c(2, 2))

> b
     [,1] [,2]
[1,]   42   42
[2,]   42   42

> a[2:3, 2:3] = b

> a
   x  y  z
a 11 14 17
b 12 42 42
c 13 42 42

However, when setting a sub-slice of an array to a vector, or setting the elements of an array using a single vector index (i.e., when accessing the array in column major order), things may become a bit surprising.
> a[2:3, 2:3] = c(1, 1, 1, 1)

> a # This is not a surprise.
   x  y  z
a 11 14 17
b 12  1  1
c 13  1  1

> a[2:3, 2:3] = c(1, 2, 3, 4, 5, 6) # This is also expected (modulo the 'multiple'.)
Error in a[2:3, 2:3] = c(1, 2, 3, 4, 5, 6) : 
  number of items to replace is not a multiple of replacement length

> a[c(1, 2, 6, 7)] = c(111, 111, 111, 111) # Ok, understandable.
> a
    x   y   z
a 111  14 111
b 111   1   1
c  13 111   1

> a[c(1, 2, 12, 7)] = c(1, 1, 1, 1)

> a # What?!?
 [1]   1   1  13  14   1 111   1   1   1  NA  NA   1

Ok, the last example shows how R manages assigning past the limit of the underlying vector.

Not really a problem, once you get the hang of it, but, source of a bit of confusion nevertheless, at least to yours truly.

In any case, all of this must be dealt with once you want to get into building some facilities like these on top of Common Lisp.

Let's "Reinvent the Wheel" (RTW) in Common Lisp

Getting the basics down was not so difficult.  Common Lisp has a somewhat richer set of array and vector data types than R, however, some design decisions resulted in some fun (albeit probably sub-optimal) setup.

Here is a list of decisions taken, which may be subject to debate.

  • Objects are represented as structures.
  • Constructors follow the R conventions (no make-array!)
  • Vectors and arrays are mapped to CL vectors and arrays.
    This means that
    • Arrays are kept as multi-dimensional entities which are accessed in row major order
    • Vector and array "objects" are separate and different.
  • A generic reference function AT is introduced (and its counterpart, (SETF AT).)

Of course, these choices will make porting code from R to the CL library/tool quite cumbersome; let's say that this goal was dropped early on. The CL library (which, by the way, is named ρ, that is: RHO) will look familiar to a R programmer, but it will not be the same. After all, is it Common Lisp, isn't it?

Here is a run down of the basic features.  This was the hard part, in the opinion of the author.  Getting more of R implemented will be somewhat easier.

The AT Generic Function

The AT generic function works as expected on the standard CL objects (as it should).
RHO 10 > (defvar qwerty "qwerty")
QWERTY

RHO 11 > (setf (at qwerty 3) #\R)
#\R

RHO 12 > qwerty
"qweRty"

RHO 13 > (at qwerty 4)
#\t

RHO 14 > (defvar a #2A((1 2 3) (4 5 6) (7 8 9)))
A

;;; ...

RHO 16 > (at a 1 1)
5

RHO 17 > (setf (at a 0 0) 42)
42

RHO 18 > a
#2A((42 2 3) (4 5 6) (7 8 9))

The generic function AT also works on structures and class instances.
RHO 22 > (defstruct foo a s d)
FOO

RHO 23 > (defvar foo (make-foo :s 42))
FOO

RHO 24 > (at foo 'foo-s) ; In this case the accessor function FOO-S is invoked.
42

RHO 25 > (setf (at foo 'foo-d) "Food")
"Food"

RHO 26 > foo
#S(FOO :A NIL :S 42 :D "Food")

So far, so good.  Now let's have some fun.

Creating Arrays (and Vectors and Matrices)

Let's create some arrays, vectors and matrices.
RHO 28 > (setf rr (range 1 9)) ; This would be 'rr = 1:8' in R.
  1  2  3  4  5  6  7  8


RHO 29 > (setf rv (vector 1 22/23 -3.2 42 5.8))
      1  22/23   -3.2     42    5.8

RHO 30 > (setf rv2 (vector* (list 1 22/23 -3.2 42 5.8)
                            :names '(a b c d e)))
      A      B      C      D      E
      1  22/23   -3.2     42    5.8

The objects returned by VECTOR and VECTOR* (the full constructor) are wrappers around the underlying CL vectors.  Their class is RHO:VECTOR.  Note also that *PRINT-READABLY* is set to NIL, therefore, a "pretty printing" of the vectors is produced.
RHO 37 > (setf ra (array (range 1 9) :dim '(3 3)))
      [, 0] [, 1] [, 2]
[0, ]     1     2     3
[1, ]     4     5     6
[2, ]     7     8     1

RHO 38 > (at ra 1 1)
5

The function RANGE excludes the upper limit (following CL conventions), while the ARRAY constructors "recycles" the values passed to it (following R conventions).  The value produces is pretty-printed, while its class is RHO:ARRAY.  Indexing is 0-based, as in CL.

Creating matrices is straightforward and, again, a R-like function is used.
RHO 3 > (setf rm (matrix '(1 1 2 0) :nrow 2 :ncol 2))
      [, 0] [, 1]
[0, ]     1     1
[1, ]     2     0

RHO 4 > (at rm 0 1)
1

RHO 5 > (at rm 1 0)
2

Slicing and Dicing in RHO

Now that we have all the bases covered, we can proceed to some "slicing and dicing".  All the R indexing and assignment methods are provided in RHO, modulo the Common Lisp background choices, bugs and unimplemented features; not many of the last kind.

First of all, let's set some dimension names...
RHO 7 > (setf (dimension-names ra) '(("a" "b") ("x" "y")))
(("a" "b") ("x" "y"))

RHO 8 > ra
      [, x] [, y] [, 2]
[a, ]     1     2     3
[b, ]     4     5     6
[2, ]     7     8     1

RHO 9 > (at ra 1 "y")
5

Next we create a new array (a matrix, but in R, matrices are separate subclasses of arrays) rb and use it as a value to set a sub-matrix of ra.
RHO 13 > (setf rb (array 42 :dim '(2 2)))
      [, 0] [, 1]
[0, ]    42    42
[1, ]    42    42

RHO 29 > (setf (at ra (vector 1 2) (vector 1 2)) rb)
#2A((42 42) (42 42))

RHO 15 > ra
      [, x] [, y] [, 2]
[a, ]     1     2     3
[b, ]     4    42    42
[2, ]     7    42    42

As you can see, everything works as expected.

Remember that we have built a matrix rm; let's use it to set some elements of the array ra, according to the R "access-by-matrix" semantics.
RHO 16 > rm
      [, 0] [, 1]
[0, ]     1     1
[1, ]     2     0

RHO 24 > (type-of rm)
MATRIX

RHO 27 > (setf (at ra rm) 1024)
1024

RHO 28 > ra
      [, x] [, y] [, 2]
[a, ]     1     2     3
[b, ]     4  1024    42
[2, ]  1024    42    42

RHO 29 > (setf (at ra #(1 2) #(1 2)) #(1 1 1 1))
#(1 1 1 1)

RHO 30 > ra
      [, x] [, y] [, 2]
[a, ]     1     2     3
[b, ]     4     1     1
[2, ]  1024     1     1

Note however, that, although "correct", no attempt to signal an error is made for incongruous dimensions.
RHO 42 > (setf (at ra #(1 2) #(1 2)) #(11 12 13 14 15))
#(11 12 13 14 15)

RHO 43 > ra
      [, x] [, y] [, 2]
[a, ]     1     2     3
[b, ]     4    11    12
[2, ]  1024    13    14

Getting the indexing of different combinations of scalars, vectors, and arrays was an interesting exercise. The implementation is probably bloated and not easily understandable; most probably because of the design choice of keeping the underlying CL vectors and arrays "as-they-are", especially multidimensional arrays.

Missing Things...

Many...
Most important one is "negative" indexing, which does not work completely yet.
Of course, I am limiting this comment to arrays and vectors: other R data structures, such as factors and data.frames are in the making, but all the heavy lifting is now done.

Want to Play?

Well, for the time being, drop me a line (you know where to find me).  We can talk about setting up a proper project and discuss some design choices.


Cheers