In my real work, I (or better, my students and colleagues) use R to program our data analyses (if you are interested, you can check out the TRONCO package.)
Of course, being a Common Lisp quite-high priest at this point and a recreational programmer in it, I wondered how to render several R constructs in the language with parentheses. It turns out that it was quite fun to re-engineer some of the key features of the R language.
This is not the first such project trying to mimic R. I know at least two of them; the most public one being Common Lisp Stat, but, being known for suffering from NIH and RTW syndromes, I had to jump in.
Here is a short summary of what I did and what was achieved.
Generic Math, Arrays and Vectors
R provides generic math on vectors. Pretty much everything in R is a vector (stored in column major order; more on this later) including arrays. The generic math helps you writing things like
Indexing can also become interesting because of other idiosyncrasies in R. Consider the following:
> 42 * 1:5 # The ':' operator produces a vector.
[1] 42 84 126 168 210
Indexing
The generic math and the construction of some operators in CL is pretty much straightforward. Where things get hairy is in the handling of the several ways that a R programmer may use to index, slice and dice vectors and arrays. As an example, you could write> a = array(11:14, dim = c(2, 2), dimnames = list(list("a", "b"), list("x", "y"))) > a x y a 11 13 b 12 14In brief, you have a matrix with "named" rows and columns. There are other idiosyncrasies but this is already something interesting to deal with, as you can now write things like:
> a[1, "y"] [1] 3 > a["b", ] x y 12 14The second "indexing" selects, or slices, the entire second row (counting from 1) and, as a result, returns a vector with the column names retained.
Indexing can also become interesting because of other idiosyncrasies in R. Consider the following:
> a[2]
[1] 12
> a[42]
[1] NA # 'NA' is the 'not available' indicator.
> a[c(2, 3, 4, 10)]
[1] 12 13 14 NA
Indexing by only one index accesses the underlying vector in column major order. If the index is beyond the current limit, the value NA
('not available') is returned. If the sole index is a vector, then the indexing returns a vector of the values in the array accessed column major order.Indexing and Assignments
Of course, once you handle slicing and dicing of arrays, you have to deal with setting these slices and dice.
In R you can have a lot of what you can dream of (and something more, some of which, you wished was not there).
Basic assignment on vectors and arrays works as expected. Consider:
So far, nothing special is happening. Things actually make sense when trying to set a sub-slice of the array.
However, when setting a sub-slice of an array to a vector, or setting the elements of an array using a single vector index (i.e., when accessing the array in column major order), things may become a bit surprising.
Not really a problem, once you get the hang of it, but, source of a bit of confusion nevertheless, at least to yours truly.
In any case, all of this must be dealt with once you want to get into building some facilities like these on top of Common Lisp.
Of course, these choices will make porting code from R to the CL library/tool quite cumbersome; let's say that this goal was dropped early on. The CL library (which, by the way, is named ρ, that is:
Here is a run down of the basic features. This was the hard part, in the opinion of the author. Getting more of R implemented will be somewhat easier.
The
The AT generic function works as expected on the standard CL objects (as it should).
In R you can have a lot of what you can dream of (and something more, some of which, you wished was not there).
Basic assignment on vectors and arrays works as expected. Consider:
> a = array(11:19, dim = c(3, 3), dimnames = list(list("a", "b", "c"), list("x", "y", "z"))) > a x y z a 11 14 17 b 12 15 18 c 13 16 19 > a[2, "y"] = 42 > a x y z a 11 14 17 b 12 42 18 c 13 16 19
So far, nothing special is happening. Things actually make sense when trying to set a sub-slice of the array.
> b = array(42, dim = c(2, 2)) > b [,1] [,2] [1,] 42 42 [2,] 42 42 > a[2:3, 2:3] = b > a x y z a 11 14 17 b 12 42 42 c 13 42 42
However, when setting a sub-slice of an array to a vector, or setting the elements of an array using a single vector index (i.e., when accessing the array in column major order), things may become a bit surprising.
> a[2:3, 2:3] = c(1, 1, 1, 1) > a # This is not a surprise. x y z a 11 14 17 b 12 1 1 c 13 1 1 > a[2:3, 2:3] = c(1, 2, 3, 4, 5, 6) # This is also expected (modulo the 'multiple'.) Error in a[2:3, 2:3] = c(1, 2, 3, 4, 5, 6) : number of items to replace is not a multiple of replacement length > a[c(1, 2, 6, 7)] = c(111, 111, 111, 111) # Ok, understandable. > a x y z a 111 14 111 b 111 1 1 c 13 111 1 > a[c(1, 2, 12, 7)] = c(1, 1, 1, 1) > a # What?!? [1] 1 1 13 14 1 111 1 1 1 NA NA 1Ok, the last example shows how R manages assigning past the limit of the underlying vector.
Not really a problem, once you get the hang of it, but, source of a bit of confusion nevertheless, at least to yours truly.
In any case, all of this must be dealt with once you want to get into building some facilities like these on top of Common Lisp.
Let's "Reinvent the Wheel" (RTW) in Common Lisp
Getting the basics down was not so difficult. Common Lisp has a somewhat richer set of array and vector data types than R, however, some design decisions resulted in some fun (albeit probably sub-optimal) setup.
Here is a list of decisions taken, which may be subject to debate.
- Objects are represented as structures.
- Constructors follow the R conventions (no
make-array
!) - Vectors and arrays are mapped to CL vectors and arrays.
This means that
- Arrays are kept as multi-dimensional entities which are accessed in row major order
- Vector and array "objects" are separate and different.
- A generic reference function
AT
is introduced (and its counterpart,(SETF AT)
.)
Of course, these choices will make porting code from R to the CL library/tool quite cumbersome; let's say that this goal was dropped early on. The CL library (which, by the way, is named ρ, that is:
RHO
) will look familiar to a R programmer, but it will not be the same. After all, is it Common Lisp, isn't it?Here is a run down of the basic features. This was the hard part, in the opinion of the author. Getting more of R implemented will be somewhat easier.
The AT
Generic Function
RHO 10 > (defvar qwerty "qwerty")
QWERTY
RHO 11 > (setf (at qwerty 3) #\R)
#\R
RHO 12 > qwerty
"qweRty"
RHO 13 > (at qwerty 4)
#\t
RHO 14 > (defvar a #2A((1 2 3) (4 5 6) (7 8 9)))
A
;;; ...
RHO 16 > (at a 1 1)
5
RHO 17 > (setf (at a 0 0) 42)
42
RHO 18 > a
#2A((42 2 3) (4 5 6) (7 8 9))
The generic function AT
also works on structures and class instances.RHO 22 > (defstruct foo a s d)
FOO
RHO 23 > (defvar foo (make-foo :s 42))
FOO
RHO 24 > (at foo 'foo-s) ; In this case the accessor function FOO-S is invoked.
42
RHO 25 > (setf (at foo 'foo-d) "Food")
"Food"
RHO 26 > foo
#S(FOO :A NIL :S 42 :D "Food")
So far, so good. Now let's have some fun.Creating Arrays (and Vectors and Matrices)
Let's create some arrays, vectors and matrices.
Creating matrices is straightforward and, again, a R-like function is used.
Slicing and Dicing in
Most important one is "negative" indexing, which does not work completely yet.
Of course, I am limiting this comment to arrays and vectors: other R data structures, such as factors and data.frames are in the making, but all the heavy lifting is now done.
RHO 28 > (setf rr (range 1 9)) ; This would be 'rr = 1:8' in R.
1 2 3 4 5 6 7 8
RHO 29 > (setf rv (vector 1 22/23 -3.2 42 5.8))
1 22/23 -3.2 42 5.8
RHO 30 > (setf rv2 (vector* (list 1 22/23 -3.2 42 5.8)
:names '(a b c d e)))
A B C D E
1 22/23 -3.2 42 5.8
The objects returned by VECTOR
and VECTOR*
(the full constructor) are wrappers around the underlying CL vectors. Their class is RHO:VECTOR
. Note also that *PRINT-READABLY*
is set to NIL
, therefore, a "pretty printing" of the vectors is produced.RHO 37 > (setf ra (array (range 1 9) :dim '(3 3))) [, 0] [, 1] [, 2] [0, ] 1 2 3 [1, ] 4 5 6 [2, ] 7 8 1 RHO 38 > (at ra 1 1) 5The function
RANGE
excludes the upper limit (following CL conventions), while the ARRAY
constructors "recycles" the values passed to it (following R conventions). The value produces is pretty-printed, while its class is RHO:ARRAY
. Indexing is 0-based, as in CL.Creating matrices is straightforward and, again, a R-like function is used.
RHO 3 > (setf rm (matrix '(1 1 2 0) :nrow 2 :ncol 2)) [, 0] [, 1] [0, ] 1 1 [1, ] 2 0 RHO 4 > (at rm 0 1) 1 RHO 5 > (at rm 1 0) 2
Slicing and Dicing in RHO
Now that we have all the bases covered, we can proceed to some "slicing and dicing". All the R indexing and assignment methods are provided in
First of all, let's set some dimension names...
Next we create a new array (a matrix, but in R, matrices are separate subclasses of arrays)
Remember that we have built a matrix
Note however, that, although "correct", no attempt to signal an error is made for incongruous dimensions.
Getting the indexing of different combinations of scalars, vectors, and arrays was an interesting exercise. The implementation is probably bloated and not easily understandable; most probably because of the design choice of keeping the underlying CL vectors and arrays "as-they-are", especially multidimensional arrays.
RHO
, modulo the Common Lisp background choices, bugs and unimplemented features; not many of the last kind.First of all, let's set some dimension names...
RHO 7 > (setf (dimension-names ra) '(("a" "b") ("x" "y"))) (("a" "b") ("x" "y")) RHO 8 > ra [, x] [, y] [, 2] [a, ] 1 2 3 [b, ] 4 5 6 [2, ] 7 8 1 RHO 9 > (at ra 1 "y") 5
Next we create a new array (a matrix, but in R, matrices are separate subclasses of arrays)
rb
and use it as a value to set a sub-matrix of ra
.RHO 13 > (setf rb (array 42 :dim '(2 2))) [, 0] [, 1] [0, ] 42 42 [1, ] 42 42 RHO 29 > (setf (at ra (vector 1 2) (vector 1 2)) rb) #2A((42 42) (42 42)) RHO 15 > ra [, x] [, y] [, 2] [a, ] 1 2 3 [b, ] 4 42 42 [2, ] 7 42 42As you can see, everything works as expected.
Remember that we have built a matrix
rm
; let's use it to set some elements of the array ra
, according to the R "access-by-matrix" semantics.RHO 16 > rm [, 0] [, 1] [0, ] 1 1 [1, ] 2 0 RHO 24 > (type-of rm) MATRIX RHO 27 > (setf (at ra rm) 1024) 1024 RHO 28 > ra [, x] [, y] [, 2] [a, ] 1 2 3 [b, ] 4 1024 42 [2, ] 1024 42 42
RHO 29 > (setf (at ra #(1 2) #(1 2)) #(1 1 1 1)) #(1 1 1 1)
RHO 30 > ra [, x] [, y] [, 2] [a, ] 1 2 3 [b, ] 4 1 1 [2, ] 1024 1 1
Note however, that, although "correct", no attempt to signal an error is made for incongruous dimensions.
RHO 42 > (setf (at ra #(1 2) #(1 2)) #(11 12 13 14 15)) #(11 12 13 14 15) RHO 43 > ra [, x] [, y] [, 2] [a, ] 1 2 3 [b, ] 4 11 12 [2, ] 1024 13 14
Getting the indexing of different combinations of scalars, vectors, and arrays was an interesting exercise. The implementation is probably bloated and not easily understandable; most probably because of the design choice of keeping the underlying CL vectors and arrays "as-they-are", especially multidimensional arrays.
Missing Things...
Many...Most important one is "negative" indexing, which does not work completely yet.
Of course, I am limiting this comment to arrays and vectors: other R data structures, such as factors and data.frames are in the making, but all the heavy lifting is now done.
Want to Play?
Well, for the time being, drop me a line (you know where to find me). We can talk about setting up a proper project and discuss some design choices.
Cheers
Cheers
No comments:
Post a Comment