I hope to write up something more easily digestible but for now just doing a bit of a brain dump.
Since K has nested vectors (lists) you could always represent trees in a Lisp-y way. I.e.
(root; ((leftchld; leaf); (rightchld; leaf)))
where a node is (value; (children))
and a leaf
is somehow distinguishable from a non-leaf by being say an atom.
There are some slightly annoying aspects to this, though. For one, you need a convention to distinguish leaves from interior nodes. For another to access interior nodes you need to descend the tree. Sure, nine times out of ten when you're working with trees you're having to walk the tree, but still. Lastly, nested lists mean pointers of pointers which just doesn't feel very array-like.
What can be done? Well there are other representations of trees, but the one I prefer is what I like to call an Apter tree. You can find examples on nsl.com or this article from Vector magazine, but depending on your appetite this may not feel like enough of a description of how they work.
[DISCLAIMER: I haven't done much(any?) research into the history of this data structure but I've also not run into anyone else who has.]
The basic idea is to separate the structure of tree from its contents. By "structure" we mean which nodes are connected to which other nodes. While it might seem natural to do this with a dictionary of node -> list of children this brings us back to nested lists. Here's where a neat observation comes in.
While each node can have multiple children, each node (except the root) has only /one/ parent.
Thus if we have a tree such as (15;(12 9;18 2))
, then the parent of the leaf 9
is 12
and the parent
of the 12
is 15
. Note that even here there's a bit of awkwardness. Is 12
a node or is it a value?
Similarly for 9
.
Let's give each node an index using a depth first preorder of the tree. So we have 5 nodes whose
indices are 0 1 2 3 4
and whose values are 15 12 9 18 2
. Now we can say that the parent of node
2
is 1
and the parent of node 1
is 0
. So using 0N
for the parent of the root (for now),
the parent of each node is 0N 0 1 0 3
. Now we have a "parent vector" and a "values vector".
If nothing else this clears up the confusion of which we're talking about.
Note that there's nothing really special about using depth first preorder. If we shuffled the indices, we'd simply need to shuffle the indices of each parent to point to its new location. What is important is that the parent vector and the values vector be in the same order. Any ordering would faithfully represent the tree, but conventions are useful.
Let's give these names p:0N 0 1 0 3
and n:15 12 9 18 2
. So the value of our node at index 2
is
n[2]~9
and its parent is at index p[2]~1
. The value of the parent is n[p[2]]~12
and the parent
of the parent at index 2
is p[p[2]]~0
. But we're using an array language! We can do this across
all the indices simultaneously. p[!#p]~0N 0 1 0 3
and p[p[!#p]]~0N 0N 0 0N 0
. This last could also
be written 2(p@)/!#p
.
We could also look at the fixed point scan which gives a matrix whose columns give paths from each index
up to the parent of the root, which is currently marked with 0N
(p@)\!#p (0 1 2 3 4 0N 0 1 0 3 0N 0N 0 0N 0 0N 0N 0N 0N 0N)
While this works well enough for many circumstances its more useful to stop at the root. To make this
happen we make the root self-parenting. I.e. we define our parent vector to be p:0 0 1 0 3
.
(p@)\!#p (0 1 2 3 4 0 0 1 0 3 0 0 0 0 0)
Above we basically eyeballed the parent vector. How do we generate a parent vector more generally? This of course depends on the information we're given, but often it's easiest to first shoot for a slightly different representation. That is a depth vector.
The depth vector is simply the length of each path from a leaf up to the parent.
I.e.
+/1&(p@)\!#p:0^p 0 1 2 1 2
Generally it's easier to generate a depth vector than to directly generate a parent vector. For
instance d:{$[`i=@y;x;,/x,o[x+1]'y 1]}[0;(15;(12 9;18 2))]
or +\-/"()"=\:"(a(bc(d)ef))"
. (We'll
talk more about this second in a bit.)
Okay, let's assume depth vectors are easy to come by. How does this help us create a parent vector? Here's where conventions help out. If our depth vector is depth first preorder then we know that our parent is at the index "closest to the left at depth one less than ours". This is calculable! There are different approaches to doing this calculation, but let's start with the easiest.
{|/0,&x=-1+*|x}',\d 0 0 1 0 3
Okay so now let's say we have a parent vector and a node vector, what can we do with them? Well, we
can do things like ask which index is the parent of the node with value 9
? We can simply do p[n?9]
.
Notice that we didn't have to walk the tree to find this!!
But what if we did want to walk the tree? That's still an option. We just need to find children of
each parent vector. We can use "group" to do that =p
. Here, having the root being self-parenting
can be a bit of pain since we want to recurse to proper children and not enter an infinite loop.
With simple trees like this we can just do =0N,1_p
to effectively make our root a child of some
other node we'll never refer to. There's also another little trick to get a useful prototype.
Let's use chld:(=0N,1_p),(,0N)!,!0
as our map from parents to children. Then chld[0]~1 3
and
chld[2]
is !0
.
Now we can do the following to descend the tree:
chld:(=0N,1_p),(,0N)!,!0 {(x@z;o[x;y]'y@z)}[n;chld]0 (15 ((12;,(9;()));(18;,(2;()))))
(Here leaves have empty children, but this is simply an example of what's possible.)
Fair question. Well posed. My motivation was mostly to see what's possible with a data structure I wasn't familiar with, but if you keep your eye out for it similar structures pop up in other domains. The motivation is usually that by using vectors you get the benefit of data locality. I.e. you're likely to have related items accessed together hot in the cache. On the negative side, as is common with vector structures, modifying them means moving them in bulk to ensure that they're in contiguous memory. Thus they really shine when the size of the structure is relatively stable.
Weighing the benefits is not unlike comparing linked lists to arrays. Each node in a linked list might be stable in memory at the expense of data locality whereas you may need to move arrays around if you need to ensure data locality.
Apter trees may not be right for your application, but in my experience they have their good points. To my mind separating the structure from the content reduces some mental load. I find it easier to picture the kind of manipulations that are possible and how to accomplish them.
If you're still curious at this point, read on…
Obviously, Apter trees being trees means they're useful wherever trees are useful. But let's explore a few examples, starting with the string parsing above.
This algorithm for finding the depth of parenthesization is a classic. I first read about it in a quote from Alan Perlis speaking about how he was amazed that APL could do something so useful with so few characters. Let's have a look at it again.
+\-/"()"=\:"(a(bc(d)ef))" 1 1 2 2 2 3 3 2 2 2 1 0
It may be easier to match each character with it's depth:
+(txt;+\-/"()"=\:txt:"(a(bc(d)ef))") (("(";1) ("a";1) ("(";2) ("b";2) ("c";2) ("(";3) ("d";3) (")";2) ("e";2) ("f";2) (")";1) (")";0))
One thing to notice, is the matching close parenthesis for each open parenthesis is one deeper. Another thing to notice is that each opening parenthesis is at the same depth as its contents. This is more than likely not what you want, but is easily fixable.
+(txt;(-*ps)++\-/ps:"()"=\:txt:"(a(bc(d)ef))") (("(";0) ("a";1) ("(";1) ("b";2) ("c";2) ("(";2) ("d";3) (")";2) ("e";2) ("f";2) (")";1) (")";0))
For simple compilers and parsers, this basic idea forms the baseline of the parsing process. As above, the step from a depth vector to a parent vector and thus to a tree structure is not very far.
Doing this again with something more recognizable:
+(txt;d:(-*ps)++\-/ps:"()"=\:txt:"(3*(4+9))-7") (("(";0) ("3";1) ("*";1) ("(";1) ("4";2) ("+";2) ("9";2) (")";1) (")";0) ("-";0) ("7";0)) d 0 1 1 1 2 2 2 1 0 0 0 {|/0,&x=-1+*|x}',\d 0 0 0 0 3 3 3 0 0 0 0
Notice that the open parentheses become parent (i.e. interior) nodes and that everything else is a leaf. Thinking about it a little bit more, we only really needed the close parentheses as a marker of where the corresponding open parentheses end. They act sort of like how null characters act at the end of C strings. Now that we have that structure in the parent vector, we no longer need these nodes.
In general removing items from the parent vector could throw off the indices in that vector.
Consider (1+3)*(4+6)
. There is a trick for doing this, but more straightforward here is
to edit the depth vector. Because closed parentheses don't interupt the "closest index to the
left at depth one less than ours", removing them doesn't effect the parent vector we'll
generate from it.
+(txt;d):(txt;d:(-*ps)++\-/ps)@\:&~*|ps:"()"=\:txt:"(3*(4+9))-7" (("(";0) ("3";1) ("*";1) ("(";1) ("4";2) ("+";2) ("9";2) ("-";0) ("7";0))
It's this sort of thinking about trees that I enjoy about Apter trees. It's as much like
painting as it is like programming. Note that here we're considering txt
to be the values
vector and we make our adjustments to both the depth vector and the values vector in parallel.
Another thing to notice here is that we have several nodes at depth 0
. We probably want a
single root, though. We could fix this by adjusting the original text and adding a set of
parentheses around the whole thing, or we can edit this postprocessed tree.
+(txt;d):("(",txt;0,1+d) (("(";0) ("(";1) ("3";2) ("*";2) ("(";2) ("4";3) ("+";3) ("9";3) ("-";1) ("7";1))
You might be asking yourself at this point "where is this going?" The answer is "nowhere!". We're just playing around to see the kind of thinking that goes into working with these structures.
But as long as we're playing around with something familiar, we probably want these (infix) operators to be parents of the things they're operating on. We could try to manipulate the depth vector, but if we keep the order the same, we'll not be in DFS preorder. Let's instead make a parent vector and rewire it.
:p:{|/0,&x=-1+*|x}',\d 0 0 1 1 1 4 4 4 0 0 txt@=p !/+((0;"((-7") (1;"3*(") (4;"4+9"))
Currently, each infix operator is a sibling of the element to the left and the right. A sibling which is an open parenthesis represents the subexpression (i.e. subtree). Instead, we'd like to have each node to the left and the right of an infix operator have the infix operator as its parent. Note that these are the indices to the left and the right among its siblings. This means we'll have to do this per grouping.
=p !/+((0;0 1 8 9) (1;2 3 4) (4;5 6 7)) {y@(0;-1 1)+/:&~^"+*-%"?x@y}[txt]'.=p !/+((0;,(8;1 9)) (1;,(3;2 4)) (4;,(6;5 7))) |+,/{y@(0;-1 1)+/:&~^"+*-%"?x@y}[txt]'.=p ((1 9;2 4;5 7) 8 3 6) :p:@[p;;:;].|+,/{y@(0;-1 1)+/:&~^"+*-%"?x@y}[txt]'.=p 0 8 3 1 3 6 4 6 0 8
This isn't great for visualizing. Let's use something like the depth from parent calculation above.
`0:(" ",txt)@(1+!#txt)*/:-1+2*~=':(p@)\!#p ((3*(4+9-7 (3*(4+9-7 (3*(4+9 7 3*(4+9 3 (4+9 4+9 4 9
Okay, maybe not the best rendering, but it'll do in a pinch. Actually, let's take a second to appreciate this. This may not be the tree redering you're used to, but you can visualize the shape of the tree from the depth of the nodes. Moreover it's pretty cheap to render from the scan of the parent vector. With a list based tree, the inner nodes are sort of buried in there and harder to get your hands on.
We can see that the operators are now a level up from the things they're operating on, but we're no longer in DFS preorder. Does this matter? Not necessarily. We can still iterate through the tree. In fact, we can iterate through the tree to generate the DFS preorder.
chld:(=0N,1_p),(,0N)!,!0 :g:{,/x[z],o[x;y]'y@z}[!#p;chld]0 0 8 1 3 2 4 6 5 7 9
Just to be clear, DFS preorder is important for the algorithm that converts depth vectors to parent vectors, but is not necessarily important for the parent vectors themselves.
This is the order we'd like the indices to be in (the grade). Rearranging the values vector according to this grade is pretty easy:
:txt@:g "(-(*3(+497"
But as alluded to above, rearranging the parent vector is a little trickier. We first reorder the parent vector just as we do the values vector so that they're in the same order:
p@g 0 0 8 1 3 3 4 6 6 8
These are now in the right order, but we're still pointing to the old indices. What we need to point to is the new location, which amounts to the grade of the grade.
:p:(<g)@p@g 0 0 1 2 3 3 5 6 6 1 `0:(" ",txt)@(1+!#txt)*/:-1+2*~=':(p@)\!#p (-(*3(+497 -(*3(+497 (*3(+497 *3(+49 3(+49 +49 49
Notice that this is basically reversed Polish notation with parens being explicit about which values go with which operators. Also, since the original expression was fully parenthezised, each parenthesized node has a single child. Given this, evaluating this expression as a tree isn't too bad.
chld:(=0N,1_p),(,0N)!,!0 {(n;c):(x;y)@\:z;$[~#c;.n;"("~n;o[x;y]@*c;(.n)/o[x;y]'c]}[txt;chld]0 32
We simply descend the tree, passing through each parenthesis node to it's child and evaluating the operator on each of its children. Leaves are simply eval-ed.
Before moving on, let's take a look at this last attempt to visualize the tree. Let's remove the duplicates and space it out a bit.
`0:" "/'(" ",txt)@(1+!#txt)*/:-1+2*1_>':,/1(,1+&#*:)\=':(p@)\!#p ( - ( 7 * 3 ( + 4 9
If you stare at this for a bit, you start to want to add some characters.
( │ - ├───────────┐ ( 7 │ * ├───┐ 3 ( │ + ├───┐ 4 9
We won't go into details, but it should be fairly clear that the graph we actually had isn't too far off from something more traditional looking.
Let's try something else. Say we had an simple expression that wasn't parenthesized, like
1+2+3+4
. How do we express this as a tree? For starters, since there is no depth, we can
add a dummy root and make the parent vector all zeros. If that's not immediately obvious, it's
worth taking a minute to contemplate why this must be the case.
:p:&# \txt:"(","1+2+3+4" "(1+2+3+4" 0 0 0 0 0 0 0 0
Now how would we rewire this to do say right to left evaluation? That is we want "long right"
precedence. If we did add parens it should look like (1+(2+(3+4
. One option is to simply add
a paren after each operator and use the parser above. Another is to note that each operator has
two children: Its left (leaf) sibling and the next operator to the right.
:p:@[p;;:;].(+(-1+;1_,[;-1+#txt]@)@\:op;op:&"+"=txt) 0 2 0 4 2 6 4 6 `0:(" ",txt)@(1+!#txt)*/:-1+2*~=':(p@)\!#p (1+2+3+4 1+2+3+4 1 2+3+4 2 3+4 3 4
You could apply something similar to each set of siblings for a partially parenthesized expression such as "(2*8*4)+(2*(3+4+6))".
Now is this the way you want to go about creating a general expression parser? Maybe not. It's not immediately clear how to handle monadic operators like negation going down this path. But then again, maybe it's not so bad. The point of this exercise is to introduce new ways of thinking about tree construction. I personally have found this "rector set" approach pretty fruitul, using it to implement a json parser and a parser of regular expressions as well as lambda calculus evaluator.
This last might be a good example of the shortcomings of this approach mentioned above. It's basically a term rewriting exercise. Which means a lot of updates to the tree and thus changes in the size of the tree, which means a lot of copying of the vectors involved as described above. It works, but perhaps is not the best fit.