This document contains some implementation notes for the
mixvlmc
package.
Suffix tree
The package uses suffix trees to build efficiently context trees. Suffix trees are constructed from the integer representation of the discrete time series, assuming that they contain only non negative integers. As a consequence, negative values can be used as sentinels.
Following Garivier’s paper we build the suffix tree of the reverse sequence (in other words the prefix tree).
Representation
The suffix tree is represented by a main C++ class
SuffixTree
which uses EdgeNode
objects
internally. As indicated by the name, an EdgeNode
represents both a node in the suffix tree and the edge used to reach it
from another node.
An EdgeNode
contains:
- a compact representation of the subsequence that labels its edge
using
start
andend
as the position of this subsequence in the main one.start
is included in the subsequenceend
is not. During the construction of the suffix treeend
can be set ton+1
wheren
is the length of the main sequence. This is the classical trick used to represent open ended edges; - a dictionary of its
children
(if any): this maps an element of the sequence to (a pointer to) the corresponding childNodeEdge
; - a
suffix
pointer that represents a suffix link; - a
parent
pointer that points to the parent node (used during node expansion); - a
counts
pointer to a map that maps values to the number of times the subsequence represented by the current node (from the root) is preceded by this value in the original sequence; - an integer
total_count
which gives the number of occurrences of the subsequence when preceded by another value; - an integer
depth
that gives the length of the subsequence represented by this node; - a
positions
pointer that points to a vector of positions of the subsequence in the original sequence.
Notice that only the start
, end
,
suffix
and parent
are set during the suffix
tree construction.
Construction
The suffix tree is constructed using Ukkonen’s algorithm.
More precisely, for a sequence x
, we build the suffix tree
of rev(x)[-1]
: the tree contains all the prefixes of
x
without its last value. This guarantees that all
sub-sequences detected by the tree appeared followed by at least one
value, which makes them potential contexts. Notice that following a link
in the suffix tree corresponds to extending a context into the past
which is exactly what is needed for the practical use of the suffix tree
as a context tree.
Counts
After the construction of the suffix tree, the
compute_counts
method is used to compute the number of
occurrences of each subsequence represented by the suffix tree as well
as the distribution of the values that follow them in the original
sequence, as proposed in Garivier’s paper.
The method uses rev(x)[1]
, that is the last value of
x
to include this final value in the counts.