This document contains some implementation notes for the
mixvlmc package.
Suffix tree
The package uses suffix trees to build efficiently context trees. Suffix trees are constructed from the integer representation of the discrete time series, assuming that they contain only non negative integers. As a consequence, negative values can be used as sentinels.
Following Garivier’s paper we build the suffix tree of the reverse sequence (in other words the prefix tree).
Representation
The suffix tree is represented by a main C++ class
SuffixTree which uses EdgeNode objects
internally. As indicated by the name, an EdgeNode
represents both a node in the suffix tree and the edge used to reach it
from another node.
An EdgeNode contains:
- a compact representation of the subsequence that labels its edge
using
startandendas the position of this subsequence in the main one.startis included in the subsequenceendis not. During the construction of the suffix treeendcan be set ton+1wherenis the length of the main sequence. This is the classical trick used to represent open ended edges; - a dictionary of its
children(if any): this maps an element of the sequence to (a pointer to) the corresponding childNodeEdge; - a
suffixpointer that represents a suffix link; - a
parentpointer that points to the parent node (used during node expansion); - a
countspointer to a map that maps values to the number of times the subsequence represented by the current node (from the root) is preceded by this value in the original sequence; - an integer
total_countwhich gives the number of occurrences of the subsequence when preceded by another value; - an integer
depththat gives the length of the subsequence represented by this node; - a
positionspointer that points to a vector of positions of the subsequence in the original sequence.
Notice that only the start, end,
suffix and parent are set during the suffix
tree construction.
Construction
The suffix tree is constructed using Ukkonen’s algorithm.
More precisely, for a sequence x, we build the suffix tree
of rev(x)[-1]: the tree contains all the prefixes of
x without its last value. This guarantees that all
sub-sequences detected by the tree appeared followed by at least one
value, which makes them potential contexts. Notice that following a link
in the suffix tree corresponds to extending a context into the past
which is exactly what is needed for the practical use of the suffix tree
as a context tree.
Counts
After the construction of the suffix tree, the
compute_counts method is used to compute the number of
occurrences of each subsequence represented by the suffix tree as well
as the distribution of the values that follow them in the original
sequence, as proposed in Garivier’s paper.
The method uses rev(x)[1], that is the last value of
x to include this final value in the counts.