Implementation Notes • mixvlmc

This document contains some implementation notes for the mixvlmc package.

Suffix tree

The package uses suffix trees to build efficiently context trees. Suffix trees are constructed from the integer representation of the discrete time series, assuming that they contain only non negative integers. As a consequence, negative values can be used as sentinels.

Following Garivier’s paper we build the suffix tree of the reverse sequence (in other words the prefix tree).

Representation

The suffix tree is represented by a main C++ class SuffixTree which uses EdgeNode objects internally. As indicated by the name, an EdgeNode represents both a node in the suffix tree and the edge used to reach it from another node.

An EdgeNode contains:

a compact representation of the subsequence that labels its edge using start and end as the position of this subsequence in the main one. start is included in the subsequence end is not. During the construction of the suffix tree end can be set to n+1 where n is the length of the main sequence. This is the classical trick used to represent open ended edges;
a dictionary of its children (if any): this maps an element of the sequence to (a pointer to) the corresponding child NodeEdge;
a suffix pointer that represents a suffix link;
a parent pointer that points to the parent node (used during node expansion);
a counts pointer to a map that maps values to the number of times the subsequence represented by the current node (from the root) is preceded by this value in the original sequence;
an integer total_count which gives the number of occurrences of the subsequence when preceded by another value;
an integer depth that gives the length of the subsequence represented by this node;
a positions pointer that points to a vector of positions of the subsequence in the original sequence.

Notice that only the start, end, suffix and parent are set during the suffix tree construction.

Construction

The suffix tree is constructed using Ukkonen’s algorithm. More precisely, for a sequence x, we build the suffix tree of rev(x)[-1]: the tree contains all the prefixes of x without its last value. This guarantees that all sub-sequences detected by the tree appeared followed by at least one value, which makes them potential contexts. Notice that following a link in the suffix tree corresponds to extending a context into the past which is exactly what is needed for the practical use of the suffix tree as a context tree.

Counts

After the construction of the suffix tree, the compute_counts method is used to compute the number of occurrences of each subsequence represented by the suffix tree as well as the distribution of the values that follow them in the original sequence, as proposed in Garivier’s paper. The method uses rev(x)[1], that is the last value of x to include this final value in the counts.