Ex - Substring Sort Editorial by spheniscine

Alternative Solution

The idea in broad strokes is:

  1. Construct the suffix array and LCP (longest common prefix) array for all the given strings (for the natural extension of these concepts to multiple strings)
  2. Using this information, construct a radix tree (compressed trie) of the substrings of the given strings.
  3. Answer the queries using the preorder traversal of the radix tree.

For step one, the official editorial provides the details for one method. Another method that avoids having to adjust the LCP values after, is to define \(N\) new sentinel “letters” \(\xi_1<\xi_2 < ... <\xi_N < \tt{a}\). Find the suffix and LCP arrays of the string \(S_1\xi_1S_2\xi_2 ... S_N\xi_N\), then ignore the first \(N\) values of both, as they correspond to suffixes starting with a sentinel character. The exact implementation details depend on how/whether your suffix array implementation can deal with an “alphabet” of \(N + 26\) characters. (The AtCoder Library implementation has the alternate signature vector<int> suffix_array(vector<int> s, int upper) that can deal with it nicely; just add \(N\) to all the real letters and use values \([0, N)\) for the sentinels)

For step two, one way to implement the radix tree is to picture a node as a tuple \(\langle i, j, l, r, w, C\rangle\), where:

  • \(i, j\) means that this node contains substrings starting from the \(j\)-th character of \(S_i\).
  • \(l, r\) means that this node contains substrings of lengths between \(l\) and \(r\).
  • \(w\) means that this node contains \(w\) copies of each of those substrings, thus this node represents \(w (r - l + 1)\) substrings.
    • in other words, the node contains \(w\) copies of \(S_i[j, j + l - 1]\), \(w\) copies of \(S_i[j, j+l]\), …, \(w\) copies of \(S_i[j, j+r-1]\).
  • \(C\) is an array-list of zero or more other nodes, the children of this node, in lexicographical order.

The root node is a special node that doesn’t contain any substrings; its only purpose is to contain other nodes in its \(C\).

To build the radix tree, you could use a stack that stores the nodes representing the LCP from the previous string (initially the stack has only the root node, which should never be removed). For each suffix in the suffix array, you first want to “build up” the stack so that it matches the current suffix, this adds \(0\) or \(1\) node to the tree. We wish to add \(1\) to the \(w\) to all nodes in the stack, but since the stack may have too many nodes, we only add it to the last/top node of the stack, with the intent of propagating \(w\) to its parent when that node is popped. Then you want to “tear down” the stack to match the LCP of this suffix with the next suffix. This may pop one or more nodes from the stack, be sure to propagate the \(w\) values to the parents. Additionally, this may require you to “split” a node on the stack (when the LCP is \(< r\) but \(\ge l\)). To do so, create a new node, this node inherits \(i, j, r, w, C\) from the old node, while its \(l\) will start from \(LCP[i] + 1\). The \(C\) of the old node is replaced with a new array-list with a single member, the new node, while the \(r\) of the old node should now be the LCP. This step also adds \(0\) or \(1\) node to the tree, so the total number of nodes on the tree, and the time spent building and traversing it, won’t exceed \(O(\sum|S_i|)\).

This is similar to the method of building a suffix tree using suffix and LCP arrays.

Code for reference (Rust). Note that \(l\) as implemented in this code is one less than the \(l\) described in this tutorial.

posted:
last update: