F - Common Prefixes Editorial by en_translator
We denote the length of string \(S\) by \(|S|\). Also, we use the term similarity and the notation \(f\), \(S_i\) just as defined in the problem statement.
We assume that indices of arrays are 0-based.
The idea of the solution
Since we want to find the sum of lengths of common prefixes of suffixes, using LCP Array (Longest Common Prefix Array) seems to be a good idea.
Suffix Array
Consider sorting the suffixes of a string \(S\) in the lexicographical order.
For example, the suffixes of aacab
are aacab
, acab
, cab
, ab
, b
, (empty string)
, and when lexicographically sorted, they become (empty string)
, aacab
, ab
, acab
, b
, cab
.
If we store actual string to represent the sorted array, it requires \(O(|S|^2)\) memory; however we only need to store the position (the index in \(S\) of the first character) of the prefixes, so it actually need only \(O(|S|)\) memory. In the example above, what we want to find is \((5,0,3,1,4,2)\). We call this array “Suffix Array.” Let us denote the array by \(SA\).
If we actually compare the suffix strings in order to obtain the Suffix Array, it will require \(O(|S|^2\log |S|)\) time, but with an appropriate algorithm it can be constructed in an \(O(|S|)\) time (regarding the number of distinct alphabets as a constant). (This algorithm is called SA-IS. We will not describe the details. There are some other algorithms that requires \(O(|S|\log |S|)\) or \(O(|S|(\log |S|)^2)\) time but are relatively easy to implement)
LCP Array
Now we consider the similarity of suffixes of two strings that are adjacent to each other when sorted lexicographically. For example, in the example above,
- The similarity between
(空文字列)
andaacab
is \(0\) - The similarity between
aacab
andab
is \(1\) - The similarity between
ab
andacab
is \(1\) - The similarity between
acab
andb
is \(0\) - The similarity between
b
andcab
is \(0\)
The resulting array is called LCP Array. In the example above, it is \((0,1,1,0,0)\). If we have obtained Suffix Array, the LCP Array can be constructed in an \(O(|S|)\) time with an appropriate algorithm. We denote this array by \(LCPA\).
Calculating the similarity of suffixes
If three strings \(A,B,C\) are in the lexicographical order of \(A < B < C\), then \(f(A,C)=\min(f(A,B),f(B,C))\). So, for any \(i < j\) we have
\(f(S_{SA_i},S_{SA_j})=\min(f(S_{SA_i},S_{SA_{i+1}}),\ldots,f(S_{SA_{j-1}},S_{SA_j})) =\min(LCPA_i,\ldots,LCPA_{j-1}),\)
which means that the similarity of two suffixes can be obtained as a range min of LCP Array (unless those two are identical). Therefore, in order to solve the original problem, it is sufficient to solve the following problem:
Problem:
Given is a sequence \(A=(A_0,\ldots,A_{N-1})\) of length \(N\). \(A_0\) is \(0\).
For \(i<j\) let \(g(i,j)=\min(A_i,A_{i+1},\ldots,A_{j-1})\). Find
\(g(1,k)+g(2,k)+\ldots+g(k-1,k)+g(k,k+1)+g(k,k+2)+\ldots+g(k,N)\)
for each \(k=1,\ldots,N\).
(Note that \(k\) in this problem is \(SA_k\) in the original problem)
Solving the transformed problem
Let \(B_k\) be the first half of the desired expression, \(g(1,k)+g(2,k)+\ldots+g(k-1,k)\). Here we explain how to find \(B_k\). The latter half can be found in the same way.
We try to find \(B_k\) in the increasing order from \(k=1\). What is the relationship between \(B_k\) and \(B_{k+1}\)? Since
\(\begin{array}{} B_k&=&g(1,k)&+&\ldots&+&g(k-1,k)&\\ B_{k+1}&=&g(1,k+1)&+&\ldots&+&g(k-1,k+1)&+&g(k,k+1)\\ &=& \min(g(1,k),A_k) &+&\ldots&+&\min(g(k-1,k),A_k)&+&A_k, \end{array}\)
\(B_k\) can be found as follows.
- Let \(X\) be a multiset that is initially empty. For each \(k=1,\ldots,N\), perform the following three operations:
- 1. Replace all the elements \(x\) in \(X\) by \(\min(x,A_{k-1})\)
- 2. Add \(A_{k-1}\) to \(X\)
- 3. Find the sums of all the elements in \(X\). This is \(B_k\)
If we implement this naively, it costs \(O(N^2)\) time at worst. Instead, we can manage the pairs of values and their numbers. Since every Operation 1 decreases the numbers of distinct elements in \(X\), the number of operations is at most \(O(N)\), so the problem has been solved in a worst time of \(O(N \log N)\) with ordered multiset or priority queue.
Moreover, since the number added in Operation 2 is no less than \(\max X\) at that point, we can use a stack as a data structure of managing \(X\), in which case the time complexity is \(O(N)\).
Therefore, the problem has been solved in a total of \(O(N)\) time.
Note that the construction of Suffix Array and LCP Array is provided by ACL (AtCoder Library), so if you are using C++, it will be useful to use this.
posted:
last update: