Official

Ex - Substring Sort Editorial by en_translator


Based on the conditions of the problem, if different triples correspond to the same string, you can swap them in the sequence, and those that can be swapped is limited to such triples. Therefore, this problem essentially asks to sort \(M\) strings expressed as \(S_K[L,R]\) and find the triple correspond to \(x_i\)-th string.

Consider sorting the entire substrings of \(S_i\) in the lexicographical order. Note that each (non-empty) substring can be written as one of the suffixes of prefixes.
(Specifically, \(S_K[L,R]\) is a prefix of \(S_K[L,|S_K|]\), which in turn is a suffix of \(S_K\).)

1. Construct a suffix array and an array of longest common prefixes

Therefore, first of all, we try to sort all suffixes of \(S_1,S_2, \ldots,S_N\) (there are \(\displaystyle\sum_{i=1}^N \lvert S_i\rvert\) of them) to obtain a suffix-array-ish sequence \(T\).
That is, we try to obtain \(T=((K_1,L_1),\ldots, (K_{|T|},L_{|T|}))\) such that \(S_{K_i}[L_i,|S_{K_i}|]\leq S_{K_j}[L_j,|S_{K_j}|]\) in the lexicographical order for all \(i\leq j\). This can be obtained by finding the suffix array of the string
\(S'=S_1+\) $ \(+S_2+\) $\( +\cdots +S_{N-1}+\) $ \(+S_N+\) $
that is obtained by concatenating \(S_1, S_2, \ldots, S_N\), separated by a character smaller than any of the letters ($ is used here). This is because, denoting by \(\bar{S}_i\) a string that is obtained by removing the first occurrence of $ and later characters, \(\bar{S}_i\leq \bar{S}_j\) if \(S'[L_i,|S'|]\leq S'[L_j,|S'|]\). (This follows from the fact that the lexicographical order does never change by appending any string to two strings if (\(\bar{S}_i\) +$) \(>\)(\(\bar{S}_j\) +$).) Therefore, we can reconstruct the desired \(T\) from the information that the \(i\)-th character of \(S'\) corresponds to the \(L_i\)-th character of \(S_{K_i}\). Here, the suffices starting from $ occurs in the very beginning, which corresponds to empty strings, so we remove them. We also find the LCP (Longest Common Prefix) array \(LCP[i]\) of \(\bar{S}_1, \ldots, \bar{S}_{|T|}\). This is an array of the length of LCP of \(\bar{S}_i\) and \(\bar{S}_{i+1}\). This can be found as \(LCP[i]=\min(\mathrm{LCP}_{S'}[I+N],\min(|\bar{S}_i|, |\bar{S}_{i+1}|))\), where \(\mathrm{LCP}_{S'}[i]\) is the LCP array of \(S'\). (Note that we add \(N\) to the indices of \(\mathrm{LCP}_{S'}[I]\) because the first \(N\) element of the suffix array of \(S'\) starts with $.) For convenience, we let \(LCP[0]=LCP[|T|]=0\).

2. Count the number of substrings in the lexicographical order efficiently

Next, we consider how to sort the substrings based on this result. We recommend you temporarily stop here and consider how to solve using this before reading further.

Currently, the substring is given as the entire prefixes of \(\bar{S}_i\).
Here, for \(i\) and \(j\) such that \(i<j\), when we compare \(\bar{S}_i[1,R]\) and \(\bar{S}_j[1,R']\), \(\bar{S}_j[1,R']\) is a prefix of \(\bar{S}_i[1,R]\), or \(\bar{S}_i[1,R]<\bar{S}_j[1,R']\). Thus, when we finished enumerating all strings lexicographically less than or equal to \(X\)

  1. Take the unique \((i,R)\) \((1\leq i\leq |T|, 1\leq R\leq |S_i|)\) that satisfies the following three conditions (which exists as long as there is a subsequence greater than \(X\)). In other words, take the lexicographically smallest integer pair \((i,R)\) \((1\leq i\leq |T|, 1\leq R\leq |S_i|)\) that are not chosen yet.

    • For all \(1\leq j\leq i-1\) and \(1\leq R'\leq |\bar{S_i}|\), it holds that \(\bar{S}_j[1,R']\leq X\).
    • For all \(1\leq R'\leq R-1\), it holds that \(\bar{S}_i[1,R']\leq X\).
    • \(\bar{S}_i[1,R']> X\).
  2. Take \(i\leq j\) that satisfies the following two conditions. Here, for \(i\leq i' \leq j\), it holds that \(\bar{S}_{i'}[1,R]=\bar{S}_i[1,R]\), and these are only subsequence that coincides with the string.

    • \(LCP[i']\geq R\) for all \(i\leq i' <j\).
    • \(LCP[j]<R\) (note that \(LCP[|T|]=0\) by definition)

We can start from an empty string \(X\) to obtain all possible subsequences and the numbers. However, trying to iterate them all costs as much as at least the number of all possible subsequences \(S_i\), which won’t finish in time. However, if you focus only on \((i,j)\) among the \((i,R,j)\) chosen in the steps above, you realize that it’s in ascending order of \(i\), ties broken by descending \(j\), so they can be computed at once. Then, \((i,j)\) has corresponding \(R\) only if \(\max(LCP[i-1],LCP[j])<\min(LCP[i],\ldots,LCP[j-1])\), and only up to \(2|T|\) satisfies it. We can use a set to find it in \(O(|T|\log |T|)\) time; by scanning \(i\) from the left, managing possible \(j\) with a slack and fill it in the lexicographically descending order of \(j\), we can implement it in an \(O(|T|)\) time.

For each \((i,j)\), the corresponding \(R\) spans over entire \(a<R\leq b\), where \(a=\max(LCP[i-1],LCP[j])\) and \(b=\min(LCP[i],\ldots,LCP[j-1])\), so there are \((j-i+1)\times (b-a)\) corresponding sequences. By accumulating them as the number of scanned items in the lexicographical order, and if it contains the \(x_i\)-th one, we can obtain the answer based on their lengths and the information on \(T_i\).

3. Wrap-up

To wrap, the answer can be obtained by:

  1. first construct \(S'\) to find its Suffix Array and LCP, which in turn is used to find \(T\) and \(LCP[i]\);
  2. then use information on \(LCP[i]\) to find conforming \((i,j)\) and find the answer based on information on \(T\).

The first step can be processed in an \(O\left(\displaystyle\sum_{i=1}^N \lvert S_i\rvert \right)\) time, and the latter in \(O\left(Q+\displaystyle\sum_{i=1}^N \lvert S_i\rvert \right)\) if we use a stack, so the problem has been solved fast enough.

Sample code in C++:

#include <bits/stdc++.h>
#include <atcoder/all>
using namespace std;
using namespace atcoder;

int main() {
	int n,q;
	cin>>n;
	vector<string> s(n);
	for(int i=0;i<n;i++)cin>>s[i];
	cin>>q;
	vector<long long>x(q);
	for(int i=0;i<q;i++)cin>>x[i];
	vector<tuple<int,int,int> > ans(q);
//1. 
	int tot=0;
	string str;
	for(int i=0;i<n;i++){
		str+=s[i]+'$';
		tot+=s[i].size();
	}
	vector<int>rev(n+tot);
	vector<pair<int,int> >t(tot);
	vector<int>len(tot);
	vector<int>lcp(tot+1);
	vector<int> str_sa = suffix_array(str);
	vector<int> str_lcp=lcp_array(str, str_sa);

	for(int i=n;i<n+tot;i++)rev[str_sa[i]]=i-n;
	int sz,cur=0;
	long long m;
	for(int i=0;i<n;i++){
		sz=s[i].size();
		for(int ii=0;ii<sz;ii++){
			t[rev[cur+ii]]={i+1,ii+1};
			len[rev[cur+ii]]=sz-ii;
		}
		cur+=sz+1;
		m+=((long long)sz)*(sz+1)/2;
	}
	for(int i=1;i<tot;i++)lcp[i]=min(str_lcp[n+i-1],min(len[i-1],len[i]));
//2.  
	stack<int>j_cand;
	int j,a,b,ans_len;
	long long cnt;
	cur=q-1;
	for(int i=tot-1;i>=0;i--){
		j_cand.push(i);
		b=len[i];
		while(!j_cand.empty()){
			j=j_cand.top();
			a=max(lcp[i],lcp[j+1]);
			cnt=((long long)(j-i+1))*(b-a);
			while(cur>=0){
				if(x[cur]<=(m-cnt))break;
				ans_len=(x[cur]+cnt-m-1)/(j-i+1)+1;
				ans[cur]={t[i].first,t[i].second,t[i].second+a+ans_len-1};
				cur--;
			}
                        m-=cnt;
			if(lcp[i]<=lcp[j+1])j_cand.pop();
			if(lcp[i]>=lcp[j+1])break;
			b=a;
		}
	}
	
	for(int i=0;i<q;i++)cout<<get<0>(ans[i])<<" "<<get<1>(ans[i])<<" "<<get<2>(ans[i])<<"\n";
	return 0;
}

posted:
last update: