F - Well-defined Abbreviation Editorial by evima

Let \(f(T)\) denote the set of strings that can result from the operation on \(T\). Additionally, when the size of \(f(T)\) is \(1\), let \(g(T)\) denote the resulting string.

The necessary and sufficient condition

Let us first consider \(S_i\) that contains one of the \(S_j\)’s \((i\neq j)\). If greedily removing \(S_j\)’s \((j \neq i)\) from the beginning of \(S_i\) makes it empty, \(f(T)\) is not affected by the existence or absence of \(S_i\). Therefore, we may remove such \(S_i\) without changing the answer. If \(S_i\) does not become empty, \(S_i\) itself is a bad string.

If this process does not find a bad string, \(S_j\) is not a substring of \(S_i\) for any \(S_i\) and \(S_j\) \((i\neq j)\). Now, the necessary and sufficient condition for there to be no bad string is:

there is no strings \(A\), \(B\), \(C\) \((A\neq C)\) and integers \(1\leq i,\ j \leq n\) such that \(S_i=A+B\) and \(S_j=B+C\).

Proof

Necessity

Assume that \(A\), \(B\), \(C\) and \(i\), \(j\) with the above properties exist. First, if \(A\) or \(C\) contained one of the \(S_k\)’s, \(S_i\) and \(S_j\) would also contain that string as a substring. Thus, both \(A\) and \(C\) contain none of the \(S_k\)s, so we have \(f(A)=\{A \},\ f(C)=\{C \}\). Therefore, we have \(\{A,\ C\} \subset f(A+B+C)\), so \(T=A+B+C\) is a bad string since \(A\neq C\).

Sufficiency

We will show that the size of \(f(T)\) is \(1\) for any \(T\) by induction on \(|T|\). The starting case where \(T\) is an empty string is clear.
Now, assume that the size of \(f(T')\) is \(1\) for strings \(T'\) shorter than \(T\). Let \(T'_1\) and \(T'_2\) be the strings obtained by removing \(S_i\) and \(S_j\) somewhere from \(T\), respectively. We need to show that \(g(T'_1)=g(T'_2)\). If the positions of the removed strings overlap, it follows from the conditions above that \(T'_1=T'_2\), so \(g(T'_1)=g(T'_2)\). If the positions do not overlap, we can take a string \(T'_3\) obtained by removing both \(S_i\) and \(S_j\) from \(T\), for which \(g(T'_1)=g(T'_2)=g(T'_3)\).

Therefore, the problem can be solved by:

Step 1: removing \(S_i\)’s that can be made empty by removing shorter \(S_j\)’s, and
Step 2: determining whether the necessary and sufficient condition above is satisfied.

Step 1

Let us add the characters of \(S_i\) one by one to an empty string, and whenever the current string ends with one of the \(S_j\)’s, remove that ending. This can be done in \(O(|S_i|)\) time with the Aho-Corasick algorithm. However, if we use failure links to move between nodes, we may perform many moves taking \(\Theta(|S_i|)\) time (when, for example, \(S=\{\) AAA…AAA, C, BD, AAA…ABDCCC…CA\(\}\)). We should precompute the next node for every character at every node. This can be done in \(O(\sigma \sum_{i=1}^{N}|S_i|)\) time in a way similar to finding the failure links.

Step 2

Let us record the \(j\) such that the length-\(n\) suffix of \(S_i\) and the prefix of \(S_j\) match, for each \((i,\ n)\). Those \(j\)’s can be listed in \(O(|S_i|)\) time by traversing down the trie to the end and then tracing failure links. We only need to record one \(j\) because if two or more prefixes of \(S_j\)’s match the same suffix, they will not be the \((S_i, S_j)\) we look for.
When \(j\) is recorded for \((i,\ n)\), we need to check whether \(|S_i|=|S_j|\) and \(i\) is recorded for \((j,\ n-|S_i|)\) to see if we have found the pair.

Summarizing the above, the problem can be solved in \(O(\sigma \sum_{i=1}^{N}|S_i|)\) time.

posted:
last update: