Official

F - 998244353 → 1000000007 Editorial by evima


The problem can basically be solved by using Montgomery multiplication (Wikipedia). See this article for a detailed explanation of what this is. (Below, the usage of variables follows this article.)

Simply speaking, Montgomery multiplication is an algorithm to compute \(A \times B \bmod{N}\) using only multiplication, addition, and a modulo operation \(\text{mod }R\) (without using the modulo operation \(\text{mod }N\)). The following code assigns \(A \times B \bmod{N}\) to \(C\).
(Below, \(N'\) is the smallest positive integer such that \(N' N \equiv -1 \pmod{R}\), and \(R_2 = R^2 \bmod{N}\).)

uint64_t reduce(uint64_t T) {
  uint64_t t = (T + (T % R * Nprime % R) * N) / R;
  if (t >= N) t -= N;
  return t;
}

uint64_t A = a, B = b, C;
void f() {
  A = reduce(A * R_2);
  B = reduce(B * R_2);
  C = A * B;
  C = reduce(C);
  C = reduce(C);
}

Let us consider performing Montgomery multiplication with \(N = 10^9 + 7, R = 998244353\).

The part reduce(T) in the code above is called Montgomery reduction, which is the key to the problem. By reading the code, we can see that everything can be implemented using the given instructions except the following two parts.

  • / R in the first line
  • The if statement in the second line

The division with \(R\) can be relatively easily handled by using the fact that the multiplication takes \(\text{mod }2^{64}\).
Let \(S\) be an integer such that \(S R \equiv 1 \pmod{2^{64}}\). For any 64-bit integer \(X\) that is a multiple of \(R\), it holds that \(X / R\) and \(X S\) are congruent modulo \(2^{64}\), and this value is uniquely determined between \(0\) and \(2^{64}-1\). Therefore, the division by \(R\) may be replaced with multiplication by \(S\) without changing the result.

Now, proceed to the if statement. There are several approaches.

Method 1

Here is a straightforward approach. To simplify the problem, consider the following function \(F(t)\):

\[F(t) = (1 \text{ if t }\geq N\text{ else }0).\]

If \(F(t)\) can be computed using only the given instructions, the if statement can be replaced by \(t + F(t) \times (2^{64} - 1)\).

Let us try to implement \(F(t)\). We need to do a calculation that produces different results when \(0 \leq t \lt N\) and when \(N \leq t \lt 2N\). One way to do so is as follows.

Consider storing the following \(p\) and \(q\) in variables.

\[p = t + 2^{64} - N, q = t + 2^{64} + 2R - N\]

If \(t \lt N\), only \(q\) is greater than or equal to \(2^{64}\), so the value to be stored in \(q\) is the true value minus \(2^{64}\). Thus, \((q - p) \bmod{R}\) is \(2^{64} \bmod{R}\).
On the other hand, if \(N \leq t \lt 2N\), both \(p\) and \(q\) are greater than or equal to \(2^{64}\) and get subtracted by \(2^{64}\). Thus, \((q - p) \bmod{R}\) is \(0\).
Therefore, we see that \(F(t)\) can be computed by performing appropriate multiplication and addition by constant numbers on \((q-p) \bmod{R}\). (Since \(2^{64} \bmod{R}\) does not have a multiplicative inverse modulo \(2^{64}\), the required implementation is more complex than it seems, which we will omit.) With this, one can completely implement the function reduce, which one can use to perform Montgomery multiplication to achieve the objective.

Method 2

Another approach (former model solution) is a “feat of strength.”
Montgomery multiplication maintains \(T R^{-1}\) as a value between \(0\) and \(N - 1\). The if statement in Montgomery reduction is used to normalize the \(t\) in the previous statement, which takes a value between \(0\) and \(2N - 1\), into the range \(\lbrack 0, N)\).
To put it the other way around, one can do without the if statement and still obtain a value congruent to the true value modulo \(N\) if there is no overflow, etc. So, let us do the multiplication without normalization in reduce, that is, consider the operation represented as the following function incomplete_reduce:

uint64_t incomplete_reduce(uint64_t T) {
  return (T + (T % R * Nprime % R) * N) / R;
}

Consider a code where the function reduce is replaced with incomplete_reduce (shown below). If the \(C\) in the fifth line of f is always less than \(N\), this would be correct (it is not).

uint64_t A = a, B = b, C;
void f() {
  A = incomplete_reduce(A * R_2);
  B = incomplete_reduce(B * R_2);
  C = A * B;
  C = incomplete_reduce(C);
  C = incomplete_reduce(C);
}

By evaluating the value assigned to the left side in each line, it can be verified that the C in the fifth line is at most \(N+6\). (Details are omitted, but one can use the fact that the return value of incomplete_reduce is bound by \((T + (R-1)N) / R\). Stricter evaluation is also possible.)

By the way, the return value of incomplete_reduce depends on \(T\), and one can see that a smaller \(T\) allows a stricter evaluation of the output. So, let us perform the reduction once again, that is, add the following to the code as the sixth and seventh lines.

C = incomplete_reduce(C * R_2);
C = incomplete_reduce(C);

Then, it can be verified that \(C \lt N+1\). (Details are again omitted, but the writer thinks it is necessary to use \(R_2 \lt N/3\).) Additionally, it never happens that \(C=N\). (In such a case, at least one of \(a\) and \(b\) should be \(0\), which would make \(C=0\) since incomplete_reduce(0) = 0.) Thus, \(C\) is always less than \(N\).
Therefore, it has been proved that performing the incomplete reduction one extra time ensures that \(C\) becomes less than \(N\), allowing one to compute \(a \times b \bmod{N}\).

From the above, one can compute \(a \times b \pmod{N}\) by implementing Method 1 or 2 in the program. There should be enough lines to implement either method. (Our solutions had \(27\), \(37\), \(40\), and \(40\) lines.) By the way, you can reduce instructions by first computing \(A \times B\) and then performing a reduction on that value.

The explanation of Method 1 was written referring to maspy’s program, and the note on reducing instructions was written referring to maroon’s program. The writer thanks the admin and testers.

posted:
last update: