Optimize for fun: Modular Exponentiation in Montgomery Space

How to compute modular exponentiation c = x^e mod n efficiently, especially when exponentiation part e is large?

A better algorithm is exponentiation by squaring [1], we can express e in binary representation:

/// Modular Exponentiation
/// Inputs: x, e, n 128-bit integer
/// Output: x^e mod n (128-bit integer)
uint128_t modexp(uint128_t x, uint128_t e, uint128_t n) {
  uint128_t base = x;
  x = 1;
  while (e) {
    if (e & 0x1) {
      uint256_t xb = uint256_t::mul(x, base);
      x = xb.mod(n);
    }

    uint256_t xb = uint256_t::mul(base, base);
    base = xb.mod(n);
    e >>= 1;
  }
  return x;
}

And modulo:

  uint128_t mod(const uint128_t b) const {
    uint128_t r = 0;
    for (uint16_t x = bits(); x > 0; x--) {
      r <<= 1;
      if (x > 128) {
        if ((high >> (x - 1U - 128)) & 1)
          ++r;
      } else {
        if ((low >> (x - 1)) & 1)
          ++r;
      }
      if (r >= b)
        r -= b;
    }
    return r;
  }

Even though the mod only use bit operations, but it is actually a very expensive operation, because of the for loop. Better algorithm? In modexp function, we compute double width product x*base, and base*base, then find their smallest nonnegative representation by performing a mod.
base*base ≡ c (mod n)
x*base ≡ d (mod n)

The idea is, we can use different equivalence class for x and base, such that divide and modulo in these classes become very cheap [2][3]. On the computer system, we choose R = 2^k and R > n, transform x and base into Montgomery representation by x*R mod n, base*R mod n, such that modulo R just discards high bits, divide R just shifts bits right [2][3].
Let's look at the code, the code framework is similar to modexp.

  /// Inputs: x, e 128-bit integer
  /// Output: x^e mod n (128-bit integer)
  ///
  /// n is initialized in constructor
  uint128_t montModExp(uint128_t x, uint128_t e) {
    uint128_t mX = toMontSpace(1);
    uint128_t mBase = toMontSpace(x);

    while (e) {
      if (e & 0x1)
        mX = montMul(mX, mBase);
      mBase = montMul(mBase, mBase);
      e >>= 1;
    }
    return montMul(mX, 1);
  }

First, we transform x and base into Montgomery representation.
This is a 128-bit integer implementation, so we choose R = 2¹²⁸

  /// Transform x to Montgomery Representation.
  /// Inputs: x (128-bit integer)
  /// Output: c (128-bit integer) = x * R mod n
  ///
  /// R = 2^128 in 128-bit integer implementation.
  uint128_t toMontSpace(uint128_t x) {
    x %= N;
    for (int i = 0; i < 128; i++) {
      x <<= 1;
      if (x >= N)
        x -= N;
    }
    return x;
  }

Then we perform Montgomery multiplication xR (mod n) * baseR (mod n), and (baseR)² (mod n),

  /// Compute Montgomery multiplication.
  /// cR = aR * bR * R^-1 (mod n)
  ///
  /// mA = aR, mB = bR. mA and mB are in Montgomery space. (128-bit integer)
  uint128_t montMul(uint128_t mA, uint128_t mB) {
    return reduce(uint256_t::mul(mA, mB));
  }

Note:

This is a critical step and is known as Montgomery reduction. [4][3][5][2]
Let's see the mathematics and how to implement it.
Let T = (aR mod n)(bR mod n)
The idea is, add kn to T such that T + kn ≡ 0 (mod R) , therefor TR^-1 ≡ (T + kn)/R (mod n) [2][3]
Why it is correct? If we just multiply two numbers without (mod n) for each number:
(aR)(bR) / R = abR, then perform (mod R), we can have correct result abR (mod n)
But we can't directly T / R, because after modulo operation, T is not divisible by R.
However, T + kn is divisible by R, and T + kn ≡ T (mod n)
So we can add kn to T, such that T + kn ≡ 0 (mod R), then we can directly divide R from it.

How to find k? The algorithm [4][3]:

function reduce(T):
    k = (T mod R) * n' mod R
    a = (T - k * n) / R
    if a < 0:
        a += n
    return a

[cp-algorithm:montgomery_multiplication ]

Note: ∵ R = 2^k , in our implementation, it is R = 2¹²⁸,
∴ (mod R) = drop bits >= 128 bits,
(div R) = shift right 128 bits, in other words, drop lower 128 bits
They are cheap on the computer system.
Also a = [-n, n-1]

  /// Compute Montgomery Reduction.
  /// cR = xR^2 * R^-1 (mod n)
  ///
  /// Algorithm:
  ///   q = (xR^2 mod R) * n' mod R
  ///   a = (xR^2 + q * n) / R
  ///   if a >= n:
  ///     a -= n
  ///   return a
  uint128_t reduce(uint256_t xR2) {
    uint128_t q = xR2.low * Ninv;
    int128_t a = xR2.high - uint256_t::mul(q, N).high;
    if (a < 0)
      a += N;
    return a;
  }

We can also think in this way (wiki Montgomery reduction):

Next question is how to find n' (that is: inverse of n, such that nn' ≡ 1 (mod R) )?
There is a fast inverse trick:
(directly copy from: https://cp-algorithms.com/algebra/montgomery_multiplication.html#toc-tgt-2)

Proof: Let ax = 1 + m * 2^k

    // Compute N^-1 s.t. N * N^-1 ≡ 1 (mod R)
    // Fast inverse trick:
    //    a * x ≡ 1 (mod 2^k) => a * x * (2 - a * x) ≡ 1 (mod 2^(2k))
    //
    // Ninv = 1,
    // (1) N * Ninv ≡ 1 (mod 2^1) => N * Ninv * (2 - N * Ninv) (mod 2^2)
    //     Update Ninv = Ninv * (2 - N * Ninv)
    // (2) N * Ninv ≡ 1 (mod 2^2) => N * Ninv * (2 - N * Ninv) (mod 2^4)
    //     Update Ninv = Ninv * (2 - N * Ninv)
    // (3) ...
    // (7) N * Ninv ≡ 1 (mod 2^64) => N * Ninv * (2 - N * Ninv) (mod 2^128)
    Ninv = 1;
    for (int i = 0; i < 7; ++i)
      Ninv = Ninv * (2 - N * Ninv);

We finished the Montgomery exponentiation implementation.
But the transformation of a number to Montgomery representation is too slow, actually we can optimize it by the fact:

The algorithm for computing R² mod n:

    R2 = -N % N;
    for (int i = 0; i < 4; i++) {
      R2 <<= 1;
      if (R2 >= N)
        R2 -= N;
    }

    for (int i = 0; i < 5; i++) {
      R2 = montMul(R2, R2);
    }

Note -N % N ≡ R - N ≡ R (mod N).
Then modify our montModExp:

  /// Compute: c = x^e mod n
  ///
  /// x, e, and c are 128-bit integers.
  /// n and r^2 is initialized in constructor
  uint128_t montModExp(uint128_t x, uint128_t e) {
    // Result x in Montgomery Space
    uint128_t mX = montMul(1, R2);

    // Transform x to Montgomery Space as base
    // mBase = x * r^2 * r^-1 (mod N)
    uint128_t mBase = montMul(x, R2);
    while (e) {
      if (e & 0x1)
        mX = montMul(mX, mBase);
      mBase = montMul(mBase, mBase);
      e >>= 1;
    }

    // Transform x from Montgomery Space to Normal Space.
    return montMul(mX, 1);
  }

Done! (=ﾟωﾟ)ﾉ

Performance Comparison:

Compile Flags: -std=c++11 -Wall -O3 -DNDEBUG
Machine: i7-4770HQ CPU @ 2.20GHz
n = 0x9e40fd675571e0af74d65da4ea541cf
x = 0xfbeab553608bdf65b2ab09bb910317f9
e = 0x172a202e867b11779604827082342863
Test loop: 10000
modexp takes 3133.34 msec
Montgomery exponential: takes 18.7284 msec.
167 times faster than modexp.

Full Implementation:
https://github.com/cycheng/coding-for-fun/blob/master/crypto/montgomery.cpp

References:
[1] https://en.wikipedia.org/wiki/Modular_arithmetic
[2] (Traditional Chinese) https://ee.ntu.edu.tw/upload/hischool/doc/2011.07.pdf
[3] https://en.wikipedia.org/wiki/Montgomery_modular_multiplication
[4] https://cp-algorithms.com/algebra/montgomery_multiplication.html
[5] https://cryptography.fandom.com/wiki/Montgomery_reduction

Good Tools:
[1] Big number calculator https://www.boxentriq.com/code-breaking/big-number-calculator
[2] Online LaTeX editor https://www.codecogs.com/latex/eqneditor.php

Optimize for fun

Sunday, April 26, 2020

Modular Exponentiation in Montgomery Space

No comments:

Post a Comment