<TeXmacs|1.99.5>

<style|<tuple|tmarticle|cite-sort>>

<\body>
  <\hide-preamble>
    <new-algorithm|cpp-function|Function|>

    <assign|xpseudo-code|<\macro|body>
      <\with|par-first|0fn|par-par-sep|0fn|item-hsep|<macro|1.5fn>|par-left|<value|par-first>>
        <arg|body>
      </with>
    </macro>>

    <assign|xcpp-code|<\macro|body>
      <\pseudo-code>
        <cpp|<arg|body>>
      </pseudo-code>
    </macro>>

    <assign|specified-algorithm|<\macro|intro|body>
      <\surround|<compound|next-algorithm>|>
        <\render-specified-algorithm|<compound|algorithm-text>
        <with|font-shape|right|<compound|the-algorithm>><no-break-here>>
          <arg|intro>
        <|render-specified-algorithm>
          <arg|body>
        </render-specified-algorithm>
      </surround>
    </macro>>
  </hide-preamble>

  <doc-data|<doc-title|Implementing fast carryless
  multiplication>|<doc-author|<author-data|<author-name|Joris van der
  Hoeven>|<\author-affiliation>
    Laboratoire d'informatique de l'cole polytechnique

    LIX, UMR 7161 CNRS

    Campus de l'cole polytechnique

    1, rue Honor d'Estienne d'Orves

    Btiment Alan Turing, CS35003

    91120 Palaiseau, France
  </author-affiliation>|<author-email|vdhoeven@lix.polytechnique.fr>>>|<doc-author|<author-data|<author-name|Robin
  Larrieu>|<author-email|larrieu@lix.polytechnique.fr>|<\author-affiliation>
    Laboratoire d'informatique de l'cole polytechnique

    LIX, UMR 7161 CNRS

    Campus de l'cole polytechnique

    1, rue Honor d'Estienne d'Orves

    Btiment Alan Turing, CS35003

    91120 Palaiseau, France
  </author-affiliation>>>|<doc-title-options|cluster-by-affiliation>|<doc-author|<author-data|<author-name|Grgoire
  Lecerf>|<\author-affiliation>
    Laboratoire d'informatique de l'cole polytechnique

    LIX, UMR 7161 CNRS

    Campus de l'cole polytechnique

    1, rue Honor d'Estienne d'Orves

    Btiment Alan Turing, CS35003

    91120 Palaiseau, France
  </author-affiliation>|<author-email|lecerf@lix.polytechnique.fr>>>|<doc-date|Preliminary
  version of <date>>>

  <abstract-data|<\abstract>
    The efficient multiplication of polynomials over the finite field
    <math|\<bbb-F\><rsub|2>> is a fundamental problem in computer science
    with several applications to geometric error correcting codes and
    algebraic crypto-systems. In this paper we report on a new algorithm that
    leads to a practical speed-up of about two over previously available
    implementations. Our current implementation assumes a modern AVX2 and
    CLMUL enabled processor.
  </abstract>>

  <section|Introduction>

  Modern algorithms for fast polynomial multiplication are generally based on
  <with|font-shape|italic|evaluation-interpolation> strategies and more
  particularly on the <with|font-shape|italic|discrete Fourier transform>
  (DFT). Taking coefficients in the finite field <math|\<bbb-F\><rsub|2>>
  with two elements, the problem of multiplying in
  <math|\<bbb-F\><rsub|2><around*|[|x|]>> is also known as <em|carryless
  integer multiplication> (assuming binary notation). The aim of this paper
  is to present a practically efficient solution for large degrees.

  One major obstruction to evaluation-interpolation strategies over small
  finite fields is the potential lack of evaluation points. The customary
  remedy is to work in suitable extension fields. Remains the question of how
  to reduce the incurred overhead as much as possible.

  More specifically, it was shown in<nbsp><cite|vdH:f2kmul> that
  multiplication in <math|\<bbb-F\><rsub|2><around*|[|x|]>> can be done
  efficiently by reducing it to polynomial multiplication over the
  <em|Babylonian field> <math|\<bbb-F\><rsub|2<rsup|60>>>. Part of this
  reduction relied on Kronecker segmentation, which involves an overhead of a
  factor two. In this paper, we present a variant of a new algorithm
  from<nbsp><cite|vdH:ffft> that removes this overhead almost entirely. We
  also report on our <name|Mathemagix> implementation that is roughly twice
  as efficient as before.

  <subsection|Related work>

  For a long time, the best known algorithm for carryless integer
  multiplication was Schnhage's triadic variant<nbsp><cite|Sch77> of
  Schnhage\UStrassen's algorithm<nbsp><cite|SS71> for integer
  multiplication: it achieves a complexity <math|O<around*|(|n*log n*log log
  n|)>> for the multiplication of two polynomials of degree <math|n>.
  Recently<nbsp><cite|vdH:ffmul>, Harvey, van der Hoeven and Lecerf proved
  the sharper bound <math|O<around*|(|n*log n*8<rsup|log<rsup|\<ast\>>
  n>|)>>, but also showed that several of the new ideas could be used for
  faster practical implementations<nbsp><cite|vdH:f2kmul>.

  More specifically, they showed how to reduce multiplication in
  <math|\<bbb-F\><rsub|2><around*|[|x|]>> to DFTs over
  <math|\<bbb-F\><rsub|2<rsup|60>>>, which can be computed efficiently due to
  the existence of many small prime divisors
  of<nbsp><rigid|<math|2<rsup|60>-1>>. Their reduction relies on
  <em|Kronecker segmentation>: given two input polynomials
  <math|><math|A<around*|(|x|)>=<big|sum><rsub|0\<leqslant\>i\<less\>n>a<rsub|i>*x<rsup|i>>
  and <math|B<around*|(|x|)>=<big|sum><rsub|0\<leqslant\>i\<less\>n>a<rsub|i>*x<rsup|i>>
  in <math|\<bbb-F\><rsub|2><around*|[|x|]>>, one cuts them into chunks of 30
  bits and forms <math|<wide|A|~><rigid|<around*|(|y,z|)>>=<big|sum><rsub|i=0><rsup|m-1><big|sum><rsub|j=0><rsup|29>a<rsub|30*i+j>*z<rsup|j>*y<rsup|i>>
  and <math|<wide|B|~><around*|(|y,z|)>=<big|sum><rsub|i=0><rsup|m-1><big|sum><rsub|j=0><rsup|29>b<rsub|30*i+j>*z<rsup|j>*y<rsup|i>>,
  where <math|m=<around*|\<lceil\>|n/30|\<rceil\>>> (the least integer
  <math|\<geqslant\>n/30>). Hence <math|A<around*|(|x|)>=<wide|A|~><around*|(|x<rsup|30>,x|)>>,
  <math|B<around*|(|x|)>=<rigid|<wide|B|~><around*|(|x<rsup|30>,x|)>>>, and
  the product <math|C=A*B> satisfies <math|C<around*|(|x|)>=<wide|C|~><around*|(|x<rsup|30>,x|)>>,
  where <math|<wide|C|~>=<wide|A|~>*<wide|B|~>>. Now <math|<wide|A|~>> and
  <math|<wide|B|~>> are multiplied in <math|\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]>>
  by reinterpreting <math|z> as the generator of
  <math|\<bbb-F\><rsub|2<rsup|60>>>. The recovery of <math|<wide|C|~>> is
  possible since its degree in <math|z> is bounded by
  <math|2\<cdot\>29=58\<less\>60>. However, in terms of input size, half of
  <math|60> coefficients of <math|<wide|A|~><around*|(|y,z|)>> and
  <math|<wide|B|~><around*|(|y,z|)>> in <math|z> are \Pleft blank\Q, when
  reinterpreted inside <math|\<bbb-F\><rsub|2<rsup|60>>>. Consequently, this
  reduction method based on Kronecker segmentation involves a constant
  overhead of roughly <math|2>. In fact, when considering algorithms with
  asymptotically softly linear costs, comparing relative input sizes gives a
  rough approximation of the relative costs.

  Recently van der Hoeven and Larrieu<nbsp><cite|vdH:ffft> have proposed a
  new way to reduce multiplication of polynomials in
  <math|\<bbb-F\><rsub|q><around*|[|x|]>> to the computation of DFTs over an
  extension <math|\<bbb-F\><rsub|q<rsup|\<ell\>>>>. Roughly speaking, they
  have shown that the DFT of a polynomial in
  <math|\<bbb-F\><rsub|q<rsup|\<ell\>>><around*|[|x|]>> could be computed
  almost<nbsp><math|\<ell\>> times faster if its coefficients happen to lie
  in the subfield <math|\<bbb-F\><rsub|q>>. Using their algorithm, called the
  <em|Frobenius FFT>, it is theoretically possible to avoid the overhead of
  Kronecker segmentation, and thereby to gain a factor of two with respect
  to<nbsp><cite|vdH:f2kmul>. However, application of the Frobenius FFT as
  described in<nbsp><cite|vdH:ffft> involves computations in all intermediate
  fields <math|\<bbb-F\><rsub|q<rsup|e>>> between <math|\<bbb-F\><rsub|q>>
  and <math|\<bbb-F\><rsub|q<rsup|\<ell\>>>>. This makes the theoretical
  speed-up of two harder to achieve and practical implementations more
  cumbersome.

  Besides Schnhage\UStrassen type algorithms, let us mention that other
  strategies such as the <em|additive Fourier transform> have been developed
  for <math|\<bbb-F\><rsub|2<rsup|k>><around*|[|x|]>><nbsp><cite|GaoMateer2010|LinChungHan2014>.
  A competitive implementation based on the latter transform has been
  achieved very recently by Chen et al.<nbsp><cite|chen2017faster>\Vnotice
  that their preprint<nbsp><cite|chen2017faster> does not take into account
  our new implementation. For more historical details on the complexity of
  polynomial multiplication we refer the reader to the introductions
  of<nbsp><cite|vdH:ffmul|vdH:f2kmul> and to the book by von zur Gathen and
  Gerhard<nbsp><cite|GaGe2013>.

  <subsection|Results and outline of the paper>

  This paper contains two main results. In section<nbsp><reference|algo-sec>,
  we describe a variant of the Frobenius DFT for the special extension of
  <math|\<bbb-F\><rsub|2<rsup|60>>> over <math|\<bbb-F\><rsub|2>>. Using a
  single rewriting step, this new algorithm reduces the computation of a
  Frobenius DFT to the computation of an ordinary DFT over
  <math|\<bbb-F\><rsub|2<rsup|60>>>, thereby avoiding computations in any
  intermediate fields <math|\<bbb-F\><rsub|2<rsup|e>>> with
  <math|1\<less\>e\<less\>60> and <math|e\<divides\>60>.

  Our second main result is a practical implementation of the new algorithm
  and our ability to indeed gain a factor that approaches two with respect to
  our previous work. We underline that in both cases, DFTs over
  <math|\<bbb-F\><rsub|2<rsup|60>>> represent the bulk of the computation,
  but the lengths of the DFTs are halved for the new algorithm. In
  particular, the observed acceleration is due to our new algorithm and not
  the result of <em|ad hoc> code tuning or hardware specific optimizations.

  In section<nbsp><reference|impl-sec>, we present some of the low level
  implementation details concerning the new rewriting step. Our timings are
  presented in section<nbsp><reference|bench-sec>. Our implementation
  outperforms the reference library <name|gf2x> version<nbsp>1.2 developed by
  Brent, Gaudry, Thom and Zimmermann<nbsp><cite|BrGaThZi2008> for
  multiplying polynomials in <math|\<bbb-F\><rsub|2><around*|[|x|]>>. We also
  outperform the recent implementation by Chen et
  al.<nbsp><cite|chen2017faster>. Finally, the evaluation-interpolation
  strategy used by our algorithm is particularly well suited for multiplying
  matrices of polynomials over <math|\<bbb-F\><rsub|2>>, as reported in
  section<nbsp><reference|bench-sec>.

  <section|Prerequisites><label|sec:prereq>

  <subsubsection*|Discrete Fourier transforms>

  <no-indent>Let <math|\<omega\>> be a primitive root of unity of order
  <math|n> in <math|\<bbb-F\><rsub|q>>. The <em|discrete Fourier transform>
  (DFT) of an<nbsp><math|n><nbhyph>tuple <math|a=<around*|(|a<rsub|0>,\<ldots\>,a<rsub|n-1>|)>\<in\>\<bbb-F\><rsub|q><rsup|n>>
  with respect to <math|\<omega\>> is <math|DFT<rsub|\<omega\>><around*|(|a|)>\<assign\><around*|(|<wide|a|^><rsub|0>,\<ldots\>,<wide|a|^><rsub|n-1>|)>\<in\>\<bbb-F\><rsub|q><rsup|n>>,
  where

  <\eqnarray*>
    <tformat|<table|<row|<cell|<wide|a|^><rsub|i>>|<cell|\<assign\>>|<cell|a<rsub|0>+a<rsub|1>*\<omega\><rsup|i>+\<cdots\>+a<rsub|n-1>*\<omega\><rsup|<around*|(|n-1|)>*i>.>>>>
  </eqnarray*>

  Hence <math|<wide|a|^><rsub|i>> is the evaluation of the polynomial
  <math|A<around*|(|x|)>=a<rsub|0>+a<rsub|1>*x+\<cdots\>+a<rsub|n-1>*x<rsup|n-1>>
  at <math|\<omega\><rsup|i>>. For simplicity we often identify <math|A> with
  <math|a> and we simply write <math|DFT<rsub|\<omega\>><around*|(|A|)>>. The
  inverse transform is related to the direct transform via
  <math|DFT<rsub|\<omega\>><rsup|-1>=n<rsup|-1>*DFT<rsub|\<omega\><rsup|-1>>>,
  which follows from the well known formula

  <\eqnarray*>
    <tformat|<table|<row|<cell|DFT<rsub|\<omega\><rsup|-1>><around*|(|DFT<rsub|\<omega\>><around*|(|a|)>|)>>|<cell|=>|<cell|n*a.>>>>
  </eqnarray*>

  If <math|n> properly factors as <math|n=n<rsub|1>*n<rsub|2>>, then
  <math|\<omega\><rsup|n<rsub|1>>> is an <math|n<rsub|2>>-th primitive root
  of unity and <math|\<omega\><rsup|n<rsub|2>>> is an <math|n<rsub|1>>-th
  primitive root of unity. Moreover, for any
  <math|i<rsub|1>\<in\><around*|{|0,\<ldots\>,n<rsub|1>-1|}>> and
  <math|i<rsub|2>\<in\><around*|{|0,\<ldots\>,n<rsub|2>-1|}>>, we have

  <\eqnarray*>
    <tformat|<table|<row|<cell|<wide|a|^><rsub|i<rsub|1>*n<rsub|2>+i<rsub|2>>>|<cell|=>|<cell|<big|sum><rsub|0\<leqslant\>k<rsub|1>\<less\>n<rsub|1>><big|sum><rsub|0\<leqslant\>k<rsub|2>\<less\>n<rsub|2>>a<rsub|k<rsub|2>*n<rsub|1>+k<rsub|1>>*\<omega\><rsup|<around*|(|k<rsub|2>*n<rsub|1>+k<rsub|1>|)>*<around*|(|i<rsub|1>*n<rsub|2>+i<rsub|2>|)>>>>|<row|<cell|>|<cell|=>|<cell|<big|sum><rsub|0\<leqslant\>k<rsub|1>\<less\>n<rsub|1>>\<omega\><rsup|k<rsub|1>*i<rsub|2>>*<around*|(|<big|sum><rsub|0\<leqslant\>k<rsub|2>\<less\>n<rsub|2>>a<rsub|k<rsub|2>*n<rsub|1>+k<rsub|1>>*<around*|(|\<omega\><rsup|n<rsub|1>>|)><rsup|k<rsub|2>*i<rsub|2>>|)>*<around*|(|\<omega\><rsup|n<rsub|2>>|)><rsup|k<rsub|1>*i<rsub|1>>.<eq-number><label|FFT-dec>>>>>
  </eqnarray*>

  If <math|\<cal-A\><rsub|1>> and <math|\<cal-A\><rsub|2>> are algorithms for
  computing DFTs of length <math|n<rsub|1>> and <math|n<rsub|2>>, we may
  use<nbsp><eqref|FFT-dec> to construct an algorithm for computing DFTs of
  length <math|n> as follows. For each <math|k<rsub|1>\<in\><around*|{|0,\<ldots\>,n<rsub|1>-1|}>>,
  the sum inside the brackets corresponds to the <math|i<rsub|2>>-th
  coefficient of a<nbsp>DFT of the <math|n<rsub|2>>-tuple
  <math|<around*|(|a<rsub|0*n<rsub|1>+k<rsub|1>>,\<ldots\>,a<rsub|<around*|(|n<rsub|2>-1|)>*n<rsub|1>+k<rsub|1>>|)>\<in\>\<bbb-F\><rsub|q><rsup|n<rsub|2>>>
  with respect to <math|\<omega\><rsup|n<rsub|1>>>. Evaluating these
  <with|font-shape|italic|inner DFTs> requires <math|n<rsub|1>> calls to
  <math|\<cal-A\><rsub|2>>. Next, we multiply by the <em|twiddle factors>
  <math|\<omega\><rsup|k<rsub|1>*i<rsub|2>>\<nocomma\>>, at a cost
  of<nbsp><math|n> operations in <math|\<bbb-F\><rsub|q>>. Finally, for each
  <math|i<rsub|2>\<in\><around*|{|0,\<ldots\>,n<rsub|2>-1|}>>, the outer sum
  corresponds to the <math|i<rsub|1>>-th coefficient of a DFT of an
  <math|n<rsub|1>>-tuple in <math|\<bbb-F\><rsub|q><rsup|n<rsub|1>>> with
  respect to <math|\<omega\><rsup|n<rsub|2>>>. These
  <with|font-shape|italic|outer DFTs> require <math|n<rsub|2>> calls to
  <math|\<cal-A\><rsub|1>>. Iterating this decomposition for further
  factorizations of <math|n<rsub|1>> and <math|n<rsub|2>> yields the seminal
  Cooley\UTukey algorithm<nbsp><cite|CT65>.

  <no-indent><subsubsection*|Frobenius Fourier transforms><no-break-here>

  <no-indent>Let <math|A> be a polynomial in
  <math|\<bbb-F\><rsub|q><around*|[|x|]>> and let <math|\<omega\>> be a
  primitive root of unity in some extension
  <math|\<bbb-F\><rsub|q<rsup|\<ell\>>>> of <math|\<bbb-F\><rsub|q>>. We
  write <math|\<phi\><rsub|q>> for the Frobenius map
  <math|a\<mapsto\>a<rsup|q>> in <math|\<bbb-F\><rsub|q<rsup|\<ell\>>>> and
  notice that

  <\equation>
    A<around*|(|\<phi\><rsub|q><around*|(|a|)>|)>=\<phi\><rsub|q><around*|(|A<around*|(|a|)>|)>,<label|eqn:fdft>
  </equation>

  for any <math|a\<in\>\<bbb-F\><rsub|q<rsup|\<ell\>>>>. This formula implies
  many nontrivial relations for the DFT of <math|A>: if
  <math|\<omega\><rsup|i>=\<phi\><rsub|q><rsup|\<circ\>k><around*|(|\<omega\><rsup|j>|)>>,
  then we have <math|A<around*|(|\<omega\><rsup|i>|)>=\<phi\><rsub|q><rsup|\<circ\>k><around*|(|A<around*|(|\<omega\><rsup|j>|)>|)>>.
  In other words, some values of the DFT of <math|A> can be deduced from
  others, and the advantage of the Frobenius transform introduced
  in<nbsp><cite|vdH:ffft> is to restrict the bulk of the evaluations to a
  minimum number of points.

  Let <math|n> denote the order of the root <math|\<omega\>>, and consider
  the set <math|\<Omega\>=<around*|{|1,\<omega\>,\<omega\><rsup|2>,\<ldots\>,\<omega\><rsup|n-1>|}>>.
  This set is clearly globally stable under <math|\<phi\><rsub|q>>, so the
  group <math|<around*|\<langle\>|\<phi\><rsub|q>|\<rangle\>>> generated by
  <math|\<phi\><rsub|q>> acts naturally on it. This action partitions
  <math|\<Omega\>> into disjoint orbits. Assume that we have a section
  <math|\<Sigma\>> of <math|\<Omega\>> that contains exactly one element in
  each orbit. Then formula<nbsp>(<reference|eqn:fdft>) allows us to recover
  <math|DFT<rsub|\<omega\>><around*|(|A|)>> from the evaluations of <math|A>
  at each of the points in <math|\<Sigma\>>. The vector
  <math|<around*|(|A<around*|(|\<sigma\>|)>|)><rsub|\<sigma\>\<in\>\<Sigma\>>>
  is called the <em|Frobenius DFT> of <math|A>.

  <section|Fast reduction from <math|\<bbb-F\><rsub|2><around*|[|x|]>> to
  <math|\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]>>><label|algo-sec>

  <subsection|Variant of the Frobenius DFT>

  To efficiently reduce a multiplication in
  <math|\<bbb-F\><rsub|2><around*|[|x|]>> into DFTs over
  <math|\<bbb-F\><rsub|2<rsup|60>>>, we use an order <math|n> that divides
  <math|2<rsup|60>-1> and such that <math|n=61*m> for some integer <math|m>.
  We perform the decomposition<nbsp><eqref|FFT-dec> with <math|n<rsub|1>=m>
  and <math|n<rsub|2>=61>. Let <math|\<omega\>> be a primitive
  <math|n><nbhyph>th root of unity in <math|\<bbb-F\><rsub|2<rsup|60>>>. The
  discrete Fourier transform of <math|A\<in\>\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>n>>,
  given by <math|<around*|(|A<around*|(|1|)>,A<around*|(|\<omega\>|)>,A<around*|(|\<omega\><rsup|2>|)>,\<ldots\>,A<around*|(|\<omega\><rsup|n-1>|)>|)>\<in\>\<bbb-F\><rsub|2<rsup|60>><rsup|n>>,
  can be reorganized into <math|61> slices as follows

  <\equation*>
    DFT<rsub|\<omega\>><around*|(|A|)>=<around*|(|<around*|(|A<around*|(|\<omega\><rsup|61*i>|)>|)><rsub|0\<leqslant\>i\<less\>m>,<around*|(|A<around*|(|\<omega\><rsup|61*i+1>|)>|)><rsub|0\<leqslant\>i\<less\>m>,\<ldots\>,<around*|(|A<around*|(|\<omega\><rsup|61*i+60>|)>|)><rsub|0\<leqslant\>i\<less\>m>|)>.
  </equation*>

  The variant of the Frobenius DFT of <math|A> that we introduce in the
  present paper corresponds to computing only the second slice:

  <\eqnarray*>
    <tformat|<table|<row|<cell|E<rsub|\<omega\>>:<space|1spc>\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>60*m>>|<cell|\<rightarrow\>>|<cell|\<bbb-F\><rsub|2<rsup|60>><rsup|m>>>|<row|<cell|A>|<cell|\<mapsto\>>|<cell|<around*|(|A<around*|(|\<omega\><rsup|61*i+1>|)>|)><rsub|0\<leqslant\>i\<less\>m>.>>>>
  </eqnarray*>

  Let us show that this transform is actually a bijection. The following
  lemma shows that the slices <math|<around*|(|A<around*|(|\<omega\><rsup|61*i+2>|)>|)><rsub|0\<leqslant\>i\<less\>m>,\<ldots\>,<around*|(|A<around*|(|\<omega\><rsup|61*i+60>|)>|)><rsub|0\<leqslant\>i\<less\>m>>
  can be deduced from the second slice <math|<around*|(|A<around*|(|\<omega\><rsup|61*i+1>|)>|)><rsub|0\<leqslant\>i\<less\>m>>
  using the action of the Frobenius map<nbsp><math|\<phi\><rsub|2>>.

  <\lemma>
    <label|lm:transitive>Let <math|\<Omega\><rsub|i>=<around*|{|\<omega\><rsup|61*j+i>\<of\>0\<leqslant\>j\<less\>m|}>>
    for <math|1\<leqslant\>i\<less\>61>. Then the action of
    <math|<around*|\<langle\>|\<phi\><rsub|2>|\<rangle\>>> is transitive on
    the pairwise disjoint sets <math|\<Omega\><rsub|1>,\<ldots\>,\<Omega\><rsub|60>>.
  </lemma>

  <\proof>
    Let <math|1\<leqslant\>i\<less\>61> and <math|0\<leqslant\>j\<less\>m>,
    we have <math|\<phi\><rsub|2><around*|(|\<omega\><rsup|61*j+i>|)>=\<omega\><rsup|61*j<rprime|'>+<around*|(|2*i
    mod 61|)>>> for some integer <math|0\<leqslant\>j<rprime|'>\<less\>m>, so
    the action of <math|<around*|\<langle\>|\<phi\><rsub|2>|\<rangle\>>> onto
    <math|\<Omega\><rsub|1>,\<ldots\>,\<Omega\><rsub|60>> is well defined.
    Notice that <math|2> is primitive for the multiplicative group
    <math|\<bbb-F\><rsub|61><rsup|\<times\>>>. This implies that for any
    <math|1\<leqslant\>i\<less\>61> there exists <math|k> such that
    <math|2<rsup|k>=i mod 61>. Consequently we have
    <math|\<phi\><rsub|2><rsup|\<circ\>k><around*|(|\<omega\><rsup|61*j+1>|)>=\<omega\><rsup|61*j<rprime|'>+i>>
    for some <math|0\<leqslant\>j<rprime|'>\<less\>m>, whence
    <math|\<phi\><rsub|2><rsup|\<circ\>k><around*|(|\<Omega\><rsub|1>|)>\<subseteq\>\<Omega\><rsub|i>>.
    Since <math|\<phi\><rsub|2>> is injective the latter inclusion is an
    equality.
  </proof>

  If we were needed the complete <math|DFT<rsub|\<omega\>><around*|(|A|)>>,
  then we would still have to compute the first slice
  <math|<around*|(|A<around*|(|\<omega\><rsup|61*i>|)>|)><rsub|0\<leqslant\>i\<less\>m>>.
  The second main new idea with respect to<nbsp><cite|vdH:ffft> is to discard
  this first slice and to restrict ourselves to input polynomials <math|A> of
  degrees <math|\<less\>60*m>. In this way, <math|E<rsub|\<omega\>>> can be
  inverted, as proved in the following proposition.

  <\proposition>
    <label|pp:Ebijective><math|E<rsub|\<omega\>>> is bijective.
  </proposition>

  <\proof>
    The dimensions of the source and destination spaces of
    <math|E<rsub|\<omega\>>> over <math|\<bbb-F\><rsub|2>> being the same, it
    suffices to prove that <math|E<rsub|\<omega\>>> is injective. Let
    <math|A\<in\>\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>60*m>> be such
    that <math|E<rsub|\<omega\>><around*|(|A|)>=0>. By construction, <math|A>
    vanishes at <math|m> distinct values, namely
    <math|\<omega\><rsup|61*i+1>> for <math|0\<leqslant\>i\<less\>m>. Under
    the action of <math|<around*|\<langle\>|\<phi\><rsub|2>|\<rangle\>>> it
    also vanishes at <math|60*<around*|(|m-1|)>> other values by
    Lemma<nbsp><reference|lm:transitive>, whence <math|A=0>.
  </proof>

  <\remark>
    The transformation <math|E<rsub|\<omega\>>> being bijective is due to the
    fact that <math|2> is primitive in the multiplicative group
    <math|\<bbb-F\><rsub|61><rsup|\<times\>>>. Among the prime divisors of
    <math|2<rsup|60>-1>, the factors 3, 5, 11 and 13 also have this property,
    but taking <math|n<rsub|2>=61> allows us to divide the size of the
    evaluation-interpolation scheme by 60, which is optimal.
  </remark>

  <subsection|Frobenius encoding>

  We decompose the computation of <math|E<rsub|\<omega\>>> into two routines.
  The first routine is written <math|F<rsub|\<omega\>>> and called the
  <em|Frobenius encoding>:

  <\eqnarray*>
    <tformat|<table|<row|<cell|F<rsub|\<omega\>>:<space|1spc>\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>60*m>>|<cell|\<rightarrow\>>|<cell|\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]><rsub|\<less\>m>>>|<row|<cell|A=<big|sum><rsub|0\<leqslant\>k\<less\>60*m>a<rsub|k>*x<rsup|k>>|<cell|\<mapsto\>>|<cell|<big|sum><rsub|0\<leqslant\>k\<less\>m>\<omega\><rsup|k>*<around*|(|<big|sum><rsub|0\<leqslant\>l\<less\>60>a<rsub|k+m*l>*\<theta\><rsup|l>|)>*x<rsup|k>,where
    \<theta\>=\<omega\><rsup|m>.<eq-number><label|eqn:encode>>>>>
  </eqnarray*>

  Below, we will choose <math|\<theta\>> in such a way that
  <math|F<rsub|\<omega\>>> is essentially a simple reorganization of the
  coefficients of<nbsp><math|A>.

  We observe that the coefficients of <math|F<rsub|\<omega\>><around*|(|A|)>>
  are part of the values of the inner DFTs of <math|A> in the Cooley\UTukey
  formula<nbsp><eqref|FFT-dec>, applied with <math|n<rsub|1>=m> and
  <math|n<rsub|2>=61>. The second task is the computation of the
  corresponding outer DFT of order<nbsp><math|m>:

  <\eqnarray*>
    <tformat|<table|<row|<cell|DFT<rsub|<wide|\<omega\>|~>>:<space|1spc>\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]><rsub|\<less\>m>>|<cell|\<rightarrow\>>|<cell|\<bbb-F\><rsub|2<rsup|60>><rsup|m>>>|<row|<cell|<wide|A|~>>|<cell|\<mapsto\>>|<cell|<around*|(|<wide|A|~><around*|(|<wide|\<omega\>|~><rsup|i>|)>|)><rsub|0\<leqslant\>i\<less\>m>,<text|
    where ><wide|\<omega\>|~>=\<omega\><rsup|61>.>>>>
  </eqnarray*>

  <\proposition>
    <label|pp:Edecomp><math|E<rsub|\<omega\>>=DFT<rsub|<wide|\<omega\>|~>>\<circ\>F<rsub|\<omega\>>>.
  </proposition>

  <\proof>
    <label|pp:E>This formula follows from<nbsp><eqref|FFT-dec>:

    <\equation*>
      A<around*|(|\<omega\><rsup|61*i+1>|)>=<big|sum><rsub|0\<leqslant\>k\<less\>m>\<omega\><rsup|k>*<around*|(|<big|sum><rsub|0\<leqslant\>l\<less\>61>a<rsub|k+m*l>*\<theta\><rsup|l>|)>*<wide|\<omega\>|~><rsup|k*i>=F<rsub|\<omega\>><around*|(|A|)><around*|(|<wide|\<omega\>|~><rsup|i>|)>.
    </equation*>
  </proof>

  Summarizing, we have reduced the computation of a DFT of size
  <math|60*n/61> over<nbsp><math|\<bbb-F\><rsub|2>> to a DFT of size
  <math|m=n/61> over <math|\<bbb-F\><rsub|2<rsup|60>>>. This reduction
  preserves data size.

  <subsection|Direct transforms>

  The computation of <math|F<rsub|\<omega\>>> involves the evaluation of
  <math|m> polynomials in <math|\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>60>>
  at <math|\<theta\>=\<omega\><rsup|m>\<in\>\<bbb-F\><rsub|2<rsup|60>>>. In
  order to perform these evaluations fast, we fix the representation of
  <math|\<bbb-F\><rsub|2<rsup|60>>=\<bbb-F\><rsub|2><around*|[|z|]>/<around*|(|\<mu\><around*|(|z|)>|)>>
  and the primitive root <math|\<nu\>> of unity of maximal order
  <math|2<rsup|60>-1> to be given by

  <\eqnarray*>
    <tformat|<table|<row|<cell|\<mu\><around*|(|z|)>>|<cell|=>|<cell|<around*|(|z<rsup|61>-1|)>/<around*|(|z-1|)>>>|<row|<cell|\<nu\>>|<cell|=>|<cell|z<rsup|18>+z<rsup|6>+1
    mod \<mu\><around*|(|z|)>.>>>>
  </eqnarray*>

  Setting <math|\<omega\>=\<nu\><rsup|<around*|(|2<rsup|60>-1|)>/n>> and
  <math|\<theta\>=\<nu\><rsup|<around*|(|2<rsup|60>-1|)>/61>>, it can be
  checked that <math|\<theta\>=z mod \<mu\><around*|(|z|)>>. Evaluation of a
  polynomial in <math|\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>60>> at
  <math|\<theta\>> can now be done efficiently.

  <\specified-algorithm>
    <label|al:encode><strong|Input:> <math|A<around*|(|x|)>=<big|sum><rsub|0\<leqslant\>i\<less\>60*m>a<rsub|i>*x<rsup|i>>.

    <strong|Output:> <math|F<rsub|\<omega\>><around*|(|A|)>>.

    <strong|Assumption:> <math|n=61*m> divides <math|2<rsup|60>-1>.
  <|specified-algorithm>
    <\enumerate>
      <item>For <math|i=0,\<ldots\>,m-1>, build
      <math|P<rsub|i><around*|(|z|)>=<big|sum><rsub|0\<leqslant\>j\<less\>60>a<rsub|i+m*j>*z<rsup|j>
      mod \<mu\><around*|(|z|)><space|0.6spc>\<in\><space|0.6spc>\<bbb-F\><rsub|2<rsup|60>>>.

      <item>Return <math|P<rsub|0>+\<omega\>*P<rsub|1>*x+\<omega\><rsup|2>*P<rsub|2>*x<rsup|2>+\<cdots\>+\<omega\><rsup|m-1>*P<rsub|m-1>*x<rsup|m-1>>.
    </enumerate>
  </specified-algorithm>

  <\proposition>
    <label|pp:encode>Algorithm<nbsp><reference|al:encode> is correct.
  </proposition>

  <\proof>
    This deduces immediately from the definition of <math|F<rsub|\<omega\>>>
    in formula<nbsp>(<reference|eqn:encode>), using the fact that
    <math|\<theta\>=z mod \<mu\><around*|(|z|)>> in our representation.
  </proof>

  <\specified-algorithm>
    <label|al:directtransform><strong|Input:>
    <math|A\<in\>\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>60*m>>.

    <strong|Output:> <math|E<rsub|\<omega\>><around*|(|A|)>>.

    <strong|Assumption:> <math|n=61*m> divides <math|2<rsup|60>-1>.
  <|specified-algorithm>
    <\enumerate>
      <item>Compute the Frobenius encoding
      <math|<wide|A|~><around*|(|x|)>\<in\>\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]><rsub|\<less\>m>>
      of <math|A> by Algorithm<nbsp><reference|al:encode>.

      <item>Compute the DFT of <math|<wide|A|~>> with respect to
      <math|<wide|\<omega\>|~>>.
    </enumerate>
  </specified-algorithm>

  <\proposition>
    <label|pp:directtransform>Algorithm<nbsp><reference|al:directtransform>
    is correct.
  </proposition>

  <\proof>
    The correctness simply follows from Propositions<nbsp><reference|pp:E>
    and<nbsp><reference|pp:encode>.
  </proof>

  <subsection|Inverse transforms>

  By combining Propositions<nbsp><reference|pp:Ebijective>
  and<nbsp><reference|pp:Edecomp>, the map <math|F<rsub|\<omega\>>> is
  invertible and its inverse may be computed by the following algorithm.

  <\specified-algorithm>
    <label|al:decode><strong|Input:> <math|<wide|A|~><around*|(|x|)>=<big|sum><rsub|i\<geqslant\>0><wide|a|~><rsub|i>*x<rsup|i>\<in\>\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]><rsub|\<less\>m>>.

    <strong|Output:> <math|F<rsub|\<omega\>><rsup|-1><around*|(|<wide|A|~>|)>>.

    <strong|Assumption:> <math|n=61*m> divides <math|2<rsup|60>-1>.
  <|specified-algorithm>
    <\enumerate>
      <item>For <math|i=0,\<ldots\>,m-1>, build the preimage
      <math|P<rsub|i><around*|(|z|)>\<assign\><big|sum><rsub|0\<leqslant\>j\<less\>60>p<rsub|i,j>*z<rsup|j>>
      of <math|\<omega\><rsup|-i>*<wide|a|~><rsub|i>>.

      <item>Return <math|<big|sum><rsub|0\<leqslant\>i\<less\>m><big|sum><rsub|0\<leqslant\>j\<less\>60>p<rsub|i,j>*x<rsup|i+m*j>>.
    </enumerate>
  </specified-algorithm>

  <\proposition>
    <label|pp:decode>Algorithm<nbsp><reference|al:decode> is correct.
  </proposition>

  <\proof>
    This is a straightforward inversion of
    Algorithm<nbsp><reference|al:encode>.
  </proof>

  <\specified-algorithm>
    <label|al:inversetransform><strong|Input:>
    <math|<wide|a|^>\<in\>\<bbb-F\><rsub|2<rsup|60>><rsup|m>>.

    <strong|Output:> <math|E<rsup|-1><rsub|\<omega\>><around*|(|<wide|a|^>|)>>.

    <strong|Assumption:> <math|n=61*m> divides <math|2<rsup|60>-1>.
  <|specified-algorithm>
    <\enumerate>
      <item>Compute the inverse DFT <math|<wide|A|~>\<in\>\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]><rsub|\<less\>m>>
      of <math|<wide|a|^>> with respect to <math|<wide|\<omega\>|~>>.

      <item>Compute the Frobenius decoding <math|A> of <math|<wide|A|~>> by
      Algorithm<nbsp><reference|al:decode> and return <math|A>.
    </enumerate>
  </specified-algorithm>

  <\proposition>
    <label|pp:inversetransform>Algorithm<nbsp><reference|al:inversetransform>
    is correct.
  </proposition>

  <\proof>
    The correctness simply follows from Propositions<nbsp><reference|pp:E>
    and<nbsp><reference|pp:decode>.
  </proof>

  <subsection|Multiplication in <math|\<bbb-F\><rsub|2><around*|[|x|]>>>

  Using the standard technique of multiplication by evaluation-interpolation,
  we may now compute products in <math|\<bbb-F\><rsub|2><around*|[|x|]>> as
  follows:

  <\specified-algorithm>
    <label|al:productf2x><strong|Input:> <math|A,B\<in\>\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>\<ell\>>>.

    <strong|Output:> <math|A*B>
  <|specified-algorithm>
    <\enumerate>
      <item><label|choice-m>Let <math|m\<geqslant\><around*|(|2*\<ell\>-1|)>/60>
      be such that <math|n=61*m> divides <math|2<rsup|60>-1>.

      <item>Let <math|\<omega\>=\<nu\><rsup|<around*|(|2<rsup|60>-1|)>/n>> be
      the privileged root of unity of order <math|n>.

      <item>Compute <math|E<rsub|\<omega\>><around*|(|A|)>> and
      <math|E<rsub|\<omega\>><around*|(|B|)>> by
      Algorithm<nbsp><reference|al:directtransform>.

      <item>Compute <math|<wide|c|^>> as the entry-wise product of
      <math|E<rsub|\<omega\>><around*|(|A|)>> and
      <math|E<rsub|\<omega\>><around*|(|B|)>>.

      <item>Compute <math|C<around*|(|x|)>=E<rsup|-1><rsub|\<omega\>><around*|(|<wide|c|^>|)>>
      by Algorithm<nbsp><reference|al:inversetransform> and return <math|C>.
    </enumerate>
  </specified-algorithm>

  <\proposition>
    Algorithm<nbsp><reference|al:inversetransform> is correct.
  </proposition>

  <\proof>
    The correctness simply follows from Propositions<nbsp><reference|pp:directtransform>
    and<nbsp><reference|pp:inversetransform> and using the fact that
    <math|E<rsub|\<omega\>><around*|(|A*B|)>=E<rsub|\<omega\>><around*|(|A|)>*E<rsub|\<omega\>><around*|(|B|)>>,
    since <math|m\<geqslant\><around*|(|2*\<ell\>-1|)>/60>.
  </proof>

  For step <reference|choice-m>, the actual determination of <math|m> has
  been discussed in<nbsp><cite-detail|vdH:f2kmul|section<nbsp>3>. In fact it
  is often better not to pick the smallest possible value for <math|m> but a
  slightly larger one that is also very smooth. Since <math|2<rsup|60>-1>
  admits many small prime divisors, such smooth values of <math|m> usually
  indeed exist.

  <section|Implementation details><label|impl-sec>

  We follow <name|Intel>'s terminology and use the term
  <with|font-shape|italic|quad word> to denote a unit of 64 bits of data. In
  the rest of the paper we use the <name|C99> standard for presenting our
  source code. In particular a quad word representing an unsigned integer is
  considered of type <cpp|uint64_t.>

  Our implementations are done for an AVX2-enabled processor and an operating
  system compliant to System V Application Binary Interface. The <name|C++>
  library <name|numerix> of <name|Mathemagix><nbsp><cite|mmx-user-guide>
  (<verbatim|<hlink|http://www.mathemagix.org|http://www.mathemagix.org>>)
  defines wrappers for <name|AVX> types. In particular, <cpp|avx_uint64_t>
  represents an SIMD vector of <math|4> elements of type <cpp|uint64_t>.
  Recall that the platform disposes of <math|16> AVX registers which must be
  allocated accurately in order to minimize read and write accesses to the
  memory.

  Our new polynomial product is implemented in the <name|justinline> library
  of <name|Mathemagix>. The source code is freely available from
  revision<nbsp>10681 of our SVN server (<verbatim|<hlink|https://gforge.inria.fr/projects/mmx/|https://gforge.inria.fr/projects/mmx/>>).
  Main sources are in <shell|justinline/src/frobenius_encode_f2_60.cpp> for
  the Frobenius encoding and in <shell|justinline/mmx/polynomial_f2_amd64_avx2_clmul.mmx>
  for the top level functions. Related test and bench files are also
  available from dedicated directories of the <name|justinline> library. Let
  us further mention here that our <name|Mathemagix> functions may be easily
  exported to <name|C++><nbsp><cite|HoevenLecerf2013>.

  <subsection|Packed representations>

  Polynomials over <math|\<bbb-F\><rsub|2>> are supposed to be given in
  <em|packed representation>, which means that coefficients are stored as a
  vector of contiguous bits in memory. For the implementation considered in
  this paper, a polynomial of degree <math|\<ell\>-1> is stored into
  <math|<around*|\<lceil\>|\<ell\>/64|\<rceil\>>> quad words, starting with
  the low-degree coefficients: the constant term is the least significant bit
  of the first word. The last word is suitably padded with zeros.

  Reading or writing one coefficient or a range of coefficients of a
  polynomial in packed representation must be done carefully to avoid invalid
  memory access. Let <math|A> be such a polynomial of type <cpp|uint64_t*>.
  Reading the coefficient <math|a<rsub|i>> of degree <math|i>
  in<nbsp><math|A> is obtained as <cpp|(<math|A>[i \<gtr\>\<gtr\> 6]
  \<gtr\>\<gtr\> (i & 63)) & 1>. However, reading or writing a single
  coefficient should be avoided as much as possible for efficiency, so we
  prefer handling ranges of 256 bits. In the sequel the function of prototype

  <\cpp-code>
    void <with|color|blue|load> (avx_uint64_t& <math|d>, const uint64_t*
    <math|A>,<no-break-here>

    \ \ const uint64_t& <math|\<ell\>>, const uint64_t& <math|i>, const
    uint64_t& <math|e>);
  </cpp-code>

  returns the <math|e\<leqslant\>256> bits of <math|A> starting from <math|i>
  into <math|d>. Bits beyond position <math|\<ell\>> are considered to be
  zero.

  For arithmetic operations in <math|\<bbb-F\><rsub|2<rsup|60>>> we refer the
  reader to<nbsp><cite-detail|vdH:f2kmul|section<nbsp>3.1>. In the sequel we
  only appeal to the function<no-break-here>

  <\cpp-code>
    uint64_t <with|color|blue|f2_60_mul> (const uint64_t& <math|a>, const
    uint64_t& <math|b>);<no-break-here>
  </cpp-code>

  that multiplies the two elements <math|a> and <math|b> of
  <math|\<bbb-F\><rsub|2<rsup|60>>> in packed representation.

  We also use a packed column-major representation for matrices over
  <math|\<bbb-F\><rsub|2>>. For instance, an <math|8\<times\>8> bit matrix
  <math|<around*|(|M<rsub|i,j>|)><rsub|0\<leqslant\>i\<less\>8,<space|1spc>0\<leqslant\>j\<less\>8>>
  is encoded as a quad word whose <rigid|<math|<around*|(|8*j+i|)>>><nbhyph>th
  bit is <math|M<rsub|i,j>>. Similarly, a <math|256\<times\>\<ell\>> matrix
  <math|<around*|(|M<rsub|i,j>|)><rsub|0\<leqslant\>i\<less\>256,<space|1spc>0\<leqslant\>j\<less\>\<ell\>>>
  may be seen as a vector <math|v> of type <cpp|avx_uint64_t*>, so
  <math|M<rsub|i,j>> corresponds to the <math|i>-th bit of
  <cpp|<math|v>[<math|j>]>.

  <subsection|Matrix transposition>

  The Frobenius encoding essentially boils down to matrix transpositions. Our
  main building block is <math|256\<times\>64> bit matrix transposition. We
  decompose this transposition in a suitable way with regards to data
  locality, register allocation and vectorization.

  For the computation of general transpositions, we repeatedly make use of
  the well-known divide and conquer strategy: to transpose an
  <math|n\<times\>\<ell\>> matrix <math|M>, where <math|n> and <math|\<ell\>>
  are even, we decompose <math|M=<matrix|<tformat|<table|<row|<cell|A>|<cell|B>>|<row|<cell|C>|<cell|D>>>>>>,
  where <math|A,B,C,D> are <math|n/2\<times\>\<ell\>/2> matrices; we swap the
  anti-diagonal blocks <math|B> and <math|C> and recursively transpose each
  block <math|A,B,C,D>.

  <no-indent><subsubsection|Transposing packed <math|8\<times\>8> bit
  matrices>

  <no-indent>The basic task we begin with is the transposition of a packed
  <math|8\<times\>8> bit matrix. The solution used here is borrowed from
  <cite-detail|Warren2012|Chapter<nbsp>7, section<nbsp>3>.

  <\specified-cpp-function>
    <label|fun:8x8_transpose><strong|Input:>
    <math|<around*|(|M<rsub|i,j>|)><rsub|0\<leqslant\>i\<less\>8,<space|1spc>0\<leqslant\>j\<less\>8>>
    in packed representation.

    <strong|Output:> The transpose <math|<around*|(|N<rsub|i,j>|)><rsub|0\<leqslant\>i\<less\>8,<space|1spc>0\<leqslant\>j\<less\>8>>
    of <math|M> in packed representation.
  <|specified-cpp-function>
    <\cpp>
      uint64_t<no-break-here>

      <with|color|blue|packed_matrix_bit_8x8_transpose> (const uint64_t&
      <math|M>) {
    </cpp>

    <\with|item-vsep|0fn>
      <\enumerate>
        <item><cpp|uint64_t <math|N> = <math|M>;>

        <item><cpp|static const uint64_t mask_4 = 0x00000000f0f0f0f0;>

        <item><cpp|static const uint64_t mask_2 = 0x0000cccc0000cccc;>

        <item><cpp|static const uint64_t mask_1 = 0x00aa00aa00aa00aa;>

        <item><cpp|uint64_t <math|a>;>

        <item><cpp|<math|a> = ((<math|N> \<gtr\>\<gtr\> 28) ^ <math|N>) &
        mask_4; <math|N> = <math|N> ^ <math|a>>;

        <item><cpp|<math|a> = <math|a> \<less\>\<less\> 28; <cpp|<math|N> =
        <math|N> ^ <math|a>;>>

        <item><cpp|<math|a> = ((<math|N> \<gtr\>\<gtr\> 14) ^ <math|N>) &
        mask_2; <math|N> = <math|N> ^ <math|a>>;

        <item><cpp|<math|a> = <math|a> \<less\>\<less\> 14; <cpp|<math|N> =
        <math|N> ^ <math|a>;>>

        <item><cpp|<math|a> = ((<math|N> \<gtr\>\<gtr\> 7) ^ <math|N>) &
        mask_1; \ <math|N> = <math|N> ^ <math|a>>;

        <item><cpp|<math|a> = <math|a> \<less\>\<less\> 7; \ <cpp|<math|N> =
        <math|N> ^ <math|a>;>>

        <item><cpp|return <math|N>; }>
      </enumerate>
    </with>
  </specified-cpp-function>

  In steps 6 and 7, the anti-diagonal <math|4\<times\>4> blocks are swapped.
  In steps 8 and<nbsp>9, the matrix <math|N> is seen as four
  <math|4\<times\>4> matrices whose anti-diagonal <math|2\<times\>2> blocks
  are swapped. In steps 10 and 11, the matrix <math|N> is seen as sixteen
  <math|2\<times\>2> matrices whose anti-diagonal elements are swapped. All
  in all, 18 instructions, 3 constants and one auxiliary variable are needed
  to transpose a packed <math|8\<times\>8> bit matrix in this way.

  One advantage of the above algorithm is that it admits a straightforward
  AVX vectorization that we will denote<nbsp>by

  <\cpp-code>
    avx_uint64_t

    <with|color|blue|avx_packed_matrix_bit_8x8_transpose> (const
    avx_uint64_t& <math|M>);
  </cpp-code>

  This routine transposes four <math|8\<times\>8> bit matrices
  <math|M<rsub|0>,M<rsub|1>,M<rsub|2>,M<rsub|3>> that are packed successively
  into an AVX register of type <cpp|avx_uint64_t>. We emphasize that this
  task is <em|not> the same as transposing a <math|32\<times\>8> or
  <math|8\<times\>32> bit matrices.

  <\remark>
    <label|rk:bmi2>The BMI2 technology gives another method for transposing
    <math|8\<times\>8> bit matrices:

    <\cpp-code>
      uint64_t mask = 0x0101010101010101;

      uint64_t <math|N>= 0;

      for (unsigned <math|i> = 0; <math|i> \<less\> 8; <math|i>++)

      \ \ <math|N> \|= _pext_u64 (<math|M>, mask \<less\>\<less\> <math|i>)
      \<less\>\<less\> (8 * <math|i>);
    </cpp-code>

    The loop can be unrolled while precompting the shift amounts and masks,
    which leads to a faster sequential implementation. Unfortunately this
    approach cannot be vectorized with the AVX2 technology. Other sequential
    solutions even exist, based on lookup tables or integer arithmetic, but
    their vectorization is again problematic. Practical efficiencies are
    reported in section<nbsp><reference|bench-sec>.
  </remark>

  <no-indent><subsubsection|Transposing four <math|8\<times\>8> byte matrices
  simultaneously><no-break-here>

  <no-indent>Our next task is to design a transposition algorithm of four
  packed <math|8\<times\>8> byte matrices simultaneously. More precisely, it
  performs the following operation on a packed <math|32\<times\>8> byte
  matrix:

  <\equation*>
    <matrix|<tformat|<table|<row|<cell|M<rsub|0>>>|<row|<cell|M<rsub|1>>>|<row|<cell|M<rsub|2>>>|<row|<cell|M<rsub|3>>>>>>\<longrightarrow\><matrix|<tformat|<table|<row|<cell|M<rsub|0><rsup|\<top\>>>>|<row|<cell|M<rsup|\<top\>><rsub|1>>>|<row|<cell|M<rsub|2><rsup|\<top\>>>>|<row|<cell|M<rsub|3><rsup|\<top\>>>>>>>,
  </equation*>

  where the <math|M<rsub|i>> are <math|8\<times\>8> blocks. <math|>This
  operation has the following prototype in the sequel:

  <\cpp-code>
    void <with|color|blue|avx_packed_matrix_byte_8x8_transpose><no-break-here>

    \ \ (avx_uint64_t* dest, const avx_uint64_t* src);
  </cpp-code>

  This function works as follows. First the input <cpp|src> is loaded into
  eight AVX registers <math|r<rsub|0>,\<ldots\>,r<rsub|7>>. Each
  <math|r<rsub|i>> is seen as a vector of four <cpp|uint64_t>: for
  <math|j\<in\><around*|{|0,\<ldots\>,3|}>>,
  <math|r<rsub|0><around*|[|j|]>,\<ldots\>,r<rsub|7><around*|[|j|]>> thus
  represent the <math|8\<times\>8> byte matrix <math|M<rsub|j>>. Then we
  transpose these four matrices simultaneously in-register by means of AVX
  shift and blend operations over 32, 16 and 8 bits entries in the spirit of
  the aforementioned divide and conquer strategy.

  <no-indent><subsubsection|Transposing <math|256\<times\>64> bit
  matrices><no-break-here>

  <no-indent>Having the above subroutines at our disposal, we can now present
  our algorithm to transpose a packed <math|256\<times\>64> bit matrix. The
  input bit matrix of type <math|<cpp|avx_int64_t*>> is written
  <math|<around*|(|M<rsub|i,j>|)><rsub|0\<leqslant\>i\<less\>256,<space|1spc>0\<leqslant\>j\<less\>64>>.
  The transposed output matrix is written
  <math|<around*|(|N<rsub|i,j>|)><rsub|0\<leqslant\>i\<less\>64,<space|1spc>0\<leqslant\>j\<less\>256>>
  and has type <cpp|uint64_t*>. We first compute the auxiliary byte matrix
  <math|T> as follows:

  <\cpp-code>
    <\cpp>
      static avx_uint64_t <math|T>[64];<no-break-here>

      for (int i= 0; i \<less\> 8; i++) {<no-break-here>

      \ \ avx_packed_matrix_byte_8x8_transpose (<math|T> + 8*i, <math|M> +
      8*i);<no-break-here>

      \ \ for (int k= 0; k \<less\> 8; k++)<no-break-here>

      \ \ \ \ T[8*i+k]= avx_packed_matrix_bit_8x8_transpose(T[8*i+k]); }
    </cpp>
  </cpp-code>

  If we write <math|M<rsub|i,k:l>> for the byte representing the packed bit
  vector <math|<around*|(|M<rsub|i,k>,\<ldots\>,M<rsub|i,l>|)>>, then
  <math|T> contains the following <math|32\<times\>64> byte matrix:

  <\flat-size>
    <\equation*>
      <matrix|<tformat|<cwith|3|3|1|-1|cell-tborder|0ln>|<cwith|2|2|1|-1|cell-bborder|0ln>|<cwith|3|3|1|-1|cell-bborder|1ln>|<cwith|4|4|1|-1|cell-tborder|1ln>|<cwith|3|3|1|1|cell-lborder|0ln>|<cwith|3|3|10|10|cell-rborder|0ln>|<cwith|6|6|1|-1|cell-tborder|0ln>|<cwith|5|5|1|-1|cell-bborder|0ln>|<cwith|6|6|1|-1|cell-bborder|1ln>|<cwith|7|7|1|-1|cell-tborder|1ln>|<cwith|6|6|1|1|cell-lborder|0ln>|<cwith|6|6|10|10|cell-rborder|0ln>|<cwith|7|7|1|10|cell-tborder|1ln>|<cwith|9|9|1|10|cell-tborder|0ln>|<cwith|8|8|1|10|cell-bborder|0ln>|<cwith|9|9|1|10|cell-bborder|1ln>|<cwith|9|9|1|1|cell-lborder|0ln>|<cwith|9|9|10|10|cell-rborder|0ln>|<cwith|10|10|1|10|cell-tborder|1ln>|<cwith|12|12|1|-1|cell-tborder|0ln>|<cwith|11|11|1|-1|cell-bborder|0ln>|<cwith|12|12|1|-1|cell-bborder|0ln>|<cwith|12|12|1|1|cell-lborder|0ln>|<cwith|12|12|10|10|cell-rborder|0ln>|<cwith|1|1|3|3|cell-tborder|0ln>|<cwith|12|12|3|3|cell-bborder|0ln>|<cwith|1|-1|3|3|cell-lborder|0ln>|<cwith|1|-1|2|2|cell-rborder|0ln>|<cwith|1|-1|3|3|cell-rborder|1ln>|<cwith|1|-1|4|4|cell-lborder|1ln>|<cwith|1|1|6|6|cell-tborder|0ln>|<cwith|12|12|6|6|cell-bborder|0ln>|<cwith|1|-1|6|6|cell-lborder|0ln>|<cwith|1|-1|5|5|cell-rborder|0ln>|<cwith|1|-1|6|6|cell-rborder|1ln>|<cwith|1|1|8|8|cell-tborder|0ln>|<cwith|12|12|8|8|cell-bborder|0ln>|<cwith|1|-1|8|8|cell-lborder|1ln>|<cwith|1|-1|8|8|cell-rborder|0ln>|<cwith|1|-1|9|9|cell-lborder|0ln>|<table|<row|<cell|M<rsub|0,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|56,0:7>>|<cell|M<rsub|0,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|56,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|0,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|56,56:63>>>|<row|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>>|<row|<cell|M<rsub|7,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|63,0:7>>|<cell|M<rsub|7,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|63,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|7,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|63,56:63>>>|<row|<cell|M<rsub|64,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|120,0:7>>|<cell|M<rsub|64,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|120,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|64,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|120,56:63>>>|<row|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>>|<row|<cell|M<rsub|71,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|127,0:7>>|<cell|M<rsub|71,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|127,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|71,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|127,56:63>>>|<row|<cell|M<rsub|128,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|184,0:7>>|<cell|M<rsub|128,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|184,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|128,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|184,56:63>>>|<row|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>>|<row|<cell|M<rsub|135,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|191,0:7>>|<cell|M<rsub|135,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|191,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|135,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|191,56:63>>>|<row|<cell|M<rsub|192,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|248,0:7>>|<cell|M<rsub|192,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|248,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|192,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|248,56:63>>>|<row|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>>|<row|<cell|M<rsub|199,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|255,0:7>>|<cell|M<rsub|199,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|255,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|199,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|255,56:63>>>>>>.
    </equation*>
  </flat-size>

  First, for all <math|0\<leqslant\>i\<leqslant\>7>, we load column
  <math|8*i> into the AVX register <math|r<rsub|i>>. We interpret these
  registers as forming a <math|32\<times\>8> byte matrix that we transpose
  in-registers. This transposition is again performed in the spirit of the
  aforementioned divide and conquer strategy and makes use of various
  specific AVX2 instructions. We obtain

  <\flat-size>
    <\equation*>
      <matrix|<tformat|<cwith|1|1|4|4|cell-tborder|0ln>|<cwith|4|4|4|4|cell-bborder|0ln>|<cwith|1|-1|4|4|cell-lborder|0ln>|<cwith|1|-1|3|3|cell-rborder|0ln>|<cwith|1|-1|4|4|cell-rborder|1ln>|<cwith|1|1|8|8|cell-tborder|0ln>|<cwith|4|4|8|8|cell-bborder|0ln>|<cwith|1|4|8|8|cell-lborder|0ln>|<cwith|1|4|7|7|cell-rborder|0ln>|<cwith|1|4|8|8|cell-rborder|1ln>|<cwith|1|1|5|5|cell-tborder|0ln>|<cwith|4|4|5|5|cell-bborder|0ln>|<cwith|1|-1|5|5|cell-lborder|1ln>|<cwith|1|-1|5|5|cell-rborder|0ln>|<cwith|1|-1|6|6|cell-lborder|0ln>|<table|<row|<cell|M<rsub|0,0:7>>|<cell|M<rsub|1,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|7,0:7>>|<cell|M<rsub|64,0:7>>|<cell|M<rsub|65,0:7>>|<cell|\<ldots\>>|<cell|M<rsub|71,0:7>>|<cell|\<ldots\>>>|<row|<cell|M<rsub|0,8:15>>|<cell|M<rsub|1,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|7,8:15>>|<cell|M<rsub|64,8:15>>|<cell|M<rsub|65,8:15>>|<cell|\<ldots\>>|<cell|M<rsub|71,8:15>>|<cell|\<ldots\>>>|<row|<cell|\<vdots\>>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|\<vdots\>>|<cell|\<vdots\>>|<cell|>|<cell|\<vdots\>>|<cell|>>|<row|<cell|M<rsub|0,56:63>>|<cell|M<rsub|1,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|7,56:63>>|<cell|M<rsub|64,56:63>>|<cell|M<rsub|65,56:63>>|<cell|\<ldots\>>|<cell|M<rsub|71,56:63>>|<cell|\<ldots\>>>>>>.
    </equation*>
  </flat-size>

  More precisely, for <math|i=0,\<ldots\>,7>, the group of four consecutive
  columns from <math|4*i> until <math|4*i+3> is in the register
  <math|r<rsub|i>>. We save the registers
  <math|r<rsub|0>,\<ldots\>,r<rsub|7>> at the addresses
  <math|N,N+4,N+64,N+68,N+128,N+132,N+192> and <math|N+196>.

  For each <math|k=1,\<ldots\>,7>, we build a similar <math|32\<times\>8>
  byte matrix from the columns <math|k,8+k,\<ldots\>,56+k> of <math|T>, and
  transpose this matrix using the same algorithm. This time the result is
  saved at the addresses <math|N<rprime|'>,N<rprime|'>+4,N<rprime|'>+64,N<rprime|'>+68,N<rprime|'>+128,N<rprime|'>+132,N<rprime|'>+192>
  and <math|N<rprime|'>+196>, where <math|N<rprime|'>=N+8*k>. This yields an
  efficient routine for transposing <math|M> into <math|N>, whose prototype
  is given by

  <\cpp-code>
    void <with|color|blue|packed_matrix_bit_256x64_transpose><no-break-here>

    \ \ <cpp|(uint64_t* <math|N>, (const avx_uint64_t*) <math|M>);>
  </cpp-code>

  <subsection|Frobenius encoding>

  If the input polynomial <math|A> has degree less than
  <math|\<ell\>\<leqslant\>60*m> and is in packed representation, then it can
  also be seen as a <math|m\<times\>60> matrix in packed representation
  (except a padding with zeros could be necessary to adjust the size).

  In this setting, the polynomials <math|P<rsub|i>> of
  Algorithm<nbsp><reference|al:encode> are simply read as the rows of the
  matrix. Therefore, to compute the Frobenius encoding
  <math|F<rsub|\<omega\>><around*|(|A|)>>, we only need to transpose this
  matrix, then add 4 rows of zeros for alignment (because we store one
  element of <math|\<bbb-F\><rsub|2<rsup|60>>> per quad word) and multiply by
  twiddle factors. This leads to the following implementation:

  <\specified-cpp-function>
    <label|fun:encode><strong|Input:> <math|A<around*|(|x|)>=<big|sum><rsub|0\<leqslant\>i\<less\>\<ell\>>a<rsub|i>*x<rsup|i>\<in\>\<bbb-F\><rsub|2><around*|[|x|]>>.

    <strong|Output:> <math|F<rsub|\<omega\>><around*|(|A|)>> stored from
    pointer <math|d> to <math|m> allocated quad words.

    <strong|Assumptions:> <math|n=61*m> divides <math|2<rsup|60>-1> and
    <math|\<ell\>\<leqslant\>60*m>.
  <|specified-cpp-function>
    <\cpp>
      void <with|color|blue|encode> (uint64_t* <math|d>, const uint64_t&
      <math|m>,

      \ \ \ \ \ \ \ \ \ \ \ \ \ const uint64_t* <math|A>, const uint64_t&
      <math|\<ell\>>) {
    </cpp>

    <\with|item-vsep|0fn>
      <\enumerate>
        <item><cpp|uint64_t <math|c> = 1, <math|i> = 0, <math|e> = 0;>

        <item><cpp|avx_uint64_t <math|v>[64]; uint64_t <math|w>[256];>

        <item><cpp|while (<math|i> \<less\> <math|m>) {>

        <item><cpp| \ <math|e> = min (<math|m> - <math|i>, 256);>

        <item><cpp| \ for (int <math|j> = 0; <math|j> \<less\> 64;
        <math|j>++)>

        <cpp| \ \ \ load (<math|v>[<math|j>], <math|A>, <math|\<ell\>>,
        <math|i> + <math|m> * <math|j>, <math|e>);>

        <item><cpp| \ packed_matrix_bit_256x64_transpose> <cpp|(<math|w>,
        <math|v>);>

        <item><cpp| \ for (int <math|j> = 0; <math|j> \<less\> e; <math|j>++)
        {>

        <cpp| \ \ \ <math|d>[<math|i> + <math|j>] = f2_60_mul
        (<math|w>[<math|j>], <math|c>);>

        <\cpp>
          \ \ \ \ <math|c> = f2_60_mul (<math|c>, <math|\<omega\>>); }
        </cpp>

        <item><cpp| \ <math|i> += <math|e>; }>
      </enumerate>
    </with>
  </specified-cpp-function>

  <\remark*>
    To optimize read accesses, it is better to run loop <math|5> for
    <math|j\<less\><around*|\<lceil\>|l/m|\<rceil\>>> and to initialize the
    remaining <cpp|<math|v>[<math|j>]> to zero. Indeed, for a product of
    degree <math|d>, we typically multiply two polynomials of degree
    <math|\<simeq\>d/2>, which means <math|\<ell\>\<less\>30*m> when
    computing the direct transform.
  </remark*>

  The Frobenius decoding consists in inverting the encoding. The
  implementation issues are the same, so we refer to our source code for
  further details.

  <section|Timings><label|bench-sec>

  The platform considered in this paper is equipped with an <name|Intel>(R)
  <name|Core>(TM) i7-6700 <abbr|CPU> at <math|3.40><nbsp>GHz and 32<nbsp>GB
  of <math|2133><nbsp>MHz DDR4 memory. This CPU features AVX2, BMI2 and CLMUL
  technologies (family number<nbsp><math|6> and model number<nbsp>94). The
  platform runs the <name|Stretch GNU Debian> operating system with a
  64<nbsp>bit <name|Linux> kernel version<nbsp>4.3. We compile with
  <name|GCC><nbsp><cite|gcc> version<nbsp>5.4.

  We use version<nbsp>1.2 of the <name|gf2x> library
  (<verbatim|<hlink|https://gforge.inria.fr/projects/gf2x/|https://gforge.inria.fr/projects/gf2x/>>,
  released in July 2017)\Vit makes use of the CLMUL features of the platform.
  We tuned it to our platform during the installation process up to <math|32
  \ 000 000> input quad words. We also compare to the implementation of the
  additive Fourier transform by Chen et al.<nbsp><cite|chen2017faster>, using
  the GIT version of 2017, September, 1.

  <no-indent><subsubsection*|Frobenius encoding><no-break-here>

  <no-indent>Concerning the cost of the Frobenius encoding and decoding,
  Function<nbsp><reference|fun:8x8_transpose> takes about <math|20> CPU
  cycles when compiled with the sole <shell|-O3> option. With the additional
  options <shell|-mtune=native -mavx2 -mbmi2>, the BMI2 version of
  Remark<nbsp><reference|rk:bmi2> takes about 16 CPU cycles. The vectorized
  version of Function<nbsp><reference|fun:8x8_transpose> transposes four
  packed <math|8\<times\>8> bit matrices simultaneously in about 20 cycles,
  which makes an average of <math|5> cycles per matrix.

  It it interesting to examine the performance of the sole transpositions
  made during the Frobenius encoding and decoding (that is discarding
  products by twiddle factors in <math|\<bbb-F\><rsub|2<rsup|60>>>). From
  sizes of a few kilobytes this average cost per quad word is about 8 cycles
  with the AVX2 technology, and it is about 23 cycles without. Unfortunately
  the vectorization speed-up is not as close to 4 as we would have liked.

  Since the encoding and decoding costs are linear, their relative
  contribution to the total computation time of polynomial products decreases
  for large sizes. For two input polynomials in
  <math|\<bbb-F\><rsub|2><around*|[|x|]>> of <math|2<rsup|16>> quad words,
  the contribution is about <math|15>%; for <math|2<rsup|22>> quad words, it
  is about <math|10>%.

  <no-indent><subsubsection*|Polynomial product><\float|float|tf>
    <\big-figure>
      <with|gr-mode|<tuple|group-edit|edit-props>|gr-frame|<tuple|scale|1pt|<tuple|0.10998gw|0.109972gh>>|gr-geometry|<tuple|geometry|287.82pt|152.32pt|center>|gr-grid|<tuple|cartesian|<point|0|0>|15>|gr-grid-old|<tuple|cartesian|<point|0|0>|15>|gr-edit-grid-aspect|<tuple|<tuple|axes|none>|<tuple|1|none>|<tuple|5|none>>|gr-edit-grid|<tuple|cartesian|<point|0|0>|15>|gr-edit-grid-old|<tuple|cartesian|<point|0|0>|15>|gr-grid-aspect|<tuple|<tuple|axes|light
      grey>|<tuple|1|light grey>|<tuple|5|pastel
      grey>>|gr-grid-aspect-props|<tuple|<tuple|axes|light
      grey>|<tuple|1|light grey>|<tuple|5|pastel
      grey>>|<graphics|<with|color|blue|line-width|1ln|<line|<point|1.500|0.2250>|<point|3.000|0.6300>|<point|4.500|0.9600>|<point|6.000|1.575>|<point|7.500|2.055>|<point|9.000|2.070>|<point|10.50|2.880>|<point|12.00|2.910>|<point|13.50|3.525>|<point|15.00|3.765>|<point|16.50|4.575>|<point|18.00|4.935>|<point|19.50|4.905>|<point|21.00|4.920>|<point|22.50|7.185>|<point|24.00|7.185>|<point|25.50|7.185>|<point|27.00|8.325>|<point|28.50|8.355>|<point|30.00|8.340>|<point|31.50|9.450>|<point|33.00|10.23>|<point|34.50|10.26>|<point|36.00|10.31>|<point|37.50|12.29>|<point|39.00|12.30>|<point|40.50|12.30>|<point|42.00|12.32>|<point|43.50|12.33>|<point|45.00|12.33>|<point|46.50|12.33>|<point|48.00|12.33>|<point|49.50|14.56>|<point|51.00|14.58>|<point|52.50|15.89>|<point|54.00|16.02>|<point|55.50|15.99>|<point|57.00|15.94>|<point|58.50|15.90>|<point|60.00|16.03>|<point|61.50|15.99>|<point|63.00|15.96>|<point|64.50|16.09>|<point|66.00|24.15>|<point|67.50|24.08>|<point|69.00|24.21>|<point|70.50|24.14>|<point|72.00|24.27>|<point|73.50|24.19>|<point|75.00|24.33>|<point|76.50|24.24>|<point|78.00|24.17>|<point|79.50|24.30>|<point|81.00|29.01>|<point|82.50|29.13>|<point|84.00|29.01>|<point|85.50|29.15>|<point|87.00|29.04>|<point|88.50|29.18>|<point|90.00|29.04>|<point|91.50|29.22>|<point|93.00|29.09>|<point|94.50|29.22>|<point|96.00|40.19>|<point|97.50|40.32>|<point|99.00|40.50>|<point|100.5|40.67>|<point|102.0|40.77>|<point|103.5|40.56>|<point|105.0|40.76>|<point|106.5|40.85>|<point|108.0|40.62>|<point|109.5|40.74>|<point|111.0|40.88>|<point|112.5|40.67>|<point|114.0|40.80>|<point|115.5|40.92>|<point|117.0|40.71>|<point|118.5|40.86>|<point|120.0|40.95>|<point|121.5|40.72>|<point|123.0|40.86>|<point|124.5|41.01>|<point|126.0|40.78>|<point|127.5|40.92>|<point|129.0|41.06>|<point|130.5|40.86>|<point|132.0|40.97>|<point|133.5|41.08>|<point|135.0|45.47>|<point|136.5|45.60>|<point|138.0|45.69>|<point|139.5|45.47>|<point|141.0|45.60>|<point|142.5|45.69>|<point|144.0|45.86>|<point|145.5|45.54>|<point|147.0|45.70>|<point|148.5|45.87>>>|<with|color|red|line-width|1ln|dash-style|1111010|<line|<point|1.500|0.5700>|<point|3.000|1.350>|<point|4.500|2.070>|<point|6.000|2.805>|<point|7.500|3.645>|<point|9.000|4.470>|<point|10.50|5.760>|<point|12.00|6.030>|<point|13.50|7.140>|<point|15.00|7.140>|<point|16.50|7.710>|<point|18.00|9.615>|<point|19.50|9.765>|<point|21.00|11.74>|<point|22.50|11.66>|<point|24.00|12.84>|<point|25.50|12.84>|<point|27.00|14.53>|<point|28.50|14.67>|<point|30.00|14.68>|<point|31.50|18.36>|<point|33.00|18.30>|<point|34.50|19.88>|<point|36.00|19.98>|<point|37.50|22.64>|<point|39.00|22.49>|<point|40.50|25.25>|<point|42.00|24.91>|<point|43.50|25.01>|<point|45.00|25.05>|<point|46.50|27.87>|<point|48.00|29.85>|<point|49.50|29.85>|<point|51.00|30.36>|<point|52.50|30.55>|<point|54.00|31.73>|<point|55.50|32.16>|<point|57.00|31.71>|<point|58.50|31.76>|<point|60.00|31.79>|<point|61.50|45.83>|<point|63.00|45.81>|<point|64.50|45.83>|<point|66.00|45.81>|<point|67.50|45.81>|<point|69.00|45.83>|<point|70.50|45.83>|<point|72.00|45.84>|<point|73.50|45.84>|<point|75.00|45.83>|<point|76.50|51.90>|<point|78.00|51.87>|<point|79.50|51.90>|<point|81.00|52.01>|<point|82.50|52.02>|<point|84.00|52.01>|<point|85.50|52.07>|<point|87.00|52.14>|<point|88.50|52.31>|<point|90.00|52.23>|<point|91.50|71.33>|<point|93.00|71.22>|<point|94.50|71.29>|<point|96.00|71.29>|<point|97.50|71.33>|<point|99.00|71.34>|<point|100.5|71.33>|<point|102.0|71.34>|<point|103.5|71.29>|<point|105.0|71.34>|<point|106.5|71.61>|<point|108.0|71.81>|<point|109.5|71.82>|<point|111.0|71.85>|<point|112.5|71.90>|<point|114.0|71.93>|<point|115.5|71.99>|<point|117.0|72.01>|<point|118.5|72.01>|<point|120.0|72.14>|<point|121.5|80.03>|<point|123.0|80.24>|<point|124.5|80.19>|<point|126.0|80.06>|<point|127.5|80.06>|<point|129.0|80.04>|<point|130.5|80.00>|<point|132.0|80.06>|<point|133.5|80.29>|<point|135.0|80.10>|<point|136.5|95.03>|<point|138.0|95.08>|<point|139.5|94.96>|<point|141.0|95.10>|<point|142.5|95.16>|<point|144.0|95.10>|<point|145.5|95.20>|<point|147.0|95.25>|<point|148.5|94.95>>>|<with|color|dark
      green|line-width|1ln|<line|<point|1.500|0.5850>|<point|3.000|1.455>|<point|4.500|1.920>|<point|6.000|2.655>|<point|7.500|3.480>|<point|9.000|4.530>|<point|10.50|4.545>|<point|12.00|6.735>|<point|13.50|7.950>|<point|15.00|7.980>|<point|16.50|9.720>|<point|18.00|9.735>|<point|19.50|11.44>|<point|21.00|11.46>|<point|22.50|11.47>|<point|24.00|11.50>|<point|25.50|13.73>|<point|27.00|14.94>|<point|28.50|14.95>|<point|30.00|14.97>|<point|31.50|15.00>|<point|33.00|22.93>|<point|34.50|22.97>|<point|36.00|23.00>|<point|37.50|23.01>|<point|39.00|23.04>|<point|40.50|27.77>|<point|42.00|27.78>|<point|43.50|27.80>|<point|45.00|27.83>|<point|46.50|27.84>|<point|48.00|38.34>|<point|49.50|38.37>|<point|51.00|38.40>|<point|52.50|38.42>|<point|54.00|38.43>|<point|55.50|38.43>|<point|57.00|38.46>|<point|58.50|38.49>|<point|60.00|38.52>|<point|61.50|38.55>|<point|63.00|38.55>|<point|64.50|38.58>|<point|66.00|38.58>|<point|67.50|42.54>|<point|69.00|42.56>|<point|70.50|42.57>|<point|72.00|42.61>|<point|73.50|42.58>|<point|75.00|42.63>|<point|76.50|42.65>|<point|78.00|51.62>|<point|79.50|51.40>|<point|81.00|51.52>|<point|82.50|51.42>|<point|84.00|51.44>|<point|85.50|51.49>|<point|87.00|51.49>|<point|88.50|51.52>|<point|90.00|51.51>|<point|91.50|66.41>|<point|93.00|66.61>|<point|94.50|66.49>|<point|96.00|66.48>|<point|97.50|66.53>|<point|99.00|66.48>|<point|100.5|66.58>|<point|102.0|66.53>|<point|103.5|66.51>|<point|105.0|66.65>|<point|106.5|66.57>|<point|108.0|66.63>|<point|109.5|66.66>|<point|111.0|66.68>|<point|112.5|87.86>|<point|114.0|87.87>|<point|115.5|87.86>|<point|117.0|88.02>|<point|118.5|88.02>|<point|120.0|88.13>|<point|121.5|87.96>|<point|123.0|88.31>|<point|124.5|88.39>|<point|126.0|88.00>|<point|127.5|88.11>|<point|129.0|88.15>|<point|130.5|88.13>|<point|132.0|88.18>|<point|133.5|88.23>|<point|135.0|88.22>|<point|136.5|88.25>|<point|138.0|88.41>|<point|139.5|88.32>|<point|141.0|88.43>|<point|142.5|88.29>|<point|144.0|124.7>|<point|145.5|124.3>|<point|147.0|124.1>|<point|148.5|124.4>>>|<with|color|dark
      magenta|line-width|1ln|dash-style|11100|dash-style-unit|2ln|<line|<point|1.500|0.5100>|<point|3.000|1.155>|<point|4.500|2.595>|<point|6.000|2.790>|<point|7.500|2.790>|<point|9.000|5.610>|<point|10.50|5.730>|<point|12.00|5.865>|<point|13.50|5.940>|<point|15.00|6.075>|<point|16.50|12.23>|<point|18.00|12.09>|<point|19.50|12.19>|<point|21.00|12.35>|<point|22.50|12.38>|<point|24.00|12.51>|<point|25.50|12.60>|<point|27.00|12.69>|<point|28.50|12.84>|<point|30.00|12.87>|<point|31.50|25.75>|<point|33.00|25.77>|<point|34.50|25.71>|<point|36.00|25.62>|<point|37.50|26.31>|<point|39.00|26.33>|<point|40.50|26.28>|<point|42.00|26.51>|<point|43.50|26.58>|<point|45.00|26.77>|<point|46.50|26.88>|<point|48.00|27.09>|<point|49.50|27.05>|<point|51.00|27.15>|<point|52.50|27.01>|<point|54.00|26.97>|<point|55.50|26.67>|<point|57.00|26.68>|<point|58.50|26.81>|<point|60.00|26.92>|<point|61.50|27.03>|<point|63.00|53.28>|<point|64.50|52.62>|<point|66.00|52.62>|<point|67.50|52.74>|<point|69.00|52.92>|<point|70.50|52.92>|<point|72.00|53.11>|<point|73.50|53.15>|<point|75.00|53.33>|<point|76.50|53.43>|<point|78.00|53.48>|<point|79.50|53.78>|<point|81.00|53.73>|<point|82.50|53.78>|<point|84.00|53.90>|<point|85.50|54.00>|<point|87.00|54.14>|<point|88.50|54.24>|<point|90.00|54.35>|<point|91.50|54.42>|<point|93.00|54.57>|<point|94.50|54.66>|<point|96.00|54.75>|<point|97.50|54.87>|<point|99.00|54.97>|<point|100.5|55.08>|<point|102.0|55.15>|<point|103.5|55.29>|<point|105.0|55.37>|<point|106.5|55.47>|<point|108.0|55.78>|<point|109.5|56.02>|<point|111.0|55.80>|<point|112.5|55.89>|<point|114.0|56.13>|<point|115.5|56.10>|<point|117.0|56.20>|<point|118.5|56.35>|<point|120.0|56.42>|<point|121.5|56.53>|<point|123.0|56.64>|<point|124.5|57.19>|<point|126.0|110.0>|<point|127.5|110.1>|<point|129.0|110.4>|<point|130.5|110.3>|<point|132.0|110.5>|<point|133.5|110.5>|<point|135.0|110.7>|<point|136.5|110.8>|<point|138.0|110.8>|<point|139.5|111.0>|<point|141.0|111.1>|<point|142.5|111.2>|<point|144.0|111.3>|<point|145.5|111.4>|<point|147.0|112.3>|<point|148.5|111.6>>>|<with|color|black|line-width|1ln|<cline|<point|0|0>|<point|0|120.0>>>|<with|color|black|line-width|1ln|<cline|<point|0|0>|<point|150.0|0>>>|<with|color|black|line-width|1ln|<cline|<point|0|0>|<point|0|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|15.00|0>|<point|15.00|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|30.00|0>|<point|30.00|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|45.00|0>|<point|45.00|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|60.00|0>|<point|60.00|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|75.00|0>|<point|75.00|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|90.00|0>|<point|90.00|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|105.0|0>|<point|105.0|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|120.0|0>|<point|120.0|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|135.0|0>|<point|135.0|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|150.0|0>|<point|150.0|-2.000>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|0>|<point|0|0>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|15.00>|<point|0|15.00>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|30.00>|<point|0|30.00>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|45.00>|<point|0|45.00>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|60.00>|<point|0|60.00>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|75.00>|<point|0|75.00>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|90.00>|<point|0|90.00>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|105.0>|<point|0|105.0>>>|<with|color|black|line-width|1ln|<cline|<point|-2.000|120.0>|<point|0|120.0>>>|<text-at|<small|<math|10\<times\>10<rsup|6>>>|<point|144.0|-9.0>>|<with|text-at-halign|center|<text-at|<small|<math|0>>|<point|0.0|-9.0>>>|<with|text-at-valign|center|<text-at|<small|<math|8
      000>>|<point|-20.6545882352941|120.0>>>|<with|text-at-valign|center|<text-at|<small|<math|0>>|<point|-9.0|0.0>>>|<text-at|<small|<with|color|dark
      green|Old implementation of<nbsp><cite|vdH:f2kmul>>>|<point|150.0|123.0>>|<text-at|<small|<with|color|red|<name|gf2x>
      version 1.2>>|<point|150.0|93.0>>|<text-at|<small|<with|color|blue|New
      implementation>>|<point|150.0|45.0>>|<with|text-at-valign|center|<text-at|<small|timings
      in ms>|<point|6|120>>>|<text-at|<small|size in quad
      words>|<point|144|9>>|<with|text-at-valign|center|<text-at|<with|color|dark
      magenta|<small|Chen et al.<nbsp><cite|chen2017faster>>>|<point|150.0|111.0>>>|<text-at||<point|240|99>>>>

      \;
    <|big-figure>
      <label|fig:timings>Products in <math|\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>\<ell\>>>,
      input size <math|<around*|\<lceil\>|\<ell\>/64|\<rceil\>>> quad words,
      timings in milliseconds.

      \;
    </big-figure>
  </float><no-break-here>

  <no-indent>In Figure<nbsp><reference|fig:timings> we report timings in
  milliseconds for multiplying two polynomials in
  <math|\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>\<ell\>>>, hence each
  of input size <math|<around*|\<lceil\>|\<ell\>/64|\<rceil\>>> quad
  words\Vindicated in abscissa and obtained from
  <shell|justinline/bench/polynomial_f2_bench.mmx>. Notice that our
  implementation in<nbsp><cite|vdH:f2kmul> was faster than version<nbsp>1.1
  of <name|gf2x>, but is now of similar speed as version<nbsp>1.2. The
  additive FFT strategy of<nbsp><cite|chen2017faster> achieves a noticeable
  speed-up in favorable cases, but because of its staircase-effect its
  runtime is roughly similar to the one of <name|gf2x> in average. With
  respect to our old implementation, the new one finally achieves a speed-up
  that is not far from the factor<nbsp><math|2> predicted by the asymptotic
  complexity analysis. Let us mention that our new implementation becomes
  faster than <name|gf2x> when <math|<around*|\<lceil\>|\<ell\>/64|\<rceil\>>>
  is larger than <math|2048>.

  <no-indent><subsubsection*|Polynomial matrix product><no-break-here>

  <no-indent>As in<nbsp><cite|vdH:f2kmul>, one major advantage of DFTs over
  the Babylonian field <math|\<bbb-F\><rsub|2<rsup|60>>> is the compactness
  of the evaluated FFT-representation of polynomials. This makes linear
  algebra over <math|\<bbb-F\><rsub|2><around*|[|x|]>> particularly
  efficient: instead of multiplying <math|r\<times\>r> matrices over
  <math|\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>\<ell\>>> naively by
  means of <math|r<rsup|3>> polynomial products of
  degree<nbsp><rigid|<math|\<less\>\<ell\>>>, we use the standard
  evaluation-interpolation approach. In our context, this comes down to: (a)
  computing the <math|2*r<rsup|2>> Frobenius encodings, (b) the
  <math|2*r<rsup|2>> direct DFTs of all entries of the two matrices to be
  multiplied, (c) performing the <math|\<approx\>2*\<ell\>/60> products of
  <math|r\<times\>r> matrices over <math|\<bbb-F\><rsub|2<rsup|60>>>, (d)
  computing the <math|r<rsup|2>> inverse DFTs and Frobenius decodings of the
  so-computed matrix products.

  Timings for matrices over <math|\<bbb-F\><rsub|2><around*|[|x|]>> are
  obatined from <shell|justinline/bench/matrix_polynomial_f2_bench.mmx> and
  are reported in Table<nbsp><reference|tab:mat_pol_f_2>. The row \Pthis
  paper\Q confirms the practical gain of this fast approach within our
  implementation. For comparison, the row \P<name|gf2x>\Q shows the cost of
  computing the product naively, by doing <math|r<rsup|3>> polynomial
  multiplications using <name|gf2x<math|>>. More efficient
  evaluation-interpolation based approaches<nbsp><cite-detail|vdH:fnewton|Section<nbsp>2>
  for matrix multiplication can in principle be combined with Schnhage's
  triadic polynomial multiplication<nbsp><cite|Sch77> as implemented in
  <name|gf2x<math|>>. However, this would require an additional
  implementation effort and also lead to an extra constant overhead with
  respect to our approach.<\float|float|tf>
    <\big-table|<block*|<tformat|<cwith|1|-1|2|-1|cell-halign|c>|<cwith|2|-1|2|-1|cell-halign|r>|<cwith|1|1|2|-1|cell-background|pastel
    grey>|<cwith|2|-1|1|1|cell-background|pastel
    grey>|<cwith|1|1|1|1|cell-background|pastel
    grey>|<cwith|1|-1|1|1|cell-halign|c>|<cwith|2|-1|1|1|cell-halign|l>|<twith|table-hmode|auto>|<table|<row|<cell|<math|r>>|<cell|1>|<cell|2>|<cell|4>|<cell|8>|<cell|16>|<cell|32>>|<row|<cell|this
    paper>|<cell|12>|<cell|51>|<cell|212>|<cell|896>|<cell|3969>|<cell|18953>>|<row|<cell|<name|gf2x>>|<cell|22>|<cell|182>|<cell|1457>|<cell|11856>|<cell|92858>|<cell|745586>>>>><tabular|<tformat|<table|<row|<cell|>>>>>>
      <label|tab:mat_pol_f_2>Products of <math|r\<times\>r> matrices over
      <math|\<bbb-F\><rsub|2><around*|[|x|]>>, for degree
      <math|64\<cdot\>2<rsup|16>>, in milliseconds.

      \;
    </big-table>
  </float>

  <section|Conclusion>

  The present paper describes a major new approach for the efficient
  computation of large carryless products. It confirms the excellent
  arithmetic properties of the Babylonian field
  <math|\<bbb-F\><rsub|2<rsup|60>>> for practical purposes, when compared to
  the fastest previously available strategies.

  Improvements are still possible for our implementation of DFTs over
  <math|\<bbb-F\><rsub|2<rsup|60>>>. First, taking advantage of the more
  recent AVX-512 technologies is an important challenge. This is difficult
  due to the current lack of 256 or 512 bit SIMD counterparts for the
  <cpp|vpclmulqdq> assembly instruction (carryless multiplication of two quad
  words). However, larger vector instruction would be beneficial for matrix
  transposition, and even more taking into account that there are twice as
  many 512 bit registers as 256 bit registers; so we can expect a significant
  speed-up for the Frobenius encoding/decoding stages. The second expected
  improvement concerns the use of truncated Fourier
  transforms<nbsp><cite|Hoeven2004|Larrieu2017> in order to smoothen the
  graph from Figure<nbsp><reference|fig:timings>. Finally we expect that our
  new ideas around the Frobenius transform might be applicable to other small
  finite fields.

  <\bibliography|bib|plain|ff2mul.bib>
    <\bib-list|10>
      <bibitem*|1><label|bib-BrGaThZi2008>R.<nbsp>P. Brent, P.<nbsp>Gaudry,
      E.<nbsp>Thom, and P.<nbsp>Zimmermann. <newblock>Faster multiplication
      in GF<math|<around|(|2|)><around|[|x|]>>. <newblock>In
      A.<nbsp>van<nbsp>der Poorten and A.<nbsp>Stein, editors,
      <with|font-shape|italic|Algorithmic Number Theory>, volume 5011 of
      <with|font-shape|italic|Lect. Notes Comput. Sci.>, pages 153\U166.
      Springer Berlin Heidelberg, 2008.

      <bibitem*|2><label|bib-chen2017faster>Ming-Shing Chen, Chen-Mou Cheng,
      Po-Chun Kuo, Wen-Ding Li, and Bo-Yin Yang. <newblock>Faster
      multiplication for long binary polynomials.
      <newblock><slink|https://arxiv.org/abs/1708.09746>, 2017.

      <bibitem*|3><label|bib-CT65>J.<nbsp>W. Cooley and J.<nbsp>W. Tukey.
      <newblock>An algorithm for the machine calculation of complex Fourier
      series. <newblock><with|font-shape|italic|Math. Computat.>,
      19:297\U301, 1965.

      <bibitem*|4><label|bib-GaoMateer2010>S.<nbsp>Gao and T.<nbsp>Mateer.
      <newblock>Additive fast Fourier transforms over finite fields.
      <newblock><with|font-shape|italic|IEEE Trans. Inform. Theory>,
      56(12):6265\U6272, 2010.

      <bibitem*|5><label|bib-GaGe2013>J.<nbsp>von<nbsp>zur Gathen and
      J.<nbsp>Gerhard. <newblock><with|font-shape|italic|Modern Computer
      Algebra>. <newblock>Cambridge University Press, 3rd edition, 2013.

      <bibitem*|6><label|bib-gcc>GCC, the GNU Compiler Collection.
      <newblock>Software available at <slink|http://gcc.gnu.org>, from 1987.

      <bibitem*|7><label|bib-vdH:f2kmul>D.<nbsp>Harvey, J.<nbsp>van<nbsp>der
      Hoeven, and G.<nbsp>Lecerf. <newblock>Fast polynomial multiplication
      over <math|<with|math-font|Bbb|F><rsub|2<rsup|60>>>. <newblock>In
      M.<nbsp>Rosenkranz, editor, <with|font-shape|italic|Proceedings of the
      ACM on International Symposium on Symbolic and Algebraic Computation>,
      ISSAC '16, pages 255\U262. ACM, 2016.

      <bibitem*|8><label|bib-vdH:ffmul>D.<nbsp>Harvey, J.<nbsp>van<nbsp>der
      Hoeven, and G.<nbsp>Lecerf. <newblock>Faster polynomial multiplication
      over finite fields. <newblock><with|font-shape|italic|J. ACM>, 63(6),
      2017. <newblock>Article<nbsp>52.

      <bibitem*|9><label|bib-Hoeven2004>J.<nbsp>van<nbsp>der Hoeven.
      <newblock>The truncated Fourier transform and applications.
      <newblock>In J.<nbsp>Schicho, editor,
      <with|font-shape|italic|Proceedings of the 2004 International Symposium
      on Symbolic and Algebraic Computation>, ISSAC '04, pages 290\U296. ACM,
      2004.

      <bibitem*|10><label|bib-vdH:fnewton>J.<nbsp>van<nbsp>der Hoeven.
      <newblock>Newton's method and FFT trading.
      <newblock><with|font-shape|italic|J. Symbolic Comput.>, 45(8):857\U878,
      2010.

      <bibitem*|11><label|bib-vdH:ffft>J.<nbsp>van<nbsp>der Hoeven and
      R.<nbsp>Larrieu. <newblock>The Frobenius FFT. <newblock>In
      M.<nbsp>Burr, editor, <with|font-shape|italic|Proceedings of the 2017
      ACM on International Symposium on Symbolic and Algebraic Computation>,
      ISSAC '17, pages 437\U444. ACM, 2017.

      <bibitem*|12><label|bib-HoevenLecerf2013>J.<nbsp>van<nbsp>der Hoeven
      and G.<nbsp>Lecerf. <newblock>Interfacing Mathemagix with C++.
      <newblock>In M.<nbsp>Monagan, G.<nbsp>Cooperman, and
      M.<nbsp>Giesbrecht, editors, <with|font-shape|italic|Proceedings of the
      2013 ACM on International Symposium on Symbolic and Algebraic
      Computation>, ISSAC '13, pages 363\U370. ACM, 2013.

      <bibitem*|13><label|bib-mmx-user-guide>J.<nbsp>van<nbsp>der Hoeven and
      G.<nbsp>Lecerf. <newblock>Mathemagix User Guide.
      <newblock><slink|https://hal.archives-ouvertes.fr/hal-00785549>, 2013.

      <bibitem*|14><label|bib-Larrieu2017>R.<nbsp>Larrieu. <newblock>The
      truncated Fourier transform for mixed radices. <newblock>In
      M.<nbsp>Burr, editor, <with|font-shape|italic|Proceedings of the 2017
      ACM on International Symposium on Symbolic and Algebraic Computation>,
      ISSAC '17, pages 261\U268. ACM, 2017.

      <bibitem*|15><label|bib-LinChungHan2014>Sian-Jheng Lin, Wei-Ho Chung,
      and S.<nbsp>Yunghsiang<nbsp>Han. <newblock>Novel polynomial basis and
      its application to Reed-Solomon erasure codes. <newblock>In
      <with|font-shape|italic|2014 IEEE 55th Annual Symposium on Foundations
      of Computer Science (FOCS)>, pages 316\U325. IEEE, 2014.

      <bibitem*|16><label|bib-Sch77>A.<nbsp>Schnhage. <newblock>Schnelle
      Multiplikation von Polynomen ber Krpern der Charakteristik 2.
      <newblock><with|font-shape|italic|Acta Infor.>, 7:395\U398, 1977.

      <bibitem*|17><label|bib-SS71>A.<nbsp>Schnhage and V.<nbsp>Strassen.
      <newblock>Schnelle Multiplikation groer Zahlen.
      <newblock><with|font-shape|italic|Computing>, 7:281\U292, 1971.

      <bibitem*|18><label|bib-Warren2012>H.<nbsp>S. Warren.
      <newblock><with|font-shape|italic|Hacker's Delight>.
      <newblock>Addison-Wesley, 2nd edition, 2012.
    </bib-list>
  </bibliography>
</body>

<\initial>
  <\collection>
    <associate|font|pagella>
    <associate|font-base-size|11>
    <associate|info-flag|short>
    <associate|large-padding-above|0.25fn>
    <associate|large-padding-below|0.25fn>
    <associate|math-font|math-pagella>
    <associate|padding-above|0.25fn>
    <associate|padding-below|0.25fn>
    <associate|page-medium|paper>
    <associate|preamble|false>
  </collection>
</initial>

<\attachments>
  <\collection>
    <\associate|bib-bibliography>
      <\db-entry|+j83UpRDaGYuaIY|inproceedings|vdH:f2kmul>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|D. <name|Harvey><name-sep>J. van der
        <name|Hoeven><name-sep>G. <name|Lecerf>>

        <db-field|title|Fast polynomial multiplication over
        <math|<with|math-font|Bbb|F><rsub|2<rsup|60>>>>

        <db-field|booktitle|Proceedings of the ACM on International Symposium
        on Symbolic and Algebraic Computation>

        <db-field|pages|255\U262>

        <db-field|year|2016>

        <db-field|editor|M. <name|Rosenkranz>>

        <db-field|series|ISSAC '16>

        <db-field|publisher|ACM>

        <db-field|location|Waterloo, ON, Canada>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIb|inproceedings|vdH:ffft>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|J. van der <name|Hoeven><name-sep>R. <name|Larrieu>>

        <db-field|title|The Frobenius FFT>

        <db-field|booktitle|Proceedings of the 2017 ACM on International
        Symposium on Symbolic and Algebraic Computation>

        <db-field|pages|437\U444>

        <db-field|year|2017>

        <db-field|editor|M. <name|Burr>>

        <db-field|series|ISSAC '17>

        <db-field|publisher|ACM>

        <db-field|location|Kaiserslautern, Germany>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIf|article|Sch77>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|A. <name|Schnhage>>

        <db-field|title|Schnelle Multiplikation von Polynomen ber Krpern
        der Charakteristik 2>

        <db-field|journal|Acta Infor.>

        <db-field|year|1977>

        <db-field|volume|7>

        <db-field|pages|395\U398>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIe|article|SS71>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|A. <name|Schnhage><name-sep>V. <name|Strassen>>

        <db-field|title|Schnelle Multiplikation groer Zahlen>

        <db-field|journal|Computing>

        <db-field|year|1971>

        <db-field|volume|7>

        <db-field|pages|281\U292>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIZ|article|vdH:ffmul>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|D. <name|Harvey><name-sep>J. van der
        <name|Hoeven><name-sep>G. <name|Lecerf>>

        <db-field|title|Faster polynomial multiplication over finite fields>

        <db-field|journal|J. ACM>

        <db-field|year|2017>

        <db-field|volume|63>

        <db-field|number|6>

        <db-field|note|Article<nbsp>52>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIc|article|GaoMateer2010>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|S. <name|Gao><name-sep>T. <name|Mateer>>

        <db-field|title|Additive fast Fourier transforms over finite fields>

        <db-field|journal|IEEE Trans. Inform. Theory>

        <db-field|year|2010>

        <db-field|volume|56>

        <db-field|number|12>

        <db-field|pages|6265\U6272>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaId|inproceedings|LinChungHan2014>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|Sian-Jheng <name|Lin><name-sep>Wei-Ho
        <name|Chung><name-sep>S. <name|Yunghsiang Han>>

        <db-field|title|Novel polynomial basis and its application to
        Reed-Solomon erasure codes>

        <db-field|booktitle|2014 IEEE 55th Annual Symposium on Foundations of
        Computer Science (FOCS)>

        <db-field|pages|316\U325>

        <db-field|year|2014>

        <db-field|publisher|IEEE>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIp|unpublished|chen2017faster>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|Ming-Shing <name|Chen><name-sep>Chen-Mou
        <name|Cheng><name-sep>Po-Chun <name|Kuo><name-sep>Wen-Ding
        <name|Li><name-sep>Bo-Yin <name|Yang>>

        <db-field|title|Faster multiplication for long binary polynomials>

        <db-field|note|<slink|https://arxiv.org/abs/1708.09746>>

        <db-field|year|2017>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIh|book|GaGe2013>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|J. von zur <name|Gathen><name-sep>J. <name|Gerhard>>

        <db-field|title|Modern Computer Algebra>

        <db-field|publisher|Cambridge University Press>

        <db-field|year|2013>

        <db-field|edition|3rd>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIi|inproceedings|BrGaThZi2008>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|R. P. <name|Brent><name-sep>P.
        <name|Gaudry><name-sep>E. <name|Thom><name-sep>P. <name|Zimmermann>>

        <db-field|title|Faster multiplication in
        gf<math|<around|(|2|)><around|[|x|]>>>

        <db-field|booktitle|Algorithmic Number Theory>

        <db-field|pages|153\U166>

        <db-field|year|2008>

        <db-field|editor|A. <name-von|van der> <name|Poorten><name-sep>A.
        <name|Stein>>

        <db-field|volume|5011>

        <db-field|series|Lect. Notes Comput. Sci.>

        <db-field|publisher|Springer Berlin Heidelberg>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIo|unpublished|mmx-user-guide>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|J. van der <name|Hoeven><name-sep>G. <name|Lecerf>>

        <db-field|title|Mathemagix User Guide>

        <db-field|note|<slink|https://hal.archives-ouvertes.fr/hal-00785549>>

        <db-field|year|2013>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIn|inproceedings|HoevenLecerf2013>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|J. van der <name|Hoeven><name-sep>G. <name|Lecerf>>

        <db-field|title|Interfacing Mathemagix with C++>

        <db-field|booktitle|Proceedings of the 2013 ACM on International
        Symposium on Symbolic and Algebraic Computation>

        <db-field|pages|363\U370>

        <db-field|year|2013>

        <db-field|editor|M. <name|Monagan><name-sep>G.
        <name|Cooperman><name-sep>M. <name|Giesbrecht>>

        <db-field|series|ISSAC '13>

        <db-field|publisher|ACM>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIm|book|Warren2012>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|H. S. <name|Warren>>

        <db-field|title|Hacker's Delight>

        <db-field|publisher|Addison-Wesley>

        <db-field|year|2012>

        <db-field|edition|2nd>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIj|misc|gcc>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|key|gcc>

        <db-field|title|GCC, the GNU Compiler Collection>

        <db-field|howpublished|Software available at
        <slink|http://gcc.gnu.org>>

        <db-field|year|from 1987>
      </db-entry>

      <\db-entry|+AnCgvA8Quqf7HO|article|vdH:fnewton>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505402082>
      <|db-entry>
        <db-field|author|J. van der <name|Hoeven>>

        <db-field|title|Newton's method and FFT trading>

        <db-field|journal|J. Symbolic Comput.>

        <db-field|year|2010>

        <db-field|volume|45>

        <db-field|number|8>

        <db-field|pages|857\U878>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIl|inproceedings|Hoeven2004>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|J. van der <name|Hoeven>>

        <db-field|title|The truncated Fourier transform and applications>

        <db-field|booktitle|Proceedings of the 2004 International Symposium
        on Symbolic and Algebraic Computation>

        <db-field|pages|290\U296>

        <db-field|year|2004>

        <db-field|editor|J. <name|Schicho>>

        <db-field|series|ISSAC '04>

        <db-field|publisher|ACM>
      </db-entry>

      <\db-entry|+j83UpRDaGYuaIa|inproceedings|Larrieu2017>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1505305060>
      <|db-entry>
        <db-field|author|R. <name|Larrieu>>

        <db-field|title|The truncated Fourier transform for mixed radices>

        <db-field|booktitle|Proceedings of the 2017 ACM on International
        Symposium on Symbolic and Algebraic Computation>

        <db-field|pages|261\U268>

        <db-field|year|2017>

        <db-field|editor|M. <name|Burr>>

        <db-field|series|ISSAC '17>

        <db-field|publisher|ACM>

        <db-field|location|Kaiserslautern, Germany>
      </db-entry>

      <\db-entry|+ikO8ncBMk21kfa|article|CT65>
        <db-field|contributor|lecerf>

        <db-field|modus|imported>

        <db-field|date|1475843318>
      <|db-entry>
        <db-field|author|J. W. <name|Cooley><name-sep>J. W. <name|Tukey>>

        <db-field|title|An algorithm for the machine calculation of complex
        Fourier series>

        <db-field|journal|Math. Computat.>

        <db-field|year|1965>

        <db-field|volume|19>

        <db-field|pages|297\U301>
      </db-entry>
    </associate>
  </collection>
</attachments>

<\references>
  <\collection>
    <associate|FFT-dec|<tuple|1|3|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|al:decode|<tuple|3|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|al:directtransform|<tuple|2|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|al:encode|<tuple|1|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|al:inversetransform|<tuple|4|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|al:productf2x|<tuple|5|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|algo-sec|<tuple|3|4|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-1|<tuple|1|1|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-10|<tuple|3.3|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-11|<tuple|3.4|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-12|<tuple|3.5|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-13|<tuple|4|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-14|<tuple|4.1|7|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-15|<tuple|4.2|7|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-16|<tuple|4.2.1|7|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-17|<tuple|4.2.2|8|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-18|<tuple|4.2.3|9|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-19|<tuple|4.3|10|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-2|<tuple|1.1|1|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-20|<tuple|5|10|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-21|<tuple|5|11|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-22|<tuple|5|11|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-23|<tuple|1|11|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-24|<tuple|1|11|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-25|<tuple|1|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-26|<tuple|6|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-27|<tuple|6|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-3|<tuple|1.2|2|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-4|<tuple|2|3|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-5|<tuple|2|3|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-6|<tuple|1|3|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-7|<tuple|3|4|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-8|<tuple|3.1|4|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|auto-9|<tuple|3.2|4|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bench-sec|<tuple|5|10|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-BrGaThZi2008|<tuple|1|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-CT65|<tuple|3|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-GaGe2013|<tuple|5|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-GaoMateer2010|<tuple|4|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-Hoeven2004|<tuple|9|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-HoevenLecerf2013|<tuple|12|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-Larrieu2017|<tuple|14|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-LinChungHan2014|<tuple|15|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-SS71|<tuple|17|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-Sch77|<tuple|16|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-Warren2012|<tuple|18|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-chen2017faster|<tuple|2|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-gcc|<tuple|6|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-mmx-user-guide|<tuple|13|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-vdH:f2kmul|<tuple|7|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-vdH:ffft|<tuple|11|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-vdH:ffmul|<tuple|8|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|bib-vdH:fnewton|<tuple|10|13|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|choice-m|<tuple|1|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|eqn:encode|<tuple|3|4|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|eqn:fdft|<tuple|2|3|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|fig:timings|<tuple|1|11|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|fun:8x8_transpose|<tuple|1|8|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|fun:encode|<tuple|2|10|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|impl-sec|<tuple|4|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|lm:transitive|<tuple|1|4|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|pp:E|<tuple|4|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|pp:Ebijective|<tuple|2|4|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|pp:Edecomp|<tuple|4|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|pp:decode|<tuple|7|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|pp:directtransform|<tuple|6|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|pp:encode|<tuple|5|5|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|pp:inversetransform|<tuple|8|6|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|rk:bmi2|<tuple|10|8|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|sec:prereq|<tuple|2|3|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
    <associate|tab:mat_pol_f_2|<tuple|1|12|../../../public/publs/2017/ff2mul/ff2mul-macis.tm>>
  </collection>
</references>

<\auxiliary>
  <\collection>
    <\associate|bib>
      vdH:f2kmul

      vdH:ffft

      Sch77

      SS71

      vdH:ffmul

      vdH:f2kmul

      vdH:ffft

      vdH:f2kmul

      vdH:ffft

      GaoMateer2010

      LinChungHan2014

      chen2017faster

      chen2017faster

      vdH:f2kmul

      vdH:ffmul

      GaGe2013

      BrGaThZi2008

      chen2017faster

      CT65

      vdH:ffft

      vdH:ffft

      vdH:f2kmul

      mmx-user-guide

      HoevenLecerf2013

      vdH:f2kmul

      Warren2012

      gcc

      chen2017faster

      vdH:f2kmul

      chen2017faster

      vdH:f2kmul

      chen2017faster

      vdH:f2kmul

      vdH:fnewton

      Sch77

      Hoeven2004

      Larrieu2017
    </associate>
    <\associate|figure>
      <tuple|normal|<\surround|<hidden|<tuple>>|>
        Products in <with|mode|<quote|math>|\<bbb-F\><rsub|2><around*|[|x|]><rsub|\<less\>\<ell\>>>,
        input size <with|mode|<quote|math>|<around*|\<lceil\>|\<ell\>/64|\<rceil\>>>
        quad words, timings in milliseconds.

        \;
      </surround>|<pageref|auto-23>>
    </associate>
    <\associate|table>
      <tuple|normal|<\surround|<hidden|<tuple>>|>
        Products of <with|mode|<quote|math>|r\<times\>r> matrices over
        <with|mode|<quote|math>|\<bbb-F\><rsub|2><around*|[|x|]>>, for degree
        <with|mode|<quote|math>|64\<cdot\>2<rsup|16>>, in milliseconds.

        \;
      </surround>|<pageref|auto-25>>
    </associate>
    <\associate|toc>
      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|1.<space|2spc>Introduction>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-1><vspace|0.5fn>

      <with|par-left|<quote|1tab>|1.1.<space|2spc>Related work
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-2>>

      <with|par-left|<quote|1tab>|1.2.<space|2spc>Results and outline of the
      paper <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-3>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|2.<space|2spc>Prerequisites>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-4><vspace|0.5fn>

      <with|par-left|<quote|2tab>|Discrete Fourier transforms
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-5>>

      <with|par-left|<quote|2tab>|Frobenius Fourier transforms
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-6>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|3.<space|2spc>Fast
      reduction from <with|mode|<quote|math>|\<bbb-F\><rsub|2><around*|[|x|]>>
      to <with|mode|<quote|math>|\<bbb-F\><rsub|2<rsup|60>><around*|[|x|]>>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-7><vspace|0.5fn>

      <with|par-left|<quote|1tab>|3.1.<space|2spc>Variant of the Frobenius
      DFT <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-8>>

      <with|par-left|<quote|1tab>|3.2.<space|2spc>Frobenius encoding
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-9>>

      <with|par-left|<quote|1tab>|3.3.<space|2spc>Direct transforms
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-10>>

      <with|par-left|<quote|1tab>|3.4.<space|2spc>Inverse transforms
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-11>>

      <with|par-left|<quote|1tab>|3.5.<space|2spc>Multiplication in
      <with|mode|<quote|math>|\<bbb-F\><rsub|2><around*|[|x|]>>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-12>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|4.<space|2spc>Implementation
      details> <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-13><vspace|0.5fn>

      <with|par-left|<quote|1tab>|4.1.<space|2spc>Packed representations
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-14>>

      <with|par-left|<quote|1tab>|4.2.<space|2spc>Matrix transposition
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-15>>

      <with|par-left|<quote|2tab>|4.2.1.<space|2spc>Transposing packed
      <with|mode|<quote|math>|8\<times\>8> bit matrices
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-16>>

      <with|par-left|<quote|2tab>|4.2.2.<space|2spc>Transposing four
      <with|mode|<quote|math>|8\<times\>8> byte matrices simultaneously
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-17>>

      <with|par-left|<quote|2tab>|4.2.3.<space|2spc>Transposing
      <with|mode|<quote|math>|256\<times\>64> bit matrices
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-18>>

      <with|par-left|<quote|1tab>|4.3.<space|2spc>Frobenius encoding
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-19>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|5.<space|2spc>Timings>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-20><vspace|0.5fn>

      <with|par-left|<quote|2tab>|Frobenius encoding
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-21>>

      <with|par-left|<quote|2tab>|Polynomial product
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-22>>

      <with|par-left|<quote|2tab>|Polynomial matrix product
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <no-break><pageref|auto-24>>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|6.<space|2spc>Conclusion>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-26><vspace|0.5fn>

      <vspace*|1fn><with|font-series|<quote|bold>|math-font-series|<quote|bold>|font-shape|<quote|small-caps>|Bibliography>
      <datoms|<macro|x|<repeat|<arg|x>|<with|font-series|medium|<with|font-size|1|<space|0.2fn>.<space|0.2fn>>>>>|<htab|5mm>>
      <pageref|auto-27><vspace|0.5fn>
    </associate>
  </collection>
</auxiliary>

<\links>
  <\collection>
    <id|+8MBFPemd735ILc>
    <target|+RvPymVTKoDnwf9|../../../public/publs/2017/ff2mul/ff2mul.tm>
    <locator|source-authref-1|<id|+RvPymVTKoDnwf9>>
    <locator|dest-authref-1|<id|+RvPymVTKoDnwf9>>
    <locator|source-authref-2|<id|+RvPymVTKoDnwf9>>
    <locator|dest-authref-2|<id|+RvPymVTKoDnwf9>>
    <locator|source-authref-3|<id|+RvPymVTKoDnwf9>>
    <locator|dest-authref-3|<id|+RvPymVTKoDnwf9>>
  </collection>
</links>