Vectorization of integration

A frequently asked question about numerical integration is: how to speed up the process of computing {\int_a^b f(x,p)\,dx} for many values of parameter {p}? Running an explicit for over the values of {p} seems slow… is there a way to vectorize this, performing integration on an array of functions at once (which in reality means, pushing the loop to a lower level language)?

Usually, the answer is some combination of “no” and “you are worrying about the wrong thing”. Integration routines are typically adaptive, meaning they pick evaluation points based on the function being integrated. Different values of the parameter will require different evaluation points to achieve the desired precision, so vectorization is out of question. This applies, for example, to quad method of SciPy integration module.

Let’s suppose we insist on vectorization and are willing to accept a non-adaptive method. What are our options, considering SciPy/NumPy? I will compare

The test case is {\int_0^1 (p+1)x^p\,dx} with {p=0,0.01,\dots,99.99}. Theoretically all these are equal to {1}. But most of these functions are not analytic at {0}, and some aren’t even differentiable there, so their integration is not a piece of cake.

Keeping in mind that Romberg’s integration requires {2^k} subintervals, let’s use {1024} equal subintervals, hence {1025} equally spaced sample points. Here is the vectorized evaluation of all our functions at these points, resulting in a 2D array “values”:

import numpy as np
import scipy.integrate as si
x = np.linspace(0, 1, 1025).reshape(1, -1)
dx = x[0,1] - x[0,0]
p = np.arange(0, 100, 0.01).reshape(-1, 1)
values = (p+1)*x**p

This computation took 0.799 seconds on my AWS instance (t2.nano). Putting the results of evaluation together takes less time:

  • Trapezoidal rule np.trapz(values, dx=dx) took 0.144 seconds and returned values ranging from 0.99955 to 1.00080.
  • Simpson’s rule si.simps(values, dx=dx) took 0.113 seconds and returned values ranging from 0.99970 to 1.0000005.
  • Romberg quadrature si.romb(values, dx=dx) was fastest at 0.0414 seconds, and also most accurate, with output between 0.99973 and 1.000000005.

Conclusions so far:

  • SciPy’s implementation of Romberg quadrature is surprisingly fast, considering that this algorithm is the result of repeated Richardson extrapolation of the trapezoidal rule (and Simpson’s rule is just the result of the first extrapolation step). It’d be nice to backport whatever makes Romberg so effective back to Simpson’s rule.
  • The underestimation errors 0.999… are greatest when {p} is near zero, so the integrand is very nonsmooth at {0}. The lack of smoothness renders Richardson extrapolation ineffective, hence all three rules have about the same error here.
  • The overestimation errors are greatest when {p} is large. The function is pretty smooth then, so upgrading from trapezoidal to Simpson to Romberg brings substantial improvements.

All of this is nice, but the fact remains that non-vectorized adaptive integration is both faster and much more accurate. The following loop with quad, which uses adaptive Clenshaw-Curtis quadrature, runs in 0.616 seconds (less than it took to evaluate our functions on a grid) and returns values between 0.99999999995 and 1.00000000007. Better to use exponential notation here: {1-5\cdot 10^{-11} } and {1+7\cdot 10^{-11}}.

funcs = [lambda x, p=p_: (p+1)*x**p for p_ in np.arange(0, 100, 0.01)]
result = [si.quad(fun, 0, 1)[0] for fun in funcs]

An adaptive algorithm shines when {p} is small, by placing more evaluation points near zero. It controls both over- and under- estimates equally well, continuing to add points until the required precision is reached.

The last hope of non-adaptive methods is Gaussian quadrature, which is implemented in SciPy as fixed_quad (“fixed” referring to the number of evaluation points used). With 300 evaluation points, the integration takes 0.263 seconds (excluding the computation of nodes and weights, which can be cached) and the results are between {1-2\cdot 10^{-12}} and {1+2\cdot 10^{-7}}. This is twice as fast as a loop with quad, and more accurate for large {p} — but sadly, not so accurate for small {p}. As said before, {x^p} with {p} near zero is really a showcase for adaptive methods.

Trigonometric approximation and the Clenshaw-Curtis quadrature

Trying to approximate a generic continuous function on {[-1,1]} with the Fourier trigonometric series of the form {\sum_n (A_n\cos \pi nx+B_n\sin \pi nx)} is in general not very fruitful. Here’s such an approximation to {f(x)=\exp(x)}, with the sum over {n\le 4}:

Poor approximation to exp(x)
Poor approximation to exp(x)

It’s better to make a linear change of variable: consider {f(2x-1)} on the interval {[0,1]}, and use the formula for the cosine series. This results in {\exp(2x-1)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos \pi nx} of its cosine Fourier series as follows.

Better approximation after a change of variable
Better approximation after a change of variable

But one can do much better with a different, nonlinear change of variable. Consider {f(\cos x)} on the interval {[0,\pi]}, and again use the formula for the cosine series. This results in {\exp(\cos x)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos nx} of its cosine Fourier series as follows.

Excellent approximation after nonlinear change of variable
Excellent approximation after nonlinear change of variable

Yes, I can’t see any difference either: the error is less than {10^{-3}}.

The composition with cosine improves approximation because {f(\cos t)} is naturally a periodic function, with no jumps or corners in its graph. Fourier series, which are periodic by nature, follow such functions more easily.


A practical implication of this approximation of {f(\cos t)} is the Clenshaw-Curtis integration method. It can be expressed in one line:

\displaystyle    \int_{-1}^1 f(x)\,dx = \int_0^\pi f(\cos t)\sin t\,dt \approx \int_0^\pi \sum_{n=0}^N a_n \cos nt \sin t\,dt   = \sum_{n=0}^N \frac{(1+(-1)^n) a_n}{1-n^2}

The integral {\int_{-1}^1 f(x)\,dx} is approximated by summing {2a_{2k}/(1-4k^2)}, where {a_{2k}} are even-numbered cosine coefficients of {f(\cos t)}. In the example with {f(x)=\exp(x)} using just three coefficients yields

\displaystyle    \frac{2a_0}{1}+\frac{2a_2}{-3}+\frac{2a_4}{-15} \approx 2.350405

while the exact integral is {\approx 2.350402}.


At first this doesn’t look practical at all: after all, the Fourier coefficients are themselves found by integration. But since {f(\cos t)} is so close to a trigonometric polynomial, one can sample it at equally spaced points and apply the Fast Fourier transform to the result, quickly obtaining many coefficients at once. This is what the Clenshaw-Curtis quadrature does (at least in principle, the practical implementation may streamline these steps.)

Unreasonable effectiveness of the left endpoint rule

The left endpoint rule, and its twin right endpoint rule, are ugly ducklings of integration methods. The left endpoint rule is just the average of the values of the integrand over left endpoints of equal subintervals:

\displaystyle \int_a^b f(x)\,dx \approx \frac{b-a}{n} \sum_{i=0}^{n-1} f(a+i/n)

Here is its graphical representation with {n=10} on the interval {[-1,1]}: the sample points are marked with vertical lines, the length of each line representing the weight given to that point. Every point has the same weight, actually.

Left endpoint rule
Left endpoint rule

Primitive, ineffective, with error {O(1/n)} for {n} points used.

Simpson’s rule is more sophisticated, with error {O(1/n^4)}. It uses weights of three sizes:

Simpson's rule
Simpson’s rule

Gaussian quadrature uses specially designed (and difficult to compute) sample points and weights: more points toward the edges, larger weights in the middle.

Gaussian quadrature
Gaussian quadrature

Let’s compare these quadrature rules on the integral {\int_{-1}^1 e^x \cos \pi x\,dx}, using {10} points as above. Here is the function:

Nice smooth function
Nice smooth function
  • The exact value of the integral is {\dfrac{e^{-1}-e}{1+\pi^2}}, about -0.216.
  • Simpson’s rule gets within 0.0007 of the exact value. Well done!
  • Gaussian quadrature gets within 0.000000000000003 of the exact value. Amazing!
  • And the lame left endpoint rule outputs… a positive number, getting even the sign wrong! This is ridiculous. The error is more than 0.22.

Let’s try another integral: {\int_{-1}^1 \sqrt{4+\cos \pi x +\sin \pi x} \,dx}, again using {10} points. The function looks like this:

Another nice function
Another nice function

The integral can be evaluated exactly… sort of. In terms of elliptic integrals. And preferably not by hand:

Maple output, "simplified"
Maple output, “simplified”
  • Simpson’s rule is within 0.00001 of the exact value, even better than the first time.
  • Gaussian quadrature is within 0.00000003, not as spectacular as in the first example.
  • And the stupid left endpoint rule is … accurate within 0.00000000000000005. What?

The integral of a smooth periodic function over its period amounts to integration over a circle. When translated to the circle, the elaborate placement of Gaussian sample points is… simply illogical. There is no reason to bunch them up at any particular point: there is nothing special about (-1,0) or any other point of the circle.

Gaussian nodes on a circle
Gaussian nodes on a circle

The only natural approach here is the simplest one: equally spaced points, equal weights. Left endpoint rule uses it and wins.

Equally spaced points
Equally spaced points

Quadrature rules and quadrature domains

The midpoint rule of numerical integration

\displaystyle \int_a^b f(x)\,dx \approx (b-a)f\left(\frac{a+b}{2}\right)

is approximate in general, but exact for linear functions (polynomials of degree at most one).

Midpoint Rule
Midpoint Rule

With two sample points we can integrate any cubic polynomial exactly. The choice of sample points is not obvious: they are to be placed at distance {\dfrac{1}{2\sqrt{3}}(b-a)} from the midpoint. On the example below, {a=0} and {b=1}, so the sample points are {x_{1,2} = \dfrac12 \pm \dfrac{1}{2\sqrt{3}}}. The integral is equal to {\dfrac12 f(x_1)+\dfrac12 f(x_2)}. One can say that each sample point has weight {1/2} in this rule.

Two-point quadrature: areas of yellow rectangles add up to the integral
Two-point quadrature: areas of yellow rectangles add up to the integral

Three sample points are enough for polynomials of degrees up to and including five. This time, the weights attached to the sample points are not equal. The midpoint is used again, this time with the weight of {4/9}. Two other sample points are at distance {\dfrac{\sqrt{3}}{2\sqrt{5}}(b-a)} from the midpoint, and their weights are {5/18} each. This contraption exactly integrates polynomials of degrees up to five.

Three-point quadrature
Three-point quadrature: rectangles of unequal width

Compare this with Simpson’s rule, which also uses three sample points but is exact only up to degree three.

The above are examples of Gaussian quadrature: for each positive integer {n}, one can integrate polynomials of degree up to {2n-1} by taking {n} samples at the right places, and weighing them appropriately.


Let’s move from the real line to the complex plane. If one accepts that the analog of interval {(a,b)} is a disk in the plane, then quadrature becomes very simple: for any disk {D} and any complex polynomials {p},

\displaystyle   \iint_D p = \pi r^2 p(c)

where {c} is the center of the disk and {r} is its radius. One sample point is enough for all degrees! The proof is easy: rewrite {p} in terms of powers of {(z-c)} and integrate them in polar coordinates. The same works for any holomorphic function, as long as it is integrable in {D}.

Disk: a quadrature domain with one node
Disk: a quadrature domain with one node

But maybe the disk is a unique such shape? Not at all: there are other such quadrature domains. A simple family of examples is Neumann ovals (Neumann as in “boundary condition”). Geometrically, they are ellipses inverted in a concentric circle. Analytically, they are (up to linear transformations) images of the unit disk {\mathbb{D}} under

\displaystyle  \varphi(z)=\frac{z}{1-c^2 z^2}\quad (0\le c<1)

This image, denoted {\Omega } below, looks much like an ellipse when {c} is small:

Neumann oval with c=0.3
Neumann oval with c=0.3

Then it becomes peanut-shaped:

Neumann oval with c=0.6
Neumann oval with c=0.6

For {c\approx 1} it looks much like the union of two disks (but the boundary is smooth, contrary to what the plot suggests):

Neumann oval with c=0.95
Neumann oval with c=0.95

In each of these images, the marked points are the quadrature nodes. Let’s find out what they are and how they work.

Suppose {f} is holomorphic and integrable in {\Omega}. By a change of variables,

\displaystyle  \iint_{\Omega } f = \iint_{\mathbb D} (f\circ \varphi)\, \varphi' \,\overline{\varphi'}

Here { (f\circ \varphi) \varphi'} is holomorphic, but { \overline{\varphi'} } is anti-holomorphic. We want to know what { \overline{\varphi'} } does when integrated against something holomorphic. Power series to the rescue:

\displaystyle    \varphi(z) = z\sum_{n=0}^\infty c^{2n} z^{2n} = \sum_{n=0}^\infty c^{2n} z^{2n+1}

hence

\displaystyle  \overline{\varphi'(z)} = \sum_{n=0}^\infty c^{2n} (2n+1) \bar z^{2n}

Multiply this by {z^{k}} and integrate over {\mathbb D} in polar coordinates: the result is {0} if {k} is odd and

\displaystyle    2\pi c^{k} (k+1) \int_0^1 r^{2k} r\,dr= \pi c^{k}

if {k} is even. So, integration of {\overline{\varphi'(z)}} against a power series {g(z) = \sum a_k z^k} produces the sum of {a_{k} c^{k}} over even powers {k} only (with the factor of {\pi}). The process of dropping odd powers amounts to taking the even part of {g}:

\displaystyle  \iint_{\mathbb D} g\, \overline{\varphi'} = \frac{\pi}{2} ( g(c)+ g(-c))

Yes, there’s something about { \overline{\varphi'} } that’s magic (key words: reproducing kernel, specifically Bergman kernel). Plug {g= (f\circ \varphi)\, \varphi' } to conclude

\displaystyle  \iint_{\Omega } f = \frac{\pi}{2} \Big\{f(\varphi(c))\, \varphi'(c) + f(\varphi(-c) ) \,\varphi'(-c)\Big\}

So, the nodes are {\varphi(\pm c) = \pm c/(1-c^4)}. They have equal weight, because {\varphi'(\pm c) = \dfrac{1+c^4}{(1-c^4)^2}}. Final result:

\displaystyle  \iint_{\Omega } f = \frac{\pi (1+c^4)}{2 (1-c^4)^2}\, \left\{f \left( \frac{c}{1-c^4} \right) + f \left(- \frac{c}{1-c^4} \right) \right\}

Again, this is a two-point quadrature formula that is exact for all complex polynomials.

As a bonus, put {f\equiv 1} to find that the area of {\Omega} is {\dfrac{\pi (1+c^4)}{ (1-c^4)^2}}.

Integrate by parts twice and solve for the integral

The dreaded calculus torture device that works for exactly two integrals, {\int e^{ax}\sin bx\,dx} and {\int e^{ax}\cos bx\,dx}.

Actually, no. A version of it (with one integration by parts) works for {\int x^n\,dx}:

\displaystyle    \int x^n\,dx = x^n x - \int x\, d(x^n) = x^{n+1} - n \int x^n\,dx

hence (assuming {n\ne -1})

\displaystyle  \int x^n\,dx = \frac{x^{n+1}}{n+1} +C

Yes, this is more of a calculus joke. A more serious example comes from Fourier series.

The functions {\sin nx}, {n=1,2,\dots}, are orthogonal on {[0,\pi]}, in the sense that

\displaystyle    \int_0^\pi \sin nx \sin mx \,dx =0 , \quad m\ne n

This is usually proved using a trigonometric identity that converts the product to a sum. But the double integration by parts give a nicer proof, because no obscure identities are needed. No boundary terms will appear because the sines vanish at both endpoints:

\displaystyle    \int_0^\pi \sin nx \sin mx \,dx = \frac{n}{m} \int_0^\pi \cos nx \cos mx \,dx = \frac{n^2}{m^2} \int_0^\pi \sin nx \sin mx \,dx

All integrals here must vanish because {n^2/m^2\ne 1}. As a bonus, we get the orthogonality of cosines, {\int_0^\pi \cos nx \cos mx \,dx=0}, with no additional effort.

The double integration by parts is also a more conceptual proof, because it gets to the heart of the matter: eigenvectors of a symmetric matrix (operator) that correspond to different eigenvalues are orthogonal. The trigonometric form is incidental, the eigenfunction property is essential. Let’s try this one more time, for the mixed boundary value problem {f(a)=0}, {f'(b)=0}. Suppose that {f} and {g} satisfy the boundary conditions, {f''=\lambda f}, and {g''=\mu g}. Since {fg'} and {f'g} vanish at both endpoints, we can pass the primes easily:

\displaystyle    \int_a^b fg= \frac{1}{\mu}\int_a^b fg'' = -\frac{1}{\mu}\int_a^b f'g' = \frac{1}{\mu}\int_a^b f''g = \frac{\lambda}{\mu} \int_a^b fg

If {\lambda\ne \mu}, all integrals must vanish.

Scaling and oscillation

A function {f\colon \mathbb R\rightarrow\mathbb R} can be much larger than its derivative. Take the constant function {f(x)=10^{10}}, for example. Or {f(x)=10^{10}+\sin x} to make it nonconstant. But if one subtracts the average (mean) from {f}, the residual is nicely estimated by the derivative:

\displaystyle    \frac{1}{b-a}\int_a^b |f(x)-\overline{f}|\,dx \le \frac12 \int_a^b |f'(x)|\,dx    \ \ \ \ \ (1)

Here {\overline{f}} is the mean of {f} on {[a,b]}, namely {\overline{f}=\frac{1}{b-a}\int_a^b f(t)\,dt}. Indeed, what’s the worst that could happen? Something like this:

Deviation from the mean
Deviation from the mean

Here {H} is at most the integral of {|f'|}, and the shaded area is at most {\frac12 H(b-a)}. This is what the inequality (1) says.

An appealing feature of (1) is that it is scale-invariant. For example, if we change the variable {u=2x}, both sides remain the same. The derivative will be greater by the factor of {2}, but will be integrated over the shorter interval. And on the left we have averages upon averages, which do not change under scaling.

What happens in higher dimensions? Let’s stick to two dimensions and consider a smooth function {f\colon\mathbb R^2\rightarrow\mathbb R}. Instead of an interval we now have a square, denoted {Q}. It makes sense to denote squares by {Q}, because it’s natural to call a square a cube, and “Q” is the first letter of “cube”. Oh wait, it isn’t. Moving on…

The quantity {b-a} was the length of interval of integration. Now we will use the area of {Q}, denoted {|Q|}. And {\overline{f}=\frac{1}{|Q|}\iint_Q f} is now the mean value of {f} on {Q}. At first glance one might conjecture the following version of (1):

\displaystyle    \frac{1}{|Q|}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le C \int_Q |\nabla f(x,y)|\,dx\,dy   \ \ \ \ \ (2)

But this can’t be true because of inconsistent scaling. The left side of (2) is scale-invariant as before. The right side is not. If we shrink the cube by factor of {2}, the gradient {|\nabla f|} will go up by {2}, but the area goes down by {4}. This suggests that the correct inequality should be

\displaystyle    \frac{1}{|Q|}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le C \left(\int_Q |\nabla f(x,y)|^2\,dx\,dy\right)^{1/2}   \ \ \ \ \ (3)

We need the square root so that the right side of (3) scales correctly with {f}: to first power.

And here is the proof. Let {f(*,y)} denote {f} averaged over {x}. Applying (1) to every horizontal segment in {Q}, we obtain

\displaystyle    \frac{1}{h}\iint_Q |f(x,y)-f(*,y)|\,dx\,dy \le \frac12 \int_Q |f_x(x,y)|\,dx\,dy    \ \ \ \ \ (4)

where {h} is the sidelength of {Q}. Now work with {f(*,y)}, using (1) along vertical segments:

\displaystyle  \frac{1}{h}\iint_Q |f(*,y)-f(*,*)|\,dx\,dy \le \frac12 \int_Q |f_y(*,y)|\,dx\,dy    \ \ \ \ \ (5)

Of course, {f(*,*)} is the same as {\overline{f}}. The derivative on the right can be estimated: the derivative of average does not exceed the average of the absolute value of derivative. To keep estimates clean, simply estimate both partial derivatives by {|\nabla f|}. From (4) and (5) taken together it follows that

\displaystyle    \frac{1}{h}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le \int_Q |\nabla f(x,y)|\,dx\,dy    \ \ \ \ \ (6)

This is an interesting result (a form of the Poincar\'{e} inequality), but in the present form it’s not scale-invariant. Remember that we expect the square of the gradient on the right. Cauchy-Schwarz to the rescue:

\displaystyle    \int_Q 1\cdot |\nabla f| \le \left( \int_Q 1 \right)^{1/2} \left( \int_Q |\nabla f|^2 \right)^{1/2}

The first factor on the right is simply {h}. Move it to the left and we are done:

\displaystyle    \frac{1}{|Q|}\iint_Q |f(x,y)-\overline{f}|\,dx\,dy \le \left(\int_Q |\nabla f(x,y)|^2\,dx\,dy\right)^{1/2}   \ \ \ \ \ (7)

In higher dimensions we would of course have {n} instead of {2}. Which is one of many reasons why analysis in two dimensions is special: {L^n} is a Hilbert space only when {n=2}.

The left side of (7) is the mean oscillation of {f} on the square {Q}. The integrability of {|\nabla f|^n} in {n} dimensions ensures that {f} is a function of bounded mean oscillation, known as BMO. Actually, it is even in the smaller space VMO because the right side of (7) tends to zero as the square shrinks. But it need not be continuous or even bounded: for {f(x)=\log\log |x| } the integral of {|\nabla f|^n} converges in a neighborhood of the origin (just barely, thanks to {\log^n |x|} in the denominator). This is unlike the one-dimensional situation where the integrability of {|f'|} guarantees that the function is bounded.