Vectorization of integration

A frequently asked question about numerical integration is: how to speed up the process of computing {\int_a^b f(x,p)\,dx} for many values of parameter {p}? Running an explicit for over the values of {p} seems slow… is there a way to vectorize this, performing integration on an array of functions at once (which in reality means, pushing the loop to a lower level language)?

Usually, the answer is some combination of “no” and “you are worrying about the wrong thing”. Integration routines are typically adaptive, meaning they pick evaluation points based on the function being integrated. Different values of the parameter will require different evaluation points to achieve the desired precision, so vectorization is out of question. This applies, for example, to quad method of SciPy integration module.

Let’s suppose we insist on vectorization and are willing to accept a non-adaptive method. What are our options, considering SciPy/NumPy? I will compare

The test case is {\int_0^1 (p+1)x^p\,dx} with {p=0,0.01,\dots,99.99}. Theoretically all these are equal to {1}. But most of these functions are not analytic at {0}, and some aren’t even differentiable there, so their integration is not a piece of cake.

Keeping in mind that Romberg’s integration requires {2^k} subintervals, let’s use {1024} equal subintervals, hence {1025} equally spaced sample points. Here is the vectorized evaluation of all our functions at these points, resulting in a 2D array “values”:

import numpy as np
import scipy.integrate as si
x = np.linspace(0, 1, 1025).reshape(1, -1)
dx = x[0,1] - x[0,0]
p = np.arange(0, 100, 0.01).reshape(-1, 1)
values = (p+1)*x**p

This computation took 0.799 seconds on my AWS instance (t2.nano). Putting the results of evaluation together takes less time:

  • Trapezoidal rule np.trapz(values, dx=dx) took 0.144 seconds and returned values ranging from 0.99955 to 1.00080.
  • Simpson’s rule si.simps(values, dx=dx) took 0.113 seconds and returned values ranging from 0.99970 to 1.0000005.
  • Romberg quadrature si.romb(values, dx=dx) was fastest at 0.0414 seconds, and also most accurate, with output between 0.99973 and 1.000000005.

Conclusions so far:

  • SciPy’s implementation of Romberg quadrature is surprisingly fast, considering that this algorithm is the result of repeated Richardson extrapolation of the trapezoidal rule (and Simpson’s rule is just the result of the first extrapolation step). It’d be nice to backport whatever makes Romberg so effective back to Simpson’s rule.
  • The underestimation errors 0.999… are greatest when {p} is near zero, so the integrand is very nonsmooth at {0}. The lack of smoothness renders Richardson extrapolation ineffective, hence all three rules have about the same error here.
  • The overestimation errors are greatest when {p} is large. The function is pretty smooth then, so upgrading from trapezoidal to Simpson to Romberg brings substantial improvements.

All of this is nice, but the fact remains that non-vectorized adaptive integration is both faster and much more accurate. The following loop with quad, which uses adaptive Clenshaw-Curtis quadrature, runs in 0.616 seconds (less than it took to evaluate our functions on a grid) and returns values between 0.99999999995 and 1.00000000007. Better to use exponential notation here: {1-5\cdot 10^{-11} } and {1+7\cdot 10^{-11}}.

funcs = [lambda x, p=p_: (p+1)*x**p for p_ in np.arange(0, 100, 0.01)]
result = [si.quad(fun, 0, 1)[0] for fun in funcs]

An adaptive algorithm shines when {p} is small, by placing more evaluation points near zero. It controls both over- and under- estimates equally well, continuing to add points until the required precision is reached.

The last hope of non-adaptive methods is Gaussian quadrature, which is implemented in SciPy as fixed_quad (“fixed” referring to the number of evaluation points used). With 300 evaluation points, the integration takes 0.263 seconds (excluding the computation of nodes and weights, which can be cached) and the results are between {1-2\cdot 10^{-12}} and {1+2\cdot 10^{-7}}. This is twice as fast as a loop with quad, and more accurate for large {p} — but sadly, not so accurate for small {p}. As said before, {x^p} with {p} near zero is really a showcase for adaptive methods.

Trigonometric approximation and the Clenshaw-Curtis quadrature

Trying to approximate a generic continuous function on {[-1,1]} with the Fourier trigonometric series of the form {\sum_n (A_n\cos \pi nx+B_n\sin \pi nx)} is in general not very fruitful. Here’s such an approximation to {f(x)=\exp(x)}, with the sum over {n\le 4}:

Poor approximation to exp(x)
Poor approximation to exp(x)

It’s better to make a linear change of variable: consider {f(2x-1)} on the interval {[0,1]}, and use the formula for the cosine series. This results in {\exp(2x-1)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos \pi nx} of its cosine Fourier series as follows.

Better approximation after a change of variable
Better approximation after a change of variable

But one can do much better with a different, nonlinear change of variable. Consider {f(\cos x)} on the interval {[0,\pi]}, and again use the formula for the cosine series. This results in {\exp(\cos x)}, which is approximated by the partial sum {\sum_{n=0}^4 A_n\cos nx} of its cosine Fourier series as follows.

Excellent approximation after nonlinear change of variable
Excellent approximation after nonlinear change of variable

Yes, I can’t see any difference either: the error is less than {10^{-3}}.

The composition with cosine improves approximation because {f(\cos t)} is naturally a periodic function, with no jumps or corners in its graph. Fourier series, which are periodic by nature, follow such functions more easily.


A practical implication of this approximation of {f(\cos t)} is the Clenshaw-Curtis integration method. It can be expressed in one line:

\displaystyle    \int_{-1}^1 f(x)\,dx = \int_0^\pi f(\cos t)\sin t\,dt \approx \int_0^\pi \sum_{n=0}^N a_n \cos nt \sin t\,dt   = \sum_{n=0}^N \frac{(1+(-1)^n) a_n}{1-n^2}

The integral {\int_{-1}^1 f(x)\,dx} is approximated by summing {2a_{2k}/(1-4k^2)}, where {a_{2k}} are even-numbered cosine coefficients of {f(\cos t)}. In the example with {f(x)=\exp(x)} using just three coefficients yields

\displaystyle    \frac{2a_0}{1}+\frac{2a_2}{-3}+\frac{2a_4}{-15} \approx 2.350405

while the exact integral is {\approx 2.350402}.


At first this doesn’t look practical at all: after all, the Fourier coefficients are themselves found by integration. But since {f(\cos t)} is so close to a trigonometric polynomial, one can sample it at equally spaced points and apply the Fast Fourier transform to the result, quickly obtaining many coefficients at once. This is what the Clenshaw-Curtis quadrature does (at least in principle, the practical implementation may streamline these steps.)

Unreasonable effectiveness of the left endpoint rule

The left endpoint rule, and its twin right endpoint rule, are ugly ducklings of integration methods. The left endpoint rule is just the average of the values of the integrand over left endpoints of equal subintervals:

\displaystyle \int_a^b f(x)\,dx \approx \frac{b-a}{n} \sum_{i=0}^{n-1} f(a+i/n)

Here is its graphical representation with {n=10} on the interval {[-1,1]}: the sample points are marked with vertical lines, the length of each line representing the weight given to that point. Every point has the same weight, actually.

Left endpoint rule
Left endpoint rule

Primitive, ineffective, with error {O(1/n)} for {n} points used.

Simpson’s rule is more sophisticated, with error {O(1/n^4)}. It uses weights of three sizes:

Simpson's rule
Simpson’s rule

Gaussian quadrature uses specially designed (and difficult to compute) sample points and weights: more points toward the edges, larger weights in the middle.

Gaussian quadrature
Gaussian quadrature

Let’s compare these quadrature rules on the integral {\int_{-1}^1 e^x \cos \pi x\,dx}, using {10} points as above. Here is the function:

Nice smooth function
Nice smooth function
  • The exact value of the integral is {\dfrac{e^{-1}-e}{1+\pi^2}}, about -0.216.
  • Simpson’s rule gets within 0.0007 of the exact value. Well done!
  • Gaussian quadrature gets within 0.000000000000003 of the exact value. Amazing!
  • And the lame left endpoint rule outputs… a positive number, getting even the sign wrong! This is ridiculous. The error is more than 0.22.

Let’s try another integral: {\int_{-1}^1 \sqrt{4+\cos \pi x +\sin \pi x} \,dx}, again using {10} points. The function looks like this:

Another nice function
Another nice function

The integral can be evaluated exactly… sort of. In terms of elliptic integrals. And preferably not by hand:

Maple output, "simplified"
Maple output, “simplified”
  • Simpson’s rule is within 0.00001 of the exact value, even better than the first time.
  • Gaussian quadrature is within 0.00000003, not as spectacular as in the first example.
  • And the stupid left endpoint rule is … accurate within 0.00000000000000005. What?

The integral of a smooth periodic function over its period amounts to integration over a circle. When translated to the circle, the elaborate placement of Gaussian sample points is… simply illogical. There is no reason to bunch them up at any particular point: there is nothing special about (-1,0) or any other point of the circle.

Gaussian nodes on a circle
Gaussian nodes on a circle

The only natural approach here is the simplest one: equally spaced points, equal weights. Left endpoint rule uses it and wins.

Equally spaced points
Equally spaced points

Unreasonable effectiveness of Runge-Kutta 2

I used to give this example of an initial-value problem whenever numerical solutions of ODE came up in class:

\displaystyle    y' = -y^2, \quad y(0)=1

The exact solution is {y(t)=1/(1+t)}. Using Euler’s method with step {h=1} (admittedly large) we get, starting with {y_0=1}:

\displaystyle    y_1 = y_0 + h (-y_0^2) = 0,\quad y_2 = y_1 + h (-y_1^2) = 0, \dots

The numerical solution drops down to zero at once and stays there. Not good at all.

Euler's method
Euler’s method

The trapezoidal version of RK2 (also known as improved Euler, or Heun’s method) treats the above value of {y_1} merely as prediction {\tilde y_1}, and computes the actual {y_1} using the average of the slopes at {y_0} and {\tilde y_1}:

\displaystyle    y_1 = y_0 + \frac{h}{2} (-y_0^2 - \tilde y_1^2) = 1 + \frac12 (-1 - 0 ) = \frac12

… which is right on target. Let’s do it again: the prediction {\tilde y_2 = y_1 + h (-y_1^2) = 1/4} and correction

\displaystyle    y_2 = y_1 + \frac{h}{2} (-y_1^2 - \tilde y_2^2) = 1 + \frac12 \left(-\frac{1}{4}-\frac{1}{16} \right) =    \frac{11}{32}

… which is {\approx 0.34}, compared to the exact solution {1/3}. Nice!

Trapezoidal RK2
Trapezoidal RK2

I like the example, because all numbers are manageable with mental arithmetics for two steps. And it’s clear that {11/32} is pretty close to {1/3 = 11/33}.


But the fact that the relative error stays below {3.7\%} throughout the computation despite the large step size {h=1} seems… too good to be true. Consider that for the exact solution, the expression {y^2} can differ by the factor of {4} between two {t}-values at distance {1}.

The midpoint version of RK2, which generally performs about as well as the trapezoidal version, is nowhere as close in this example. Indeed, using the same data as above, we get

\displaystyle    y_1 = y_0 - h \left(\frac{y_0+\tilde y_1}{2}\right)^2 = \frac{3}{4}

and so forth: the relative error reaches {59\%} at the second step. That’s 16 times worse than the trapezoidal version.

Midpoint RK2
Midpoint RK2

What happens here has less to do with numerical analysis than with algebra of rational functions. Using {h=1} in trapezoidal RK2, we are in effect iterating the function

\displaystyle \varphi(y) = y + \frac12 (-y^2-(y-y^2)^2) = y-y^2+y^3-\frac12 y^4

The exact solution would be obtained by iterating

\displaystyle \Phi (y) = \frac{y}{1+y} = y-y^2+ y^3 - y^4 + \dots

Two functions just happen to coincide at {y=1}, which is our starting point here.

Two iterands
Two iterands

From there we get to {0.5}, and on {[0,0.5]} they are really close anyway, due to another coincidence: the truncation error {|\varphi(y)-\Phi(y)|} is {O(y^4)} instead of {O(y^3)} as it is normally for second-order methods.

The midpoint method with {h=1} amounts to iterating

\displaystyle \psi(y) = y - \left(\frac{y+y-y^2}{2} \right)^2 = y-y^2+y^3-\frac14 y^4

which is not substantially further away from {\Phi}, but does not enjoy the lucky coincidence at {y=1}.


The tables are turned with the initial value {y_0=3}. The exact solution is {y(t) = 3/(1+3t)}, which drops sharply from {t=0} to {t=1}; its slope decreases by the factor of {16} during one step. Still, midpoint-RK2 does a decent job with {h=1}:

Midpoint RK2 starting with y=3
Midpoint RK2 starting with y=3

while trapezoidal RK2 outputs {y_1=-19.5}, {y_2=-80110}, {y_3 = -2\cdot 10^{19}} and promptly overflows.


With a reasonable step size, like {h=0.1}, normality is restored: both methods perform about equally well, with {0.13\%} error for trapezoidal and {0.21\%} for midpoint.

Calculating 2*2 with high precision

The definition of derivative,

\displaystyle    f'(x) = \lim_{h\rightarrow 0}\frac{f(x+h) - f(x)}{h}   \ \ \ \ \ \ \ \ \ \ (1)

is not such a great way to actually find the derivative numerically. Its symmetric version,

\displaystyle    f'(x) = \lim_{h\rightarrow 0}\frac{f(x+h) - f(x-h)}{2h}   \ \ \ \ \ \ \ \ \ \ (2)

performs much better in computations. For example, consider the derivative {f(x)=e^x } at the point {x=1}. We know that {f'(1)=2.718281828\dots}. Numerically, with {h=0.001}, we get

\displaystyle    \frac{f(1+h) - f(1)}{h} \approx \mathbf{2.71}9 \dots

(error {>10^{-3}}) versus

\displaystyle    \frac{f(1+h) - f(1-h)}{2h} \approx \mathbf{2.71828}2 \dots

(error {<10^{-6}}).

Considering this, why don’t we ditch (1) altogether and adopt (2) as the definition of derivative? Just say that by definition,

\displaystyle    f'(x) = \lim_{h\rightarrow 0}\frac{f(x+h)-f(x-h)}{2h}

whenever the limit exists.

This expands the class of differentiable functions: for example, {f(x)=|x|} becomes differentiable with {f'(0)=0}. Which looks more like a feature than a bug: after all, {f} has a minimum at {0}, and the horizontal line through the minimum is the closest thing to the tangent line that it has.

Another example: the function

\displaystyle     f(x) = \begin{cases} x ,\quad & x\le 0 \\ 3x,\quad & x>0 \end{cases}

has {f'(0)=2} under this definition, because

\displaystyle    \lim_{h\rightarrow 0^+}\frac{f(x+h)-f(x-h)}{2h} = \lim_{h\rightarrow 0^+}\frac{3h-(-h)}{2h} = 2

and

\displaystyle    \lim_{h\rightarrow 0^-}\frac{f(x+h)-f(x-h)}{2h} = \lim_{h\rightarrow 0^-}\frac{h-3(-h)}{2h} = 2

This example also makes sense: since {f(x)=|x|+2x}, getting {f'(0)=0+2} is expected. In fact, with the new definition we still have basic derivative rules: if {f,g} are differentiable, then {f+g}, {f-g}, {fg}, {f/g} are also differentiable (with the usual caveat about {g\ne 0}) and the familiar formulas hold.

Let’s test the chain rule on the function {g = f\circ f}. The rule says

\displaystyle     g'(0) = f'(f(0)) f'(0)    \ \ \ \ \ \ \ \ \ \ (3)

Since {f(0)=0}, the product on the right is {2\cdot 2}. On the other hand,

\displaystyle     g(x) = \begin{cases} x ,\quad & x\le 0 \\ 9x,\quad & x>0 \end{cases}

which implies, by a computation similar to the above, that {g'(0)=5}. So, if we want to have the chain rule (3), we must accept that

\displaystyle     \mathbf{2\cdot 2 = 5}

This is where the desire for high numerical precision leads.

Plenty of other things go wrong with the symmetric definition:

  • Maximum or minimum of {f} may be attained where {f'} exists and is nonzero.
  • A differentiable function may be discontinuous.
  • Having {f'>0} everywhere does not imply that {f} is increasing.
  • Mean Value Theorem fails.