Experiments with the significance of autocorrelation

Given a sequence of numbers {x_j} of length {L} one may want to look for evidence of its periodic behavior. One way to do this is by computing autocorrelation, the correlation of the sequence with a shift of itself. Here is one reasonable way to do so: for lag values {\ell=1,\dots, \lfloor L/2 \rfloor} compute the correlation coefficient of {(x_1,\dots, x_{L-\ell}} with {(x_{\ell+1},\dots, x_L)}. That the lag does not exceed {L/2} ensures the entire sequence participates in the computation, so we are not making a conclusion about its periodicity after comparing a handful of terms at the beginning and the end. In other words, we are not going to detect periodicity if the period is more than half of the observed time period.

Having obtained the correlation coefficients, pick one with the largest absolute value; call it R. How large does R have to be in order for us to conclude the correlation is not a fluke? The answer depends on the distribution of our data, but an experiment can be used to get some idea of likelihood of large R.

I picked {x_j} independently from the standard normal distribution, and computed {r} as above. After 5 million trials with a sequence of length 100, the distribution of R was as follows:

ac100-5M
Extremal correlation coefficient in a sequence of length 100

Based on this experiment, the probability of obtaining |R| greater than 0.5 is less than 0.0016. So, 0.5 is pretty solid evidence. The probability of {|R| > 0.6} is two orders of magnitude less, etc. Also, |R| is unlikely to be very close to zero unless the data is structured in some strange way. Some kind of correlation ought to be present in the white noise.

Aside: it’s not easy to construct perfectly non-autocorrelated sequences for the above test. For length 5 an example is 1,2,3,2,3. Indeed, (1,2,3,2) is uncorrelated with (2,3,2,3) and (1,2,3) is uncorrelated with (3,2,3). For length 6 and more I can’t construct these without filling them with a bunch of zeros.

Repeating the experiment with sequences of length 1000 shows a tighter distribution of R: now |R| is unlikely to be above 0.2. So, if a universal threshold is to be used here, we need to adjust R based on sequence length.

ac1000-5M
Extremal correlation coefficient in a sequence of length 1000

I did not look hard for statistical studies of this subject, resorting to an experiment. Experimentally obtained p-values are pretty consistent for the criterion {L^{0.45}|R| > 4}. The number of trials was not very large (10000) so there is some fluctuation, but the pattern is clear.
 

Length, L P(L0.45|R| > 4)
100 0.002
300 0.0028
500 0.0022
700 0.0028
900 0.0034
1100 0.0036
1300 0.0039
1500 0.003
1700 0.003
1900 0.0042
2100 0.003
2300 0.0036
2500 0.0042
2700 0.0032
2900 0.0043
3100 0.0042
3300 0.0025
3500 0.0031
3700 0.0027
3900 0.0042

Naturally, all this depends on the assumption of independent normal variables.

And this is the approach I took to computing r in Python:

import numpy as np
n = 1000
x = np.random.normal(size=(n,))
acorr = np.correlate(x, x, mode='same')
acorr = acorr[n//2+1:]/(x.var()*np.arange(n-1, n//2, -1))
r = acorr[np.abs(acorr).argmax()]

When the digits of pi go to 11

There is an upward trend in the digits of {\pi}. I just found it using Maple.

X := [0, 1, 2, 3, 4, 5, 6, 7, 8]:
Y := [3, 1, 4, 1, 5, 9, 2, 6, 5]:
LinearFit([1, n], X, Y, n);

2.20000000000000+.450000000000000*n

Here the digits are enumerated beginning with the {0}th, which is {3}. The regression line {y = 2.2 + 0.45n} predicts that the {20}th digit of {\pi} is approximately {11}.

It goes to 11
It goes to 11

But maybe my data set is too small. Let’s throw in one more digit; that ought to be enough. Next digit turns out to be {3}, and this hurts my trend. The new regression line {y=2.67+0.27n} has smaller slope, and it crosses the old one at {n\approx 2.7}.

Next digit, not as good
Next digit, not as good

But we all know that {3} can be easily changed to {8}. The old “professor, you totaled the scores on my exam incorrectly” trick. Finding a moment when none of the {\pi}-obsessed people are looking, I change the decimal expansion of {\pi} to {3.1 41592658\dots}. New trend looks even better than the old: the regression line became steeper, and it crosses the old one at the point {n\approx 2.7}.

Much better!
Much better!

What, {2.7} again? Is this a coincidence? I try changing the {9}th digit to other numbers, and plot the resulting regression lines.

What is going on?
What is going on?

All intersect at the same spot. The hidden magic of {\pi} is uncovered.

(Thanks to Vincent Fatica for the idea of this post.)