35 categories of Stack Overflow comments

Google’s BigQuery dataset now includes Stack Overflow data dump, including the text of over 50 million comments posted on the site. What do these comments say? I picked the most frequent ones and grouped them by topic. The counts are an underestimate: there is only so much time I was willing to spend organizing synonymous comments.

  1. Thank you” comments (128960 in total) are the most common by far. Typical forms: Thank you very/so much!, Thanks a lot :), Perfect, thanks! The popularity of the emoticon in the second version is attributable to the minimal length requirement for comments: they must contain at least 15 characters. The laziest way to pad the text is probably Thank you……
  2. You are welcome” (50090), presumably posted in response to group 1 comments. You’re welcome. You’re welcome! You’re welcome 🙂 Users need that punctuation or emoticon to reach 15 characters. Although those not contracting “you are” don’t have this problem.
  3. Updated answer” (30979) invites whoever raised objections about the previous version of the answer to read it again.
  4.  “What is your question?” (20830) is the most common type of critical comments toward questions.
  5. This is not an answer” (17306) is the most common criticism for answers; usually posted automatically by reviewers. Typical form: This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. Another one is for questions posted in the answer box: If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context.
  6. What error are you getting?” (13439) is a request for debugging information.
  7. What have you tried?” (12640) often comes with the link whathaveyoutried.com and is a sufficiently notorious kind of comments that Stack Overflow software deletes them if anyone “flags” the comment. And it’s easy to cast flags automatically, so I substantially reduced the number of such comments since this data dump was uploaded. Further context: Should Stack Overflow (and Stack Exchange in general) be awarding “A”s for Effort?
  8. Post your code” (11486) can sometimes be a form of “what have you tried”; in other times it’s a logical response to someone posting an error message without the code producing it. Can you post your code? Post your code. Please post your code. Show your code. Where is your code? And so on.
  9. It does not work” (9674) — either the question author, or someone else with the same issue did not benefit from the solution. Maybe it’s wrong, maybe they used it wrong.
  10. This is a link-only answer” (9501) usually comes from reviewers in the standard form While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
  11. I updated the question” (8060), presumably in response to critical comments.
  12. Why the downvote(s)?” (6005) is asking whoever voted down the post to explain their position. Usually fruitless; if the voter wanted to say something, they would already.
  13. This is a duplicate question” (3859) is inserted automatically when someone moves for a question to be marked as a duplicate. Such comments are normally deleted automatically when the required number of close-votes is reached; but some remain. The most common by far is possible duplicate of What is a Null Pointer Exception, and how do I fix it? 
  14. I edited your title” (3795) is directed at users who title their questions like “Java: How to read a CSV file?”, using a part of the title as a tag. Standard form: I have edited your title. Please see, “Should questions include “tags” in their titles?“, where the consensus is “no, they should not”.
  15. Post a MCVE” (3775) – the line on which the error is thrown is probably not enough to diagnose the problem; on the other hand, a wall of code with an entire program is too much. One of standard forms: Questions seeking debugging help (“why isn’t this code working?”) must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example.
  16. That is correct” (3158)  usually refers to a statement made in another comment.
  17. It works” (3109) is the counterpart of group 9 above. Often used with “like a charm” but do charms actually work?
  18. What do you mean?” (2998) – for when an exchange in comments leads to more confusion.
  19. What tool are you using?” (2649) indicates that the question author forgot to specify either the language, OS, or the DBMS they are using.
  20. Good answer” (2607) – various forms of praise, This should be the accepted answer. This is the correct answer. Excellent answer! The first form additionally indicates that the question author did not pick the best answer as “accepted”.
  21. This question is off-topic” (2377) is a template for close votes with a custom explanation. For some years Stack Overflow used This question appears to be off-topic because… but then switched to the more assertive I’m voting to close this question as off-topic because…
  22. This is a low quality answer” (2003) is a response to answers that contain nothing but code, perhaps preceded by “try this”. Example: While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.
  23. Is this homework?” (1995) is not a particularly fruitful type of comments.
  24. Does this work?” (1916) is meant to obtain some response from question asker who has not yet acknowledged the answer.
  25. The link is dead” (1250) is a major reason why group 10 comments exist.
  26. http://stackoverflow.com/help/ . . .” (1117) and nothing but the link. Directs to one of Help Center articles such as “How to ask”. Maybe there should also be “How to Comment”
  27. Thanks are discouraged” (1046) … so all those group 1 comments aren’t meant to be. But this is mostly about posts rather than comments. Unlike forum sites, we don’t use “Thanks”, or “Any help appreciated”, or signatures on Stack Overflow. See “Should ‘Hi’, ‘thanks,’ taglines, and salutations be removed from posts?.
  28. Format your code” (967) – yes, please. Select the code block and press Ctrl-K. Thanks in advance. Oops, forgot about the previous group.
  29. What doesn’t work?” (926) is a response to vague comments of group 9.
  30. Don’t use mysql_* functions” (693) or Russian hackers will pwn your site. Comes with a link-rich explanation: Please, don’t use `mysql_*` functions in new code. They are no longer maintained and are officially deprecated. See the red box? Learn about prepared statements instead, and use PDO or MySQLithis article will help you decide which. If you choose PDO, here is a good tutorial.
  31. Add tags” (625) often comes up in the context of database questions. Which RDBMS is this for? Please add a tag to specify whether you’re using `mysql`, `postgresql`, `sql-server`, `oracle` or `db2` – or something else entirely.
  32. Improve title” (585) is like group 14, but invites the user to edit the title instead of doing it for them.
  33. Use modern JOIN syntax” (301) bemoans obsolete ways of dealing with databases. Bad habits to kick : using old-style JOINs – that old-style *comma-separated list of tables* style was replaced with the *proper* ANSI `JOIN` syntax in the ANSI-92 SQL Standard (more than 20 years ago) and its use is discouraged.
  34. More SQL woes” (272) is another template: Side note: you should not use the `sp_` prefix for your stored procedures. Microsoft has reserved that prefix for its own use (see *Naming Stored Procedures*, and you do run the risk of a name clash sometime in the future. It’s also bad for your stored procedure performance. It’s best to just simply avoid `sp_` and use something else as a prefix – or no prefix at all!
  35. I have the same problem” (241) is a kind of comments that really should not exist.

Vectorization of integration

A frequently asked question about numerical integration is: how to speed up the process of computing {\int_a^b f(x,p)\,dx} for many values of parameter {p}? Running an explicit for over the values of {p} seems slow… is there a way to vectorize this, performing integration on an array of functions at once (which in reality means, pushing the loop to a lower level language)?

Usually, the answer is some combination of “no” and “you are worrying about the wrong thing”. Integration routines are typically adaptive, meaning they pick evaluation points based on the function being integrated. Different values of the parameter will require different evaluation points to achieve the desired precision, so vectorization is out of question. This applies, for example, to quad method of SciPy integration module.

Let’s suppose we insist on vectorization and are willing to accept a non-adaptive method. What are our options, considering SciPy/NumPy? I will compare

The test case is {\int_0^1 (p+1)x^p\,dx} with {p=0,0.01,\dots,99.99}. Theoretically all these are equal to {1}. But most of these functions are not analytic at {0}, and some aren’t even differentiable there, so their integration is not a piece of cake.

Keeping in mind that Romberg’s integration requires {2^k} subintervals, let’s use {1024} equal subintervals, hence {1025} equally spaced sample points. Here is the vectorized evaluation of all our functions at these points, resulting in a 2D array “values”:

import numpy as np
import scipy.integrate as si
x = np.linspace(0, 1, 1025).reshape(1, -1)
dx = x[0,1] - x[0,0]
p = np.arange(0, 100, 0.01).reshape(-1, 1)
values = (p+1)*x**p

This computation took 0.799 seconds on my AWS instance (t2.nano). Putting the results of evaluation together takes less time:

  • Trapezoidal rule np.trapz(values, dx=dx) took 0.144 seconds and returned values ranging from 0.99955 to 1.00080.
  • Simpson’s rule si.simps(values, dx=dx) took 0.113 seconds and returned values ranging from 0.99970 to 1.0000005.
  • Romberg quadrature si.romb(values, dx=dx) was fastest at 0.0414 seconds, and also most accurate, with output between 0.99973 and 1.000000005.

Conclusions so far:

  • SciPy’s implementation of Romberg quadrature is surprisingly fast, considering that this algorithm is the result of repeated Richardson extrapolation of the trapezoidal rule (and Simpson’s rule is just the result of the first extrapolation step). It’d be nice to backport whatever makes Romberg so effective back to Simpson’s rule.
  • The underestimation errors 0.999… are greatest when {p} is near zero, so the integrand is very nonsmooth at {0}. The lack of smoothness renders Richardson extrapolation ineffective, hence all three rules have about the same error here.
  • The overestimation errors are greatest when {p} is large. The function is pretty smooth then, so upgrading from trapezoidal to Simpson to Romberg brings substantial improvements.

All of this is nice, but the fact remains that non-vectorized adaptive integration is both faster and much more accurate. The following loop with quad, which uses adaptive Clenshaw-Curtis quadrature, runs in 0.616 seconds (less than it took to evaluate our functions on a grid) and returns values between 0.99999999995 and 1.00000000007. Better to use exponential notation here: {1-5\cdot 10^{-11} } and {1+7\cdot 10^{-11}}.

funcs = [lambda x, p=p_: (p+1)*x**p for p_ in np.arange(0, 100, 0.01)]
result = [si.quad(fun, 0, 1)[0] for fun in funcs]

An adaptive algorithm shines when {p} is small, by placing more evaluation points near zero. It controls both over- and under- estimates equally well, continuing to add points until the required precision is reached.

The last hope of non-adaptive methods is Gaussian quadrature, which is implemented in SciPy as fixed_quad (“fixed” referring to the number of evaluation points used). With 300 evaluation points, the integration takes 0.263 seconds (excluding the computation of nodes and weights, which can be cached) and the results are between {1-2\cdot 10^{-12}} and {1+2\cdot 10^{-7}}. This is twice as fast as a loop with quad, and more accurate for large {p} — but sadly, not so accurate for small {p}. As said before, {x^p} with {p} near zero is really a showcase for adaptive methods.

Kolakoski turtle curve

Let’s take another look at the Kolakoski sequence (part 1, part 2) which, by definition, is the sequence of 1s and 2s in which the nth term is equal to the length of the nth run of consecutive equal numbers in the same sequence. When a sequence has only two distinct entries, it can be visualized with the help of a turtle that turns left (when the entry is 1) or right (when the entry is 2). This visualization method seems particularly appropriate for the  Kolakoski sequence since there are no runs of 3 equal entries, meaning the turtle will never move around a square of sidelength equal to its step. In particular, this leaves open the possibility of getting a simple curve… Here are the first 300 terms; the turtle makes its first move down and then goes left-right-right-left-left-right-left-… according to the terms 1,2,2,1,1,2,1,…

300 terms of the sequence
300 terms of the sequence

No self-intersections yet… alas, at the 366th term it finally happens.

First self-intersection: 366 terms
First self-intersection: 366 terms

Self-intersections keep occurring after that:

1000 terms
1000 terms

again and again…

5000 terms
5000 terms

Okay, the curve obviously doesn’t mind intersecting self. But it can’t be periodic since the Kolakoski sequence isn’t. This leaves two questions unresolved:

  • Does the turtle ever get above its initial position? Probably… I haven’t tried more than 5000 terms
  • Is the curve bounded? Unlikely, but I’ve no idea how one would dis/prove that. For example, there cannot be a long diagonal run (left-right-left-right-left) because having 1,2,1,2,1 in the sequence implies that elsewhere, there are three consecutive 1s, and that doesn’t happen.

Here’s the Python code used for the above. I represented the sequence as a Boolean array with 1 = False, 2 = True.

import numpy as np
import turtle
n = 500                   # number of terms to compute
a = np.zeros(n, dtype=np.bool_)
j = 0                     # the index to look back at 
same = False   # will next term be same as current one?
for i in range(1, n):
    if same:
        a[i] = a[i-1]     # current run continues
        same = False
        a[i] = not a[i-1] # the run is over
        j += 1            # another run begins
        same = a[j]       # a[j] determines its length

for i in range(n):
    turtle.forward(10)    # used steps of 10 or 5 pixels 
    if a[i]:

Wild power pie

Many people are aware of {\pi} being a number between 3 and 4, and some also know that {e} is between 2 and 3. Although the difference {\pi-e} is less than 1/2, it’s enough to place the two constants in separate buckets on the number line, separated by an integer.

When dealing with powers of {e}, using {e>2} is frequently wasteful, so it helps to know that {e^2>7}. Similarly, {\pi^2<10} is way more precise than {\pi<4}. To summarize: {e^2} is between 7 and 8, while {\pi^2} is between 9 and 10.

Do any two powers of {\pi} and {e} have the same integer part? That is, does the equation {\lfloor \pi^n \rfloor = \lfloor e^m \rfloor} have a solution in positive integers {m,n}?

Probably not. Chances are that the only pairs {(m,n)} for which {|\pi^n - e^m|<10} are {m,n\in \{1,2\}}, the smallest difference attained by {m=n=1}.

Indeed, having {|\pi^n - e^m|<1} implies that {|n\log \pi - m|<e^{-m}}, or put differently, {\left|\log \pi - \dfrac{m}{n}\right| < \dfrac{1}{n \,\pi^n}}. This would be an extraordinary rational approximation… for example, with {n=100} it would mean that {\log \pi = 1.14\ldots} with the following {50} digits all being {0}. This isn’t happening.

Looking at the continued fraction expansion of {\log \pi} shows the denominators of modest size {[1; 6, 1, 10, 24, \dots]}, indicating the lack of extraordinarily nice rational approximations. Of course, can use them to get good approximations, {\left|\log \pi - \dfrac{m}{n}\right| < \dfrac{1}{n^2}}, which leads to {\pi^n\approx e^m} with small relative error. For example, dropping {24} and subsequent terms yields the convergent {87/76}, and one can check that {\pi^{76} = 6.0728... \cdot 10^{37}} while {e^{87} = 6.0760...\cdot 10^{37}}.

Trying a few not-too-obscure constants with the help of mpmath library, the best coincidence of integer parts that I found is the following: the 13th power of the golden ratio {\varphi = (\sqrt{5}+1)/2} and the 34th power of Apèry’s constant {\zeta(3) = 1^{-3}+2^{-3}+3^{-3}+4^{-4}+\dots} both have integer part 521.

A necklace of tears

The problem is: given a set of objects drawn from several groups, put all of them in a row in a “uniform” way. Whatever that means.

For example, suppose we have 21 gemstones: 9 red, 5 blue, 3 green, 2 cyan, 1 magenta and 1 yellow. How to place them on a string to make a reasonably looking necklace? The criterion is subjective, but we can probably agree that

Arrangement #1
Arrangement #1

looks better than

Arrangement #2
Arrangement #2

(referring to the uniformity of distributions, not the aesthetics of color.)

The approach I took was to repeatedly add the “most underrepresented” gem, defined by maximal difference between projected frequency of appearance (e.g., 5/21 for blue) and the frequency of appearance so far in the string. (Taking the convention 0/0=0 here.) Here is this algorithm in Python — not in Ruby, which would be more topical.

count = {'r': 9, 'b': 5, 'g': 3, 'c': 2, 'm': 1, 'y': 1}
length = sum(count.values())
str = ''
while len(str) < length:
    deficit = {}
    for char in count:
        deficit[char] = count[char]/length - (str.count(char)/len(str) if str else 0)
    str += max(deficit, key=deficit.get)

The output, “rbgcryrbrmgrbrcbrgrbr”, is what the first image in this post represents. The second image is the result of an ill-fated attempt to replace difference by ratio when determining the under-representation.

I initially thought that two unique gems (yellow and magenta) would end up together, but this hasn’t happened: after one is added, the frequency of more common gems drops, allowing them to come back into play for a while. Still, the left half of the string is noticeably more diverse than the right half. It’d be better if two unique gems were in roughly symmetrical position, and generally there would be no preference between left-right and right-left directions.

Perhaps the new character should be added to the string either on the right or on the left, in alternating fashion. That should make things nice and symmetric, right?


Alternating concatenation
Alternating concatenation

The search continues…

Update: Rahul suggested in a comment to adjust the deficit computation to

deficit[char] = count[char]/length - str.count(char)/(len(str) + 1)

This has some advantages but on the other hand, two unique gems (magenta and yellow) are placed next to each other, which is something I wanted to avoid.

Dividing by len(str) + 1

Bankers’ rounding and floating point arithmetics

Bankers’ rounding means preferring even digits in edge cases: both 3.5 and 4.5 are rounded to 4. When rounding a random amount of money to the nearest dollar in this way, the expected value added or subtracted is zero. Indeed: 0 cents don’t get rounded, 1 through 49 get rounded down, 51 through 99 get rounded up (and the amounts match 1-49), so 50 has to be split between the two directions.

Python follows this convention: rounding the numbers 1/2, 3/2, 5/2, … via

[round((2*k+1)/2) for k in range(10)]

results in [0, 2, 2, 4, 4, 6, 6, 8, 8, 10].

However, when rounding 0.05 = 1/20, 0.15 = 3/20, … to 1 digit after the decimal dot, we get a surprise:

[round((2*k+1)/20, 1) for k in range(10)]

returns [0.1, 0.1, 0.2, 0.3, 0.5, 0.6, 0.7, 0.8, 0.8, 0.9]. An irregularly spaced array, in which odd digits win 6 to 4…

This is because the numbers like 0.05 cannot be precisely represented in the double-precision format used for floating point numbers in Python (and other languages): the denominator of 1/20 is not a power of 2. And so, 0.05 gets represented by 3602879701896397/256. Since 0.05*256 = 3602879701896396.8, this representation is greater than the real number 0.05. And this pushes the rounding routine over the edge, toward 0.1 rather than 0.0.

Out of the numbers tested above, only 0.25 = 1/4 and 0.75 = 3/4 get a precise binary representation and are therefore rounded predictably. For the other 8 the direction of rounding has nothing to do with bankers… oddly enough, 6 out of these 8 are rounded toward an odd digit.

This isn’t the end of surprises, though. Recalling that numerical manipulations on arrays are best done with NumPy library, we may want to import it and try

numpy.around([(2*k+1)/20 for k in range(10)], 1)

The output ought to make any banker happy: [0., 0.2, 0.2, 0.4, 0.4, 0.6, 0.6, 0.8, 0.8, 1.]

Why? NumPy takes a different approach to rounding: when asked to round to 1 digit after the dot, it multiplies the numbers by 10, then rounds to the nearest integer, and divides by 10. Multiplication by 10 restores sanity: 0.05 * 10 = 0.5 is represented exactly and gets rounded to 0; similarly for the others.

This algorithm isn’t perfect, however. When rounding 0.005, 0.015, …, 0.995 to two decimal digits, it misplaces 0.545 (to 0.55) and 0.575 (to 0.57). The other 98 numbers are rounded as expected; tested with

numpy.around([(2*k+1)/200 for k in range(100)], 2)

Out of 1000 numbers of the form (2k+1)/2000, k=0,\dots, 999, there are six deviations from the bankers’ rule. They all appear together, in increments of 0.002, from 0.5015 to 0.5115.

Is there a rounding algorithm that follows bankers’ convention for all (not too long) finite decimal expansions?

Autogenerated numbers 2

Although this is a follow up to an earlier post, it can be read independently.

Repeatedly squaring the number 5, we get the following.

5^1 =                        5
5^2 =                       25
5^4 =                      625
5^8 =                   390625
5^16 =            152587890625
5^32 = 23283064365386962890625

There is a pattern here, which becomes more evident if the table is truncated to a triangle:

5^1 =                        5
5^2 =                       25
5^4 =                      625
5^8 =                  ...0625
5^16 =                ...90625
5^32 =               ...890625
5^64 =              ...2890625
5^128 =            ...12890625
5^256 =           ...212890625

What other digits, besides 5, have this property? None in the decimal system (ignoring the trivial 0 and 1). But the same pattern is shown by 3 in base 6 (below, the exponent is still in base 10, for clarity).

3^1 =                        3
3^2 =                       13
3^4 =                      213
3^8 =                  ...0213
3^16 =                ...50213
3^32 =               ...350213
3^64 =              ...1350213

and so on.

Here is a proof that the pattern appears when the base is 2 mod 4 (except 2), and the starting digit is half the base.

Let {x} be the starting digit, and {b=2x} the base. The claim is that {x^{2^k}\equiv x^{2^{k-1}} \bmod b^k} for all {k=1,2,\dots}.

Base of induction: {x^2-x = x(x-1)} is divisible by {x} and also by {2}, since {x-1} is even. Hence it is divisible by {2x=b}.

Step of induction: Suppose {x^{2^k} - x^{2^{k-1}}} is divisible by {b^k}. Then
{x^{2^{k+1}} - x^{2^{k}} = (x^{2^k} - x^{2^{k-1}})(x^{2^k} + x^{2^{k-1}})}
The second factor is divisible by {x} and also by {2}, being the sum of two odd numbers. So, it is divisible by {b}. Since the first factor is divisible by {b^k}, the claim follows: {x^{2^{k+1}} - x^{2^{k}} } is divisible by {b^{k+1}}.

So that was easy. The part I can’t prove is the converse. Numerical evidence strongly suggests that {b=2\bmod 4}, {x=b/2}
is also the necessary condition for the triangular pattern to appear.