35 categories of Stack Overflow comments

Google’s BigQuery dataset now includes Stack Overflow data dump, including the text of over 50 million comments posted on the site. What do these comments say? I picked the most frequent ones and grouped them by topic. The counts are an underestimate: there is only so much time I was willing to spend organizing synonymous comments.

  1. Thank you” comments (128960 in total) are the most common by far. Typical forms: Thank you very/so much!, Thanks a lot :), Perfect, thanks! The popularity of the emoticon in the second version is attributable to the minimal length requirement for comments: they must contain at least 15 characters. The laziest way to pad the text is probably Thank you……
  2. You are welcome” (50090), presumably posted in response to group 1 comments. You’re welcome. You’re welcome! You’re welcome 🙂 Users need that punctuation or emoticon to reach 15 characters. Although those not contracting “you are” don’t have this problem.
  3. Updated answer” (30979) invites whoever raised objections about the previous version of the answer to read it again.
  4.  “What is your question?” (20830) is the most common type of critical comments toward questions.
  5. This is not an answer” (17306) is the most common criticism for answers; usually posted automatically by reviewers. Typical form: This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. Another one is for questions posted in the answer box: If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context.
  6. What error are you getting?” (13439) is a request for debugging information.
  7. What have you tried?” (12640) often comes with the link whathaveyoutried.com and is a sufficiently notorious kind of comments that Stack Overflow software deletes them if anyone “flags” the comment. And it’s easy to cast flags automatically, so I substantially reduced the number of such comments since this data dump was uploaded. Further context: Should Stack Overflow (and Stack Exchange in general) be awarding “A”s for Effort?
  8. Post your code” (11486) can sometimes be a form of “what have you tried”; in other times it’s a logical response to someone posting an error message without the code producing it. Can you post your code? Post your code. Please post your code. Show your code. Where is your code? And so on.
  9. It does not work” (9674) — either the question author, or someone else with the same issue did not benefit from the solution. Maybe it’s wrong, maybe they used it wrong.
  10. This is a link-only answer” (9501) usually comes from reviewers in the standard form While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
  11. I updated the question” (8060), presumably in response to critical comments.
  12. Why the downvote(s)?” (6005) is asking whoever voted down the post to explain their position. Usually fruitless; if the voter wanted to say something, they would already.
  13. This is a duplicate question” (3859) is inserted automatically when someone moves for a question to be marked as a duplicate. Such comments are normally deleted automatically when the required number of close-votes is reached; but some remain. The most common by far is possible duplicate of What is a Null Pointer Exception, and how do I fix it? 
  14. I edited your title” (3795) is directed at users who title their questions like “Java: How to read a CSV file?”, using a part of the title as a tag. Standard form: I have edited your title. Please see, “Should questions include “tags” in their titles?“, where the consensus is “no, they should not”.
  15. Post a MCVE” (3775) – the line on which the error is thrown is probably not enough to diagnose the problem; on the other hand, a wall of code with an entire program is too much. One of standard forms: Questions seeking debugging help (“why isn’t this code working?”) must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: How to create a Minimal, Complete, and Verifiable example.
  16. That is correct” (3158)  usually refers to a statement made in another comment.
  17. It works” (3109) is the counterpart of group 9 above. Often used with “like a charm” but do charms actually work?
  18. What do you mean?” (2998) – for when an exchange in comments leads to more confusion.
  19. What tool are you using?” (2649) indicates that the question author forgot to specify either the language, OS, or the DBMS they are using.
  20. Good answer” (2607) – various forms of praise, This should be the accepted answer. This is the correct answer. Excellent answer! The first form additionally indicates that the question author did not pick the best answer as “accepted”.
  21. This question is off-topic” (2377) is a template for close votes with a custom explanation. For some years Stack Overflow used This question appears to be off-topic because… but then switched to the more assertive I’m voting to close this question as off-topic because…
  22. This is a low quality answer” (2003) is a response to answers that contain nothing but code, perhaps preceded by “try this”. Example: While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.
  23. Is this homework?” (1995) is not a particularly fruitful type of comments.
  24. Does this work?” (1916) is meant to obtain some response from question asker who has not yet acknowledged the answer.
  25. The link is dead” (1250) is a major reason why group 10 comments exist.
  26. http://stackoverflow.com/help/ . . .” (1117) and nothing but the link. Directs to one of Help Center articles such as “How to ask”. Maybe there should also be “How to Comment”
  27. Thanks are discouraged” (1046) … so all those group 1 comments aren’t meant to be. But this is mostly about posts rather than comments. Unlike forum sites, we don’t use “Thanks”, or “Any help appreciated”, or signatures on Stack Overflow. See “Should ‘Hi’, ‘thanks,’ taglines, and salutations be removed from posts?.
  28. Format your code” (967) – yes, please. Select the code block and press Ctrl-K. Thanks in advance. Oops, forgot about the previous group.
  29. What doesn’t work?” (926) is a response to vague comments of group 9.
  30. Don’t use mysql_* functions” (693) or Russian hackers will pwn your site. Comes with a link-rich explanation: Please, don’t use `mysql_*` functions in new code. They are no longer maintained and are officially deprecated. See the red box? Learn about prepared statements instead, and use PDO or MySQLithis article will help you decide which. If you choose PDO, here is a good tutorial.
  31. Add tags” (625) often comes up in the context of database questions. Which RDBMS is this for? Please add a tag to specify whether you’re using `mysql`, `postgresql`, `sql-server`, `oracle` or `db2` – or something else entirely.
  32. Improve title” (585) is like group 14, but invites the user to edit the title instead of doing it for them.
  33. Use modern JOIN syntax” (301) bemoans obsolete ways of dealing with databases. Bad habits to kick : using old-style JOINs – that old-style *comma-separated list of tables* style was replaced with the *proper* ANSI `JOIN` syntax in the ANSI-92 SQL Standard (more than 20 years ago) and its use is discouraged.
  34. More SQL woes” (272) is another template: Side note: you should not use the `sp_` prefix for your stored procedures. Microsoft has reserved that prefix for its own use (see *Naming Stored Procedures*, and you do run the risk of a name clash sometime in the future. It’s also bad for your stored procedure performance. It’s best to just simply avoid `sp_` and use something else as a prefix – or no prefix at all!
  35. I have the same problem” (241) is a kind of comments that really should not exist.

String matching in Google Sheets/Docs/Apps Script

Here I summarize various ways of matching string patterns in selected Google products: Sheets, Docs, and Apps Script. They are ordered from least to most powerful.

  1. String comparison in Sheets, =if(A1 = "cat", "is", "is not") Perhaps surprisingly, this is not literal comparison, but case-insensitive match: the condition also holds true if A1 has “Cat” or “CAT”. This makes one wonder how to actually test strings for equality; I’ll return to this later.
  2. Wildcards in Sheets functions like countif and sumif: ? matches any single character, * matches any number of any characters.
    Example: =countif(A1:A9, "a?b*")
  3. like comparison in query, the Sheets function using the Google Visualization API Query Language. This is similar to the previous: underscore _ matches any single character, percentage symbol % any number of any characters.
    Example: =query(A1:B9, "select A where B like 'a_b%'").
  4. findText method of objects of class Text in Apps Script, extending Google Docs. Documentation says “A subset of the JavaScript regular expression features are not fully supported, such as capture groups and mode modifiers.” In particular, it does not support lookarounds.
    Example:
  5. var body = DocumentApp.getActiveDocument().getBody().editAsText();
    var found = body.findText("ar{3,}gh").getElement().asText().getText();

    Yeah… I find Google Docs API clumsy, verbose and generally frustrating; unlike Google Sheets API.

  6. regexmatch function in Sheets and its relatives regexextract and regexreplace. Uses RE2 regex library, which is performance-oriented but somewhat limited, for example it does not support lookarounds. It does support capture groups.
    Example: =regexmatch(A1, "^h[ao]t+\b")
  7. matches comparison in query, the Sheets function using the Google Visualization API Query Language. Supports regular expressions (matching an entire string), including lookaheads but apparently not lookbehinds. Not clear what exactly is supported.
    Example: =query(A1:B9, "select A where B matches 'a.b(?!c).*'").
  8. JavaScript regular expression methods are supported in Apps Script… to the extent that they are supported in whatever version of Rhino JavaScript engine that GAS runs on. For example, non-capturing groups are broken and won’t be fixed. Be sure to test your regexes in Apps Script itself, not in a regular JS environment.

So what about the literal, case-sensitive string comparison in Google Sheets? Apparently, the way to do it is to use regexmatch… Example: =regexmatch(A1, "^cat$") returns True when A1 contains “cat” in lower case.

Maintaining restyled Google Calendar

RESTYLEgc is a pretty old (2008-11) MIT-licensed project by Brian Gibson, which enables, among other things, replacing the default stylesheet of embedded Google Calendar with a custom one.

Due to various changes on Google’s side made since 2011, the version available from Google Code link above no longer works. So I put up a GitHub repo with slightly modified files; only the two PHP scripts and the CSS file are included, the optional resources are not. (February 2017 update: a JS file is added.)

Keeping up with the changes in Google URL structure

Line 120 of restylegc.php changed from

$url = "https://www.google.com/calendar/embed?" . $_SERVER['QUERY_STRING'];

to

$url = "https://calendar.google.com/calendar/embed?" . $_SERVER['QUERY_STRING'];

Line 19 of restylegc-js.php changed from

$url = "https://www.google.com/calendar/" . $_SERVER['QUERY_STRING'];

to

$url = "https:" . $_SERVER['QUERY_STRING'];

Line 42 of restylegc-js.php changed from

$replacement = '"https://www.google.com/calendar';

to

$replacement = '"https://calendar.google.com/calendar';

February 2017 update

Google changed the link to the JavaScript file that runs the calendar; it is no longer a direct link to a .js file. (This resulted in the iframe erroring out with “window._init is not a function” and such; the function was defined in the JS file that failed to load.)

To correct the issue, I did the following:

Lines 137-138 of restylegc.php changed from

$pattern = '/src="(.*js)"/';
$replacement = 'src="restylegc-js.php?$1"';

to

$pattern = '/javascript" src="(\/calendar\S*)"/';
$replacement = 'javascript" src="restylegc-js.php?$1"';

Uncommented line 29 of restylegc-js.php, changing it from

//$url = "http://myserver.tld/path/to/archive/e0437df6468589031e718f3606b03917embedcompiled__en.js";

to

$url = "gcal.js";

Downloaded the JavaScript file used by the calendar, so that it’s served locally. (This is the gcal.js file referred to above.)


The rest of the post describes cosmetic changes made in the GitHub repo, which are not required for the calendar to work.

Better wrapping of text in narrow calendars

Line 1181 of restylegc.css (rules for .agenda .event-summary, .agenda .event-summary-expanded) changed

  white-space:nowrap;

to

  white-space:pre-wrap;

Line 1263 of restylegc.css (rules for .agenda .event-summary-expanded) changed

  font-weight:bold;

to

  font-weight:normal;

Also (line 1323) added

white-space:normal;

to the rule

.ie6 .agenda .event-title

although in 2015, IE6 isn’t quite as big a deal anymore.

Stack Exchange sites: questions and visits

After a year, it is time to update the Hertzsprung–Russell diagram of Stack Exchange sites. Here is the new and much improved version: a bubble chart made with Google Spreadsheets. The data is imported with ImportXML function, which apparently means it will be automatically refreshed every two hours. Just in case, here is a (heavily zoomed out) static image. It’s not as usable as the interactive chart you see above, because one can’t read the text on obscured bubbles by hovering over them.

Static snapshot of the diagram
Static snapshot of the diagram

As a rule of thumb, the sites with at least 10000 visits/day and 10 questions/day are those that “made it”: they either graduated from beta or will probably graduate in foreseeable future (assuming the quality of content is decent and the community is somewhat capable of self-organization).


A year ago, I joked about Beer, Italian Language, and Expatriates being either closed or merged into one site. Neither thing happened yet. Of the three, Expatriates site now has the largest numbers of visits and questions, though the quality is often an issue. Italian Language caught up with Russian Language (not to be confused with Русский язык, which is a recently imported site from Hashcode.ru and differentiates itself by targeting native speakers rather than language learners). Of the three, Beer appears to have the least amount of energy at present.

Some of the sites at the bottom, around 100 visits/day, are very new and will not stay there for long. The future of Community Building seems less certain.


Here’s the technical part. The data is/are obtained from this list of sites with the command

=IMPORTXML("http://stackexchange.com/sites?view=list", "//div[contains(@class,'category')]/@class|//input/@value")

The XPath parameter consists of two: //div[contains(@class,'category')]/@class picks up the class of every div element with “category” in its list of classes. This is later used to categorize as Technology, Science, Life/Arts, etc. The other part is //input/@value, which grabs all other data: number of questions, number of answers, number of users… most of which I have no use for. To sort this column by data type I put MOD(ROW()-2,9) next to it:

+------------------------------------------------+------+
|                      Raw                       | Item |
+------------------------------------------------+------+
| lv-item category-technology site-stackoverflow |    0 |
| 1217462400                                     |    1 |
| 9410733                                        |    2 |
| 15707906                                       |    3 |
| 73.95805406                                    |    4 |
| 4258590                                        |    5 |
| 7232805                                        |    6 |
| 7956.285714                                    |    7 |
| Stack Overflow                                 |    8 |
| lv-item category-technology site-serverfault   |    0 |
| 1241049600                                     |    1 |
+------------------------------------------------+------+

The columns C-F filter the content of A by item type. For example, D is the site name: =FILTER(A:A,B:B=8), etc. The Category column needs a bit of extra (routine) work to make it user-friendly. Otherwise, the table is finished:

+---------------------+-------------------+----------------+
|        Name         | Questions per day | Visits per day |
+---------------------+-------------------+----------------+
| Stack Overflow      | 7956.3            |        7232805 |
| Server Fault        | 103.1             |         353739 |
| Super User          | 162.9             |         734077 |
| Meta Stack Exchange | 21.8              |           8680 |
| Web Applications    | 12.7              |          44341 |
| Arqade              | 36.5              |         247448 |
| Webmasters          | 13.6              |          12433 |
| Seasoned Advice     | 6.9               |          80308 |
| Game Development    | 21.1              |          25836 |
| Photography         | 8.5               |          26560 |
| Cross Validated     | 84.4              |          63317 |
| Mathematics         | 616.4             |         150744 |
| Home Improvement    | 17.3              |          71148 |
+---------------------+-------------------+----------------+

The bubble chart uses this table for the log-log scale plot that you see above.

Parsing calculus with regular expressions

As a service* to math students everywhere (especially those taking calculus), I started Mathematics.SE Index. The plan is to have a thematic catalog of common exercises in college-level mathematics, each linked to a solution posted on Math.SE.

As of now, the site has reasonably complete sections on Limits and Series, with a rudimentary section on binomial sums. All lists are automatically generated. Initial filtering was done with Data Explorer SQL query, using tags and keywords in question body. The query also took into account the view count (i.e., how often the problem is searched for), and the existence of upvoted answers.

The results of the query were processed with a Google Sheets script: a bunch of regular expressions extracted LaTeX markup with a desired pattern, checked its integrity [not of academic kind] and transformed into WordPress-compatible markup.

Plans for near future: integrals (especially improper), basic proofs by induction, {\epsilon-\delta}, maybe some group theory and differential equations… depends on how easy it is to teach these topics to regular expressions.

(*) Or disservice, I wouldn’t know.

Hertzsprung–Russell diagram of Stack Exchange sites

The Hertzsprung–Russell diagram is a scatter plot of stars by temperature and luminosity.

Hertzsprung–Russell diagram*

Traditionally, it is shown with temperature axis pointing from right to left, which I don’t really like.

Stack Exchange family of sites is not (yet) as numerous as stars in the Universe; there are only 120 or so of them. But they can also be organized on a two-parameter log-log scatterplot. The two parameters are: total number of questions (intrinsic characteristic, like surface temperature) and the number of daily visits (luminosity in Internet terms).

The linear scale chart was not going to work, due to the supergiant size of Stack Overflow:

Linear scale
Linear scale

Even on the log-log scale Stack Overflow is an outlier, but within reason:

log-log scale
log-log scale

The colors follow Stack Exchange classification: Technology, Science, Culture & Recreation, Life & Arts, Business, Professional. The largest of science sites, and the second largest overall, is Mathematics, although it trails several non-science sites in luminosity.

Annotated version of the above diagram:

Annotated diagram
Annotated diagram

Both Mathematics and MathOverflow have low traffic compared to their size: perhaps the Internet audience is just not that into math? On the other hand, the young Mathematics Educators site is way up in the traffic category.

Ironically, Astronomy is a low-luminosity site, trailing in traffic the much smaller Earth Science. Other annotated science sites are well established by now: Physics, Statistics, and (slow-moving) Theoretical Computer Science.

The only two non-technology sites that beat Mathematics in traffic are Gaming and English. Gaming isn’t really an exception here, since all large “technology” sites revolve around computers.

Three of the sites have extremely low traffic currently: Beer, Italian, and Expatriates. If this does not improve, they may be shut down… or perhaps merged into one: Beer for Italian Expatriates.


(*) “Hertzsprung-Russel StarData” by ESOhttp://www.eso.org/public/images/. Licensed under CC BY 3.0 via Wikimedia Commons.

Life after Google Reader, day 2

Having exported the Shared/Starred items from Google Reader, I considered several ways of keeping them around, and eventually decided to put them right here (see “Reading Lists” in the main navigation bar). My reasons were:

  • This site is unlikely to disappear overnight, since I am paying WordPress to host it.
  • Simple, structured HTML is not much harder for computers to parse than JSON, and is easier for humans to use.
  • Someone else might be interested in the stuff that I find interesting.

First step was to trim down shared.json and starred.json, and flatten them into a comma-separated list. For this I used Google Refine, which, coincidentally, is another tool that Google no longer develops. (It is being continued by volunteers as an open-source project OpenRefine). The only attributes I kept were

  • Title
  • Hyperlink
  • Author(s)
  • Feed name
  • Timestamp
  • Summary

Springer feeds gave me more headache than anything else in this process. They bundled the authors, abstract, and other information into a single string, within which they were divided only by presentational markup. I cleaned up some of Springer-published entries, but also deleted many more than I kept. “Sorry, folks – your paper looks interesting, but it’s in Springer.”

Compared to the clean-up, conversion to HTML (in a Google Spreadsheet) took no time at all. I split the list into chunks 2008-09, 2010-11, 2012, and 2013, the latter being updated periodically. Decided not to do anything about LaTeX markup; much of it would not compile anyway.

Last question: how to keep the 2013 list up-to-date, given that Netvibes offers no export feature for marked items? jQuery to the rescue:

Now with a freehand circle!
Now with a freehand circle!

The script parses the contents of Read Later tab, extracts the items mentioned above, and wraps them into the same HTML format as the existing reading list.

$(document).ready(function() {
  function exportSaved() {
  var html='';
  $('.hentry.readlater').each(function() {
    var title = $('.entry-innerTitle', this).html().split('<span')[0];
    var hyperref = $('.entry-innerTitle', this).attr('href');
    var summary = ($('.entry-content', this).html()+'').replace('\n',' ');
    var timestamp = $('.entry-innerPublished', this).attr('time');
    var feed = $('.entry-innerFeedName', this).html();
    var author = ($('.author', this).html()+'').slice(5);
    var date = new Date(timestamp*1000);
    var month = ('0'+(date.getMonth()+1)).slice(-2);
    var day = ('0'+date.getDate()).slice(-2);
    var readableDate = date.getFullYear()+'-'+month+'-'+day;
        html = html+"<div class='list-item'><h4 class='title'><strong><a href='"+hyperref+"'>"+title+"</a></strong></h4><h4 class='author'>"+author+"</h4><h5 class='source'>"+feed+" : "+readableDate+"</h5><div class='summary' style='font-size:80%;line-height:140%;'>"+summary+"</div></div>";  
    });
 console.log(html);
   }
   $('export').insertAfter($('#switch-view')).click(function() { exportSaved(); });
}); 

The script accesses the page via an extension in Google Chrome. It labels HTML elements with classes for the convenience of future parsing, not for styling. I had to use inline styles to keep WordPress happy. Dates are formatted according to ISO 8601, making it possible to quickly jump to any month using in-page search for “yyyy-mm”. Which would not work with mm/dd/yyyy, of course.

Unfortunately, some publishers still insert the authors’ names in the summary. Unlike arXiv, which serves reasonably structured feeds, albeit with unnecessary hyperlinks in the author field. While I am unhappy with such things (and the state of the web in general), my setup works for me, and this concludes my two-post detour from mathematics.

xkcd 743: Infrastructures