Query StackOverflow with SQL and StackExchange Data Explorer.

Thanks to my friend Glenn, for the tip that StackExchange has a UI where you can query all SE sites like StackOverflow, Ask Ubuntu, and more (about 100 sites).

  1. Create an account with StackExchange Data Explorer here using OpenID.
  2. Click on Compose Query to see the UI where to enter a query. It shows the database schema on the right.
  3. Enter a name and description (optional) for the query
  4. Compose the SQL query and click Run Query

Database Schema

It’s not very intuitive to understand the schema, even as a StackOverflow user. I assumed that all answers belonged to the Comments table.

  • Database schema documentation here
  • Schema ERD here

This the structure of Posts:

  • All questions and answers are in the Posts table
    • These are identified using PostTypeId (see table PostTypes)
  • In StackOverflow, the comments that you see under questions or answers are in the Comments table

It’s easier to understand the schema by looking at the ERD and querying individual tables.

Find all PostTypes

SELECT * FROM PostTypes

Questions have a PostTypeId of 1 and answers are 2.

Find all Tags

There are a lot of tags

SELECT COUNT(Tags.Id) FROM Tags

Find top 20 Tags with most count

SELECT TOP 20 Tags.Name FROM Tags ORDER BY Tags.Count DESC

The UI notifies if you made a SQL or schema mistake. The above query shows Invalid column name 'Name'. Correct to:

SELECT TOP 20 Tags.TagName, Tags.Count FROM Tags ORDER BY Tags.Count DESC

These are the first 5 rows:

javascript | 2470759
python | 2097773
java | 1885215
c# | 1579042
php | 1454705

What are badges?

SELECT DISTINCT Badges.Name, COUNT(Badges.Name) AS Total FROM Badges GROUP BY Badges.Name ORDER BY Total DESC

If the queries start to get longer, then use your preferred best practice for indentation.

SELECT
  DISTINCT TOP 10 Badges.Name
  , COUNT(Badges.Name) AS Total
FROM
  Badges
GROUP BY
  Badges.Name
ORDER BY
  Total DESC