Query StackOverflow with SQL
Query StackOverflow with SQL and StackExchange Data Explorer.
Thanks to my friend Glenn, for the tip that StackExchange has a UI where you can query all SE sites like StackOverflow, Ask Ubuntu, and more (about 100 sites).
- Create an account with StackExchange Data Explorer here using OpenID.
- Click on
Compose Query
to see the UI where to enter a query. It shows the database schema on the right. - Enter a name and description (optional) for the query
- Compose the SQL query and click
Run Query
Database Schema
It’s not very intuitive to understand the schema, even as a StackOverflow user. I assumed that all answers belonged to the Comments table.
This the structure of Posts:
- All questions and answers are in the Posts table
- These are identified using
PostTypeId
(see tablePostTypes
)
- These are identified using
- In StackOverflow, the comments that you see under questions or answers are in the Comments table
It’s easier to understand the schema by looking at the ERD and querying individual tables.
Find all PostTypes
SELECT * FROM PostTypes
Questions have a PostTypeId
of 1
and answers are 2
.
Find all Tags
There are a lot of tags
SELECT COUNT(Tags.Id) FROM Tags
Find top 20 Tags with most count
SELECT TOP 20 Tags.Name FROM Tags ORDER BY Tags.Count DESC
The UI notifies if you made a SQL or schema mistake. The above query shows Invalid column name 'Name'
. Correct to:
SELECT TOP 20 Tags.TagName, Tags.Count FROM Tags ORDER BY Tags.Count DESC
These are the first 5 rows:
javascript | 2470759
python | 2097773
java | 1885215
c# | 1579042
php | 1454705
What are badges?
SELECT DISTINCT Badges.Name, COUNT(Badges.Name) AS Total FROM Badges GROUP BY Badges.Name ORDER BY Total DESC
If the queries start to get longer, then use your preferred best practice for indentation.
SELECT
DISTINCT TOP 10 Badges.Name
, COUNT(Badges.Name) AS Total
FROM
Badges
GROUP BY
Badges.Name
ORDER BY
Total DESC