In a recent exploration of Hacker News (HN), a developer embarked on an ambitious project to download and analyze the entire dataset using DuckDB, a powerful analytics tool. The project aimed to delve into the historical trends of comments and stories on the platform, particularly focusing on key programming topics over time.

The journey began with the developer's desire to create a robust HN API client, utilizing the latest features of the Go programming language. Although several other clients already existed, this new initiative allowed the developer to experiment with Go's innovative capabilities while addressing personal project needs. The result was a versatile client capable of retrieving active items and lists of itemsreferred to as items in the HN API.

Initially, the developer intended to access only recent items. However, curiosity led to the decision to download the entire archive of posts from Hacker News. The estimated size of this massive dataset was around tens of GiB in JSON format. Despite facing challenges, including a few stalled downloads that required interruption, the developer successfully completed the operation with the command:

hn scan --no-cache --asc -c- -o full.json

After a few hours, the developer was left with a staggering 20 GiB JSON file encapsulating the rich history of Hacker News. This file could be topped off with the latest information at any time simply by re-running the download command. The question then arose: What insights could be derived from such a comprehensive dataset?

The first step was to conduct basic searches through the data, using simple text queries. For instance, the phrase correct horse battery staple appeared 231 times, showcasing how certain memes or popular phrases resonate within the community. However, mere grepping through the data felt inadequate for the depth of analysis the developer desired. It was time to put DuckDB to the test.

DuckDB stands out in the database landscape as a super-fast, embeddable analytics execution engine. It is particularly user-friendly, even for novices. The developer, having spent much of their day managing other databases, was eager to try DuckDB for this task, especially given its user-friendly interface and capabilities. After successfully importing the vast dataset, the developer crafted SQL queries to analyze the frequency of discussions around programming languages such as Python, JavaScript, Java, Ruby, and Rust.

To track the trends over time, a 12-week moving average was applied, revealing the relative popularity of these programming languages within the Hacker News community. The SQL query executed was as follows:

WITH weekly AS ( SELECT DATE_TRUNC ( 'week' , TO_TIMESTAMP ( time ) ) AS week_start , COUNT ( * ) FILTER ( WHERE text ILIKE '%python%' ) :: float / NULLIF ( COUNT ( * ) , 0 ) AS python_prop , COUNT ( * ) FILTER ( WHERE text ILIKE '%javascript%' ) :: float / NULLIF ( COUNT ( * ) , 0 ) AS javascript_prop , COUNT ( * ) FILTER ( WHERE text ILIKE '%java%' ) :: float / NULLIF ( COUNT ( * ) , 0 ) AS java_prop , COUNT ( * ) FILTER ( WHERE text ILIKE '%ruby%' ) :: float / NULLIF ( COUNT ( * ) , 0 ) AS ruby_prop , COUNT ( * ) FILTER ( WHERE text ILIKE '%rust%' ) :: float / NULLIF ( COUNT ( * ) , 0 ) AS rust_prop FROM items GROUP BY week_start ) SELECT week_start , AVG ( python_prop ) OVER ( ORDER BY week_start ROWS BETWEEN 11 PRECEDING AND CURRENT ROW ) AS avg_python_12w , AVG ( javascript_prop ) OVER ( ORDER BY week_start ROWS BETWEEN 11 PRECEDING AND CURRENT ROW ) AS avg_javascript_12w , AVG ( java_prop ) OVER ( ORDER BY week_start ROWS BETWEEN 11 PRECEDING AND CURRENT ROW ) AS avg_java_12w , AVG ( ruby_prop ) OVER ( ORDER BY week_start ROWS BETWEEN 11 PRECEDING AND CURRENT ROW ) AS avg_ruby_12w , AVG ( rust_prop ) OVER ( ORDER BY week_start ROWS BETWEEN 11 PRECEDING AND CURRENT ROW ) AS avg_rust_12w FROM weekly ORDER BY week_start ;

This query effectively illustrated the evolution of programming language discussions over time on Hacker News. The developer concluded that DuckDB proved to be an excellent choice for analyzing such extensive datasets and vowed to continue exploring its capabilities.

Looking ahead, the developer contemplated the implications of having a local copy of all Hacker News content. One intriguing idea involved training a myriad of LLM-based bots on this data, potentially automating the contributions on the platform. However, the developer acknowledged that they had reached a suitable stopping point for this project and would leave the next steps to other enthusiastic developers willing to push the boundaries further.

For those interested in this unique approach to analyzing Hacker News, further insights and articles can be found at