Nolan Shah

Lost Data

I’ve been working on Sodha for a couple of weeks now — indexing thousands of pieces of technical discussion, blogs, references — and have found a treasure trove of historical content dated as far as the early 2000s. Sodha has nothing to do with (lost) historical data specifically, but in finding this content, I’ve had some thoughts about the lost pieces of history we'll never recover.

Some of that old historical content still exists on the original sites that were hosting, typically Blogspot or Wordpress.com. Some of that old content can only now be found on the Internet Archive as the website domain is no longer owned by anyone or DNS resolution fails since the server is down. Sadly, some only exist as dead links on the websites of others.

All those dead links bother me. We’ve lost a piece of our (recent) history are only left with the knowledge that it did, at some point, exist. It's now something for the historians of 100 years from now to peel through and make conjectures about.

One site that is luckily still up is Harj Taggar’s old blog, including an old post on Lessons Learned through YCombinator dated March 11, 2007. Every word of that post resonates with someone building a company. I’m sure the same advise has been told a thousand times before this post and a thousand times since. And yet, it still tells a story of Taggar's experiences that would otherwise be lost.

The usefulness and relevance of all that lost data is not something we can even begin to comprehend. How many technical designs, personal stories, philosophical discussions, and fan fictions are baking in overheated attics? Or more likely, underneath a heap of junk in a landfill — never to be recovered.

It’s a shred of luck that the Internet Archive still exists. A non-profit institution almost 20 years old funded by donations and government support. It’s “Current status” on Wikipedia is listed as “Active”. It’s almost a taunt -- we're waiting for the day it's edited to "Inactive". After all, where’s the money in history?

Moreover -- In the wake of Google tearing down old services, it’s a surprise that Blogspot hasn’t been blown away. Google migrated most of their old tech blogs off Blogspot a while ago, and I can’t imagine Google makes much money off it otherwise. Where’s the money if not in advertising?

Luckily, Wordpress is still the most used website hosting engine in the world. And a large number of other services like GitHub Pages, Render.com, and Medium are in the mix now with none dying anytime soon. So most of the generated data of the late 2010s and certainly 2020s should be safe for now.

On the flip side, the default action for a cloud provider is to delete customer data if they fail to pay their bills. I can imagine the day a Google engineer types the command to delete the final archive of Blogspot off Google Cloud Storage.

The same kinds of problems in other places albeit taking different shapes.

With DRM for streaming and games, we rent access to media instead of owning copies of it because it’s easier to monetize especially with the ubiquity of the internet.

With academic knowledge, we’ve increasingly relied on closed but increasingly irrelevant institutions like Elseiver or open but non-profit and barely funded institutions like Arxiv to keep the continuity of knowledge.

With social media, we’ve made it a point to instill "the right to be forgotten" which gives people the choice to not be a part of history. Ironically, that’s a right we’ve denied with every tomb explored or body exhumed in archeological sites.

There’s actually a dedicated Wikipedia page to Lost television broadcast. What even happens when a TV show stops running in syndication?

Digital historical preservation is underrated.

Part of the problem I’ve mentioned is money and the other part is design. Despite the relevance and the gold mine in all of this data, we’ve made choices in the last 20 years that led us to have lost data. The sad truth is that we won’t learn our lesson.