Nolan Shah

Lost Data

I’ve been working on Sodha for a couple of weeks now — indexing thousands of pieces of technical discussion, blogs, and references — and have found a treasure trove of historical content dated as far as the early 2000s. Sodha has nothing to do with (lost) historical data specifically, but in finding this content, I’ve been thinking about the "lost" pieces of history we'll never recover.

Some old historical content still exists on their hosting platforms, typically Blogspot or WordPress. Some old content can only be found on the Internet Archive as the website domain or server is no longer active. Sadly, some only exist as dead links on the websites of others.

All those dead links bother me. A piece of our (recent) history is gone. All that's left is the knowledge that it did, at some point, exist. Future historians and digital archeologists will peel through and make conjectures.

One site that is luckily still up is Harj Taggar’s (a tech entrepreneur) old blog, including an old post on Lessons Learned through YCombinator dated March 11, 2007. Every word of that post resonates with someone building a company. The same advice has been told a thousand times before and a thousand times since. And yet, it still tells a story of Taggar's experiences that would otherwise be lost.

The usefulness and relevance of all that lost data is not something we can even begin to comprehend. How many technical designs, personal stories, philosophical discussions, and fan fiction pieces are baking in overheated attics? Or perhaps more likely, underneath a heap of junk in a landfill, never to be recovered.

It is a shred of luck that the Internet Archive still exists. A non-profit institution almost 20 years old and funded through donations & government support, their "Current Status" on Wikipedia is listed as "Active". Almost a taunt, we are waiting for the day someone updates it to "Inactive". After all, where is the money in history?

In the wake of Google tearing down old services, it’s a surprise that Blogspot has not been blown away. Google migrated most of their old tech blogs off Blogspot a while ago, and I can’t imagine Google makes much money on it otherwise. After all, where is the money if not in advertising?

Luckily, WordPress is still the most used website hosting engine in the world. And other services like GitHub Pages, Render.com, and Medium are in the mix now with none dying anytime soon. Most of the generated data of the late 2010s and certainly 2020s should be safe for now.

On the flip side, the default action for a cloud provider is to delete customer data if they fail to pay their bills. I imagine the day a Google engineer types the command to delete the final archive of Blogspot off Google Cloud Storage.

The same kinds of problems exist in other places with different forms.

With DRM for streaming and games, we rent access to media instead of owning copies because the ubiquity of fast internet makes renting better for monetization.

With academic knowledge, we increasingly rely on closed but irrelevant publishers like Elsevier or open but non-profit (and scarcely funded) institutions like Arxiv to keep the continuity of academic knowledge.

With social media, we made a point to let people opt out of history through the right to be forgotten. Ironically, that is a right denied with every tomb explored or body exhumed in archeological sites.

There is a dedicated Wikipedia page to Lost television broadcast. What even happens when a TV show stops running in syndication?

Digital historical preservation is underrated.

Part of the problem is money, but the other part is design. Despite the relevance and gold in this data, we made choices in the last 20 years that led us to have lost data. The sad truth is that we won’t learn our lesson.