Use the Wayback Machine API to quickly find the most recent snapshot for a url.
An NGO client asked me to rebuild a site that had been deleted when someone in the organisation forgot to pay the annual hosting invoice. I used the Wayback Machine to restore a significant number of the pages.
Earlier this year Uptime Robot alerted me that the NGO’s site was down. It had been down for a few hours before I opened the alert email and there was no more recent email to say it was back up. I visited the site and saw that it was showing a generic domain landing page because the hosting invoice had not been paid. The hosting company generally leaves the customer site online for a week or two past the invoice due date before deactivating it.
I immediately contacted someone in the organisation advising that prompt payment would ensure the site files and database data would not be deleted. Unfortunately the responsible people were slow to act and the files and data were lost.
A lucky break
I had developed the site with WooThemes Canvas theme which has since been retired. A while after I completed the site I had started using StudioPress Genesis theme framework. I decided to copy the data to a staging site so that I could adapt it to use Genesis. I created a child theme and ported the Canvas version to Genesis. I never did any more work on the adapted site.
I was able to export the database of this staging site and use it as a starting point for the new site. The data was 4 years old but it had all the pages and this was a big help for me.
Wayback Archive and its API
After restoring the site and making it live again I decided to have a quick look at the Wayback Archive to see if it had any newer copies of the pages. I checked a few urls and it had snapshots from about 2 years ago.
Manually checking all the site’s urls (about 120 of them) would be tedious so I browsed the Wayback Archive API page and experimented with the Wayback Availability JSON API which returns the url of the most recent snapshot, if it has one.
At the command line I ran a simple loop of each url to wget the JSON from the Wayback Archive. From the size of the returned file I could tell whether a snapshot was available. I then wrote a PHP script to loop through each file, read and decode the JSON data and produce a html page with a link to the live url and a link to the snapshot url. I used this to edit the live page and update it will information from the snapshot.
Checking for snapshots
I thought about querying all posts or pages and checking each one felt that it would be a bit rude to overuse the Wayback Availability API service. Instead I wrote code to query from a static array of urls (it’s only 2 urls in the code below). Readers can easily expand it as necessary.
The results are in a table, with a link to the archived snapshot.