written May 2018, updated May 2021
Note: I got back on Twitter in 2020.
I recently exited Twitter, but I wanted to take my witty tweets with me. Fortunately, Twitter allows one to download all their tweets in the form of some HTML, CSS, and JavaScript in a Zip file.
Now, setting aside my concerns about the longevity of the circa-2018 single-page JavaScript app Twitter provides, I was even more surprised to find that none of the media (images, videos, etc.) were inside the archive—the images are hotloaded from Twitter's servers! Facepalm.
I'm not saying my old tweets are proverbs of wisdom or my memes worthy of the Smithsonian, but I prefer not to worry about Twitter's investors passing down my memories to the next generation. (I'll leave that up to Dropbox, if my kids can find my password and figure out the 2FA.)
So I wrote a little Ruby script to download the media files from the web and replace the references in the HTML with local links. Here is the code:
require 'http'
require 'fileutils'
require 'digest'
FileUtils.mkdir_p('media')
paths = Dir['data/**/*.js'].to_a + ['index.html']
paths.each_with_index do |path, index|
puts "#{index + 1} of #{paths.size}"
data = File.read(path)
data.gsub!(/"(http[^"]+)(\.(ico|png|gif|jpg|jpeg|mov|mp4|mpg|mpeg))"/i) do
print '.'
ext = Regexp.last_match[2]
url = Regexp.last_match[1].gsub(%r{\\/}, '/')
name = Digest::MD5.hexdigest(url) + ext
asset_path = 'media/' + name
unless File.exist?(asset_path)
begin
raw = HTTP.get(url + ext).to_s
File.write(asset_path, raw)
rescue HTTP::ConnectionError
puts url + ext + ' could not be downloaded'
next
end
end
'"' + asset_path + '"'
end
File.write(path, data)
puts
end
archive.rb.
gem install http
unzip archive.zip -d archive cd archive ruby path/to/archive.rb
And that's it! Let it run, and at the end you should have slightly less disk space and slightly more peace of mind!