Help scrape Google Video before it’s gone forever!

Update: Google About-Turn

Google have capitulated to feedback and decided to keep Google Video alive and migrate the videos to YouTube, by which point Archive Team had 40% of the content and were well on track to save it.


Google Video will be shutting down within the next few weeks, and for some stupid reason, they’re not just transferring the videos to YouTube (as Google owns both) – instead they’re just pulling the plug and it’s all going to be lost. To fix this rubbish state of affairs, Archive Team are in a race to scrape as much Google Video content as they can before the viewing deadline (29/04/2011) and the download deadline (13/05/2011) – and you can help! Archive.org have kindly donated 100TB for storage, but first we need to index the videos and scrape them.

If you have a lot of bandwidth you can help scrape the videos themselves, but even if you don’t you can help with the indexing effort by running a simple, resource and bandwidth light Linux script and just leaving it running!

Why save it?

YouTube has a 15 minute video length limit – and Google Video doesn’t. This means there are large amounts of video that might be on Google Video and nowhere else, so when they’re gone – they’re gone. A lot of this might not be fantastic material – but a lot of it will be unique and the only copy on the Net. There’s documentaries, films, and all sorts of good stuff, and even personal video blogs will be a snapshot of the times we live in. In short, it’s stuff we as a species, should not throw away.

It’s like the BBC scrapping their archives when they didn’t want to pay to save them – we won’t know what’s been lost until it’s gone, and by then it’ll be too late. So let’s not let that happen, eh?

How you can help if you have ~200GB bandwidth/storage or more: Scrape the videos

Update: pentium ported the video download script to Windows (you still need python and aria2, which can be downloaded separately). Script location: http://www.pentium100.com/gg_windows.zip

Head on over to ArchtiveTeam Google Videos wiki, sign up, pick an un-taken section of videos and add your initials/handle to it, then go for your life! Full instructions at the site.

How you can help if you don’t have a lot of bandwidth/storage

Update: nstrom ported the related.sh indexing script to Windows – all you now need is a pre-compiled version of the phantomjs browser. Full instructions come with the script, which is located here:
http://nstrom.chaosnet.org/google_video_related_win.zip

Note: This will only work on Linux machines with X running – you can’t run it on headless servers due to phantomjs requirements. Instructions are for Ubuntu 10.10 or later and might need a little modification if you’re running an older or non-Debian based distro.

  1. Get and build phantomjs (a headless web browser) by doing the following:
    • Install build-essential, curl git, libqtwebkit4and libqtwebkit-dev if necessary, for example using:
    • Create a directory called phantomjs
    • In the terminal go into your new directory and run the following command to get the phantomjs source code:
    • Build phantomjs by issuing the command:
    • Move the phantomjs binary somewhere in your path by issuing the command:
  2. Create a folder called gvscript or such and download the file with the list of Google Video related pages to scrape: google_video_related.tar.gz
    • Extract the above downloaded file (Right-click and Extract To.. or use tar -zxvf ./google_video_related.tar.gz)
  3. In a terminal, navigate to the folder where you extracted the google_video_related file (above) and run the following command to help scrape Google Video:
  4. Simply leave the script running, and head on over to #ggtesting on EFnet (IRC) if you need any assistance or in case the script has any issues (p.s. kill the script with Ctrl+Z if it misbehaves – though mine’s been running for about 7 hours solid with no complaints so I doubt you’ll have any).

The script scrapes each page for related videos and sends them off to an archiveteam server. It takes very little processing and bandwidth on your end (a couple of kb/sec, if that) and seems to work just fine.

Every little helps

I’m sure anything you can do to pitch in will be appreciated by Archive Team, the Internets, your future self, your kids, your kids kids, your kids kids kids… you get the picture ;)

Cheers!

Leave a Reply

Your email address will not be published.