Help scrape Google Video before it’s gone forever!

Update: Google About-Turn

Google have capitulated to feedback and decided to keep Google Video alive and migrate the videos to YouTube, by which point Archive Team had 40% of the content and were well on track to save it.


Google Video will be shutting down within the next few weeks, and for some stupid reason, they’re not just transferring the videos to YouTube (as Google owns both) – instead they’re just pulling the plug and it’s all going to be lost. To fix this rubbish state of affairs, Archive Team are in a race to scrape as much Google Video content as they can before the viewing deadline (29/04/2011) and the download deadline (13/05/2011) – and you can help! Archive.org have kindly donated 100TB for storage, but first we need to index the videos and scrape them.

If you have a lot of bandwidth you can help scrape the videos themselves, but even if you don’t you can help with the indexing effort by running a simple, resource and bandwidth light Linux script and just leaving it running!

Why save it?

YouTube has a 15 minute video length limit – and Google Video doesn’t. This means there are large amounts of video that might be on Google Video and nowhere else, so when they’re gone – they’re gone. A lot of this might not be fantastic material – but a lot of it will be unique and the only copy on the Net. There’s documentaries, films, and all sorts of good stuff, and even personal video blogs will be a snapshot of the times we live in. In short, it’s stuff we as a species, should not throw away.

It’s like the BBC scrapping their archives when they didn’t want to pay to save them – we won’t know what’s been lost until it’s gone, and by then it’ll be too late. So let’s not let that happen, eh?

How you can help if you have ~200GB bandwidth/storage or more: Scrape the videos

Update: pentium ported the video download script to Windows (you still need python and aria2, which can be downloaded separately). Script location: http://www.pentium100.com/gg_windows.zip

Head on over to ArchtiveTeam Google Videos wiki, sign up, pick an un-taken section of videos and add your initials/handle to it, then go for your life! Full instructions at the site.

How you can help if you don’t have a lot of bandwidth/storage

Update: nstrom ported the related.sh indexing script to Windows – all you now need is a pre-compiled version of the phantomjs browser. Full instructions come with the script, which is located here:
http://nstrom.chaosnet.org/google_video_related_win.zip

Note: This will only work on Linux machines with X running – you can’t run it on headless servers due to phantomjs requirements. Instructions are for Ubuntu 10.10 or later and might need a little modification if you’re running an older or non-Debian based distro.

  1. Get and build phantomjs (a headless web browser) by doing the following:
    • Install build-essential, curl git, libqtwebkit4and libqtwebkit-dev if necessary, for example using:
    • Create a directory called phantomjs
    • In the terminal go into your new directory and run the following command to get the phantomjs source code:
    • Build phantomjs by issuing the command:
    • Move the phantomjs binary somewhere in your path by issuing the command:
  2. Create a folder called gvscript or such and download the file with the list of Google Video related pages to scrape: google_video_related.tar.gz
    • Extract the above downloaded file (Right-click and Extract To.. or use tar -zxvf ./google_video_related.tar.gz)
  3. In a terminal, navigate to the folder where you extracted the google_video_related file (above) and run the following command to help scrape Google Video:
  4. Simply leave the script running, and head on over to #ggtesting on EFnet (IRC) if you need any assistance or in case the script has any issues (p.s. kill the script with Ctrl+Z if it misbehaves – though mine’s been running for about 7 hours solid with no complaints so I doubt you’ll have any).

The script scrapes each page for related videos and sends them off to an archiveteam server. It takes very little processing and bandwidth on your end (a couple of kb/sec, if that) and seems to work just fine.

Every little helps

I’m sure anything you can do to pitch in will be appreciated by Archive Team, the Internets, your future self, your kids, your kids kids, your kids kids kids… you get the picture ;)

Cheers!

How To: Monitor Which Processes are Using Bandwidth in Linux

I was listening to some music from my (aging) NAS earlier via Rhythmbox and it started to skip and stutter, so I had a look at the System Monitor and my laptop’s sending and receiving around 100KB/Sec.. WTF? I’m thinking along the lines of that’s a bit high just to stream a mp3 – I wish I could see what processes are using the network card… so I did a little bit of searching and found that I could via a nifty little command line program called nethogs.

You can install nethogs from the standard universe repositories (in Ubuntu at least), and then fire it up with the line:

So for me, this meant running:

Which should give you output that looks something like this:
NetHogs Example Output

Now unfortunately, in the above output it’s stating that the kernel/scheduler is using the bandwidth (PID 0) which is fair enough as it’s getting the data from the CIFS share – but at least I can see what is NOT using the bandwidth, and then through a process of elimination can close down apps and keep an eye on the output until only one process is left which is eating up bandwidth, and then either accept the fact that it’s a hungry, hungry hippo or shut it down.

The version of Rhythmbox that ships with Ubuntu 10.10 was the culprit in my case – it’s just nom-ing to check that all the files still exist and stuff, and after leaving it for some time to “check”, it’s finally quietened down to negligible levels and the music has stopped skipping – Yay! =D

Linkage #2 – It’s Been a While…

I used to post a bunch of sites and stuff I’d found interesting every now and then back when the site was on PHPNuke, but I’ve kinda forgotten to do that of late… Well – no more!

Linkage #2 be down!

  • Fancy brewing your own booze? Neither do I, really. But it looks like a fun project…
  • Consistently awesome image-based goodness abounds at 9gag.
  • The next router I get is very likely to be a NETGEAR WNR3500L Wireless-N Gigabit Open Source Router. Gigabit Ethernet! Wireless N! OpenWRT or DD-WRT or Tomato! Fo’ shizzle!
  • UnixPorn is not really porn. It’s screenshots of peoples linux desktops. Honest! I’ve even submitted my own desktop! Cos it’s not like I’m procrastinating and avoiding making database slides… Hell no! :)
  • How to position home theater speakers properly.
  • An Engineers Guide to Bandwidth is a pretty good read – about how bandwidth, latency and packet reliability effects usability of web services.
  • Red Remover is a physics based flash game – keep the green blocks, get rid of the red ones, and we aren’t fussy about the blues. Yaaayy!
  • Considering building your own NAS? Me too. Am thinking FreeNAS might be the way and the path.
  • Create a web layout with buttons, header graphics, dividers and all that jazz by knocking it up in a graphics package then slicing it into sections and exporting the HTML and sliced images. You can do the same thing using Gimp, but watching the video makes it easier to get your head around it all first.

That’ll do, pig. That’ll do…