Using Python and wget to download certain file types from a web-page

Recently I wanted to download all the movies in Mackay’s information theory course. That was a great deal of files and I though it would too tedious to try to do it using the browser or even manual wget. So digging around, I found this decent tutorial on python HTML processing. I also dug around some information on parsing URLs. I combined that with my knowledge of python’s subprocess and put together this nice piece of code.

 

#!/usr/bin/python

import sgmllib

class MyParser(sgmllib.SGMLParser):
"A simple parser class."

def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()

def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."

sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)

def get_hyperlinks(self):
"Return the list of hyperlinks."

return self.hyperlinks

import urllib, sgmllib

# Get something to work with.
webPage="http://www.inference.phy.cam.ac.uk/itprnn_lectures/"
f = urllib.urlopen(webPage)
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
links=myparser.get_hyperlinks()
print links

movies=[x for x in links if x.endswith('mp4')]
print movies

import urlparse
movieURLs=[urlparse.urljoin(webPage,x) for x in movies]
print movieURLs

from subprocess import call

for movieURL in movieURLs:
call(["wget -c "+movieURL],shell=True)

Leave a Reply