{"id":328,"date":"2012-10-02T09:11:00","date_gmt":"2012-10-02T09:11:00","guid":{"rendered":""},"modified":"2017-02-07T14:07:22","modified_gmt":"2017-02-07T14:07:22","slug":"using-python-and-wget-to-download-certain-file-types-from-a-web-page","status":"publish","type":"post","link":"https:\/\/perfectionatic.org\/?p=328","title":{"rendered":"Using Python and wget to download certain file types from a web-page"},"content":{"rendered":"<div dir=\"ltr\" style=\"text-align: left;\">Recently I wanted to download all the movies in Mackay&#8217;s <a href=\"http:\/\/www.inference.phy.cam.ac.uk\/itprnn_lectures\/\">information theory <\/a>course. That was a great deal of files and I though it would too tedious to try to do it using the browser or even manual <a href=\"http:\/\/www.thegeekstuff.com\/2009\/09\/the-ultimate-wget-download-guide-with-15-awesome-examples\/\"><code>wget<\/code><\/a>.   So digging around, I found this <a href=\"http:\/\/www.boddie.org.uk\/python\/HTML.html\">decent tutorial<\/a> on python HTML processing. I also dug around some information <a href=\"http:\/\/docs.python.org\/library\/urlparse.html#urlparse.urljoin\">on parsing URLs<\/a>. I combined that with my knowledge of python&#8217;s <a href=\"http:\/\/jimmyg.org\/blog\/2009\/working-with-python-subprocess.html\"><code>subprocess<code><\/code><\/code><\/a> and put together this nice piece of code.   <\/p>\n<p>&nbsp;<\/p>\n<h2><\/h2>\n<div>\n<pre><span>#!\/usr\/bin\/python<\/span><br \/><br \/><span>import<\/span> <span>sgmllib<\/span><br \/><br \/><span>class<\/span> <span>MyParser<\/span><span>(<\/span><span>sgmllib<\/span><span>.<\/span><span>SGMLParser<\/span><span>):<\/span><br \/>    <span>\"A simple parser class.\"<\/span><br \/><br \/>    <span>def<\/span> <span>parse<\/span><span>(<\/span><span>self<\/span><span>,<\/span> <span>s<\/span><span>):<\/span><br \/>        <span>\"Parse the given string 's'.\"<\/span><br \/>        <span>self<\/span><span>.<\/span><span>feed<\/span><span>(<\/span><span>s<\/span><span>)<\/span><br \/>        <span>self<\/span><span>.<\/span><span>close<\/span><span>()<\/span><br \/><br \/>    <span>def<\/span> <span>__init__<\/span><span>(<\/span><span>self<\/span><span>,<\/span> <span>verbose<\/span><span>=<\/span><span>0<\/span><span>):<\/span><br \/>        <span>\"Initialise an object, passing 'verbose' to the superclass.\"<\/span><br \/><br \/>        <span>sgmllib<\/span><span>.<\/span><span>SGMLParser<\/span><span>.<\/span><span>__init__<\/span><span>(<\/span><span>self<\/span><span>,<\/span> <span>verbose<\/span><span>)<\/span><br \/>        <span>self<\/span><span>.<\/span><span>hyperlinks<\/span> <span>=<\/span> <span>[]<\/span><br \/><br \/>    <span>def<\/span> <span>start_a<\/span><span>(<\/span><span>self<\/span><span>,<\/span> <span>attributes<\/span><span>):<\/span><br \/>        <span>\"Process a hyperlink and its 'attributes'.\"<\/span><br \/><br \/>        <span>for<\/span> <span>name<\/span><span>,<\/span> <span>value<\/span> <span>in<\/span> <span>attributes<\/span><span>:<\/span><br \/>            <span>if<\/span> <span>name<\/span> <span>==<\/span> <span>\"href\"<\/span><span>:<\/span><br \/>                <span>self<\/span><span>.<\/span><span>hyperlinks<\/span><span>.<\/span><span>append<\/span><span>(<\/span><span>value<\/span><span>)<\/span><br \/><br \/>    <span>def<\/span> <span>get_hyperlinks<\/span><span>(<\/span><span>self<\/span><span>):<\/span><br \/>        <span>\"Return the list of hyperlinks.\"<\/span><br \/><br \/>        <span>return<\/span> <span>self<\/span><span>.<\/span><span>hyperlinks<\/span><br \/><br \/><span>import<\/span> <span>urllib<\/span><span>,<\/span> <span>sgmllib<\/span><br \/><br \/><span># Get something to work with.<\/span><br \/><span>webPage<\/span><span>=<\/span><span>\"http:\/\/www.inference.phy.cam.ac.uk\/itprnn_lectures\/\"<\/span><br \/><span>f<\/span> <span>=<\/span> <span>urllib<\/span><span>.<\/span><span>urlopen<\/span><span>(<\/span><span>webPage<\/span><span>)<\/span><br \/><span>s<\/span> <span>=<\/span> <span>f<\/span><span>.<\/span><span>read<\/span><span>()<\/span><br \/><br \/><span># Try and process the page.<\/span><br \/><span># The class should have been defined first, remember.<\/span><br \/><span>myparser<\/span> <span>=<\/span> <span>MyParser<\/span><span>()<\/span><br \/><span>myparser<\/span><span>.<\/span><span>parse<\/span><span>(<\/span><span>s<\/span><span>)<\/span><br \/><br \/><span># Get the hyperlinks.<\/span><br \/><span>links<\/span><span>=<\/span><span>myparser<\/span><span>.<\/span><span>get_hyperlinks<\/span><span>()<\/span><br \/><span>print<\/span> <span>links<\/span><br \/><br \/><span>movies<\/span><span>=<\/span><span>[<\/span><span>x<\/span> <span>for<\/span> <span>x<\/span> <span>in<\/span> <span>links<\/span> <span>if<\/span> <span>x<\/span><span>.<\/span><span>endswith<\/span><span>(<\/span><span>'mp4'<\/span><span>)]<\/span><br \/><span>print<\/span> <span>movies<\/span><br \/><br \/><span>import<\/span> <span>urlparse<\/span><br \/><span>movieURLs<\/span><span>=<\/span><span>[<\/span><span>urlparse<\/span><span>.<\/span><span>urljoin<\/span><span>(<\/span><span>webPage<\/span><span>,<\/span><span>x<\/span><span>)<\/span> <span>for<\/span> <span>x<\/span> <span>in<\/span> <span>movies<\/span><span>]<\/span><br \/><span>print<\/span> <span>movieURLs<\/span><br \/><br \/><span>from<\/span> <span>subprocess<\/span> <span>import<\/span> <span>call<\/span><br \/><br \/><span>for<\/span> <span>movieURL<\/span> <span>in<\/span> <span>movieURLs<\/span><span>:<\/span><br \/>    <span>call<\/span><span>([<\/span><span>\"wget -c \"<\/span><span>+<\/span><span>movieURL<\/span><span>],<\/span><span>shell<\/span><span>=<\/span><span>True<\/span><span>)<\/span><br \/><\/pre>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Recently I wanted to download all the movies in Mackay&#8217;s information theory course. That was a great deal of files and I though it would too tedious to try to do it using the browser or even manual wget. So digging around, I found this decent tutorial on python HTML processing. I also dug around [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[84,85,1,87,86],"tags":[],"class_list":["post-328","post","type-post","status-publish","format-standard","hentry","category-linux","category-python","category-uncategorized","category-url-parsing","category-wget"],"_links":{"self":[{"href":"https:\/\/perfectionatic.org\/index.php?rest_route=\/wp\/v2\/posts\/328","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/perfectionatic.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/perfectionatic.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/perfectionatic.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/perfectionatic.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=328"}],"version-history":[{"count":1,"href":"https:\/\/perfectionatic.org\/index.php?rest_route=\/wp\/v2\/posts\/328\/revisions"}],"predecessor-version":[{"id":331,"href":"https:\/\/perfectionatic.org\/index.php?rest_route=\/wp\/v2\/posts\/328\/revisions\/331"}],"wp:attachment":[{"href":"https:\/\/perfectionatic.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=328"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/perfectionatic.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=328"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/perfectionatic.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=328"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}