Override Robots.txt With wget

I find myself downloading lots of files from the web when converting sites into my company’s CMS. Whether from static sites or other CMS platforms, trying to do this manually sucks. But, thanks to wget’s recursive download feature, I can rip through a site, and get all of the images I need, while keeping even the folder structure.

One thing I found out was that wget respects robots.txt files, so the the site you are trying to copy has one with the right settings, wget will get only what is allowed. This is something that can be overridden with a few tweaks. I gladly used it and decided to pass it along. See the instructions at the site below.

Ignoring robots restrictions with wget — bitbucket.org

UPDATE:
Thanks to @jcheshire who pointed out that wget actually has an ignore robots setting. Not the greatest documentation, but a much simpler process.


wget -e robots=off --wait 1 http://your.site.here

Link to the actual documentation