How to non-interactively retrieve files from the Web

Overview

Wget is a network utility to retrieve files from the Web using http and ftp, the two most widely used Internet protocols . It works non-interactively, so it will work in the background, after having logged off. The program supports recursive retrieval of web-authoring pages as well as ftp sites. You can use wget to make mirrors of archives and home pages or to travel the Web like a WWW robot.

Examples

The examples are classified into three sections, because of clarity. The first section is a tutorial for beginners. The second section explains some of the more complex program features. The third section contains advice for mirror administrators, as well as even more complex features (that some would call perverted).

Say you want to download a URL. Just type:

wget http://foo.bar.com/

But what will happen if the connection is slow, and the file is lengthy? The connection will probably fail before the whole file is retrieved, more than once. In this case, Wget will try getting the file until it either gets the whole of it, or exceeds the default number of retries (this being 20). It is easy to change the number of tries to 45, to insure that the whole file will arrive safely:

wget --tries=45 http://foo.bar.com/jpg/flyweb.jpg

Now let's leave Wget to work in the background, and write its progress to log file ' log '. It is tiring to type ' --tries ', so we shall use ' -t '.

wget -t 45 -o log http://foo.bar.com/jpg/flyweb.jpg &

The ampersand at the end of the line makes sure that Wget works in the background. To unlimit the number of retries, use ' -t inf '.

The usage of FTP is as simple. Wget will take care of login and password.

wget ftp://foo.bar.com/welcome.msg

ftp://foo.download.com/welcome.msg
=> 'welcome.msg'
Connecting to foo.download.com:21... connected!
Logging in as anonymous ... Logged in!
==> TYPE I ... done. ==> CWD not needed.
==> PORT ... done. ==> RETR welcome.msg ... done.

Download Oracle 9i Manuals to your local directory using CYGWINs wget

wget -q --tries=45 -r \
http://download-east.oracle.com/otndoc/oracle9i/901_doc

You would like to read the list of URLs from a file? Not a problem with that:

wget -i file

If you specify ' - ' as file name, the URLs will be read from standard input.

Create a mirror image of GNU WWW site (with the same directory structure the original has) with only one try per document, saving the log of the activities to ' gnulog ':

wget -r -t1 http://foo.bar.com/ -o gnulog

Retrieve the first layer of yahoo links:

wget -r -l1 http://www.yahoo.com/

Retrieve the index.html of ' www.lycos.com ', showing the original server headers:

wget -S http://www.lycos.com/

You want to download all the GIFs from an HTTP directory. The command 'wget http://host/dir/*.gif ' doesn't work, since HTTP retrieval does not support globbing. In that case, use:

wget -r -l1 --no-parent -A.gif http://host/dir/

It is a bit of a kludge, but it works perfectly. ' -r -l1 ' means to retrieve recursively, with maximum depth of 1. ' --no-parent ' means that references to the parent directory are ignored, and ' -A.gif ' means to download only the GIF files. ' -A " *.gif " ' would have worked too.

Suppose you were in the middle of downloading, when Wget was interrupted. Now you do not want to clobber the files already present. It would be:

wget -nc -r http://foo.bar.com/

If you want to encode your own username and password to HTTP or FTP, use the appropriate URL syntax:

wget ftp://name:password@foo.bar.com/myfile

If you wish Wget to keep a mirror of a page (or FTP subdirectories), use ' --mirror ', which is the shorthand for ' -r -N '. You can put Wget in the crontab file asking it to recheck a site each Sunday:

0 0 * * 0 wget --mirror ftp://x.y.z/pub -o /var/weeklog

You may wish to do the same with someone's home page. But you do not want to download all those images, you're only interested in HTML.

wget --mirror -A.html http://www.w3.org/

You find the sources of wget with all the documentation under the following links

http://www.gnu.org/software/wget/wget.html
http://www.lns.cornell.edu/public/COMP/info/wget/wget_toc.html
http://www.interlog.com/~tcharron/wgetwin.html