[ Home ] – [ Research ] – [ Teaching ] – [ Misc ]
Read from http://billposer.org/Linguistics/Computation/Resources.html
Obtaining Data from Web Sites
Much useful data can be obtained from web sites. In a way, the web is a gigantic collection of electronic corpora. In general, any material published on a web site is fair game for research use as this constitutes “fair use” under the law of the United States and many other jurisdictions. There are, however, legal and ethical issues concerning redistribution of material obtained on the web. Furthermore, there are legal and ethical issues involved in obtaining material available on the web but not intended by the owner to be stored or converted to another format. For example, if something is made available only as streaming audio, this may be because the provider does not want you to be able to store a copy. This raises the question of whether you may legally and ethically do so.
Some useful sources of information are:
Bitlaw
A site created by an intellectual property lawyer containing over 1,800 pages dealing with all aspects of intellectual property law.
Copyright and Fair Use
Lots of information on fair use of copyrighted material provided by the Stanford University Library system.
The Electronic Frontier Foundation
An organization dedicated to the preservation of freedom on the web. Its web site contains information about such topics as copyright law, file-sharing, and digital rights management.
Here are some useful tools:
curl
A tool for transferring files with URL syntax.
DataparkSearch
A web indexing and search tool.
Getleft
Similar to curl but with a graphical interface.
H2Text
Strips HTML from a file, leaving pure text. (To do this, use the command line: h2text -nc -t < <input file name> > <output file name>.)
wget
Automatically downloads web sites, recursively if so desired. Has oodles of options allowing detailed specifications of what links to follow and what kinds of files to download. It is possible, for example, to specify that text files should be downloaded but image files should not be.
Problems sometimes arise in obtaining audio data from the web. You may find these lecture notes on audio data helpful, especially this section.