[ Home ] – [ Research ] – [ Teaching ] – [ Misc ]


Obtaining Data from Web Sites

Read from http://billposer.org/Linguistics/Computation/Resources.html

Obtaining Data from Web Sites

Much useful data can be obtained from web sites. In a way, the web is a gigantic collection of electronic corpora. In general, any material published on a web site is fair game for research use as this constitutes “fair use” under the law of the United States and many other jurisdictions. There are, however, legal and ethical issues concerning redistribution of material obtained on the web. Furthermore, there are legal and ethical issues involved in obtaining material available on the web but not intended by the owner to be stored or converted to another format. For example, if something is made available only as streaming audio, this may be because the provider does not want you to be able to store a copy. This raises the question of whether you may legally and ethically do so.

Some useful sources of information are:

Bitlaw

  A site created by an intellectual property lawyer containing over 1,800 pages dealing with all aspects of intellectual property law. 

Copyright and Fair Use

  Lots of information on fair use of copyrighted material provided by the Stanford University Library system. 

The Electronic Frontier Foundation

  An organization dedicated to the preservation of freedom on the web. Its web site contains information about such topics as copyright law, file-sharing, and digital rights management. 

Here are some useful tools:

curl

  A tool for transferring files with URL syntax.

DataparkSearch

  A web indexing and search tool.

Getleft

  Similar to curl but with a graphical interface.

H2Text

  Strips HTML from a file, leaving pure text. (To do this, use the command line: h2text -nc -t < <input file name> > <output file name>.) 

wget

  Automatically downloads web sites, recursively if so desired. Has oodles of options allowing detailed specifications of what links to follow and what kinds of files to download. It is possible, for example, to specify that text files should be downloaded but image files should not be. 

Problems sometimes arise in obtaining audio data from the web. You may find these lecture notes on audio data helpful, especially this section.

 
misc/obtainingdatafromweb.txt · Last modified: 2010/05/13 12:35 (external edit)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki