Text Analysis Toolkit

Contenu de la boîte

character encoding recognition (iso…) and conversion : to utf8
document format detection (text raw, HTML, pdf, XML-TEI…) and conversion : to a tagged format (which accept multi-annotations)
- internal anchor and external tag to describe features (name-value) of annotated units
language detection
normalisation/homogénéisation du texte (selon ressources et traitements souhaités) : selon le niveau de traitement (du caractère au mot, de la syntaxe à la sémantique)
- casse des caractères (MAJUSCULE2minuscule),
- caractères diacritiques (cédille, accent, …) en leur équivalent sans diacritique,
- les signes de ponctuation en un même signe,
- certains mots par leur catégorie grammaticale (déterminant, préposition), représentant de chaque chaîne lexicale
pré-traitement
- les abréviations, les sigles, les acronymes (sigle prononcé comme un mot ordinaire) ;
- les pronoms indéfinis
word tokenization
sentence tokenisation (splitter)
analyse morpho-syntaxique
- lemmatizer
- étiquetage catégorie grammaticale (tagger)
- fonction syntaxique (chunker/parser)
text classification
désambiguïsation lexicale ; construction de chaînes lexicales
reconnaissance d'entités nommés
désambiguïsation lexicale
résolution des anaphores
découpage en segments discursifs “thématiques”/fonctionnels
étiqueteur sémantico-rhétorique des énoncés ; des relations entre énoncés

recognition

file -i <file>

fournit l'encodage du fichier ; voir aussi utrac -P : liste les encodages candidats par ordre de pertinence

utrac -p <fichier>

conversion

UTRAC stands for Universal Text Recognizer and Converter. It is a command line tool and a library that recognize the encoding of an input file (ex: UTF-8, ISO-8859-1, CP437…) and its end-of-line type (CR, LF, CRLF). http://utrac.sourceforge.net/

utrac -f ISO-8859-1 -t UTF-8  fichier.iso.txt > fichier.utf.txt

iconv, The original GNU encoding conversion tool. It is a command-line tool based on libiconv.

Convertir un fichier vers l'UTF-8 et inversement

iconv -f iso-8859-1 -t utf-8 <in >out    # vers UTF-8
iconv -f utf-8 -t iso-8859-1 <in >out    # vers latin-1

recode, A successor to iconv but with a somewhat peculiar command-line syntax.
siconv, This is a stream-oriented counterpart to iconv, using libiconv, the same library that underlies iconv. It can handle larger amounts of data than iconv.

detection

file -i <file>

conversion

Any to text – The Multivalent Document Tools can extract text from a number of formats, including PDF and HTML. http://multivalent.sourceforge.net/Tools/index.html
Portable Document Format (PDF)
- to text/ps – PDFtotext extracts plain text from PDF files. It is part of the xpdf package, which also provides a PDF file viewer and some other tools. http://www.foolabs.com/xpdf/
- to html – It's based on the xpdf 2.02 by Derek Noonburg http://pdftohtml.sourceforge.net/
Open Document Format (ODF) to text – odt2txt is a simple command-line tool that extracts plain text from ODF. http://stosberg.net/odt2txt/
Postscript – PSToText extracts plain text (in the ISO-8859-1 extended ASCII encoding) from Postscript files. http://pages.cs.wisc.edu/~ghost/doc/pstotext.htm
TeX and LaTeX – TeX TTH translates TeX and LaTeX into HTML, from which it can be converted to plain text by an HTML converter. http://hutchinson.belmont.ma.us/tth/
Rich Text Format (RTF) – Rtfeeder also converts RTF to HTML. Since it is a Perl script, it should run on any platform. http://www.theory.org/%7Ematt/perl/rtfeeder/
Microsoft Word – Antiword is able to convert Word documents to plain text, to PostScript, to PDF and to XML/DocBook. http://www.winfield.demon.nl/

Manipulation

cat input/sample.xml | sgml2token.pl

References

Conversion de documents DocBook XML/SGML avec OpenJade http://www.ibiblio.org/pub/Linux/docs/HOWTO/translations/fr/html-1page/DocBook-OpenJade-SGML-XML-HOWTO.html
UTF-8 et Linux http://www.haypocalc.com/wiki/UTF-8_et_Linux
Discussion sur et comparaison entre Utrac et autres détecteur/convertisseur http://linuxfr.org/~calandoa/16251.html
Computational Resources for Linguistic Research dont des histoires d'encoding et de conversions de formats (RTF, PS… 2TEXT or 2HTML)… http://billposer.org/Linguistics/Computation/Resources.html
En-têtes de fichiers (HTML, Email, HTTP, LINK) http://www.cis.uni-muenchen.de/~wastl/kurse/kut/locale.html
SGML handling http://www.jclark.com/sp/