Table of Contents

[ Home ] – [ Research ] – [ Teaching ] – [ Resources ] – [ Misc ]


Text Analysis Toolkit

Contenu de la boîte

character encoding recognition (iso...) and conversion

recognition

file -i <file> 
utrac -p <fichier>

conversion

utrac -f ISO-8859-1 -t UTF-8  fichier.iso.txt > fichier.utf.txt

Convertir un fichier vers l'UTF-8 et inversement

iconv -f iso-8859-1 -t utf-8 <in >out    # vers UTF-8
iconv -f utf-8 -t iso-8859-1 <in >out    # vers latin-1

document format detection (text raw, HTML, pdf, XML-TEI...) and conversion

detection

file -i <file> 

conversion

Manipulation

cat input/sample.xml | sgml2token.pl 

References

Nicolas Hernandez