[ Home ] – [ Research ] – [ Teaching ] – [ Resources ] – [ Misc ]
SGML-SK is a perl script which aims at handling SGML / XML / HTML documents for common tasks such as
SK stands for Swiss Knife.
Please, feel free to send me any comments or to inform me of your use nicolas /dot/ hernandez /at/ univ-nantes /dot/ fr
Copyright ou © ou Copr. Nicolas Hernandez, (2007)
This software is governed by the CeCILL license under French law and abiding by the rules of distribution of free software. You can use, modify and/ or redistribute the software under the terms of the CeCILL license as circulated by CEA, CNRS and INRIA at the following URL “http://www.cecill.info”.
The CeCILL License is GNU GPL Compatible. Downloading or using this software stands for license aggreement.
Last version is sgmlsk-0.10.4.tgz
Right know the project is not in a forge. Let me know if you are interested to participate in Sgmlsk development or whether you make some personnal adaptations.
sudo perl -MCPAN -e 'install HTML::TagReader'
tar -xvzf sgmlsk-<version>.tgz
In your .bashrc, add the following lines
export SGMLSK_HOME=... export PERL5LIB=${SGMLSK_HOME}/lib:${PERL5LIB} export PATH=${SGMLSK_HOME}:${PATH}
Syntax: [cat STDIN |] \ ./sgmlsk.pl \ [--input inputFile[,inputFile]*] \ [--load index2LoadFile[,index2LoadFile]* [--criteria (word|line)]] \ [--(positivefilter|negativefilter) regexp[,regexp]* [--save index2SaveFile]] \ [--output outputFile] \ [--help] [--version] [--verbose]* [--nocritical] --input uri[,uri]* : input file uri --load index2LoadFile[,index2LoadFile]* : load tags --criteria (word|line) : indexing SGML/XML/HTML tags on words (by default) or lines criteria --output uri : ouput file or directory uri --positivefilter|negativefilter regexp[,regexp]* : respectively * keep tags from input which match one regexp and do not dump them if index required ; (case-insensitive) * remove tags from input which match one regexp and dump them if index required ; (case-insensitive) --save index2SaveFile : trace of the filtered tags -h|--help : this (help) message --version : version --verbose : --verbose) main algorithm steps ; --verbose --verbose) with debug messages concerning main process ; other depending on wanted details --nocritical : do not die on critical warning (used to debug) Examples: # Some help ./sgmlsk.pl -h
# to tokenize cat examples/sample.xml | ./sgmlsk.pl
# to filter by removing all tags (either specify all you do not want or that you want empty) cat examples/sample.xml | ./sgmlsk.pl --pos ""
# to filter by removing all tags and index them on word position (.dd stands for data dump) cat examples/sample.xml | ./sgmlsk.pl --pos "" -s examples/sample.dd
# to filter by removing all tags and index them on line position (.dd stands for data dump) cat examples/sample.xml | ./sgmlsk.pl --pos "" -s examples/sample.dd -c line
# to filter in order to keep para and sect tags cat examples/sample.xml | ./sgmlsk.pl --pos para,sect -s examples/sample.dd
# to filter by removing some selected tags of HTML documents because of their name or attributes... cat examples/sample.xml | ./sgmlsk.pl --neg xml,DOCTYPE,']]>',HTML,meta,head,link,style,body,'!--',tr,td
# to filter by removing some selected tags of HTML documents because of their attributes name or/and value... cat examples/sample.xml | ./sgmlsk.pl --neg 'hrefs*=s*"http'
# to work (here as simply a tokenizer) on specific file input and producing a specific file output and index ./sgmlsk.pl -i examples/sample.xml -o examples/sample.xml -s examples/sample.dd
# to load index previously made (for instance in order to temporarly remove some tags for some processing) ./sgmlsk.pl -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt ./sgmlsk.pl -i examples/doc.1-1.txt -l examples/doc.1-1.dd
# merge serveral description of a same text content by parsing several input files and load several index describing the same document ./sgmlsk.pl -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt ./sgmlsk.pl -i examples/doc.1-2.sgml -pos "" -s examples/doc.1-2.dd -o examples/doc.1-2.txt ./sgmlsk.pl -i examples/doc.1-4.sgml -pos "" -s examples/doc.1-4.dd -o examples/doc.1-4.txt # then the following give equivalent result ./sgmlsk.pl -i examples/doc.1-1.sgml,examples/doc.1-2.sgml,examples/doc.1-4.sgml # or ./sgmlsk.pl -i examples/doc.1-1.sgml -l examples/doc.1-2.dd,examples/doc.1-4.dd # or cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd
# filter and index works on resulting merging cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -pos "" -s examples/doc.1-1-2-4.dd
# make the merged document well-formed cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -w
# to debug and help developpement ./sgmlsk.pl -i examples/sample.xml -o examples/sample.xml -verb -verb -verb ./sgmlsk.pl -i examples/sample.xml -o examples/sample.xml -verb -verb -verb 1>/dev/null 2>&1
Ordered by priority
Use the following script to perform integration tests
perl test_sgmlsk.pl
It should return such display in case of success
Ok -- to tokenize Ok -- to filter by removing all tags (either specify all you do not want or that you want empty) Ok -- to filter by removing all tags and index them on word position (.dd stands for data dump) Ok -- to filter by removing all tags and index them on line position (.dd stands for data dump) Ok -- to filter in order to keep para and sect tags Ok -- to load index previously made (for instance in order to temporarly remove some tags for some processing) Ok -- merge description from several input file Not Ok But Sort Ok-- merge description from several load with input file Not Ok But Sort Ok-- merge description from several load with input file and stdin Ok -- filter and index works on resulting merging Ok -- make the merged document well-formed