Table of Contents

[ Home ] – [ Research ] – [ Teaching ] – [ Resources ] – [ Misc ]


SGML-SK (Swiss Knife) tokenize, filter, index, merge and make well-formed SGML / XML / HTML documents

url: http://e.nicolas.hernandez.free.fr/pro/doku.php?id=misc:software:sgmlsk

What is SGML-SK ?

SGML-SK is a perl script which aims at handling SGML / XML / HTML documents for common tasks such as

SK stands for Swiss Knife.

Contact

Please, feel free to send me any comments or to inform me of your use nicolas /dot/ hernandez /at/ univ-nantes /dot/ fr

License

Copyright ou © ou Copr. Nicolas Hernandez, (2007)

This software is governed by the CeCILL license under French law and abiding by the rules of distribution of free software. You can use, modify and/ or redistribute the software under the terms of the CeCILL license as circulated by CEA, CNRS and INRIA at the following URLhttp://www.cecill.info”.

The CeCILL License is GNU GPL Compatible. Downloading or using this software stands for license aggreement.

Download

Last version is sgmlsk-0.10.4.tgz

Right know the project is not in a forge. Let me know if you are interested to participate in Sgmlsk development or whether you make some personnal adaptations.

Changelog

How to install it ?

Requirements

sudo perl -MCPAN -e 'install HTML::TagReader'

Unpack

tar -xvzf sgmlsk-<version>.tgz

Environment settings

In your .bashrc, add the following lines

export SGMLSK_HOME=...
export PERL5LIB=${SGMLSK_HOME}/lib:${PERL5LIB}
export PATH=${SGMLSK_HOME}:${PATH}

How to use it ?

Syntax:  
[cat STDIN |] \
  ./sgmlsk.pl \
     [--input inputFile[,inputFile]*]  \
        [--load index2LoadFile[,index2LoadFile]* [--criteria (word|line)]] \
           [--(positivefilter|negativefilter) regexp[,regexp]* [--save index2SaveFile]] \
              [--output outputFile]    \
    [--help] [--version] [--verbose]* [--nocritical] 

--input uri[,uri]*                                      : input file uri
--load index2LoadFile[,index2LoadFile]*                 : load tags 
--criteria (word|line)                                  : indexing SGML/XML/HTML tags on words (by default) or lines criteria
--output uri                                            : ouput file or directory uri
--positivefilter|negativefilter regexp[,regexp]*        : respectively
      * keep tags from input which match one regexp and do not dump them if index required ; (case-insensitive)
      * remove tags from input which match one regexp and dump them if index required ; (case-insensitive)
--save index2SaveFile                                   : trace of the filtered tags             
-h|--help            : this (help) message
--version            : version
--verbose            : --verbose) main algorithm steps ; --verbose --verbose) with debug messages concerning main process ; other depending on wanted details
--nocritical         : do not die on critical warning (used to debug)

Examples:
  # Some help
  ./sgmlsk.pl  -h
  # to tokenize
  cat examples/sample.xml | ./sgmlsk.pl 
  # to filter by removing all tags (either specify all you do not want or that you want empty)
  cat examples/sample.xml | ./sgmlsk.pl --pos ""
  # to filter by removing all tags and index them on word position (.dd stands for data dump)
  cat examples/sample.xml | ./sgmlsk.pl  --pos "" -s examples/sample.dd 
  # to filter by removing all tags and index them on line position (.dd stands for data dump)
  cat examples/sample.xml | ./sgmlsk.pl  --pos "" -s examples/sample.dd -c line
  # to filter in order to keep para and sect tags 
  cat examples/sample.xml | ./sgmlsk.pl  --pos para,sect -s examples/sample.dd 
  # to filter by removing some selected tags of HTML documents because of their name or attributes...
  cat examples/sample.xml | ./sgmlsk.pl  --neg xml,DOCTYPE,']]>',HTML,meta,head,link,style,body,'!--',tr,td
  # to filter by removing some selected tags of HTML documents because of their attributes name or/and value...
  cat examples/sample.xml | ./sgmlsk.pl  --neg 'hrefs*=s*"http'
  # to work (here as simply a tokenizer) on specific file input and producing a specific file output and index
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -s examples/sample.dd
  # to load index previously made (for instance in order to temporarly remove some tags for some processing)
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt
  ./sgmlsk.pl  -i examples/doc.1-1.txt -l examples/doc.1-1.dd 
  # merge serveral description of a same text content by parsing several input files and load several index describing the same document 
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt
  ./sgmlsk.pl  -i examples/doc.1-2.sgml -pos "" -s examples/doc.1-2.dd -o examples/doc.1-2.txt
  ./sgmlsk.pl  -i examples/doc.1-4.sgml -pos "" -s examples/doc.1-4.dd -o examples/doc.1-4.txt
  # then the following give equivalent result 
  ./sgmlsk.pl  -i examples/doc.1-1.sgml,examples/doc.1-2.sgml,examples/doc.1-4.sgml
  # or
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -l examples/doc.1-2.dd,examples/doc.1-4.dd
  # or 
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd
  # filter and index works on resulting merging
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -pos "" -s examples/doc.1-1-2-4.dd
  # make the merged document well-formed
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -w
  # to debug and help developpement 
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -verb -verb -verb 
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -verb -verb -verb  1>/dev/null 2>&1

TODO, FIXME, BEAWARE lists for future release

Ordered by priority

TODO

FIXME

BEAWARE

Development

Use the following script to perform integration tests

perl test_sgmlsk.pl 

It should return such display in case of success

Ok -- to tokenize
Ok -- to filter by removing all tags (either specify all you do not want or that you want empty)
Ok -- to filter by removing all tags and index them on word position (.dd stands for data dump)
Ok -- to filter by removing all tags and index them on line position (.dd stands for data dump)
Ok -- to filter in order to keep para and sect tags 
Ok -- to load index previously made (for instance in order to temporarly remove some tags for some processing)
Ok -- merge description from several input file
Not Ok But Sort Ok-- merge description from several load with input file
Not Ok But Sort Ok-- merge description from several load with input file and stdin
Ok -- filter and index works on resulting merging
Ok -- make the merged document well-formed

How does it work ?

Nicolas Hernandez