[ Home ] – [ Research ] – [ Teaching ] – [ Resources ] – [ Misc ]


SGML-SK (Swiss Knife) tokenize, filter, index, merge and make well-formed SGML / XML / HTML documents

What is SGML-SK ?

SGML-SK is a perl script which aims at handling SGML / XML / HTML documents for common tasks such as

  • tokenizing tags and text,
  • filtering tags and optionally indexing them,
  • merging tags coming from several documents or index files (resulting of previous filtering),
  • and making them well-formed

SK stands for Swiss Knife.

Contact

Please, feel free to send me any comments or to inform me of your use nicolas /dot/ hernandez /at/ univ-nantes /dot/ fr

License

Copyright ou © ou Copr. Nicolas Hernandez, (2007)

This software is governed by the CeCILL license under French law and abiding by the rules of distribution of free software. You can use, modify and/ or redistribute the software under the terms of the CeCILL license as circulated by CEA, CNRS and INRIA at the following URLhttp://www.cecill.info”.

The CeCILL License is GNU GPL Compatible. Downloading or using this software stands for license aggreement.

Download

Last version is sgmlsk-0.10.4.tgz

Right know the project is not in a forge. Let me know if you are interested to participate in Sgmlsk development or whether you make some personnal adaptations.

Changelog

  • [WHEN] [WHO] [WHAT]
  • 0710 hernandez
    • 0.10.1 documentation POD for modules and CeCILL License
    • 0.10.0 refactor to include load and filter functions into parseLoadAndIndex.pm
    • 0.9.1 refactor to include determine inclusion order and make it well-formed functions into WellFormed.pm
    • 0.9.0 add function determine inclusion order of a given tagged file
    • 0.8.0 add function make it well-formed
    • 0.7.0 add merging functions for input files and load files
    • 0.6.0 refactor to deal with multiple input files and load files
    • 0.5.0 refactor (better fix which allows a generic approach to) to deal with HTTP::TagReader parser bugs and load index files which share the point of potentially having several tags or words for a single expected word : save/loadState
    • 0.4.0 add function loading index
    • 0.3.0 add function filtering and generating index
    • 0.2.0 refactor distribute into SGML.pm, text.pm, common.pm the various existing
    • 0.1.3 add function to handle I/O ; stdin and input file per option indifferently, STDOUT or output file indifferently
    • 0.1.2 add private debug module, and command line options handler
    • 0.1.1 fix HTTP::TagReader parser by post-processing ; handling starting CDATA tags including tags and ending CDATA tags
    • 0.1.0 add Tokenizer SGML / XML /HTML using Perl module HTTP::TagReader

How to install it ?

Requirements

  • Developped with: perl, v5.8.8 built for i486-linux-gnu-thread-multi
  • Standard PERL Package: File::Basename, Time::Local, Getopt::Long
  • Less standard PERL Package: HTML::TagReader ; Copyright © Guido Socher This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
sudo perl -MCPAN -e 'install HTML::TagReader'

Unpack

tar -xvzf sgmlsk-<version>.tgz

Environment settings

In your .bashrc, add the following lines

export SGMLSK_HOME=...
export PERL5LIB=${SGMLSK_HOME}/lib:${PERL5LIB}
export PATH=${SGMLSK_HOME}:${PATH}

How to use it ?

Syntax:  
[cat STDIN |] \
  ./sgmlsk.pl \
     [--input inputFile[,inputFile]*]  \
        [--load index2LoadFile[,index2LoadFile]* [--criteria (word|line)]] \
           [--(positivefilter|negativefilter) regexp[,regexp]* [--save index2SaveFile]] \
              [--output outputFile]    \
    [--help] [--version] [--verbose]* [--nocritical] 

--input uri[,uri]*                                      : input file uri
--load index2LoadFile[,index2LoadFile]*                 : load tags 
--criteria (word|line)                                  : indexing SGML/XML/HTML tags on words (by default) or lines criteria
--output uri                                            : ouput file or directory uri
--positivefilter|negativefilter regexp[,regexp]*        : respectively
      * keep tags from input which match one regexp and do not dump them if index required ; (case-insensitive)
      * remove tags from input which match one regexp and dump them if index required ; (case-insensitive)
--save index2SaveFile                                   : trace of the filtered tags             
-h|--help            : this (help) message
--version            : version
--verbose            : --verbose) main algorithm steps ; --verbose --verbose) with debug messages concerning main process ; other depending on wanted details
--nocritical         : do not die on critical warning (used to debug)

Examples:
  # Some help
  ./sgmlsk.pl  -h
  # to tokenize
  cat examples/sample.xml | ./sgmlsk.pl 
  # to filter by removing all tags (either specify all you do not want or that you want empty)
  cat examples/sample.xml | ./sgmlsk.pl --pos ""
  # to filter by removing all tags and index them on word position (.dd stands for data dump)
  cat examples/sample.xml | ./sgmlsk.pl  --pos "" -s examples/sample.dd 
  # to filter by removing all tags and index them on line position (.dd stands for data dump)
  cat examples/sample.xml | ./sgmlsk.pl  --pos "" -s examples/sample.dd -c line
  # to filter in order to keep para and sect tags 
  cat examples/sample.xml | ./sgmlsk.pl  --pos para,sect -s examples/sample.dd 
  # to filter by removing some selected tags of HTML documents because of their name or attributes...
  cat examples/sample.xml | ./sgmlsk.pl  --neg xml,DOCTYPE,']]>',HTML,meta,head,link,style,body,'!--',tr,td
  # to filter by removing some selected tags of HTML documents because of their attributes name or/and value...
  cat examples/sample.xml | ./sgmlsk.pl  --neg 'hrefs*=s*"http'
  # to work (here as simply a tokenizer) on specific file input and producing a specific file output and index
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -s examples/sample.dd
  # to load index previously made (for instance in order to temporarly remove some tags for some processing)
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt
  ./sgmlsk.pl  -i examples/doc.1-1.txt -l examples/doc.1-1.dd 
  # merge serveral description of a same text content by parsing several input files and load several index describing the same document 
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt
  ./sgmlsk.pl  -i examples/doc.1-2.sgml -pos "" -s examples/doc.1-2.dd -o examples/doc.1-2.txt
  ./sgmlsk.pl  -i examples/doc.1-4.sgml -pos "" -s examples/doc.1-4.dd -o examples/doc.1-4.txt
  # then the following give equivalent result 
  ./sgmlsk.pl  -i examples/doc.1-1.sgml,examples/doc.1-2.sgml,examples/doc.1-4.sgml
  # or
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -l examples/doc.1-2.dd,examples/doc.1-4.dd
  # or 
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd
  # filter and index works on resulting merging
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -pos "" -s examples/doc.1-1-2-4.dd
  # make the merged document well-formed
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -w
  # to debug and help developpement 
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -verb -verb -verb 
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -verb -verb -verb  1>/dev/null 2>&1

TODO, FIXME, BEAWARE lists for future release

Ordered by priority

TODO

  • install script to assist HTML::TagReader install and SGMLSK
    • see linux mag for that
  • handling namespace for document merging
    • see element and namespace pattern in Expat
  • option for config file instead of command line
  • supprimer des libraries les références à opts et separator
    • passer les paramètres des fonctions
  • relax instead of critical option
  • uniform $VERSION and svn $revision$ property if possible…
  • if well-formed required, offer the possibility to turn into empty elements with attributes start and end, identifiers and namespace for them
  • Main description of sgmlsk as DATA pour usage __END_ avec affichage en dessous du prog si résoud les variables
  • logg4perl instead of debug
  • make modules and turn it faster (by skipping unnecessary recurrent step)
  • turn it object
  • use Storable instead of Data::Dumper ? Because Storable is faster, better, cheaper. But less readable…
    • to discuss
  • option htmltidy by interfacing the lib (specially for HTML documents)
  • simple text tokenizer

FIXME

  • check whether all inputs describe the same document and whether all loaded tag index are potential (i dont how for the latter)
  • column and line are not correct since content text has several words
  • should not have directories as –input

BEAWARE

  • HTML::TagReader do not recognize '<![CDATA[ … ]]>' as a tag when it includes tags…
    • it results that '<![CDATA[ <myTag>' is recognized as a tag and ']]>' as a text :
      • fixed when we parse :
        • '<![CDATA[' and '<myTag>' are processed separately as tags
        • ']]>' is considered as tag

Development

Use the following script to perform integration tests

perl test_sgmlsk.pl 

It should return such display in case of success

Ok -- to tokenize
Ok -- to filter by removing all tags (either specify all you do not want or that you want empty)
Ok -- to filter by removing all tags and index them on word position (.dd stands for data dump)
Ok -- to filter by removing all tags and index them on line position (.dd stands for data dump)
Ok -- to filter in order to keep para and sect tags 
Ok -- to load index previously made (for instance in order to temporarly remove some tags for some processing)
Ok -- merge description from several input file
Not Ok But Sort Ok-- merge description from several load with input file
Not Ok But Sort Ok-- merge description from several load with input file and stdin
Ok -- filter and index works on resulting merging
Ok -- make the merged document well-formed

How does it work ?

  • Filtering thanks to regexp on tags,
  • Indexing /merging on words (by default) or lines criteria
  • Fast when dealing with a single document and/or one load.
  • Most of all, you can use it in much expressive way

 
misc/software/sgmlsk.txt · Last modified: 2010/05/13 12:35 (external edit)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki