SGML-SK (Swiss Knife) tokenize, filter, index, merge and make well-formed SGML / XML / HTML documents

[ Home ] – [ Research ] – [ Teaching ] – [ Resources ] – [ Misc ]

HTML documents

url: http://e.nicolas.hernandez.free.fr/pro/doku.php?id=misc:software:sgmlsk

What is SGML-SK ?

SGML-SK is a perl script which aims at handling SGML / XML / HTML documents for common tasks such as

tokenizing tags and text,
filtering tags and optionally indexing them,
merging tags coming from several documents or index files (resulting of previous filtering),
and making them well-formed

SK stands for Swiss Knife.

Contact

Please, feel free to send me any comments or to inform me of your use nicolas /dot/ hernandez /at/ univ-nantes /dot/ fr

License

Copyright ou © ou Copr. Nicolas Hernandez, (2007)

This software is governed by the CeCILL license under French law and abiding by the rules of distribution of free software. You can use, modify and/ or redistribute the software under the terms of the CeCILL license as circulated by CEA, CNRS and INRIA at the following URL “http://www.cecill.info”.

The CeCILL License is GNU GPL Compatible. Downloading or using this software stands for license aggreement.

Download

Last version is sgmlsk-0.10.4.tgz

Right know the project is not in a forge. Let me know if you are interested to participate in Sgmlsk development or whether you make some personnal adaptations.

Changelog

[WHEN] [WHO] [WHAT]

0710 hernandez
- 0.10.1 documentation POD for modules and CeCILL License
- 0.10.0 refactor to include load and filter functions into parseLoadAndIndex.pm
- 0.9.1 refactor to include determine inclusion order and make it well-formed functions into WellFormed.pm
- 0.9.0 add function determine inclusion order of a given tagged file
- 0.8.0 add function make it well-formed
- 0.7.0 add merging functions for input files and load files
- 0.6.0 refactor to deal with multiple input files and load files
- 0.5.0 refactor (better fix which allows a generic approach to) to deal with HTTP::TagReader parser bugs and load index files which share the point of potentially having several tags or words for a single expected word : save/loadState
- 0.4.0 add function loading index
- 0.3.0 add function filtering and generating index
- 0.2.0 refactor distribute into SGML.pm, text.pm, common.pm the various existing
- 0.1.3 add function to handle I/O ; stdin and input file per option indifferently, STDOUT or output file indifferently
- 0.1.2 add private debug module, and command line options handler
- 0.1.1 fix HTTP::TagReader parser by post-processing ; handling starting CDATA tags including tags and ending CDATA tags
- 0.1.0 add Tokenizer SGML / XML /HTML using Perl module HTTP::TagReader

How to install it ?

Requirements

Developped with: perl, v5.8.8 built for i486-linux-gnu-thread-multi
Standard PERL Package: File::Basename, Time::Local, Getopt::Long
Less standard PERL Package: HTML::TagReader ; Copyright © Guido Socher This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
- Direct link to http://main.linuxfocus.org/~guido/HTML-TagReader-1.10.tar.gz

sudo perl -MCPAN -e 'install HTML::TagReader'

Unpack

tar -xvzf sgmlsk-<version>.tgz

Environment settings

In your .bashrc, add the following lines

export SGMLSK_HOME=...
export PERL5LIB=${SGMLSK_HOME}/lib:${PERL5LIB}
export PATH=${SGMLSK_HOME}:${PATH}

How to use it ?

Syntax:  
[cat STDIN |] \
  ./sgmlsk.pl \
     [--input inputFile[,inputFile]*]  \
        [--load index2LoadFile[,index2LoadFile]* [--criteria (word|line)]] \
           [--(positivefilter|negativefilter) regexp[,regexp]* [--save index2SaveFile]] \
              [--output outputFile]    \
    [--help] [--version] [--verbose]* [--nocritical] 

--input uri[,uri]*                                      : input file uri
--load index2LoadFile[,index2LoadFile]*                 : load tags 
--criteria (word|line)                                  : indexing SGML/XML/HTML tags on words (by default) or lines criteria
--output uri                                            : ouput file or directory uri
--positivefilter|negativefilter regexp[,regexp]*        : respectively
      * keep tags from input which match one regexp and do not dump them if index required ; (case-insensitive)
      * remove tags from input which match one regexp and dump them if index required ; (case-insensitive)
--save index2SaveFile                                   : trace of the filtered tags             
-h|--help            : this (help) message
--version            : version
--verbose            : --verbose) main algorithm steps ; --verbose --verbose) with debug messages concerning main process ; other depending on wanted details
--nocritical         : do not die on critical warning (used to debug)

Examples:
  # Some help
  ./sgmlsk.pl  -h

  # to tokenize
  cat examples/sample.xml | ./sgmlsk.pl

  # to filter by removing all tags (either specify all you do not want or that you want empty)
  cat examples/sample.xml | ./sgmlsk.pl --pos ""

  # to filter by removing all tags and index them on word position (.dd stands for data dump)
  cat examples/sample.xml | ./sgmlsk.pl  --pos "" -s examples/sample.dd

  # to filter by removing all tags and index them on line position (.dd stands for data dump)
  cat examples/sample.xml | ./sgmlsk.pl  --pos "" -s examples/sample.dd -c line

  # to filter in order to keep para and sect tags 
  cat examples/sample.xml | ./sgmlsk.pl  --pos para,sect -s examples/sample.dd

  # to filter by removing some selected tags of HTML documents because of their name or attributes...
  cat examples/sample.xml | ./sgmlsk.pl  --neg xml,DOCTYPE,']]>',HTML,meta,head,link,style,body,'!--',tr,td

  # to filter by removing some selected tags of HTML documents because of their attributes name or/and value...
  cat examples/sample.xml | ./sgmlsk.pl  --neg 'hrefs*=s*"http'

  # to work (here as simply a tokenizer) on specific file input and producing a specific file output and index
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -s examples/sample.dd

  # to load index previously made (for instance in order to temporarly remove some tags for some processing)
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt
  ./sgmlsk.pl  -i examples/doc.1-1.txt -l examples/doc.1-1.dd

  # merge serveral description of a same text content by parsing several input files and load several index describing the same document 
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -pos "" -s examples/doc.1-1.dd -o examples/doc.1-1.txt
  ./sgmlsk.pl  -i examples/doc.1-2.sgml -pos "" -s examples/doc.1-2.dd -o examples/doc.1-2.txt
  ./sgmlsk.pl  -i examples/doc.1-4.sgml -pos "" -s examples/doc.1-4.dd -o examples/doc.1-4.txt
  # then the following give equivalent result 
  ./sgmlsk.pl  -i examples/doc.1-1.sgml,examples/doc.1-2.sgml,examples/doc.1-4.sgml
  # or
  ./sgmlsk.pl  -i examples/doc.1-1.sgml -l examples/doc.1-2.dd,examples/doc.1-4.dd
  # or 
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd

  # filter and index works on resulting merging
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -pos "" -s examples/doc.1-1-2-4.dd

  # make the merged document well-formed
  cat examples/doc.1-1.sgml |./sgmlsk.pl -l examples/doc.1-2.dd,examples/doc.1-4.dd -w

  # to debug and help developpement 
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -verb -verb -verb 
  ./sgmlsk.pl  -i examples/sample.xml  -o examples/sample.xml -verb -verb -verb  1>/dev/null 2>&1

TODO, FIXME, BEAWARE lists for future release

Ordered by priority

TODO

install script to assist HTML::TagReader install and SGMLSK
- see linux mag for that
handling namespace for document merging
- see element and namespace pattern in Expat
option for config file instead of command line
- http://articles.mongueurs.net/magazines/perles/perles-12.html
supprimer des libraries les références à opts et separator
- passer les paramètres des fonctions
relax instead of critical option
uniform $VERSION and svn $revision$ property if possible…
if well-formed required, offer the possibility to turn into empty elements with attributes start and end, identifiers and namespace for them
Main description of sgmlsk as DATA pour usage __END_ avec affichage en dessous du prog si résoud les variables
- use POD (Plain Old Documentation) format http://perldesignpatterns.com/?PerlDoc
logg4perl instead of debug
- http://log4perl.sourceforge.net/releases/Log-Log4perl/docs/html/Log/Log4perl.html#06b23
make modules and turn it faster (by skipping unnecessary recurrent step)
turn it object
use Storable instead of Data::Dumper ? Because Storable is faster, better, cheaper. But less readable…
- to discuss
option htmltidy by interfacing the lib (specially for HTML documents)
simple text tokenizer

FIXME

check whether all inputs describe the same document and whether all loaded tag index are potential (i dont how for the latter)
column and line are not correct since content text has several words
should not have directories as –input

BEAWARE

HTML::TagReader do not recognize '<![CDATA[ … ]]>' as a tag when it includes tags…
- it results that '<![CDATA[ <myTag>' is recognized as a tag and ']]>' as a text :
  - fixed when we parse :
    - '<![CDATA[' and '<myTag>' are processed separately as tags
    - ']]>' is considered as tag

Development

Use the following script to perform integration tests

perl test_sgmlsk.pl

It should return such display in case of success

Ok -- to tokenize
Ok -- to filter by removing all tags (either specify all you do not want or that you want empty)
Ok -- to filter by removing all tags and index them on word position (.dd stands for data dump)
Ok -- to filter by removing all tags and index them on line position (.dd stands for data dump)
Ok -- to filter in order to keep para and sect tags 
Ok -- to load index previously made (for instance in order to temporarly remove some tags for some processing)
Ok -- merge description from several input file
Not Ok But Sort Ok-- merge description from several load with input file
Not Ok But Sort Ok-- merge description from several load with input file and stdin
Ok -- filter and index works on resulting merging
Ok -- make the merged document well-formed

How does it work ?

Filtering thanks to regexp on tags,
Indexing /merging on words (by default) or lines criteria
Fast when dealing with a single document and/or one load.
Most of all, you can use it in much expressive way

Nicolas Hernandez