[ Home ] – [ Research ] – [ Teaching ] – [ Misc ]


MODIFIER ICI LE TITRE DE VOTRE NOUVELLE PAGE

Third-party developper

Dans ma démarche de lister les différents composants, j'ai essayé d'abord d'identifier “qui faisait” et ensuite “ce qui est fait” (comme composant).

Dans cette deuxième étape, il est important de savoir ce que l'on pourra utiliser. Les caractéristiques des composants qu'il faut noter sont

  • sa fonction
  • standalone (ligne de commande) ou composant uima
  • sa version d'UIMA : IBM/APACHE (numero de version)
  • la langue sur laquelle il s'applique
  • peut on l'entraîner sur le français ?
  • disponibilité ? sous quelle forme (PEAR) ? licence ?

Corriger, compléter, y placer les références vers les différentes guides officiels

Resituer UIMA

Guide du développer

Annuaires de composants

UIMA Sandbox Suggested Analysis Components

APACHE UIMA Annuaire

Jena University Language & Information Engineering (JULIE) Lab

http://www.julielab.de/component/option,com_frontpage/Itemid,1/

  • its JULIE Lab NLP Toolsuite consists of a collection of NLP components, some of which are provided as freely available UIMA components. There is also a useful type system you can use as basis for your own NLP applications.
  • License: The tools provided by Jena University Language & Information Engineering Lab are licensed under the terms of the Common Public License, Version 1.0 or (at your option) any subsequent version. (See http://www.opensource.org/licenses/cpl1.0.php)
  • Liste http://www.julielab.de/content/view/117/186/ s'accompagne de références à des papiers pour chacun
    • JULIE Sentence Boundary Detector (JSBD) and the JULIE Token Boundary Detector (JTBD) : IBM?, standalone and UIMA component
    • JULIE Named Entity Tagger (JNET) : IBM?, standalone and UIMA component
    • JULIE Acronym Annotator (JACRO) : IBM?, standalone and UIMA component
    • Lucene CAS Indexer of the JULIE Lab (LuCAS) APACHE!
    • JULIE PUBMED Reader (a UIMA Collection Reader) reads PUBMED (the major bibliographic database for the biomedical domain) abstracts in XML format IBM?, standalone and UIMA component
    • UIMA wrappers for some of the OpenNLP tools. IBM?, standalone and UIMA component
  • an annotation type system (JulieLab type system) which covers various levels of text analysis (e.g. document, linguistic and semantic analysis).

Language Technologies Institute of the School of Computer Science at Carnegie Mellon University

Component Library http://uima.lti.cs.cmu.edu:8080/UCR/Welcome.do

    • Chunker LanguageWare Annotator 6.0, negator, LuceneConsumer, pos, treebank, bio, ConceptMapper…
  • OpenNLP - A collection of open source Java projects related to natural language processing. The main projects include:
    • Maxent - A Java package for training and using maximum entropy models
    • OpenNLP Tools - A collection of Java NLP tools based on the Maxent package. The tools include a sentence detector, tokenizer, POS tagger, Shallow parser, named-entity detector, and co-reference resolver. This package of tools has already been wrapped as UIMA components, and the wrappers are included in the UIMA SDK as sample code. More information about OpenNLP wrappers is found here.
  • LingPipe - LingPipe is a suite of Java libraries for the linguistic analysis of human language. LingPipe Tools - track mentions of entities (e.g. people or proteins); link entity mentions to database entries; uncover relations between entities and actions; classify text passages by language, character encoding, genre, topic, or sentiment; correct spelling with respect to a text collection; cluster documents by implicit topic and discover significant trends over time; and provide part-of-speech tagging and phrase chunking. Two examples are listed below to show how to wrap LingPipe annotators as UIMA components; LingPipe Classifier and LingPipe Named-Entity detector.

The Center for Computational Pharmacology at the University of Colorodo

bio-nlp : has wrapped a number of popular bio-informatic annotators as UIMA components http://bionlp-uima.sourceforge.net/

  • Analysis Engines:
    • Gene Identification: ABGene, ABNER, LingPipe, KeX
    • Mutation Identification: MutationFinder
    • Semantic Parsing: OpenDMAP
    • Sentence Detection: KeX LingPipe OpenNLP
    • Tokenization: Genia Tagger, LingPipe, Penn BioTokenizer
  • Collection Readers for corpora: Bio1 BioIE, Texas, Yapex

RASP is a domain-independent, robust parsing system for English

http://www.digitalpebble.com/rasp4uima/index.html : processes of tokenisation, tagging, lemmatization and parsing

IBM IBM

http://www.alphaworks.ibm.com/tech/uima/download

  • Semantic Search for Apache UIMA : SemanticSearch 2.1: The SemanticSearch package is based on Apache UIMA and provides a full-featured semantic search engine. SemanticSearch_2.1.zip
  • IBM UIMA Adapter Wrapper for Apache UIMA: The IBM UIMA wrapper package enables you to run IBM UIMA components using Apache UIMA 2.2 or above. This package is designed for projects and products that migrate to Apache UIMA but also still need to be able to run older IBM UIMA components.

IBM-UIMA-Adapter-2.2.zip

Projets Chez IBM

A variety of advanced IBM research projects focusing on developing and applying UIMA http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.researchProjects.html

Pas toujours en cours et semble être rattaché à la version IBM UIMA

  • The Avatar project provides an easy-to-use web framework for constructing and configuring UIMA annotators to solve particular annotation tasks.
  • TALES - Multimedia mining and translation of broadcast news (TV) and news Web sites.
  • ProAct - automating customer satisfaction analysis
  • Text Mining projects at IBM's Tokyo Research Lab
  • IBM Research is participating as a partner in the SAPIR project (Search in Audio Visual Content Using Peer-to-peer Information Retrieval). This European Union project is using UIMA as an integrating platform.

Persistence

Persistence - XMI et MOF par l'OMG

XMI et MOF par l'Object Management Group (OMG)(international, open membership, not-for-profit computer industry consortium)

XML Metadata Interchange (XMI) is a model driven XML Integration framework for defining, interchanging, manipulating and integrating XML data and objects. XMI-based standards are in use for integrating tools, repositories, applications and data warehouses. XMI provides rules by which a schema can be generated for any valid XMI-transmissible MOF-based metamodel. ; XMI provides a mapping from MOF to XML. As MOF and XML technology evolved, the XMI mapping is being updated to comply with the latest versions of these specifications. Updates to the XMI mapping have tracked these version changes in a manner consistent with the existing XMI Production of XML Schema specification (XMI Version 2). ; Meta-Object Facility (MOF) is an extensible model driven integration framework for defining, manipulating and integrating metadata and data in a platform independent manner. MOF-based standards are in use for integrating tools, applications and data.

Persistence - BD

With Apache Derby

  • In 2005 July, the Apache Software Foundation promoted Derby from incubator status to a subproject of the Apache DB project, making it a full-fledged open-source database. Derby, which originated from IBM's donation of the Cloudscape code in 2004, needed non-IBM support to graduate. To date, five software vendors and three Apache projects have included Derby in their work.
  • UIMA Overview 2.6 (CAS2DB) /uimaj-examples/src/org/apache/uima/examples/cpe/PersonTitleDBWriterCasConsumer.java
  • For details, go to http://db.apache.org/derby.
    • Derby has a small footprint – about 2 megabytes for the base engine and embedded JDBC driver.
    • Derby is based on the Java, JDBC, and SQL standards.
    • Derby provides an embedded JDBC driver that lets you embed Derby in any Java-based solution.
    • Derby also supports the more familiar client/server mode with the Derby Network Client JDBC driver and Derby Network Server.
    • Derby is easy to install, deploy, and use.

Interopérabilité

Interopérabilité - API C++, etc.

voir mail de diffusion et svn

Interopérabilité - Service Web

  • Simple Server (UIMA REST Service)

The UIMA Simple Server makes results of UIMA processing available in a simple, XML-based format. The intended use of the the Simple Server is to provide UIMA analysis as a REST service. The Simple Server is implemented as a Java Servlet, and can be deployed into any Servlet container (such as Apache Tomcat or Jetty). Click here to access the user documentation of the Simple Server. http://incubator.apache.org/uima/sandbox.html#simple-server

Interopérabilité - Bean Scripting Framework

The Bean Scripting Framework (BSF) Annotator is an Apache UIMA analysis engine that provides a link between the UIMA framework and the scripting languages that are supported by Apache BSF (http://jakarta.apache.org/bsf). The current implementation comes with examples in Beanshell (http://www.beanshell.org) and Rhino Javascript (http://www.mozilla.org/rhino). Simple tests have also been conducted successfully with Jython (http://jython.sourceforge.net/Project/index.html) and JRuby (http://jruby.codehaus.org). http://incubator.apache.org/uima/sandbox.html#bsf.annotator

Packaging des composants PEAR

NLP Composants

NLP component - Word Tokenization

Sandbox - Whitespace tokenizer annotator - http://incubator.apache.org/uima/sandbox.html#whitespace.tokenizer

bio-nlp http://bionlp-uima.sourceforge.net/ avec Genia Tagger, LingPipe, Penn BioTokenizer

OpenNLPTokenizer tokenizes the text and creates token annotations that span the tokens - Apache UIMA Example Wrappers for the OpenNLP Tools - http://uima.lti.cs.cmu.edu:8080/UCR/pages/static/osnlp/OpenNLPReadme.html english

NLP component - Word Stemming

NLP component - POS

OpenNLPPOSTagger assigns part-of-speech tags to tokens - Apache UIMA Example Wrappers for the OpenNLP Tools - http://uima.lti.cs.cmu.edu:8080/UCR/pages/static/osnlp/OpenNLPReadme.html english

NLP component - Sentence Spliter

bio-nlp http://bionlp-uima.sourceforge.net/ avec KeX LingPipe OpenNLP

OpenNLPSentenceDetector detects sentence boundaries and creates Sentence annotations that span these boundaries - Apache UIMA Example Wrappers for the OpenNLP Tools - http://uima.lti.cs.cmu.edu:8080/UCR/pages/static/osnlp/OpenNLPReadme.html english

NLP component - Phrasal and Clause Parsing

english

NLP component - Named Entity and acronyms

english

NLP component - Semantic Parsing

Tool

Tool - Machine Learning

Tool - Analyser

Sandbox - Regular Expression Annotator http://incubator.apache.org/uima/sandbox.html#regex.annotator

Tool - Annotation Editor

Sandbox - Cas Editor is an annotation tool which supports manual and automatic annotation of CAS files. http://incubator.apache.org/uima/sandbox.html#CAS%20Editor

Tool - Dictionary Annotator

Sandbox - Dictionary Annotator is an Apache UIMA analysis engine that creates annotations based on word lists that are compiled to simple dictionaries. http://incubator.apache.org/uima/sandbox.html#dict.annotator

Tool - OpenNLP

APACHE UIMA ; wrappers ; NLP Process ; English Models/Buildable For French ?

OpenNLP Tools is an open source package of natural language processing components written in pure Java. The tools are based on Adwait Ratnaparkhi's Ph.D. dissertation (UPenn, 1998), which shows how to apply Maximum Entropy models to various language ambiguity problems. The OpenNLP Tools rely on the OpenNLP MAXENT package, a mature Java package for training and using maximum entropy models.

The OpenNLP Tools package (as of Version 1.3) includes a sentence detector, tokenizer, part-of-speech tagger, noun phrase chunker, shallow parser, named entity detector, and co-reference resolver. All together these tools provide a rich and powerful set of text analysis capabilities.

The Apache UIMA Example Wrappers for OpenNLP provides UIMA annotators for most of the OpenNLP Tools components, allowing you to run the OpenNLP Tools as UIMA annotators.

Tool - GATE

UIMA-GATE interoperability layer is based on the UIMA SDK version 1.2.3 (i.e. IBM alpha) http://gate.ac.uk/sale/tao/#chap:uima

GATE vs UIMA

UIMA (http://www.research.ibm.com/UIMA/) is a language processing framework developed by IBM. UIMA and GATE share some functionality but are complementary in most respects.  GATE now provides an interoperability layer to allow UIMA applications to include GATE components in their processing and vice-versa. 
It has many similarities to the GATE architecture – it represents documents as text plus annotations, and allows users to define pipelines of analysis engines that manipulate the document (or Common Analysis Structure in UIMA terminology) in much the same way as processing resources do in GATE. 
Clearly, it would be useful to be able to include UIMA components in GATE applications and vice-versa, letting GATE users take advantage of UIMA’s flexible deployment options and UIMA users access JAPE and the many useful plugins already available in GATE. 

There are some components in GATE (particularly Annie and JAPE) which I would like to use in UIMA.
UIMA has many features in common with other software architectures for language engineering such as GATE4,5 and ATLAS.6 Each of these systems isolates the core algorithms that perform language processing from system services such as storing of data, communication between components, and visualization of results. However, UIMA's emphasis on  transferring UIM technologies to products has led to a richer architecture that allows integrating applications with a host of enterprise products (e.g., WebSphere* Portal Server, Lotus* Workplace) and a variety of middleware and platform options.
We (Temis) have built our new corporate product on top of UIMA. We made this decision on year ago now. The choice was mainly between using UIMA or to do it ourselves. We resisted to the last option! We did a quick survey of other frameworks (GATE ...) but UIMA was more appropriated for

our need of a core framework platform. We liked its homogeneousity, the quality of the code, the documentation, the quick evolution, the planned move to open source and to a commercial friendly license. Send me a private message if you want to have a talk about this. (pascal.coupet <at> temis.com).

Dear Ekaterina,
I cannot directly answer your question as I am not an UIMA or GATE  wizard. I can tell you why I elected  UIMA rather than GATE and OpenNLP.
 1. I use a finite state machine toolbox of my own written in Java but
    I did not want to close the door to other applications wirtten in
    C or in Perl and for what I read when I made my decision only UIMA
    offered a clear and clean way to integrate  C or  Perl apps via 
    the descriptors fence.
 2. I know UIMA has been used in heavy industrial applications by IBM
    like Business Insight,
 3. I did not find any major differences in the documentations
    concerning the annotation scheme. In fact for my own purpose a
    list of labels, a start and an end position in the text was just
    fine for me.
 4. I did not need any other linguistics tools than mines.
 5. and last but not the least it was crucial to me to have the
    possibility to integrate easily the unstructured information part
    in Eclipse and only UIMA offered an easy way to do it.
 6. Besides, it is not too hard to integrate external applications in
    Eclipse so if somedays I need some other tools I know it will be
    easier to integrate them in eclipse than in any other environment.

In short UIMA and Eclipse are solid and complementary on the long term I  believe they have better odds than GATE and OpenNLP, even if in terms of implemented algorithms and programs UIMA is poorer than GATE and OpenNLP.
Ekaterina Buyko wrote:
> Hi,
>
> I am looking for a comparison of UIMA and GATE systems.
>
>  What does UIMA offer more or less as GATE does it? I am interested in 
> the general contrast between the UIMA and GATE and in particular in 
> the comparison of type systems and GATE annotation schemata. Can we 
> convert all UIMA types to GATE types without any restrictions or does 
> UIMA offers more features in implementation of annotation schemata as 
> GATE?
>
> Thanks,
>
> Katja
>
-- 
Cordialement/Regards
Christian Mauceri
http://hermeneute.com/Christian

* http://article.gmane.org/gmane.comp.apache.uima.general/348/match=perl

Application

Application - Search Engine and Semantic Search

Principes d'UIMA pour la recherche sémantique

http://www-306.ibm.com/software/data/enterprise-search/omnifind-enterprise/

UIMA Lucene CAS Indexer (LuCAS) http://www.julielab.de/content/view/117/186/

  • Description: The Lucene CAS Indexer of the JULIE Lab (LuCAS), a UIMA CAS consumer, takes the information that can be found in a CAS and indexes them in Lucene index fields. To determine the tokens we rely on our JULIE tokenizer (the information can be found in the CAS, as well) and ignore the tokenizers coming with Lucene. Furthermore, the indexing specification for the CAS-Lucene Indexer is similar to the one for the Semantic Search CAS Indexer (see UIMA SDK V2, Chapter 6.5). Hint: If the PEAR file doesn't work in your pipeline, please let me know!
  • Requirements: Java 5.0 (1.5), Lucene 2.3
  • LuCAS new Version 2.2 is out now
  • works with UIMA 2.2
  • works with Lucene 2.3
  • works with TypeSystem 2.0
  • ANNOUNCMENT: We are currently working on a new version of LuCAS with a new mapping file format that allows to configure more indexing options. It will be released by March 2008!

Application - bio

  • The Center for Computational Pharmacology at the University of Colorodo has wrapped a number of popular bio-informatic annotators as UIMA components

http://bionlp-uima.sourceforge.net/

Application - Web sémantique

  • UIMA peut-il réconcilier le text-mining et les outils sémantiques ?

http://mondeca.wordpress.com/2007/09/11/uima-peut-il-reconcilier-le-text-mining-et-les-outils-semantiques/

Bibliographie

D. Ferrucci and A. Lally. “UIMA: an architectural approach to unstructured information processing in the corporate research environment,” Natural Language Engineering 10, No. 3-4, 327-348 (2004). www.research.ibm.com/UIMA/

D. Ferrucci and A. Lally, “Building an example application with the Unstructured Information Management Architecture,” IBM Systems Journal 43, No. 3, 455-475 (2004). http://www.research.ibm.com/journal/sj43-3.html

T. Goetz and O. Suhre “Design and implementation of the UIMA Common Analysis System,” IBM Systems Journal 43, No. 3, 490-515 (2004). http://www.research.ibm.com/journal/sj43-3.html

R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, and L. V. Subramaniam Text analytics for life science using the Unstructured Information Management Architecture IBM Systems Journal 43, No. 3, p. 490 http://www.research.ibm.com/journal/sj43-3.html

N. Uramoto, H. Matsuzawa, T. Nagano, A. Murakami, H. Takeuchi, and K. Takeda A text-mining system for knowledge discovery from biomedical documents IBM Systems Journal 43, No. 3, p. 516 http://www.research.ibm.com/journal/sj43-3.html

Towards Declarative Information Extraction: The Almaden Story, Industrial KeyNote talk by Shivakumar Vaithyanathan at Web Intelligence, 2007. http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.projectUimaArchitectureFramework.html

Anthony Levas, Eric Brown, J. William Murdock, and David Ferrucci. “The Semantic Analysis Workbench (SAW): Towards a Framework for Knowledge Gathering and Synthesis.” Proceedings of the International Conference on Intelligence Analysis. McClean, VA, May 2-6, 2005.

The Linguistic Annotation Workshop A Merger of NLPXML 2007 and FLAC 2007 The LAW ACL 2007 Prague, Czech Republic, June 28-29, 2007 http://www.ling.uni-potsdam.de/acl-lab/LAW-07.html

Udo Hahn, Ekaterina Buyko, Katrin Tomanek, Scott Piao, John McNaught, Yoshimasa Tsuruoka, and Sophia Ananiadou. An annotation type system for a data-driven NLP pipeline. In The LAW at ACL 2007 – Proceedings of the Linguistic Annotation Workshop, pages 33–40. Prague, Czech Republic, June 28-29, 2007. Stroudsburg, PA: Association for Computational Linguistics, 2007. http://www.ling.uni-potsdam.de/acl-lab/LAW-07.html

Scott Piao, Ekaterina Buyko, Yoshimasa Tsuruoka, Katrin Tomanek, Jin Dong Kim, John McNaught, Udo Hahn, Jian Su, and Sophia Ananiadou. Bootstrep annotation scheme: Encoding information for text mining. In Corpus Linguistics 2007 -– Proceedings of the 4th Corpus Linguistics Conference. Birmingham, England, U.K., July 27-30, 2007, 2007.

Other

  • with search and knowledge management technologies.
  • UIMA Lucene CAS Indexer (LuCAS)

A faire

  • workflow generator graphic
  • tous les outils d UIMA dans une même interface
  • appli web
  • ws pour la soumission des données

Format de donnée

.xmi

  • éléments du texte annotés sont marquées par des entités HTML
  • l'annotation est externe au bas du document : le type est marqué par un espace de nom, un offset en caractère est indiqué, ainsi qu'un identifiant et un autre attribut ?

input/output dir .txt

gère simplement langue et charset

De l'aide sur

  • Récupération de paramètres déclarés dans le descriptor.xml (1.2.1. ; tutorial.ex1 )
  • Journalisation (1.2.2. ; tutorial.ex2)
  • Gestion des exceptions (1.5.3 ; tutorial.ex5)
  • Manipulation d'XML : XMLDetagger.xml fonctionne sur data/xml en récupérant la vue courrante et en spécifiant le parametter setting dans l'Annotator.xml mais pas dans documentAnalyser
  • Multiple vues : 6 et exemple en 6.6 ; descriptors/analysis_engine/SofaExampleAnnotator.xml ; org.apache.uima.examples.SofaExampleAnnotator.java
  • gérer la persistence CAS to XML Files (XMI, anciennement XCAS), ou bien DB en utilisant JDBC (voir UIMA SDK)

As far as NLP proper is concerned, Carnegie Mellon University's Language Technology Institute is hosting an UIMA Component Repository web site (http://uima.lti.cs.cmu.edu), where developers can post information about their analytics components and anyone can find out more about free and commercially available UIMA-compliant analytics. Additionally, free analytic tools that can work with UIMA include those from the General Architecture for Text Engineering (GATE - http://gate.ac.uk/) and OpenNLP (http://opennlp.sourceforge.net/) communities, as well as Jena University’s Language & Information Engineering (JULIE) (http://www.julielab.de) Lab. Commercial analytics are available from IBM, as well as from other software vendors such as Attensity, ClearForest, Temis and Nstein.

Outre IBM, plusieurs organisations universitaires et industrielles utilisent UIMA pour développer des analyseurs et des solutions d'UIM.

Des communautés d'utilisateurs et de développeurs actives, comme peuvent en témoigner les listes de diffusion dédiées.

Il existe encore peu d'universités et d'industriels utilisateurs d'UIMA.

Des travaux d'intégration qui témoignent d'un intérêt certain pour la plate-forme : encapsuleur réciproque envers GATE, encapsuleur d'outils de la suite OpenNLP, Lucene, Weka…

L'actualité tourne autour d'un profond effort sur la plate-forme plutôt que sur les passerelles ; peut requérir une adaptation suivant la version initiale du SDK pour lequel fut développé ; mais en lien avec le développement de la plate-forme, toujours un développement en cours d'un adaptateur de composants développé pour la version IBM alpha

including statistical and rule-based Natural Language Processing (NLP), Information Retrieval (IR), machine learning, and ontologies, semantic search, information extraction and text mining

 
research/plateforme/uima/annuaire.txt · Last modified: 2010/05/13 12:35 (external edit)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki