KMi
People
My Homepage
ESpotter
E S p o t t
ESpotter is supported by the Dot.Kom project.
Team:
Related Projects:
Talks:
KMi
Internal Talk: (June 14th)
ESpotter: A Domain and User
Adaptation Approach for Named Entity Recognition on the Web
Abstract: Named entity recognition (NER) systems are commonly designed with a "one-size-fits-all" philosophy. Lexicons and patterns manually crafted or learned from a training set of documents are applied to any other document without taking into account its background and user needs. However, when applying NER to Web pages, due to the diversity of these Web pages and user needs, one size frequently does not fit all. In this talk, I present a system called ESpotter, which improves NER on the Web by adapting lexicons and patterns to domains on the Web and user preferences. My results show that ESpotter provides more accurate and efficient NER on Web pages from various domains than current NER systems. ESpotter is implemented as a browser plug-in to help solve the information overload problem on the Web by discovering relevant information on user's behalf. Further work of integrating ESpotter with ontology based semantic browsing tool, Magpie, and the KMi semantic Web site are explored.
Keywords: Named entity recognition, information extraction, hierarchies.
Papers:
Demos:
Download ESpotter as a .NET Windows
Application:
![]()
You can simply click one button to
extract entities of various types, e.g., "Open University" as an
organization and "Enrico Motta" as a person, from documents. You can
select one or multiple documents in plain text format or html format and save
the recognized entities in an XML file for further processing.
The tool is based on the .NET
framework and can be download
Run the ESpotter.msi file to
install (you may need to install .net framework 1.0). The installation will
create a shortcut for an ESpotter executable file on your desktop. One example
XML output as follows shows entities of various types and their word offsets in
a document.
<?xml version="1.0"
encoding="utf-8" standalone="yes"?>
<ESpotter-Processed-Documents
corpusSize="284">
<Document
id="0">
<has-directory>D:\test.xml</has-directory>
<has-url>D:\test.xml</has-url>
<has-document-size>284</has-document-size>
<mentions-location>
<instance content="
</mentions-location>
<mentions-organization>
<instance content="
</mentions-organization>
<mentions-person>
<instance
content="Larry Stillman" pos="130" />
</mentions-person>
<mentions-research-area>
<instance content="network" pos="238"
alias="TechnologiesCommunity Informatics Research Network" />
</mentions-research-area>
<pn>
<instance content="ICT" pos="22" />
</pn>
</Document>
</ESpotter-Processed-Documents>
ESpotter uses an MS Access database
file ESpotterResources.mdb to store lexicon and pattern information. Currently ESpotter
recognize People, Organization, Location, Research Area, Email, Telephone,
Postal Code, and other Proper Names. You can easily customize the lexicon and
patterns in ESpotterResources.mdb file to recognize any type of entities you
are interested in by adding new lexicon and patterns. Lexicon and patterns are
grouped into different tables. When you add new lexicon or patterns, you can
create a new table, and register the new table in the TableSchema table. New
entity types need to be registered in the TypeSchema table. Using precision for
domain adaptation is not used in the version of ESpotter and can be ignored in
the database file.
For developers interested in
ESpotter, the installation includes an DLL file ESpotterClass.dll for easy
inclusion in a .NET application for language engineering. An example is given
in the Class1.cs file. More info on using ESpotter for development is coming
soon.
One ESpotterExe.exe file that can be call from a program in a shell on
Windows is also included, the parameters are as follows:
arg [0]: source file location arg [1]: output file location arg [2]:
database file location
Content last modified: October 12th, 2004, maintained by Jianhan Zhu (j.zhu@open.ac.uk).