Asma Aldhahri CS330 Artificial Intelligence
General Architecture software development environment and framework + macro-level organizational structure for families of software systems
Text Engineering performing tasks involving processing human language. Relates to application of computational linguistics, natural language processing and artificial intelligence
Computational Linguistics (CL): science of language that uses computation as an investigative tool.
Natural Language Processing (NLP): science of computation whose subject matter is data structures and algorithms for computer processing of human language.
Basically, it’s an environment that allows you to create all the aspects of a software system that draw information out of text
2
Develop & deploy software components
Text analysis, language processing
Language processing tasks:
Parsers
Morphology
Tagging
Language retrieval tools
Language extraction tools
All for various languages
3
Combines application development environment with GUI
Resources pane:
applications: group of processes that run on single/group of docs
language resources: docs/doc collections to be annotated (CORPUS = collection in GATE terms)
processing resources: annotation tools that operate on unstructured text within docs
data stores: specialized files located on HD or DB where docs & processing resources are kept for future use
4
Annotating a Document Example: Step 1 Load the Language Resource (document)
To add a document, right click language resources and select new > gate document
Encoding should be set and the url for the document (can browse you files or take a url straight from the web)
5
Step 2 Load Processing Resources ANNIE
Document Reset – removes old annotations
Tokenizer – splits text into tokens, groups them together (word, number, symbol, punctuation, space)
Gazetteer – identifies proper nouns with more general concepts (eg heathrow = airport)
Sentence Splitter – recognizes sentences
Tagger – produces part of speech tag
Transducer – adapts tokenizer output to POS tagger requirements
OrthoMatcher – adds identity relations between named entities
Click here
ANNIE- a nearly new information extractor
plugin based architecture separates processing components from document based resources / helps minimize memory usage
6
Step 3 Before Running an Application, Add Document to a Corpus
Right click the document you want in a corpus
Highlight multiple documents for a larger corpus
ANNIE- a nearly new information extractor
plugin based architecture separates processing components from document based resources / helps minimize memory usage
7
Step 4 Create Corpus Pipeline
Shows the time for running the application
First add document to corpus
Press “Run this Application”
Order of PRs is important, their inputs are the outputs of the previous PR
8
Step 5 View Annotations & Add Annotation
Shows the time for running the application
Highlight the text you want to annotate and right click
9
The original document before and after adding the “Arabic” and “Person” annotation
10
Step 6 Change Unknown Annotation
Select drop down list and select “Location”
“Yanbu” will be highlighted as “Location” now
11
Step 7
Tokens & Spaces Highlighted
Annotations List
12
Step 8
Sentences & Splits Highlighted
13
Step 9
Annotations List
14
Step 10
Create Arabic Pipeline
15
Step 11
Arabic Annotation
GATE now recognizes the Arabic word ماريا as a “Person” and tokenizes it as such
16
Annotation List
(closer look)
17
Another Annotation Example with more token types
18
Application Pipelines, Corpora, Documents and Processing Resources can all be saved
Annotated documents can be saved for use outside of the GATE environment
19
Thank
You
20