| Office:
231 Rush Building |
| Tel(O): (215)895-6360 |
| Tel(C): (215)840-6193 |
|
Research in Natural Language Understanding
|
|
Why This Seminar
Prerequisite and Audience
Course Format
Exit Competencies
Schedule and Content
Research Questions
|
| |
| Why This Seminar |
A group of faculties in our college (Dr. Song, Han, Hu, Weber,
Lin and Kaplan) have common interest in system-oriented research.
During the past one-and-a-half-year study, I have been the RA of
these faculties or the audience of lectures offered by them. I recognized
their research targets more or less the topic of natural language
understanding. Meanwhile, a portion of Ph.D students in our college
are also interested in this line of research. So this seminar is
designed not only to help Ph.D. students on this technical track
make a solid foundation, but also to provide related faculties and
Ph.D. students with a platform for research idea sharing.
|
| Prerequisite and Audience |
This seminar-like course is prepared for PhD students on the
techincal track in IST, espcially who are interested in research
in natural language understanding. If you still have no idea about
this course, please read the research questions
this course is going to address. Because almost all topics in this
course involve machine learning, the course takers should meet the
following prerequisite or have grasped similar skills.
- Data Mining (INFO634)
- Concepts in Artifical Intelligence (INFO629)
- Research Statistics (INFO695)
|
| Course Format |
|
Each class is comprised of four components: area overview, technical
question discussion, proposoal presentation, and proposal critique.
The class begins with the topic overview provided by the instructor.
After that, all students participate in the discussion of technical
questions. Then half of students in class present their solutions
to research questions designated for the class. The presentation
is followed by the critique from classmates in any aspects, for
example, the technical feasibility, strength and weakness of the
solution.
The whole class will be divided into two groups, A and B. In odd
week, each students in Group A answers technical questions while
each student in Group B writes a proposal for designated research
questions and then makes a presentation. In even week, the
assignment is in the opposite. The first week is free of any assignment.
The last week is for presenation of term project. All students
in class need to pick up one research questions from listed eight
(research questions used for homework during Week 2-9 can not
be chosen any more) for their term projects. The outcome of the
project should be the formal research paper.
The course grading will base on the class participation (20%),
technical questions (20%), four proposals and presentations (30%),
and term project (30%).
|
| Exit Competencies |
- Machine Learning and Natural Language Processing Techniques
Students are expected to learn about a bunch of ML and NLP techniques
by solving real world problems. Even if you know well the principles
of these techniques, you will see how better they work in conjunction
with others tecniques such as ontology.
- Skills for Proposal Writing, Presentation and Review
Each student will write five proposals for five research questions
respectively (four for homework and one for term project) and
present them in class. Also, each one is encouraged to criticize
proposals from classmates.
- The State of the Art of Research Areas
By reading papers and lecture notes as well as participating
in class discussion, students will not only grasp the state
of the art of research areas such as information extraction,
text mining,and effctive informaiton retreival, but also sense
the typical way to do research in NLU and related areas.
- Starting Your Own Research
For PhD study, the ultimate goal is not to learn about domain
knowledge, but to contribute some novel idea to the reasearch
community you are in. When you are proposing some solutions
to the research questions, you actually get yourself on the
right track!
|
| Schedule and Content |
| |
| Week 1: Course Overview |
|
Coverage:
(1) brief introduciton of all research topics and questions.
(2) machine learning techniques ovewview:
a. probabilistic models
b. instance-based learning
c. sybmolic rules
d. markov model and expectation maximum
Readings:
A. Berger, S. A. Della Pietra, and V. J. Della Pietra, A
maximum entropy approach to natural language processing
Rebecca F. Bruce and Janyce M. Wiebe, Decomposable
modeling in natural language processing
Roland Kuhn and Renato De Mori, The
Application of Semantic Classification Trees to Natural
Language Understanding
Ronald L. Rivest, Learning
Decision Lists
Walter Daelemans et al., TiMBL:
Tilburg Memory Based Learner V2.0 Reference Guide
Lawrence Rabiner, A
tutorial on hidden Markov models
Michael Collins, The EM
Algorithm
|
| Week 2: Part of Speech Tagger and Sentence Parser |
Coverage:
(1) Link Grammar Parser
(2) Markov Model Based Parser
(3) Rule-based POS Tagger
(4) Markov Model Based POS Tagger
(5) Morphology Based POS Tagger
Readings:
Eugene Charniak, Statistical
techniques for natural language parsing
Daniel Sleator and Davy Temperley, Parsing
English with a Link Grammar
Peter Szolovits, Adding
a Medical Lexicon to an English Parser
Eric Brill, A
simple rule-based part of speech tagger
Evan L. Antworth, Morphological
Parsing with a Unification-based Word Grammar
Boris Katz, From Sentence Processing
to Information Access on the World Wide Web
Research Questions: #2 or #6
Technical Questions:
(1) How to convert one complex sentence (with subordinate)
to several simple clauses?
(2) Enumerate one application for parser and one application
for POS tagger.
|
| Week 3: Information Extraction |
Coverage:
(1) Named Entity Recognition
(2) Pronominal Reference
(3) Concept Identification
(4) Pattern-based Information Extraction
Readings:
Douglas Appelt et al., Introduction
to Information Extraction Technology
Diana Maynard et al., MUlti-Source
Entity recognition
Razvan Bunescu et al., Learning
to Extract Proteins and their Interactions from Medline Abstracts
Harith Alani et al., Automatic
extraction of knowledge from web documents
Research Questions: #1 or #7
Technical Questions:
(1) Summarize all IE tasks and list at least two methods for
each tasks.
(2) Write a JAPE-compliant pattern for extraction of English-word-written
number (e.g. one hundred and fifty-six)
|
| Week 4: Linguistic Patterns |
Coverage: pattern representation,learning and
applications
Readings:
Ion Muslea, Extraction
Patterns for Information Extraction Tasks: A Survey
Ellen Riloff, Automatically
Constructing a Dictionary for Information Extraction Tasks
Mary Califf and Raymond Mooney, Bottom-up
Relational Learning of Pattern-Match Rules for Information
Extraction
Stephen Soderland, Learning
Information Extraction Rules for Semi-Structured and Free
Text
Xiaohua Hu et al., Extracting
and Mining Protein-Protein Interaction Network from Biomedical
Literature
Dayne Freitag, Information
extraction from HTML: Application of a general machine learning
approach
Research Questions: #1 or #5
Technical Questions:
(1)
|
| Week 5: Word Sense Disambiguation |
|
Coverage:
(1) knowledge sources for WSD.
(2) unsupervised and supervised WSD approaches.
(3) WSD applications.
Readings:
Oi Yee Kwong, Word Sense Disambiguation
with an Integrated Lexical Resources
Rada Mihalcea and Dan I. Moldovan, An
Iterative Approach to Word Sense Disambiguation
Hoa Trang Dang and Martha Palmer, Combining
Contextual Features for Word Sense Disambiguation
Mark Stevenson and Yorick Wilks, The
Interaction of Knowledge Sources in Word Sense Disambiguation
Raymond Mooney, Comparative Experience
on Disambiguating Word Senses: An Illustration of the Bias
in Machine Learning
Nancy Ide and Jean Veronis,
Word sense disambiguation: The state of the art
Xiaohua Zhou and Hyoil Han, Survey
of WSD Approaches
Research Questions: #4 or #8
Technical Questions:
(1) Please enumerate three applications of WSD and point
out which WSD approach are you going to apply to each application
and why?
|
| Week 6: Ontology |
Coverage:
(1) Ontology Represenation
(2) Ontology Learning
(3) Applications of Ontology
Readings:
Christopher Brewster and Kieron O'Hara, Knowledge
representation with ontologies: the present and future
H. Wache et al., Ontology-based
integration of information: a survey of existing approaches
Eneko Agirre, Olatz Ansa, Eduard Hovy and David Martinez,
Enriching
very large ontologies using the WWW
Alexander Maedche and Steffen Staab, Ontology
learning for the Semantic Web
Mihai Barbuceanu, Mark S. Fox, Lei Hong, Yannick Lallement
and Zhongdong Zhang, Building
Agents for the Customer Service Front
Aykut Firat, Stuart Madnick, and Benjamin N. Grosof, Knowledge
Integration to Overcome Ontological Heterogeneity: Challenges
from Financial Information Systems
Research Questions: #3 or #7
Technical Questions:
|
| Week 7: Effective Information Retrieval |
|
Coverage:
(1) Sense-based IR Model
(2) Query Expansion
(3) Graph-based IR Model
(4) Vector-based IR Model
(5) Relationship-based IR Model
Readings:
Christopher Stokoe and John Tait, Towards
a Sense Based Document Representation for Internet Information
Retrieval
Roberto Navigli and Paola Velardi, An
Analysis of Ontology-based Query Expansion Strategies
J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and
A. Tomkins, The web
as a graph: Measurements, models and methods
Schenker, A., Last, M., Bunke, H., and Kandel, A., Clustering
of Web Documents Using a Graph Model
Scott Deerwester et al., Index
by Latent Semantic Analysis
Chirag Shah and Pushpak Bhattacharyya, Improving
Document Vectors Representation using Semantic Links and
Attributes
Research Questions: #4 or #5
Technical Questions:
(1) How is NLP helpful for document indexing or searching?
(2) Please think out at least one approach different from
(Schenker 2003) to build graph for a document?
|
| Week 8: Text Mining |
|
Coverage: Typical text mining research
Readings:
Marti A. Hearst, Untangling
Text Data Mining
Anonymous, A Survey of Text
Data Mining
Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne
Niblack, Sentiment
Analyzer: Extracting Sentiments about a Given Topic using
Natural Language Processing Techniques
Rayid Ghani and Andrew E. Fano,Using
Text Mining to Infer Semantic Attributes for Retail Data
Mining
Xiaohua Hu et al., Extracting
and Mining Protein-Protein Interaction Network from Biomedical
Literature
Michael Gordon and Susan Dumais, Using
Latent Semantic Indexing for Literature-based Discovery
Research Questions: #2 or #3
Technical Questions:
(1) Discuss the differences among IE, TM and DM.
|
| Week 9: Semantic Web |
Coverage: overview of Semantic Web
Readings:
Jeff Heflin and James Hendler,
A Portrait of the Semantic Web in Action
James Hendler, Agents and
the Semantic Web
Sean B. Palmer, The
Semantic Web: An Introduction
Shiyong Lu, Ming Dong and Farshad Fotouhi, The
Semantic Web-opportunities and challenges for next-generation
Web applications
Alexander Maedche and Steffen Staab, Ontology
learning for the Semantic Web
R.Guha, R.McCool and R.Fikes, Contexts
for the Semantic Web
Stuart E. Madnick, Metadata
Jones and the Tower of Babel: The Challenge of Large-Scale
Semantic Heterogeneity
Research Questions: #6 or #8
Technical Questions:
|
| Week 10: Final Presentation |
| |
|
| Research Questions |
- #1 Auto-generation of Class Diagram
Object-oriented analysis requires the building of class diagrams
from the problem statement or functional specification documents
written in natural language. Can you build a system that could
automatically generate class diagram given the requirement documents?
- #2 Sentiment Analysis on Product Reviews
A manager of the department of marketing at certain MP3 player
manufacturer has got thousands of online product reviews from
customers. He is looking for some systems that could summarize
those reviews (negative and postive aspects). Can you build
such a system for automated sentiment analysis?
- #3 Math Problem Routing (Contributed by Xiaodan Zhang)
Volunteers (Math Doctor) at Math Forum answer math questions
sbumitted by kids online. So far, Math Forum has accumulated
a large number of Q&A. Math Doctor found that a good portion
of questions submitted actually have been answered before. Can
you build a system to relieve Math Doctor's workload (i.e. process
questions submitted online and find n most similar questions
from Q&A database)?
- #4 Semantics-based Information Retrieval
Most of present information retrieval (IR) systems use keywords
to query, match and index documents. However, keyword is not
effective in representation of semantics. First of all, a keyword
usually has several senses and its meaning is ambiguous without
context. Second, one meaning can be expressed by many keywords,
but it is difficult for users to exhaust all possible keywords.
Last, also the most problematic, the co-occurrence of keywords
in a document does not mean that they are related semantically.
Can you propose some new IR models to address these problems?
- #5 Web Image Indexer
Can you build a web image indexer by getting clues from surrounding
text of the image?
- #6 Auotmated Customer Support
A grocery store is looking for intelligent customer support
systems that can interact with customers through email or online
conversation. The scope of the possible questions from customers
are limited to compaint, return, inquiry, and FAQ. Can you build
such a system?
- #7 Ontology Acquisition (Contributed by Dr. Han)
Nowadays ontology is extensively used. But it is extremely expensive
to build an ontology. Can you propose an approach to automate
ontology acquisition from a collection of domain-specfic documents?
For example, can you build a university ontology from all web
pages of drexel.edu?
- #8 Knowledge Integration
|
| |
|
|