Home Research Software Personal Network
Contact Information
    Office:  231 Rush Building
    Tel(O): (215)895-6360
    Tel(C): (215)840-6193
 
Miscellaneous
  Contact Information
  Curriculum Vitae
  Research Statement
  Teaching Statement

Research in Natural Language Understanding

Why This Seminar
Prerequisite and Audience
Course Format
Exit Competencies
Schedule and Content
Research Questions

 
Why This Seminar
A group of faculties in our college (Dr. Song, Han, Hu, Weber, Lin and Kaplan) have common interest in system-oriented research. During the past one-and-a-half-year study, I have been the RA of these faculties or the audience of lectures offered by them. I recognized their research targets more or less the topic of natural language understanding. Meanwhile, a portion of Ph.D students in our college are also interested in this line of research. So this seminar is designed not only to help Ph.D. students on this technical track make a solid foundation, but also to provide related faculties and Ph.D. students with a platform for research idea sharing.

Prerequisite and Audience
This seminar-like course is prepared for PhD students on the techincal track in IST, espcially who are interested in research in natural language understanding. If you still have no idea about this course, please read the research questions this course is going to address. Because almost all topics in this course involve machine learning, the course takers should meet the following prerequisite or have grasped similar skills.
  • Data Mining (INFO634)
  • Concepts in Artifical Intelligence (INFO629)
  • Research Statistics (INFO695)
Course Format

Each class is comprised of four components: area overview, technical question discussion, proposoal presentation, and proposal critique. The class begins with the topic overview provided by the instructor. After that, all students participate in the discussion of technical questions. Then half of students in class present their solutions to research questions designated for the class. The presentation is followed by the critique from classmates in any aspects, for example, the technical feasibility, strength and weakness of the solution.

The whole class will be divided into two groups, A and B. In odd week, each students in Group A answers technical questions while each student in Group B writes a proposal for designated research questions and then makes a presentation. In even week, the assignment is in the opposite. The first week is free of any assignment. The last week is for presenation of term project. All students in class need to pick up one research questions from listed eight (research questions used for homework during Week 2-9 can not be chosen any more) for their term projects. The outcome of the project should be the formal research paper.

The course grading will base on the class participation (20%), technical questions (20%), four proposals and presentations (30%), and term project (30%).

Exit Competencies
  • Machine Learning and Natural Language Processing Techniques
    Students are expected to learn about a bunch of ML and NLP techniques by solving real world problems. Even if you know well the principles of these techniques, you will see how better they work in conjunction with others tecniques such as ontology.
  • Skills for Proposal Writing, Presentation and Review
    Each student will write five proposals for five research questions respectively (four for homework and one for term project) and present them in class. Also, each one is encouraged to criticize proposals from classmates.
  • The State of the Art of Research Areas
    By reading papers and lecture notes as well as participating in class discussion, students will not only grasp the state of the art of research areas such as information extraction, text mining,and effctive informaiton retreival, but also sense the typical way to do research in NLU and related areas.
  • Starting Your Own Research
    For PhD study, the ultimate goal is not to learn about domain knowledge, but to contribute some novel idea to the reasearch community you are in. When you are proposing some solutions to the research questions, you actually get yourself on the right track!
Schedule and Content
 
Week 1: Course Overview

Coverage:
(1) brief introduciton of all research topics and questions.
(2) machine learning techniques ovewview:
a. probabilistic models
b. instance-based learning
c. sybmolic rules
d. markov model and expectation maximum

Readings:
A. Berger, S. A. Della Pietra, and V. J. Della Pietra, A maximum entropy approach to natural language processing
Rebecca F. Bruce and Janyce M. Wiebe, Decomposable modeling in natural language processing
Roland Kuhn and Renato De Mori, The Application of Semantic Classification Trees to Natural Language Understanding
Ronald L. Rivest, Learning Decision Lists
Walter Daelemans et al., TiMBL: Tilburg Memory Based Learner V2.0 Reference Guide
Lawrence Rabiner, A tutorial on hidden Markov models
Michael Collins, The EM Algorithm

Week 2: Part of Speech Tagger and Sentence Parser
Coverage:
(1) Link Grammar Parser
(2) Markov Model Based Parser
(3) Rule-based POS Tagger
(4) Markov Model Based POS Tagger
(5) Morphology Based POS Tagger

Readings:
Eugene Charniak, Statistical techniques for natural language parsing
Daniel Sleator and Davy Temperley, Parsing English with a Link Grammar
Peter Szolovits, Adding a Medical Lexicon to an English Parser
Eric Brill, A simple rule-based part of speech tagger
Evan L. Antworth, Morphological Parsing with a Unification-based Word Grammar
Boris Katz, From Sentence Processing to Information Access on the World Wide Web

Research Questions: #2 or #6
Technical Questions:
(1) How to convert one complex sentence (with subordinate) to several simple clauses?
(2) Enumerate one application for parser and one application for POS tagger.

Week 3: Information Extraction
Coverage:
(1) Named Entity Recognition
(2) Pronominal Reference
(3) Concept Identification
(4) Pattern-based Information Extraction

Readings:
Douglas Appelt et al., Introduction to Information Extraction Technology
Diana Maynard et al., MUlti-Source Entity recognition
Razvan Bunescu et al., Learning to Extract Proteins and their Interactions from Medline Abstracts
Harith Alani et al., Automatic extraction of knowledge from web documents

Research Questions: #1 or #7
Technical Questions:
(1) Summarize all IE tasks and list at least two methods for each tasks.
(2) Write a JAPE-compliant pattern for extraction of English-word-written number (e.g. one hundred and fifty-six)

Week 4: Linguistic Patterns
Coverage: pattern representation,learning and applications

Readings:
Ion Muslea, Extraction Patterns for Information Extraction Tasks: A Survey
Ellen Riloff, Automatically Constructing a Dictionary for Information Extraction Tasks
Mary Califf and Raymond Mooney, Bottom-up Relational Learning of Pattern-Match Rules for Information Extraction
Stephen Soderland, Learning Information Extraction Rules for Semi-Structured and Free Text
Xiaohua Hu et al., Extracting and Mining Protein-Protein Interaction Network from Biomedical Literature
Dayne Freitag, Information extraction from HTML: Application of a general machine learning approach

Research Questions: #1 or #5
Technical Questions:
(1)

Week 5: Word Sense Disambiguation

Coverage:
(1) knowledge sources for WSD.
(2) unsupervised and supervised WSD approaches.
(3) WSD applications.

Readings:
Oi Yee Kwong, Word Sense Disambiguation with an Integrated Lexical Resources
Rada Mihalcea and Dan I. Moldovan, An Iterative Approach to Word Sense Disambiguation
Hoa Trang Dang and Martha Palmer, Combining Contextual Features for Word Sense Disambiguation
Mark Stevenson and Yorick Wilks, The Interaction of Knowledge Sources in Word Sense Disambiguation
Raymond Mooney, Comparative Experience on Disambiguating Word Senses: An Illustration of the Bias in Machine Learning
Nancy Ide and Jean Veronis, Word sense disambiguation: The state of the art
Xiaohua Zhou and Hyoil Han, Survey of WSD Approaches

Research Questions: #4 or #8
Technical Questions:
(1) Please enumerate three applications of WSD and point out which WSD approach are you going to apply to each application and why?

Week 6: Ontology
Coverage:
(1) Ontology Represenation
(2) Ontology Learning
(3) Applications of Ontology

Readings:
Christopher Brewster and Kieron O'Hara, Knowledge representation with ontologies: the present and future
H. Wache et al., Ontology-based integration of information: a survey of existing approaches
Eneko Agirre, Olatz Ansa, Eduard Hovy and David Martinez, Enriching very large ontologies using the WWW
Alexander Maedche and Steffen Staab, Ontology learning for the Semantic Web
Mihai Barbuceanu, Mark S. Fox, Lei Hong, Yannick Lallement and Zhongdong Zhang, Building Agents for the Customer Service Front
Aykut Firat, Stuart Madnick, and Benjamin N. Grosof, Knowledge Integration to Overcome Ontological Heterogeneity: Challenges from Financial Information Systems

Research Questions: #3 or #7
Technical Questions:

Week 7: Effective Information Retrieval

Coverage:
(1) Sense-based IR Model
(2) Query Expansion
(3) Graph-based IR Model
(4) Vector-based IR Model
(5) Relationship-based IR Model

Readings:
Christopher Stokoe and John Tait, Towards a Sense Based Document Representation for Internet Information Retrieval
Roberto Navigli and Paola Velardi, An Analysis of Ontology-based Query Expansion Strategies
J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, The web as a graph: Measurements, models and methods
Schenker, A., Last, M., Bunke, H., and Kandel, A., Clustering of Web Documents Using a Graph Model
Scott Deerwester et al., Index by Latent Semantic Analysis
Chirag Shah and Pushpak Bhattacharyya, Improving Document Vectors Representation using Semantic Links and Attributes

Research Questions: #4 or #5
Technical Questions:
(1) How is NLP helpful for document indexing or searching?
(2) Please think out at least one approach different from (Schenker 2003) to build graph for a document?

Week 8: Text Mining

Coverage: Typical text mining research

Readings:
Marti A. Hearst, Untangling Text Data Mining
Anonymous, A Survey of Text Data Mining
Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne Niblack, Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques
Rayid Ghani and Andrew E. Fano,Using Text Mining to Infer Semantic Attributes for Retail Data Mining
Xiaohua Hu et al., Extracting and Mining Protein-Protein Interaction Network from Biomedical Literature
Michael Gordon and Susan Dumais, Using Latent Semantic Indexing for Literature-based Discovery

Research Questions: #2 or #3
Technical Questions:
(1) Discuss the differences among IE, TM and DM.

Week 9: Semantic Web
Coverage: overview of Semantic Web

Readings:
Jeff Heflin and James Hendler, A Portrait of the Semantic Web in Action
James Hendler, Agents and the Semantic Web
Sean B. Palmer, The Semantic Web: An Introduction
Shiyong Lu, Ming Dong and Farshad Fotouhi, The Semantic Web-opportunities and challenges for next-generation Web applications
Alexander Maedche and Steffen Staab, Ontology learning for the Semantic Web
R.Guha, R.McCool and R.Fikes, Contexts for the Semantic Web
Stuart E. Madnick, Metadata Jones and the Tower of Babel: The Challenge of Large-Scale Semantic Heterogeneity

Research Questions: #6 or #8
Technical Questions:

Week 10: Final Presentation
 
Research Questions
  • #1 Auto-generation of Class Diagram
    Object-oriented analysis requires the building of class diagrams from the problem statement or functional specification documents written in natural language. Can you build a system that could automatically generate class diagram given the requirement documents?

  • #2 Sentiment Analysis on Product Reviews
    A manager of the department of marketing at certain MP3 player manufacturer has got thousands of online product reviews from customers. He is looking for some systems that could summarize those reviews (negative and postive aspects). Can you build such a system for automated sentiment analysis?

  • #3 Math Problem Routing (Contributed by Xiaodan Zhang)
    Volunteers (Math Doctor) at Math Forum answer math questions sbumitted by kids online. So far, Math Forum has accumulated a large number of Q&A. Math Doctor found that a good portion of questions submitted actually have been answered before. Can you build a system to relieve Math Doctor's workload (i.e. process questions submitted online and find n most similar questions from Q&A database)?

  • #4 Semantics-based Information Retrieval
    Most of present information retrieval (IR) systems use keywords to query, match and index documents. However, keyword is not effective in representation of semantics. First of all, a keyword usually has several senses and its meaning is ambiguous without context. Second, one meaning can be expressed by many keywords, but it is difficult for users to exhaust all possible keywords. Last, also the most problematic, the co-occurrence of keywords in a document does not mean that they are related semantically. Can you propose some new IR models to address these problems?

  • #5 Web Image Indexer
    Can you build a web image indexer by getting clues from surrounding text of the image?

  • #6 Auotmated Customer Support
    A grocery store is looking for intelligent customer support systems that can interact with customers through email or online conversation. The scope of the possible questions from customers are limited to compaint, return, inquiry, and FAQ. Can you build such a system?

  • #7 Ontology Acquisition (Contributed by Dr. Han)
    Nowadays ontology is extensively used. But it is extremely expensive to build an ontology. Can you propose an approach to automate ontology acquisition from a collection of domain-specfic documents? For example, can you build a university ontology from all web pages of drexel.edu?

  • #8 Knowledge Integration
 

©2006 Davis Zhou, All Rights Reserved. Last Update on December 8, 2007.