Davis's Software Factory
|
For Academia
|
| Semantic Profiling of
Human Tags@CiteULike (2007/12) |
| Social bookmarking has been a popular service
in Internet. The co-occurrence of human tag and the text it annotated
for gives us a way to explore the meaning of the tag. Such semantic
knowledge would be useful in understanding user information needs
in IR. Here is a demo based on the dataset of CiteULike from Nov.
2004 to Feb. 2007. |
| Web-based Factoid Question Answering
(2007/10) |
| The current version can answer entity, number and
abbreviation-related factoid questions ( not support description
or definition questions yet). We developed a novel scoring algorithm
which took into account the context of the candidate answers and
significantly improved the accuracy. In general, the question beginning
with a question word(e.g., how, when, who, where, what and which)
brings higher accuracy, but not required. |
| Collective Wisdom based Entity
Classification (2007/09) |
| The assignment of a semantic category to an entity
(concept) is a challenging problem to machines. Traditional approaches
extract features from either surface forms or local contexts (surrounding
texts) and then apply machine learning methods or human-coded rules
for entity classification. Such classifiers usually require large
number of training examples and domain-specific tuning and even
human-created ontologies (dictionaries). Instead, this tool utilizes
the wisdom of crowds for entity classification. It builds the semantic
context for each entity through web search engines such as Google.
The top ranked documents returned by a search engine gives the sense
of what poeple think of this entity! The new approach is simple,
robust and powerful. No tuning, no external dictionaries, applicable
to any domain,and most importantly, good accuracy! Click
here to see a demo. |
| The Dragon Pinyin [ÁúÆ´] (2007/04) |
| A smart pinyin for Chinese text input. This tool
is featureed for its high accuracy for long Chinese text input and
flexible personalized startup training from one's personal documents
such as emails, chat logs, blogs, and other written essays. Click
here (or this
address) to see the online demo. |
| The Dragon
Toolkit (2006/07) |
| The Dragon Toolkit is a Java-based development package for academic use in language modeling (LM), information retrieval (IR), and text mining (TM, including text classification, text clustering, text summarization, and topic modeling). Language modeling has recently emerged as an attractive new framework for text information retrieval and text mining. However, most Java-based free search engines such as Rucene do not support LM very well. The Lemur toolkit is designed for LM and IR, but written in C and C++, which may be a hindrance to people who prefer Java programming. Basically, the dragon toolkit is tailored for researchers who work on large-scale LM, IR and TM and prefer Java programming. Moreover, different from Lucene and Lemur, it provides built-in supports for semantic-based IR and TM. The dragon toolkit seamlessly integrates a set of NLP tools, which enable the toolkit to index text collections with various representation schemes including words, phrases, ontology-based concepts and relationships. However, to minimize the learning time, we intentionally keep the package small and simple. The toolkit does not have some features including distributed IR and cross-language IR which is a part of Lemur toolkit.
|
|
Ontology-based Biomedical Text Annotation
(2006/04)
|
| Dictionary-based biological concept extraction is still the state-of-the-art
approach to large-scale biomedical literature annotation and indexing.
The exact dictionary lookup is a very simple approach, but always
achieves low extraction recall because a biological term often has
many variants while a dic-tionary is impossible to collect all of
them. We propose a generic extraction ap-proach, referred to as
approximate dictionary lookup, to cope with term varia-tions and
implement it as an extraction system called MaxMatcher. The basic
idea of this approach is to capture the significant words instead
of all words to a particular concept. The new approach dramatically
improves the extraction re-call while maintaining the precision.
|
| Queuing System Component (2001/11) |
| This component written in Visual Basic is a middleware for development
of queusing system simulation applications. It consists of five
objects: event, customer, queue, distribution and statistics. Because
of its flexibility and simpleness, it has been chosen by a research
team of Shanghai Maritime University to design software for simulating
the traffic of Shanghai Container Harbor, and another research team
in BaoSteel, one of the largest steel companies in China. Now you
can download the midware for free. The package includes the component,
help file written in Chinese and sample projects. Download the
package. |
For Fun
|
| Magic Coversion from Image to Word Doc (2004/08) |
| It is an interesting sutff converting any image
to Microsoft Word document. All pixels in the image will appear
as specified character in the word document. But manipulating the
doc slightly like changing font type and size, you will see the
original image in the word document. It seems marvellous! Dowload
the installaton package (inclusive
of source code) |
| American Option Pricing Tool (2004/05) |
| It is a tool written in Visual Basic for American
option pricing. I basically calculate the option vlaue by simulation,
a simple least-square approch as described in this
paper. Download the source code. |
| Netlinez (2003/07) |
| It is a mini-game written in Visual Basic and Visual
C++. The original idea is not mine, but from a Russian guy. The
original version supports only stand-alone mode. I write the code
from the scratch and extend it to a net game. Now two players can
compete with each other or cooperate through Internet. The new version
seems more fun than the original one. Download installation
package and source code. |