CS 381.3/780: Data Warehouse & Data Mining

PRE-REQUISITE:
Permission by department for students registering CSCI 780; Equivalent background of MATH 241 and CSCI 086 for students registering CSCI 381.3

Note: Graduate students may be able to use this course to satisfy the semi-core of Scientific and Statistical Computing.

Course structure

In this course we will focus on the following five topics related to data mining and data warehouse:

  1. Concept of data warehouse, its model design and schema design.
  2. Basics on the use of Oracle technology for ELT (Extraction, Load, and Transformation) and application implementation of dynamic PL/SQL for the presentation layer.
  3. Concept of patterns for encapsulating information.
  4. (For graduate and advanced undergraduate students) Information theory and statistics as a foundation for data mining.
  5. (For graduate and advanced undergraduate students) Advanced information-statistical techniques for (population) change detection, pattern identification, model discovery, and pattern-based probabilistic inference.

This course is unique in several aspects. First, this course will cover topics that attempt to bridge statistics and information theory through a concept of patterns. In doing so, a framework is provided to apply statistical approach for conducting EDA (Exploratory Data Analysis) for data mining, and information theory is used to interpret the meaning behind the discovery through EDA. Second, the instructor will share his multi-disciplinary collaboration experience in a number of related fields such as statistics and probability, information theory, database, and computational intelligence. Third, the instructor will emphasize the importance of implementation to demonstrate the practicality of the approach discussed in this course.

Various tools have been implemented and will be made available for this course. Commercial tools that may be used in this course include: Insightful I-Miner data mining tool, S-PLUS, Mathcad. Other tools developed from our previous research that may also be used in this course include: Oracle based integrated environment for data warehousing and data mining, ActiveX and/or Java Data Constructor utility, Patent pending ActiveX and Java software for model discovery and probabilistic inference, ActiveX Bayesian network software, S-PLUS script for discovering signification event association patterns, and Mathcad application for change point detection.

Textbook and web resources

  1. Information-Statistical Data Mining: Warehouse Integration with Examples of Oracle Basics, Bon K. Sy and Arjun K. Gupta, Kluwer Academic Publishers, anticipated release date: last quarter of 2003 or first quarter of 2004.
  2. http://bonnet19.cs.qc.cuny.edu:7778/pls/forum/ (Data Mining/Data Warehouse E-community)
  3. http://bonnet19.cs.qc.cuny.edu:7778/pls/rschdata/ and http://bonnet19.cs.qc.cuny.edu:7778/pls/rschdata/portal.login_datamining