Contents

XLMiner™ Overview

Introduction

XLMiner™ is a comprehensive data mining add-in for Excel. Data mining is a discovery-driven data analysis technology used for identifying patterns and relationships in data sets. With overwhelming amounts of data now available from transaction systems and external data sources, organizations are presented with increasing opportunities to understand their data and gain insights into it. Data mining is still a nascent field, and is a convergence of fields like statistics, machine learning, and artificial intelligence.

Often, there may be more than one approach to a problem. XLMiner™ is a toolbelt to help you get quickly started on data mining, offering a variety of methods to analyze your data. It has extensive coverage of statistical and machine learning techniques for classification, prediction, affinity analysis and data exploration and reduction.

Note:  A typical dataset will consist of rows and columns. The columns represent variables and the rows represent cases (also called records, individuals, objects, observations or patterns).  So the value in the second row, fourth column represents the value of the fourth variable for the second observation.

Important : If you were using an earlier version of XLMiner™ and uninstalled it, it is possible that you will get an error message on invoking Excel. This is a harmless message resulting from the earlier version's Add-In entry. In such a case, use the Instructions to clean up the add-in entry of the earlier version.

Components of XLMiner™:

Partition Data Standard partition, Partition with Oversampling
Data Utilities Sample from worksheet, Sample from database (Except in Education edition), Missing data handling, Bin continuous data, Transform categorical data
Time Series Partitioning, ARIMA, ACF(Autocorrelations), PACF (Partial Autocorrelations), Smoothing

Classification

Discriminant analysis, logistic regression, classification tree, naïve Bayes, neural networks (multilayer feedforward) and k-nearest neighbors

Prediction

Multiple linear regression, regression tree, neural networks (multilayer feedforward) and k-nearest neighbors

Affinity

Association rules

Data exploration and reduction

Principal component analysis, Hierarchical clustering, and k-means clustering

Charts Box plot, Histogram, Matrix plot

Comparison of different editions of XLMiner™

Standard

 

Education

 

Demo

 

# records supported

60000 60000 600
#columns supported 200 200 200
Partition Data Y Y Y
Data Utilities Y All utilities, except Sample from database. Y
Time Series N N N
Classification Y All components, except Scoring to database. Y
Prediction Y All components, except Scoring to database. Y
Affinity Y Y Y
Data exploration and reduction Y Y Y
Charts Y Y Y

System Requirements

Hardware requirements:

      Minimum
:

            Pentium 133 MHz processor.

            1 GB hard drive with a minimum of 50 MB free disk space.

            32 MB RAM 

Recommended:  

            Pentium 200 MHz processor (or higher).

            1 GB hard drive with minimum 60 MB free disk space.

            64 MB RAM (the more the better).

Software requirements:

       Operating system: Microsoft Windows NT 4.0 / 2000/ XP.

       Microsoft Excel 2000/XP (Please note XLMiner™ will not work on Microsoft Excel 97).

 

Installing XLMiner™

To install XLMiner™, insert the XLMiner™ CD and click on "setup.exe" file from Windows Explorer. XLMiner™ installation dialog boxes will guide you through the installation procedure. Default path of installation is <Drive>:\Program Files\XLMiner. It is assumed in this documentation that this is the path to which XLMiner™ was installed.

You can change this default path to specify a different location where you want XLMiner™ to be installed. After installing XLMiner™ , the XLMiner™ program group appears under Start --> Programs--> XLMiner.

 

Now, you can either invoke XLMiner™ directly or select the option to Register XLMiner™ as an Add-in.

 

Precautions:

  1. While installing XLMiner™ on Microsoft Windows NT, 2000 or XP, you may require administrative privileges. Once XLMiner™ is installed, you no longer need to have administrative privileges. Any user can use XLMiner™ .

  2. The folder in which XLMiner™ is installed and all its subfolders should have READ/ WRITE privileges for the software, irrespective of which user is logged in. 

  3. Do not use worksheet names with leading or trailing spaces. This leads to ambiguity as there might be more than one sheet names with one or more leading or trailing spaces. Using such names will lead to an error. However, sheet names with spaces embedded within (not leading or trailing) are allowed.

Notes:  

  • XLMiner™ appears  as an item in the menu after the first time registration as an Add-in. So you can use XLMiner™ thereafter by just invoking Excel. See the figure below. (Remember, the edition indicator along with the name XLMiner™ can be different depending upon the edition you have).

  • Alternatively, in Explorer, you can locate and double-click the file XLMiner.xla in the location where XLMiner™ has been installed, or,

  • XLMiner can also be invoked by the usual method - Start => Programs =>  Register XLMiner as an Add-in.

Running XLMiner™ Procedures
Open Excel and locate the XLMiner menu in the top toolbar and select it.. 

By clicking on the appropriate menu item, you can run any of XLMiner's procedures.

Entering the licence key

To enter the Licence key after it is sent to you through sales@xlminer.com,  invoke "About XLMiner" from the XLMiner™ menu.

Select "Enter Licence key". You see the following dialog.

Enter the user name and the key supplied to you (the best way to enter the key is copy it and paste so that there is no scope for an error). Select continue. XLMiner™ is all set for you for the set duration.

  

Data types supported in XLMiner™

Method

Output variable

Input variables

Continuous

Categorical

Continuous

Categorical

Ordinal

Nominal

Ordinal

Nominal

Multiple Linear Regression

Y

N

N

Y

Y

N

k-Nearest Neighbor Prediction

Y

N

N

Y

Y

N

Regression Tree

Y

N

N

Y

Y

N

Discriminant Analysis

N

Y

Y

Y

Y

N

Logistic Regression

N

Y (Dichotomous)

Y

Y

Y

N

Classification Tree

N

Y

Y

Y

Y

N

Naïve Bayes

N

Y

Y

N

Y

Y

Neural Networks Classification

N

Y

Y

Y

Y

N

Neural Networks Prediction Y N N Y Y N

k-Nearest Neighbor Classification

N

Y

Y

Y

Y

N

Association Rules

NA

NA

NA

N

Y (Binary Only)

Y

Principal Component Analysis

NA

NA

NA

Y

Y

N

k-Means Clustering

NA

NA

NA

Y

Y

N

Hierarchical Clustering

NA

NA

NA

Y

Y

N

Legend:    Y - Supported;    N - Not- Supported;      Ordinal : Ordinal includes binary (0/1) variables.

Note:

  • Ordinal means numerical and categorical. For example (0, 1, 2) is ordinal but (a, b, c) is not.
  • Dichotomous strictly means (0, 1).

Sample datasets with XLMiner™

XLMiner™ comes with a collection of sample datasets in the "Datasets" folder, which is in the base directory where XLMiner™ is installed. These datasets are Excel files. A description of some datasets is provided in the table below. XLMiner™ provides extensive online help with examples for each of the procedures. The sample datasets have been used in these examples.

Despite the small size of some of these datasets, they are still useful illustrations of data mining techniques.  In many cases, huge datasets can be effectively dealt with by sampling a portion of the data to build a model, then applying the conclusions from the model to the entire dataset.

Dataset Name

Filename 

Description

Associations Dataset  Associations.xls This dataset is a subset of the Charles Book Club Dataset mentioned below. It is used to study the Association rules. 
Boston Housing Dataset Boston_Housing.xls This dataset contains information collected by the US Census Bureau concerning housing in the area of Boston Massachusetts. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston). The dataset has 506 cases.

The data were originally published by Harrison, D. and Rubinfeld, D.L. "Hedonic prices and the demand for clean air", J. Environ. Economics & Management, vol.5, 81-102, 1978. 

There are 14 variables:

CRIM     per capita crime rate by town 
ZN         proportion of residential land zoned for lots over 25,000 sq.ft. 
INDUS    proportion of non-retail business acres per town. 
CHAS     Charles River dummy variable (1 if tract bounds river; 0 otherwise) 
NOX       nitric oxides concentration (parts per 10 million) 
RM        average number of rooms per dwelling 
AGE       proportion of owner-occupied units built prior to 1940 
DIS       weighted distances to five Boston employment centers 
RAD      index of accessibility to radial highways 
TAX      full-value property-tax rate per $10,000 
PTRATIO pupil-teacher ratio by town 
B          1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 
LSTAT   % lower status of the population 
MEDV    Median value of owner-occupied homes in $1000's
Charles book club  Charles_BookClub.xls The Charles Book Club (“CBC”) was established in December of 1986, on the premise that a book club could differentiate itself through a deep understanding of its customer base and by delivering uniquely tailored offerings. CBC focused on selling specialty books by direct marketing through a variety of channels, including media advertising (TV, magazines, newspapers) and mailing. CBC is strictly a distributor and does not publish any of the books that it sells. In line with its commitment to understanding its customer base, CBC built and maintained a detailed database of its club members. Upon enrollment, readers were required to fill out an insert and mail it to CBC. Through this process, CBC has created an active database of 500,000 readers. CBC acquired most of these customers through advertising in specialty magazines.

This Dataset is part of a case prepared by Ms. Vinni Bhandari, a data mining consultant and Dr. Nitin Patel, a visiting professor of Operations Research at the MIT Sloan School of Management. The case has been derived from a Case Study in Database Marketing titled 'BBB - The Bookbinders Club' prepared by Nissan Levin and Jacob Zahavi, Tel Aviv University for the Direct Marketing Educational Foundation, Inc. (March 1995).

The problem : CBC sent mailings to its club members each month containing its latest offering. 
On the surface, CBC looked like they were very successful, mailing volume was increasing, book selection was diversifying and growing, their customer database was increasing; however, their bottom line profits were falling. The decreasing profits led CBC to revisit their original plan of using database marketing to improve its mailing yields and to stay profitable.

Digits data

Digits.xls This dataset was used as an example to illustrate Classification trees in the book CART by Leo Breiman et al. The variables x1,...,x7 indicate the state of their corresponding LED's in a seven segment display. The variable y takes the desired value of the integer to be displayed. The data was generated such that the probability an LED is in the correct state (0 or 1) is 0.9. 

For full reference see Leo Breiman, Jerome Friedman, Charles J Stone, R.A. Olshen. Classification and Regression Trees. Chapman & Hall / CRC. 1984.

x1,...,x7: State of corresponding LED in a seven segment display.

y: Desired integer to be displayed.

Fisher's Iris data

Iris.xls The Iris dataset was introduced by R. A. Fisher as an example for discriminant analysis. The data report four characteristics (sepal width, sepal length, petal width and petal length) of three species of Iris flower.

The data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.

  • Species_No: Flower species as a code
  • Species_Name: Species name
  • Petal_Width: Petal Width
  • Petal_Length: Petal Length
  • Sepal_Width: Sepal Width
  • Sepal_Length: Sepal Length
Flying fitness of pilots Flying_Fitness.xls The data are on test results on flying fitness tests for 40 pilots (var 1). There are six other categorical variables (named var2,through var6) indicative of the performance of the pilots on various physical and psychological tests.

Output Variable: Var 1 
Input variables: Var 2 to Var 6 
No. of classes: 2 
Utilities Utilities.xls This dataset gives corporate data on 22 US public utilities. The goal is to form groups of similar utilities. The objects to be clustered are the utilities. There are 8 measurements on each utility described below. An example where clustering would be useful is a study to predict the cost impact of deregulation. To do the requisite analysis economists would need to build a detailed cost model for each firm. It would save a considerable amount of time and effort if we could cluster similar types of utility firms, build detailed cost models for just one ”typical” firm in each cluster, then scale up from these models to estimate results for all the utility firms. The objects to be clustered are the utility firms and there are 8 measurements on each utility.

This dataset can be found in Dean W. Wichern & Richard Arnold Johnson, Applied Multivariate Statistical Analysis, Prentice Hall, 5th edition (2002)

   X1:       Fixed-charge covering ratio (income/debt)
   X2:       Rate of return on capital
   X3:       Cost per KW capacity in place
   X4:       Annual Load Factor
   X5:       Peak KWH demand growth from 1974 to 1975
   X6:       Sales (KWH use per year)
   X7:       Percent Nuclear
   X8:       Total fuel costs (cents per KWH)

Wine dataset Wine.xls The wine dataset contains properties of wine taken from three different wineries in the same region. There are 13 variables describing various properties of wine and 3 classes. This dataset can be used for classification with Type as an output variable.  It can also be used (omitting the Type variable) to perform clustering. 

This dataset can be found in the UCI Machine Learning Repository (http://www.ics.uci.edu/~mlearn/MLSummary.html ftp://ftp.ics.uci.edu/pub/machine-learning-databases/wine/)

DMEF Dataset Catalog_multi.xls A data set provided by Direct Marketing Educational Foundation, a response to direct mail offer. Several variables are used as information parameters about the customer and a binary output variable shows if the customer is a buyer.