Features

One key behind the success of KNIME is its inherent modular workflow approach, which documents and stores the analysis process in the order it was conceived and implemented, while ensuring that intermediate results are always available.

Core KNIME features include:

  • Scalability through sophisticated data handling (intelligent automatic caching of data in the background while maximizing throughput performance)
  • High, simple extensibility via a well-defined API for plugin extensions
  • Intuitive user interface
  • Import/export of workflows (for exchanging with other KNIME users)
  • Parallel execution on multi-core systems
  • Command line version for "headless" batch executions

Available KNIME modules cover a vast range of functionality, such as:

  • I/O: retrieves data from files or data bases
  • Data Manipulation: pre-processes your input data with filtering, group-by, pivoting, binning, normalization, aggregation, joining, sampling, partitioning, etc.
  • Views: visualize data and results through several interactive views, allowing for interactive data exploration
  • Hiliting: ensures hilited data points in one view are also immediately hilited in all other views
  • Mining: uses state-of-the-art data mining algorithms like clustering, rule induction, decision tree, association rules, naïve bayes, neural networks, support vector machines, etc. to better understand your data

Supported Operating Systems

  • Windows - 32bit (regularly tested on XP and Vista)
  • Windows - 64bit (regularly tested on Vista and verified to work under Windows 7)
  • Linux - 32bit (regularly tested on RHEL4/5, OpenSUSE 10.2/10.3/11.0, amongst others)
  • Linux - 64bit (regularly tested on RHEL4/5, OpenSUSE 10.2/10.3/11.0, amongst others)
  • Mac OSX - 64bit Intel-based architecture with Java 1.6

Highlighted List of Nodes (Modules)

  • IO
    • Read
    • Write
      • CSV Writer - Saves a datatable into an ASCII file.
      • ARFF Writer - Writes data into a file in ARFF format.
      • Table Writer - Writes a data table to a file using an internal format.
      • PMML Writer - Reads a model from a PMML port and writes it into a PMML v4.0 compliant file.
      • Model Writer - Writes KNIME model port objects to a file.
      • XLS Writer - Saves a datatable into a spreadsheet.
    • Other
    • Cache - Caches all input data (rows) onto disk for fast access.
  • Database
    • Database Reader - Establishes and opens a database access connection to read a table from.
    • Database Connector - Creates a database connection to the specified database.
    • Database Looping - This node runs SQL queries in the connected database restricted by the possible values given by the input table.
    • Database Row Filter - The Database Row Filter allows to filter rows from database table.
    • Database Query - Modifies the input SQL query from a incoming database connection.
    • Database Column Filter - The Database Column Filter allows columns to be excluded from the input table database table.
    • Database Connection Reader - Reads the entire data from the input database connection.
    • Database Connection Writer - Writes the input database table into a new database table.
    • Database Delete - Deletes the selected rows in the database based on the selected columns from the input table.
    • Database Update - Updates the selected rows in the database with the data values from the input tables.
    • Database Writer - Establishes and opens a database access connection to which the entire input table is written to.
  • Data Manipulation
    • Column
      • Binning
        • Auto-Binner - This node allows to group numeric data in intervals - called bins.
        • Auto-Binner (Apply) - This node allows to group numeric data in intervals - called bins.
        • Numeric Binner - Group values of numeric columns categorized string type.
        • Binner (Dictionary) - Categorizes values in a column according to a dictionary table with min/max values.
        • CAIM Binner - This node implements the CAIM discretization algorithm according to Kurgan and Cios (2004). The discretization is performed with respect to a selected class column.
        • CAIM Applier - Takes a binning (discretization) model and a data table as input and bins (discretizes) the columns of the input data according to the model.
      • Convert & Replace
      • Filter
        • Column Filter - The Column Filter allows columns to be excluded from the input table.
        • Low Variance Filter - Filters out numeric columns, which have a low variance.
        • Reference Column Filter - The Reference Column Filter allows columns to be filtered from the first table using the second table as reference.
      • Split & Combine
        • Cell Splitter - Splits the string representation of cells in one column of the table into separate columns or into one column containing a collection of cells, based on a specified delimiter.
        • Cell Splitter By Position - Splits cells in one column of the table at fixed positions into separate columns.
        • Column Aggregator - Groups the selected columns per row and aggregates their cells using the selected aggregation method.
        • Column Combiner - Combines the content of a set of columns and appends the concatenated string as separate column to the input table.
        • Column Merger - Merges two columns into one by choosing the cell that is non-missing.
        • Create Collection Column - Combines multiple columns into a new collection column.
        • Split Collection Column - Splits a collection column into its sub components, adding one new column for each.
        • Joiner - Joins two tables
        • Regex Split - Splits an input string (column) into multiple groups according to a regular expression.
        • Splitter - Splits the columns of the input table into two output tables.
        • Column to Grid - Breaks a selected column (or set of columns) into new columns, such that they align in a grid.
      • Transform
        • Case Converter - This node converts alphanumeric characters to lowercase or UPPERCASE.
        • Column Comparator - Compares the cell values of two columns row-wise using different comparison methods. A new column is appended with the result of the comparison.
        • Column Resorter - Resorts the order of the columns based on user defined settings
        • Denormalizer - Denormalizes the attributes of a table according to a model.
        • Missing Value - Filters or replaces missing values in a table.
        • Normalizer - Normalizes the attributes of a table.
        • Normalizer (Apply) - Normalizes the attributes of a table according to a model.
        • One2Many - Transforms the values of one column into appended columns.
        • Many2One - Aggregates several columns into one single column.
        • SMOTE - Adds artificial data to improve the learning quality using the SMOTE algorithm
        • Set Operator - Performs a set operation on two selected table columns.
        • String Manipulation - Manipulates strings like search and replace, capitalize or remove leading and trailing white spaces.
        • Subset Matcher - The node matches all subsets of the first input table with all sets of the second input table.
      • HiLite Collector - Node allows to apply annotations to sets of hilit rows within a view.
    • Row
      • Filter
        • HiLite Filter - Partitions input rows based on their current hilite status.
        • Nominal Value Row Filter - Filters rows on nominal attribute value
        • Numeric Row Splitter - Node splits the input data according to a given numeric range. The first output port contains the data that matches the criteria, the second the that does not comply with the settings.
        • Reference Row Filter - The Reference Row Filter allows rows to be filtered from the first table using the second table as reference.
        • Row Filter - Allows filtering of datarows by certain criteria, such as row ID, attribute value, and row number range.
        • Row Splitter - Allows splitting of the input table by certain criteria, such as row ID, attribute value, and row number range.
      • Transform
        • Bitvector Generator - Generates bitvectors either from a table containing numerical values, or from a string column containing the bit positions to set, hexadecimal or binary strings.
        • Concatenate - Concatenates two tables row-wise.
        • Concatenate (Optional in) - Concatenates tables row-wise, inputs are optional.
        • GroupBy - Groups the table by the selected column(s) and aggregates the remaining columns using the selected aggregation method.
        • Ungroup - Creates for each list of collection values a list of rows with the values of the collection in one column and all other columns given from the original row.
        • Partitioning - Splits table into two partitions.
        • Pivoting - Pivots and groups the input table by the selected columns for pivoting and grouping; enhanced by column aggregations.
        • Row Sampling - Extracts a sample (a bunch of rows) from the input data.
        • Equal Size Sampling - Removes rows from the input data set such that the values in a categorical column are equally distributed.
        • Shuffle - Shuffles the rows of the input tables.
        • Sorter - Sorts the rows according to user-defined criteria.
        • Unpivoting - This node rotates the selected columns from the input table to rows and duplicates at the same time the remaining input columns by appending them to each corresponding output row.
      • Other
        • Add Empty Rows - Adds a certain number of empty rows with missing values or a given constant to the input table.
        • Extract Column Header - Creates new table with a single row containing the column names.
        • Insert Column Header - Updates column names of a table according to the mapping in second dictionary table.
        • RowID - Node to replace the RowID and/or to create a column with the values of the current RowID.
        • Rule Engine - Applies user-defined business rules to the input table
      • Key-Collection HiLite Translator - Translates hilite events from a row containing a collection cell with row keys to the original rows.
    • Matrix
      • Transpose - Transposes a table by swapping rows and columns.
    • PMML
      • Denormalizer - Denormalizes the attributes of a table reversing the information in the PMML model.
      • Normalizer - Normalizes the attributes of a table.
      • Normalizer (Apply) - Normalizes the attributes of a table according to a PMML model.
      • Number To String - Converts numbers in a column to strings.
      • Numeric Binner - Group values of numeric columns categorized string type.
      • One2Many - Transforms the values of one column into appended columns.
      • String To Number - Converts strings in a column to numbers.
  • Data Views
    • Property
    • JFreeChart
    • Utility
    • Box Plot - A box plot displays robust statistical parameters for numerical attributes and identifies extreme outliers.
    • Conditional Box Plot - A box plot displays robust statistical parameters for numerical attributes and identifies extreme outliers. The conditional box plot partitions the data of one column into classes and creates a box plot for each of them.
    • Histogram - Displays data in a histogram view. Hiliting is not supported.
    • Histogram (interactive) - Displays data in an interactive histogram view with hiliting support.
    • Interactive Table - Displays data in a table view.
    • Lift Chart - Creates a lift chart
    • Line Plot - Plots the numeric columns as lines.
    • Parallel Coordinates - Plots the data in Parallel Coordinates.
    • Pie chart - Displays data in a pie chart. Hiliting is not supported.
    • Pie chart (interactive) - Displays data in an interactive pie chart with hiliting support.
    • Rule Viewer - This node visualizes a set of rules that are represented as a table containing numeric support, confidence, lift values and nominal values for the consequence and antecedence.
    • Scatter Matrix - Plots a scatter matrix where each column is compared to all others.
    • Scatter Plot - Creates a scatterplot of two selected attributes.
    • Spark Line Appender - Appends a column holding spark line plots based on the selected columns.
    • Radar Plot Appender - Creates radar plots for each row, summarizing selected doubles in this row
  • Statistics
  • Mining
    • Bayes
      • Naive Bayes Learner - Creates a naive Bayes model from the given classified data.
      • Naive Bayes Predictor - Uses the naive Bayes model from the naive Bayes learner to predict the class membership of each row in the input data.
    • Clustering
    • Rule Induction
      • Fuzzy Rules
    • Neural Network
    • Decision Tree
    • Misc Classifiers
      • K Nearest Neighbor - Classifies a set of test data based on the k Nearest Neighbor algorithm using the training data.
    • Ensemble Learning
      • Utility Nodes
      • Bagging - Bagging
      • Boosting Learner - Boosting Learner
      • Boosting Predictor - Boosting Predictor
      • Delegating - Delegating
    • Item Sets / Association Rules
      • Association Rule Learner - Searches for frequent itemsets with a certain minimum support in a set of transactions and optionally generates association rules with a predefined confidence value from them.
      • Association Rule Learner (Borgelt) - Provides different algorithms to searches for frequent items in a list of item sets.
      • Bitvector Generator - Generates bitvectors either from a table containing numerical values, or from a string column containing the bit positions to set, hexadecimal or binary strings.
      • Item Set Finder (Borgelt) - Provides different algorithms to searches for frequent items in a list of item sets.
      • Subset Matcher - The node matches all subsets of the first input table with all sets of the second input table.
    • MDS
      • MDS - Multi dimensional scaling node, mapping data of a high dimensional space onto a lower dimensional space by applying the Sammons mapping.
      • MDS Projection - Multi dimensional scaling node, mapping data of a high dimensional space onto a lower dimensional space by applying a modified Sammons mapping with respect to a given set of fixed points.
    • PCA
      • PCA - Principal component analysis
      • PCA Compute - Principal component analysis computation
      • PCA Apply - Apply principal components projection
      • PCA Inversion - Inverse the PCA transformation
    • SVM
      • LIBSVM
        • LIBSVMLearner - LIBSVM is an integrated software for support vector classification.
        • LIBSVMPredictor - Takes a trained LIBSVM to predict the values for new data.
      • SVM Learner - Trains a support vector machine.
      • SVM Predictor - This node uses a SVM model generated by the SVM learner node to predict the output for given parameters.
    • Scoring
    • Meta
  • Chemistry
  • ChemAxon / Infocom
    • Marvin
      • MarvinSketch - MarvinSketch is a chemical structures editor tool.
      • MarvinView - MarvinView is a chemical structures visualization tool.
      • MarvinSpace - MarvinSpace is a 3D molecule visualization tool.
      • MolConverter - MolConverter converts between various data types.
  • Distance Matrix
    • Distance Matrix Reader - Reads triangular or full distance matrix.
    • Distance Matrix Writer - Writes column containing distance matrix to file.
    • Distance Matrix Calculate - Calculates distance matrix on input table and appends result as (typed) column.
    • k-Medoids - Performs k-Medoids algorithm.
    • Hierarchical Clustering (DistMatrix) - Performs Hierarchical Clustering on distance matrix input.
    • Hierarchical Cluster View - Shows the results of hierarchical clustering.
    • Hierarchical Cluster Assigner - Assigns clusters to rows based on an hierarchical clustering.
    • MDS (DistMatrix) - The multi dimensional scaling node maps data of a high dimensional space onto a lower dimensional space by applying the MDS Sammons mapping. The node is using the same algorithm as the regular MDS node, but instead of computing the distance of the original, high dimensional data on demand, the node uses the distances of the distance matrix column.
    • MDS Projection (DistMatrix) - Multi dimensional scaling node, mapping data of a high dimensional space onto a lower dimensional space by applying a modified Sammons mapping with respect to a given set of fixed points. The distances in original, high dimensional space must be provided by distance vector columns.
    • Similarity Search - Similarity/Distance search in two data sets.
  • Meta
    • Feature Elimination - Backward Feature Elimination
    • Iterate List of Files - Iteratively executes the contained flow on a list of files. The list of files needs to be defined by the input table, whereby each row represents one individual file location.
    • Loop x-times - Executes the contained workflow multiple times. Aggregation method and termination criteration must be set using the loop start and end node contained in the workflow.
    • Variables Loop (Data)
    • Variables Loop (Database)
    • X-Validation - Provides a skeleton of nodes necessary for cross validation
  • Flow Control
  • Misc
  • KNIME Labs
  • Time Series
  • Quick Form
    • Boolean Input - Outputs an integer flow variable with a given value (boolean).
    • Column Filter QuickForm - Takes a data table and returns an empty data table with only the selected columns.
    • Column Selection QuickForm - Takes a data table and returns a variable with the selected column name.
    • Date (String) Input - Outputs a date in a string flow variable with a given value.
    • Double Input - Outputs a double-precision floating point variable with a given value.
    • Dummy Input - Placeholder quickform input node that allows the user to force a break in the wizard execution.
    • File Download - Provides a KNIME quick form with a downloadable file.
    • Image Output - Shows an image as (remote) quickform result.
    • File Upload - Quick Form node that allows uploading a file and exposing that uploaded file using a flow variable.
    • Integer Input - Outputs an integer flow variable with a given value.
    • List Box Input - Outputs a data table with one column holding a list of strings.
    • Molecule String Input - Outputs a molecule string in the specified format.
    • Multi Selection QuickForm - Defines a list of options and allows to select multiple values, which are returned in a table and as a comma-separated variable.
    • String Input - Outputs a string flow variable with a given value.
    • String Radio Buttons - Outputs a string flow variable with a given value.
    • TextArea Output - Displays dynamic text in the web portal.
    • Value Filter QuickForm - Takes a data table and returns a table with one column containing the selected domain values.
    • Value Selection QuickForm - Takes a data table and a selected column and returns a variable with the selected value from this column.
    • Variable Output - Provides the value of a selected variable to a remote quick form.
  • R
    • Local
      • R Learner - Allows execution of R commands in a local R installation and build a R model.
      • R Predictor - Allows to import a R model and predict given data by the use of the model.
      • R R-View - Enables the usage of R views using the local R installation.
      • R Snippet - Allows execution of R commands in a local R installation.
      • R To PMML - Converts a given R object into a corresponding PMML object.
      • R to Table - Converts an R object into a KNIME data table.
      • R+Table to R - Merges an optional data branch into an R workspace.
      • Table R-View - Enables the usage of R views using the local R installation.
      • Table to R - Converts a KNIME data table into an R object.
    • Remote
      • R Snippet (Remote) - Allows execution of R commands on an R server. The result of these R commands is returned in the output table of this node. The final result tables' columns are named R1, R2, and so on.
      • R View (Remote) - Enables the usage of R views generated on an R server.
    • IO
  • Reporting
    • Table Writer
      • Table to HTML - Generates HTML reports out of input data by using the Birt reporting engine.
      • Table to PDF - Generates PDF reports out of input data by using the Birt reporting engine.
    • Data to Report - Provides the incoming data to the KNIME Report Designer.
    • Image to Report - Provides the incoming image to the KNIME Report Designer.
  • Testing
  • Weka
    • Classification Algorithms
      • bayes
        • AODE - AODE achieves highly accurate classification by averaging over all of a small space of alternative naive-Bayes-like models that have weaker (and hence less detrimental) independence assumptions than naive Bayes. The resulting algorithm is computationally efficient while delivering highly accurate classification on many learning tasks. For more information, see G. Webb, J. Boughton, Z. Wang (2005). Not So Naive Bayes: Aggregating One-Dependence Estimators. Machine Learning. 58(1):5-24. Further papers are available at http://www.csse.monash.edu.au/~webb/. Can use an m-estimate for smoothing base probability estimates in place of the Laplace correction (via option -M). Default frequency limit set to 1.
        • AODEsr - AODEsr augments AODE with Subsumption Resolution.AODEsr detects specializations between two attribute values at classification time and deletes the generalization attribute value. For more information, see: Fei Zheng, Geoffrey I. Webb: Efficient Lazy Elimination for Averaged-One Dependence Estimators. In: Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), 1113-1120, 2006.
        • BayesNet - Bayes Network learning using various search algorithms and quality measures. Base class for a Bayes Network classifier. Provides datastructures (network structure, conditional probability distributions, etc.) and facilities common to Bayes Network learning algorithms like K2 and B. For more information see: http://www.cs.waikato.ac.nz/~remco/weka.bn.pdf
        • BayesianLogisticRegression - Implements Bayesian Logistic Regression for both Gaussian and Laplace Priors. For more information, see Alexander Genkin, David D. Lewis, David Madigan (2004). Large-scale bayesian logistic regression for text categorization. URL http://www.stat.rutgers.edu/~madigan/PAPERS/shortFat-v3a.pdf.
        • ComplementNaiveBayes - Class for building and using a Complement class Naive Bayes classifier. For more information see, Jason D. Rennie, Lawrence Shih, Jaime Teevan, David R. Karger: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: ICML, 616-623, 2003. P.S.: TF, IDF and length normalization transforms, as described in the paper, can be performed through weka.filters.unsupervised.StringToWordVector.
        • DMNBtext - Class for building and using a Discriminative Multinomial Naive Bayes classifier. For more information see, Jiang Su,Harry Zhang,Charles X. Ling,Stan Matwin: Discriminative Parameter Learning for Bayesian Networks. In: ICML 2008', 2008. The core equation for this classifier: P[Ci|D] = (P[D|Ci] x P[Ci]) / P[D] (Bayes rule) where Ci is class i and D is a document.
        • HNB - Contructs Hidden Naive Bayes classification model with high classification accuracy and AUC. For more information refer to: H. Zhang, L. Jiang, J. Su: Hidden Naive Bayes. In: Twentieth National Conference on Artificial Intelligence, 919-924, 2005.
        • NaiveBayes - Class for a Naive Bayes classifier using estimator classes. Numeric estimator precision values are chosen based on analysis of the training data. For this reason, the classifier is not an UpdateableClassifier (which in typical usage are initialized with zero training instances) -- if you need the UpdateableClassifier functionality, use the NaiveBayesUpdateable classifier. The NaiveBayesUpdateable classifier will use a default precision of 0.1 for numeric attributes when buildClassifier is called with zero training instances. For more information on Naive Bayes classifiers, see George H. John, Pat Langley: Estimating Continuous Distributions in Bayesian Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, 338-345, 1995.
        • NaiveBayesMultinomial - Class for building and using a multinomial Naive Bayes classifier. For more information see, Andrew Mccallum, Kamal Nigam: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI-98 Workshop on 'Learning for Text Categorization', 1998. The core equation for this classifier: P[Ci|D] = (P[D|Ci] x P[Ci]) / P[D] (Bayes rule) where Ci is class i and D is a document.
        • NaiveBayesMultinomialUpdateable - Class for building and using a multinomial Naive Bayes classifier. For more information see, Andrew Mccallum, Kamal Nigam: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI-98 Workshop on 'Learning for Text Categorization', 1998. The core equation for this classifier: P[Ci|D] = (P[D|Ci] x P[Ci]) / P[D] (Bayes rule) where Ci is class i and D is a document. Incremental version of the algorithm.
        • NaiveBayesSimple - Class for building and using a simple Naive Bayes classifier.Numeric attributes are modelled by a normal distribution. For more information, see Richard Duda, Peter Hart (1973). Pattern Classification and Scene Analysis. Wiley, New York.
        • NaiveBayesUpdateable - Class for a Naive Bayes classifier using estimator classes. This is the updateable version of NaiveBayes. This classifier will use a default precision of 0.1 for numeric attributes when buildClassifier is called with zero training instances. For more information on Naive Bayes classifiers, see George H. John, Pat Langley: Estimating Continuous Distributions in Bayesian Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, San Mateo, 338-345, 1995.
        • WAODE - WAODE contructs the model called Weightily Averaged One-Dependence Estimators. For more information, see L. Jiang, H. Zhang: Weightily Averaged One-Dependence Estimators. In: Proceedings of the 9th Biennial Pacific Rim International Conference on Artificial Intelligence, PRICAI 2006, 970-974, 2006.
      • functions
        • GaussianProcesses - Implements Gaussian Processes for regression without hyperparameter-tuning. For more information see David J.C. Mackay (1998). Introduction to Gaussian Processes. Dept. of Physics, Cambridge University, UK.
        • IsotonicRegression - Learns an isotonic regression model. Picks the attribute that results in the lowest squared error. Missing values are not allowed. Can only deal with numeric attributes.Considers the monotonically increasing case as well as the monotonicallydecreasing case
        • LeastMedSq - Implements a least median sqaured linear regression utilising the existing weka LinearRegression class to form predictions. Least squared regression functions are generated from random subsamples of the data. The least squared regression with the lowest meadian squared error is chosen as the final model. The basis of the algorithm is Peter J. Rousseeuw, Annick M. Leroy (1987). Robust regression and outlier detection. .
        • LibLINEAR - A wrapper class for the liblinear tools (the liblinear classes, typically the jar file, need to be in the classpath to use this classifier). Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin (2008). LIBLINEAR - A Library for Large Linear Classification. URL http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
        • LibSVM - A wrapper class for the libsvm tools (the libsvm classes, typically the jar file, need to be in the classpath to use this classifier). LibSVM runs faster than SMO since it uses LibSVM to build the SVM classifier. LibSVM allows users to experiment with One-class SVM, Regressing SVM, and nu-SVM supported by LibSVM tool. LibSVM reports many useful statistics about LibSVM classifier (e.g., confusion matrix,precision, recall, ROC score, etc.). Yasser EL-Manzalawy (2005). WLSVM. URL http://www.cs.iastate.edu/~yasser/wlsvm/. Chih-Chung Chang, Chih-Jen Lin (2001). LIBSVM - A Library for Support Vector Machines. URL http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
        • LinearRegression - Class for using linear regression for prediction. Uses the Akaike criterion for model selection, and is able to deal with weighted instances.
        • Logistic - Class for building and using a multinomial logistic regression model with a ridge estimator. There are some modifications, however, compared to the paper of leCessie and van Houwelingen(1992): If there are k classes for n instances with m attributes, the parameter matrix B to be calculated will be an m*(k-1) matrix. The probability for class j with the exception of the last class is Pj(Xi) = exp(XiBj)/((sum[j=1..(k-1)]exp(Xi*Bj))+1) The last class has probability 1-(sum[j=1..(k-1)]Pj(Xi)) = 1/((sum[j=1..(k-1)]exp(Xi*Bj))+1) The (negative) multinomial log-likelihood is thus: L = -sum[i=1..n]{ sum[j=1..(k-1)](Yij * ln(Pj(Xi))) +(1 - (sum[j=1..(k-1)]Yij)) * ln(1 - sum[j=1..(k-1)]Pj(Xi)) } + ridge * (B^2) In order to find the matrix B for which L is minimised, a Quasi-Newton Method is used to search for the optimized values of the m*(k-1) variables. Note that before we use the optimization procedure, we 'squeeze' the matrix B into a m*(k-1) vector. For details of the optimization procedure, please check weka.core.Optimization class. Although original Logistic Regression does not deal with instance weights, we modify the algorithm a little bit to handle the instance weights. For more information see: le Cessie, S., van Houwelingen, J.C. (1992). Ridge Estimators in Logistic Regression. Applied Statistics. 41(1):191-201. Note: Missing values are replaced using a ReplaceMissingValuesFilter, and nominal attributes are transformed into numeric attributes using a NominalToBinaryFilter.
        • MultilayerPerceptron - A Classifier that uses backpropagation to classify instances. This network can be built by hand, created by an algorithm or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the the output nodes become unthresholded linear units).
        • PLSClassifier - A wrapper classifier for the PLSFilter, utilizing the PLSFilter's ability to perform predictions.
        • PaceRegression - Class for building pace regression linear models and using them for prediction. Under regularity conditions, pace regression is provably optimal when the number of coefficients tends to infinity. It consists of a group of estimators that are either overall optimal or optimal under certain conditions. The current work of the pace regression theory, and therefore also this implementation, do not handle: - missing values - non-binary nominal attributes - the case that n - k is small where n is the number of instances and k is the number of coefficients (the threshold used in this implmentation is 20) For more information see: Wang, Y (2000). A new approach to fitting linear models in high dimensional spaces. Hamilton, New Zealand. Wang, Y., Witten, I. H.: Modeling for optimal probability prediction. In: Proceedings of the Nineteenth International Conference in Machine Learning, Sydney, Australia, 650-657, 2002.
        • RBFNetwork - Class that implements a normalized Gaussian radial basisbasis function network. It uses the k-means clustering algorithm to provide the basis functions and learns either a logistic regression (discrete class problems) or linear regression (numeric class problems) on top of that. Symmetric multivariate Gaussians are fit to the data from each cluster. If the class is nominal it uses the given number of clusters per class.It standardizes all numeric attributes to zero mean and unit variance.
        • SMO - Implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes by default. (In that case the coefficients in the output are based on the normalized data, not the original data --- this is important for interpreting the classifier.) Multi-class problems are solved using pairwise classification (1-vs-1 and if logistic models are built pairwise coupling according to Hastie and Tibshirani, 1998). To obtain proper probability estimates, use the option that fits logistic regression models to the outputs of the support vector machine. In the multi-class case the predicted probabilities are coupled using Hastie and Tibshirani's pairwise coupling method. Note: for improved speed normalization should be turned off when operating on SparseInstances. For more information on the SMO algorithm, see J. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In B. Schoelkopf and C. Burges and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, 1998. S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy (2001). Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural Computation. 13(3):637-649. Trevor Hastie, Robert Tibshirani: Classification by Pairwise Coupling. In: Advances in Neural Information Processing Systems, 1998.
        • SMOreg - SMOreg implements the support vector machine for regression. The parameters can be learned using various algorithms. The algorithm is selected by setting the RegOptimizer. The most popular algorithm (RegSMOImproved) is due to Shevade, Keerthi et al and this is the default RegOptimizer. For more information see: S.K. Shevade, S.S. Keerthi, C. Bhattacharyya, K.R.K. Murthy: Improvements to the SMO Algorithm for SVM Regression. In: IEEE Transactions on Neural Networks, 1999. A.J. Smola, B. Schoelkopf (1998). A tutorial on support vector regression.
        • SPegasos - Implements the stochastic variant of the Pegasos (Primal Estimated sub-GrAdient SOlver for SVM) method of Shalev-Shwartz et al. (2007). This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes, so the coefficients in the output are based on the normalized data. For more information, see S. Shalev-Shwartz, Y. Singer, N. Srebro: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: 24th International Conference on MachineLearning, 807-814, 2007.
        • SimpleLinearRegression - Learns a simple linear regression model. Picks the attribute that results in the lowest squared error. Missing values are not allowed. Can only deal with numeric attributes.
        • SimpleLogistic - Classifier for building linear logistic regression models. LogitBoost with simple regression functions as base learners is used for fitting the logistic models. The optimal number of LogitBoost iterations to perform is cross-validated, which leads to automatic attribute selection. For more information see: Niels Landwehr, Mark Hall, Eibe Frank (2005). Logistic Model Trees. Marc Sumner, Eibe Frank, Mark Hall: Speeding up Logistic Model Tree Induction. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, 675-683, 2005.
        • VotedPerceptron - Implementation of the voted perceptron algorithm by Freund and Schapire. Globally replaces all missing values, and transforms nominal attributes into binary ones. For more information, see: Y. Freund, R. E. Schapire: Large margin classification using the perceptron algorithm. In: 11th Annual Conference on Computational Learning Theory, New York, NY, 209-217, 1998.
        • Winnow - Implements Winnow and Balanced Winnow algorithms by Littlestone. For more information, see N. Littlestone (1988). Learning quickly when irrelevant attributes are abound: A new linear threshold algorithm. Machine Learning. 2:285-318. N. Littlestone (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. University of California, Santa Cruz. Does classification for problems with nominal attributes (which it converts into binary attributes).
      • lazy
        • IB1 - Nearest-neighbour classifier. Uses normalized Euclidean distance to find the training instance closest to the given test instance, and predicts the same class as this training instance. If multiple instances have the same (smallest) distance to the test instance, the first one found is used. For more information, see D. Aha, D. Kibler (1991). Instance-based learning algorithms. Machine Learning. 6:37-66.
        • IBk - K-nearest neighbours classifier. Can select appropriate value of K based on cross-validation. Can also do distance weighting. For more information, see D. Aha, D. Kibler (1991). Instance-based learning algorithms. Machine Learning. 6:37-66.
        • KStar - K* is an instance-based classifier, that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. It differs from other instance-based learners in that it uses an entropy-based distance function. For more information on K*, see John G. Cleary, Leonard E. Trigg: K*: An Instance-based Learner Using an Entropic Distance Measure. In: 12th International Conference on Machine Learning, 108-114, 1995.
        • LBR - Lazy Bayesian Rules Classifier. The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. Lazy Bayesian Rules selectively relaxes the independence assumption, achieving lower error rates over a range of learning tasks. LBR defers processing to classification time, making it a highly efficient and accurate classification algorithm when small numbers of objects are to be classified. For more information, see: Zijian Zheng, G. Webb (2000). Lazy Learning of Bayesian Rules. Machine Learning. 4(1):53-84.
        • LWL - Locally weighted learning. Uses an instance-based algorithm to assign instance weights which are then used by a specified WeightedInstancesHandler. Can do classification (e.g. using naive Bayes) or regression (e.g. using linear regression). For more info, see Eibe Frank, Mark Hall, Bernhard Pfahringer: Locally Weighted Naive Bayes. In: 19th Conference in Uncertainty in Artificial Intelligence, 249-256, 2003. C. Atkeson, A. Moore, S. Schaal (1996). Locally weighted learning. AI Review..
      • meta
        • nestedDichotomies
          • ClassBalancedND - A meta classifier for handling multi-class datasets with 2-class classifiers by building a random class-balanced tree structure. For more info, check Lin Dong, Eibe Frank, Stefan Kramer: Ensembles of Balanced Nested Dichotomies for Multi-class Problems. In: PKDD, 84-95, 2005. Eibe Frank, Stefan Kramer: Ensembles of nested dichotomies for multi-class problems. In: Twenty-first International Conference on Machine Learning, 2004.
          • DataNearBalancedND - A meta classifier for handling multi-class datasets with 2-class classifiers by building a random data-balanced tree structure. For more info, check Lin Dong, Eibe Frank, Stefan Kramer: Ensembles of Balanced Nested Dichotomies for Multi-class Problems. In: PKDD, 84-95, 2005. Eibe Frank, Stefan Kramer: Ensembles of nested dichotomies for multi-class problems. In: Twenty-first International Conference on Machine Learning, 2004.
          • ND - A meta classifier for handling multi-class datasets with 2-class classifiers by building a random tree structure. For more info, check Lin Dong, Eibe Frank, Stefan Kramer: Ensembles of Balanced Nested Dichotomies for Multi-class Problems. In: PKDD, 84-95, 2005. Eibe Frank, Stefan Kramer: Ensembles of nested dichotomies for multi-class problems. In: Twenty-first International Conference on Machine Learning, 2004.
        • AdaBoostM1 - Class for boosting a nominal class classifier using the Adaboost M1 method. Only nominal class problems can be tackled. Often dramatically improves performance, but sometimes overfits. For more information, see Yoav Freund, Robert E. Schapire: Experiments with a new boosting algorithm. In: Thirteenth International Conference on Machine Learning, San Francisco, 148-156, 1996.
        • AdditiveRegression - Meta classifier that enhances the performance of a regression base classifier. Each iteration fits a model to the residuals left by the classifier on the previous iteration. Prediction is accomplished by adding the predictions of each classifier. Reducing the shrinkage (learning rate) parameter helps prevent overfitting and has a smoothing effect but increases the learning time. For more information see: J.H. Friedman (1999). Stochastic Gradient Boosting.
        • AttributeSelectedClassifier - Dimensionality of training and test data is reduced by attribute selection before being passed on to a classifier.
        • Bagging - Class for bagging a classifier to reduce variance. Can do classification and regression depending on the base learner. For more information, see Leo Breiman (1996). Bagging predictors. Machine Learning. 24(2):123-140.
        • CVParameterSelection - Class for performing parameter selection by cross-validation for any classifier. For more information, see: R. Kohavi (1995). Wrappers for Performance Enhancement and Oblivious Decision Graphs. Department of Computer Science, Stanford University.
        • ClassificationViaClustering - A simple meta-classifier that uses a clusterer for classification. For cluster algorithms that use a fixed number of clusterers, like SimpleKMeans, the user has to make sure that the number of clusters to generate are the same as the number of class labels in the dataset in order to obtain a useful model. Note: at prediction time, a missing value is returned if no cluster is found for the instance. The code is based on the 'clusters to classes' functionality of the weka.clusterers.ClusterEvaluation class by Mark Hall.
        • ClassificationViaRegression - Class for doing classification using regression methods. Class is binarized and one regression model is built for each class value. For more information, see, for example E. Frank, Y. Wang, S. Inglis, G. Holmes, I.H. Witten (1998). Using model trees for classification. Machine Learning. 32(1):63-76.
        • CostSensitiveClassifier - A metaclassifier that makes its base classifier cost-sensitive. Two methods can be used to introduce cost-sensitivity: reweighting training instances according to the total cost assigned to each class; or predicting the class with minimum expected misclassification cost (rather than the most likely class). Performance can often be improved by using a Bagged classifier to improve the probability estimates of the base classifier.
        • Dagging - This meta classifier creates a number of disjoint, stratified folds out of the data and feeds each chunk of data to a copy of the supplied base classifier. Predictions are made via averaging, since all the generated base classifiers are put into the Vote meta classifier. Useful for base classifiers that are quadratic or worse in time behavior, regarding number of instances in the training data. For more information, see: Ting, K. M., Witten, I. H.: Stacking Bagged and Dagged Models. In: Fourteenth international Conference on Machine Learning, San Francisco, CA, 367-375, 1997.
        • Decorate - DECORATE is a meta-learner for building diverse ensembles of classifiers by using specially constructed artificial training examples. Comprehensive experiments have demonstrated that this technique is consistently more accurate than the base classifier, Bagging and Random Forests.Decorate also obtains higher accuracy than Boosting on small training sets, and achieves comparable performance on larger training sets. For more details see: P. Melville, R. J. Mooney: Constructing Diverse Classifier Ensembles Using Artificial Training Examples. In: Eighteenth International Joint Conference on Artificial Intelligence, 505-510, 2003. P. Melville, R. J. Mooney (2004). Creating Diversity in Ensembles Using Artificial Data. Information Fusion: Special Issue on Diversity in Multiclassifier Systems..
        • END - A meta classifier for handling multi-class datasets with 2-class classifiers by building an ensemble of nested dichotomies. For more info, check Lin Dong, Eibe Frank, Stefan Kramer: Ensembles of Balanced Nested Dichotomies for Multi-class Problems. In: PKDD, 84-95, 2005. Eibe Frank, Stefan Kramer: Ensembles of nested dichotomies for multi-class problems. In: Twenty-first International Conference on Machine Learning, 2004.
        • FilteredClassifier - Class for running an arbitrary classifier on data that has been passed through an arbitrary filter. Like the classifier, the structure of the filter is based exclusively on the training data and test instances will be processed by the filter without changing their structure.
        • Grading - Implements Grading. The base classifiers are "graded". For more information, see A.K. Seewald, J. Fuernkranz: An Evaluation of Grading Classifiers. In: Advances in Intelligent Data Analysis: 4th International Conference, Berlin/Heidelberg/New York/Tokyo, 115-124, 2001.
        • GridSearch - Performs a grid search of parameter pairs for the a classifier (Y-axis, default is LinearRegression with the "Ridge" parameter) and the PLSFilter (X-axis, "# of Components") and chooses the best pair found for the actual predicting. The initial grid is worked on with 2-fold CV to determine the values of the parameter pairs for the selected type of evaluation (e.g., accuracy). The best point in the grid is then taken and a 10-fold CV is performed with the adjacent parameter pairs. If a better pair is found, then this will act as new center and another 10-fold CV will be performed (kind of hill-climbing). This process is repeated until no better pair is found or the best pair is on the border of the grid. In case the best pair is on the border, one can let GridSearch automatically extend the grid and continue the search. Check out the properties 'gridIsExtendable' (option '-extend-grid') and 'maxGridExtensions' (option '-max-grid-extensions '). GridSearch can handle doubles, integers (values are just cast to int) and booleans (0 is false, otherwise true). float, char and long are supported as well. The best filter/classifier setup can be accessed after the buildClassifier call via the getBestFilter/getBestClassifier methods. Note on the implementation: after the data has been passed through the filter, a default NumericCleaner filter is applied to the data in order to avoid numbers that are getting too small and might produce NaNs in other schemes.
        • LogitBoost - Class for performing additive logistic regression. This class performs classification using a regression scheme as the base learner, and can handle multi-class problems. For more information, see J. Friedman, T. Hastie, R. Tibshirani (1998). Additive Logistic Regression: a Statistical View of Boosting. Stanford University. Can do efficient internal cross-validation to determine appropriate number of iterations.
        • MetaCost - This metaclassifier makes its base classifier cost-sensitive using the method specified in Pedro Domingos: MetaCost: A general method for making classifiers cost-sensitive. In: Fifth International Conference on Knowledge Discovery and Data Mining, 155-164, 1999. This classifier should produce similar results to one created by passing the base learner to Bagging, which is in turn passed to a CostSensitiveClassifier operating on minimum expected cost. The difference is that MetaCost produces a single cost-sensitive classifier of the base learner, giving the benefits of fast classification and interpretable output (if the base learner itself is interpretable). This implementation uses all bagging iterations when reclassifying training data (the MetaCost paper reports a marginal improvement when only those iterations containing each training instance are used in reclassifying that instance).
        • MultiBoostAB - Class for boosting a classifier using the MultiBoosting method. MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, Multi-boosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution. For more information, see Geoffrey I. Webb (2000). MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning. Vol.40(No.2).
        • MultiClassClassifier - A metaclassifier for handling multi-class datasets with 2-class classifiers. This classifier is also capable of applying error correcting output codes for increased accuracy.
        • MultiScheme - Class for selecting a classifier from among several using cross validation on the training data or the performance on the training data. Performance is measured based on percent correct (classification) or mean-squared error (regression).
        • OrdinalClassClassifier - Meta classifier that allows standard classification algorithms to be applied to ordinal class problems. For more information see: Eibe Frank, Mark Hall: A Simple Approach to Ordinal Classification. In: 12th European Conference on Machine Learning, 145-156, 2001.
        • RacedIncrementalLogitBoost - Classifier for incremental learning of large datasets by way of racing logit-boosted committees. For more information see: Eibe Frank, Geoffrey Holmes, Richard Kirkby, Mark Hall: Racing committees for large datasets. In: Proceedings of the 5th International Conferenceon Discovery Science, 153-164, 2002.
        • RandomCommittee - Class for building an ensemble of randomizable base classifiers. Each base classifiers is built using a different random number seed (but based one the same data). The final prediction is a straight average of the predictions generated by the individual base classifiers.
        • RandomSubSpace - This method constructs a decision tree based classifier that maintains highest accuracy on training data and improves on generalization accuracy as it grows in complexity. The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces. For more information, see Tin Kam Ho (1998). The Random Subspace Method for Constructing Decision Forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(8):832-844. URL http://citeseer.ist.psu.edu/ho98random.html.
        • RegressionByDiscretization - A regression scheme that employs any classifier on a copy of the data that has the class attribute (equal-width) discretized. The predicted value is the expected value of the mean class value for each discretized interval (based on the predicted probabilities for each interval).
        • RotationForest - Class for construction a Rotation Forest. Can do classification and regression depending on the base learner. For more information, see Juan J. Rodriguez, Ludmila I. Kuncheva, Carlos J. Alonso (2006). Rotation Forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence. 28(10):1619-1630. URL http://doi.ieeecomputersociety.org/10.1109/TPAMI.2006.211.
        • Stacking - Combines several classifiers using the stacking method. Can do classification or regression. For more information, see David H. Wolpert (1992). Stacked generalization. Neural Networks. 5:241-259.
        • StackingC - Implements StackingC (more efficient version of stacking). For more information, see A.K. Seewald: How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness. In: Nineteenth International Conference on Machine Learning, 554-561, 2002. Note: requires meta classifier to be a numeric prediction scheme.
        • ThresholdSelector - A metaclassifier that selecting a mid-point threshold on the probability output by a Classifier. The midpoint threshold is set so that a given performance measure is optimized. Currently this is the F-measure. Performance is measured either on the training data, a hold-out set or using cross-validation. In addition, the probabilities returned by the base learner can have their range expanded so that the output probabilities will reside between 0 and 1 (this is useful if the scheme normally produces probabilities in a very narrow range).
        • Vote - Class for combining classifiers. Different combinations of probability estimates for classification are available. For more information see: Ludmila I. Kuncheva (2004). Combining Pattern Classifiers: Methods and Algorithms. John Wiley and Sons, Inc.. J. Kittler, M. Hatef, Robert P.W. Duin, J. Matas (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(3):226-239.
      • misc
        • HyperPipes - Class implementing a HyperPipe classifier. For each category a HyperPipe is constructed that contains all points of that category (essentially records the attribute bounds observed for each category). Test instances are classified according to the category that "most contains the instance". Does not handle numeric class, or missing values in test cases. Extremely simple algorithm, but has the advantage of being extremely fast, and works quite well when you have "smegloads" of attributes.
        • SerializedClassifier - A wrapper around a serialized classifier model. This classifier loads a serialized models and uses it to make predictions. Warning: since the serialized model doesn't get changed, cross-validation cannot bet used with this classifier.
        • VFI - Classification by voting feature intervals. Intervals are constucted around each class for each attribute (basically discretization). Class counts are recorded for each interval on each attribute. Classification is by voting. For more info see: G. Demiroz, A. Guvenir: Classification by voting feature intervals. In: 9th European Conference on Machine Learning, 85-92, 1997. Have added a simple attribute weighting scheme. Higher weight is assigned to more confident intervals, where confidence is a function of entropy: weight (att_i) = (entropy of class distrib att_i / max uncertainty)^-bias
      • trees
        • ADTree - Class for generating an alternating decision tree. The basic algorithm is based on: Freund, Y., Mason, L.: The alternating decision tree learning algorithm. In: Proceeding of the Sixteenth International Conference on Machine Learning, Bled, Slovenia, 124-133, 1999. This version currently only supports two-class problems. The number of boosting iterations needs to be manually tuned to suit the dataset and the desired complexity/accuracy tradeoff. Induction of the trees has been optimized, and heuristic search methods have been introduced to speed learning.
        • BFTree - Class for building a best-first decision tree classifier. This class uses binary split for both nominal and numeric attributes. For missing values, the method of 'fractional' instances is used. For more information, see: Haijian Shi (2007). Best-first decision tree learning. Hamilton, NZ. Jerome Friedman, Trevor Hastie, Robert Tibshirani (2000). Additive logistic regression : A statistical view of boosting. Annals of statistics. 28(2):337-407.
        • DecisionStump - Class for building and using a decision stump. Usually used in conjunction with a boosting algorithm. Does regression (based on mean-squared error) or classification (based on entropy). Missing is treated as a separate value.
        • FT - Classifier for building 'Functional trees', which are classification trees that could have logistic regression functions at the inner nodes and/or leaves. The algorithm can deal with binary and multi-class target variables, numeric and nominal attributes and missing values. For more information see: Joao Gama (2004). Functional Trees. Niels Landwehr, Mark Hall, Eibe Frank (2005). Logistic Model Trees.
        • Id3 - Class for constructing an unpruned decision tree based on the ID3 algorithm. Can only deal with nominal attributes. No missing values allowed. Empty leaves may result in unclassified instances. For more information see: R. Quinlan (1986). Induction of decision trees. Machine Learning. 1(1):81-106.
        • J48 - Class for generating a pruned or unpruned C4.5 decision tree. For more information, see Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
        • J48graft - Class for generating a grafted (pruned or unpruned) C4.5 decision tree. For more information, see Geoff Webb: Decision Tree Grafting From the All-Tests-But-One Partition. In: , San Francisco, CA, 1999.
        • LADTree - Class for generating a multi-class alternating decision tree using the LogitBoost strategy. For more info, see Geoffrey Holmes, Bernhard Pfahringer, Richard Kirkby, Eibe Frank, Mark Hall: Multiclass alternating decision trees. In: ECML, 161-172, 2001.
        • LMT - Classifier for building 'logistic model trees', which are classification trees with logistic regression functions at the leaves. The algorithm can deal with binary and multi-class target variables, numeric and nominal attributes and missing values. For more information see: Niels Landwehr, Mark Hall, Eibe Frank (2005). Logistic Model Trees. Machine Learning. 95(1-2):161-205. Marc Sumner, Eibe Frank, Mark Hall: Speeding up Logistic Model Tree Induction. In: 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, 675-683, 2005.
        • M5P - M5Base. Implements base routines for generating M5 Model trees and rules The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements. For more information see: Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992. Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
        • NBTree - Class for generating a decision tree with naive Bayes classifiers at the leaves. For more information, see Ron Kohavi: Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In: Second International Conference on Knoledge Discovery and Data Mining, 202-207, 1996.
        • REPTree - Fast decision tree learner. Builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning (with backfitting). Only sorts values for numeric attributes once. Missing values are dealt with by splitting the corresponding instances into pieces (i.e. as in C4.5).
        • RandomForest - Class for constructing a forest of random trees. For more information see: Leo Breiman (2001). Random Forests. Machine Learning. 45(1):5-32.
        • RandomTree - Class for constructing a tree that considers K randomly chosen attributes at each node. Performs no pruning. Also has an option to allow estimation of class probabilities based on a hold-out set (backfitting).
        • SimpleCart - Class implementing minimal cost-complexity pruning. Note when dealing with missing values, use "fractional instances" method instead of surrogate split method. For more information, see: Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, California.
        • UserClassifier - Interactively classify through visual means. You are Presented with a scatter graph of the data against two user selectable attributes, as well as a view of the decision tree. You can create binary splits by creating polygons around data plotted on the scatter graph, as well as by allowing another classifier to take over at points in the decision tree should you see fit. For more information see: Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, Ian H. Witten (2001). Interactive machine learning: letting users build classifiers. Int. J. Hum.-Comput. Stud.. 55(3):281-292.
      • rules
        • ConjunctiveRule - This class implements a single conjunctive rule learner that can predict for numeric and nominal class labels. A rule consists of antecedents "AND"ed together and the consequent (class value) for the classification/regression. In this case, the consequent is the distribution of the available classes (or mean for a numeric value) in the dataset. If the test instance is not covered by this rule, then it's predicted using the default class distributions/value of the data not covered by the rule in the training data.This learner selects an antecedent by computing the Information Gain of each antecendent and prunes the generated rule using Reduced Error Prunning (REP) or simple pre-pruning based on the number of antecedents. For classification, the Information of one antecedent is the weighted average of the entropies of both the data covered and not covered by the rule. For regression, the Information is the weighted average of the mean-squared errors of both the data covered and not covered by the rule. In pruning, weighted average of the accuracy rates on the pruning data is used for classification while the weighted average of the mean-squared errors on the pruning data is used for regression.
        • DTNB - Class for building and using a decision table/naive bayes hybrid classifier. At each point in the search, the algorithm evaluates the merit of dividing the attributes into two disjoint subsets: one for the decision table, the other for naive Bayes. A forward selection search is used, where at each step, selected attributes are modeled by naive Bayes and the remainder by the decision table, and all attributes are modelled by the decision table initially. At each step, the algorithm also considers dropping an attribute entirely from the model. For more information, see: Mark Hall, Eibe Frank: Combining Naive Bayes and Decision Tables. In: Proceedings of the 21st Florida Artificial Intelligence Society Conference (FLAIRS), 318-319, 2008.
        • DecisionTable - Class for building and using a simple decision table majority classifier. For more information see: Ron Kohavi: The Power of Decision Tables. In: 8th European Conference on Machine Learning, 174-189, 1995.
        • JRip - This class implements a propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction (RIPPER), which was proposed by William W. Cohen as an optimized version of IREP. The algorithm is briefly described as follows: Initialize RS = {}, and for each class from the less prevalent one to the more frequent one, DO: 1. Building stage: Repeat 1.1 and 1.2 until the descrition length (DL) of the ruleset and examples is 64 bits greater than the smallest DL met so far, or there are no positive examples, or the error rate >= 50%. 1.1. Grow phase: Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain: p(log(p/t)-log(P/T)). 1.2. Prune phase: Incrementally prune each rule and allow the pruning of any final sequences of the antecedents;The pruning metric is (p-n)/(p+n) -- but it's actually 2p/(p+n) -1, so in this implementation we simply use p/(p+n) (actually (p+1)/(p+n+2), thus if p+n is 0, it's 0.5). 2. Optimization stage: after generating the initial ruleset {Ri}, generate and prune two variants of each rule Ri from randomized data using procedure 1.1 and 1.2. But one variant is generated from an empty rule while the other is generated by greedily adding antecedents to the original rule. Moreover, the pruning metric used here is (TP+TN)/(P+N).Then the smallest possible DL for each variant and the original rule is computed. The variant with the minimal DL is selected as the final representative of Ri in the ruleset.After all the rules in {Ri} have been examined and if there are still residual positives, more rules are generated based on the residual positives using Building Stage again. 3. Delete the rules from the ruleset that would increase the DL of the whole ruleset if it were in it. and add resultant ruleset to RS. ENDDO Note that there seem to be 2 bugs in the original ripper program that would affect the ruleset size and accuracy slightly. This implementation avoids these bugs and thus is a little bit different from Cohen's original implementation. Even after fixing the bugs, since the order of classes with the same frequency is not defined in ripper, there still seems to be some trivial difference between this implementation and the original ripper, especially for audiology data in UCI repository, where there are lots of classes of few instances. Details please see: William W. Cohen: Fast Effective Rule Induction. In: Twelfth International Conference on Machine Learning, 115-123, 1995. PS. We have compared this implementation with the original ripper implementation in aspects of accuracy, ruleset size and running time on both artificial data "ab+bcd+defg" and UCI datasets. In all these aspects it seems to be quite comparable to the original ripper implementation. However, we didn't consider memory consumption optimization in this implementation.
        • M5Rules - Generates a decision list for regression problems using separate-and-conquer. In each iteration it builds a model tree using M5 and makes the "best" leaf into a rule. For more information see: Geoffrey Holmes, Mark Hall, Eibe Frank: Generating Rule Sets from Model Trees. In: Twelfth Australian Joint Conference on Artificial Intelligence, 1-12, 1999. Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992. Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
        • NNge - Nearest-neighbor-like algorithm using non-nested generalized exemplars (which are hyperrectangles that can be viewed as if-then rules). For more information, see Brent Martin (1995). Instance-Based learning: Nearest Neighbor With Generalization. Hamilton, New Zealand. Sylvain Roy (2002). Nearest Neighbor With Generalization. Christchurch, New Zealand.
        • OneR - Class for building and using a 1R classifier; in other words, uses the minimum-error attribute for prediction, discretizing numeric attributes. For more information, see: R.C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.
        • PART - Class for generating a PART decision list. Uses separate-and-conquer. Builds a partial C4.5 decision tree in each iteration and makes the "best" leaf into a rule. For more information, see: Eibe Frank, Ian H. Witten: Generating Accurate Rule Sets Without Global Optimization. In: Fifteenth International Conference on Machine Learning, 144-151, 1998.
        • Prism - Class for building and using a PRISM rule set for classification. Can only deal with nominal attributes. Can't deal with missing values. Doesn't do any pruning. For more information, see J. Cendrowska (1987). PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies. 27(4):349-370.
        • Ridor - An implementation of a RIpple-DOwn Rule learner. It generates a default rule first and then the exceptions for the default rule with the least (weighted) error rate. Then it generates the "best" exceptions for each exception and iterates until pure. Thus it performs a tree-like expansion of exceptions.The exceptions are a set of rules that predict classes other than the default. IREP is used to generate the exceptions. For more information about Ripple-Down Rules, see:
        • ZeroR - Class for building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class).
    • Cluster Algorithms
      • CLOPE - Yiling Yang, Xudong Guan, Jinyuan You: CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 682-687, 2002.
      • Cobweb - Class implementing the Cobweb and Classit clustering algorithms. Note: the application of node operators (merging, splitting etc.) in terms of ordering and priority differs (and is somewhat ambiguous) between the original Cobweb and Classit papers. This algorithm always compares the best host, adding a new leaf, merging the two best hosts, and splitting the best host when considering where to place a new instance. For more information see: D. Fisher (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning. 2(2):139-172. J. H. Gennari, P. Langley, D. Fisher (1990). Models of incremental concept formation. Artificial Intelligence. 40:11-61.
      • DBScan - Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Second International Conference on Knowledge Discovery and Data Mining, 226-231, 1996.
      • EM - Simple EM (expectation maximisation) class. EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate. The cross validation performed to determine the number of clusters is done in the following steps: 1. the number of clusters is set to 1 2. the training set is split randomly into 10 folds. 3. EM is performed 10 times using the 10 folds the usual CV way. 4. the loglikelihood is averaged over all 10 results. 5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2. The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.
      • FarthestFirst - Cluster data using the FarthestFirst algorithm. For more information see: Hochbaum, Shmoys (1985). A best possible heuristic for the k-center problem. Mathematics of Operations Research. 10(2):180-184. Sanjoy Dasgupta: Performance Guarantees for Hierarchical Clustering. In: 15th Annual Conference on Computational Learning Theory, 351-363, 2002. Notes: - works as a fast simple approximate clusterer - modelled after SimpleKMeans, might be a useful initializer for it
      • FilteredClusterer - Class for running an arbitrary clusterer on data that has been passed through an arbitrary filter. Like the clusterer, the structure of the filter is based exclusively on the training data and test instances will be processed by the filter without changing their structure.
      • MakeDensityBasedClusterer - Class for wrapping a Clusterer to make it return a distribution and density. Fits normal distributions and discrete distributions within each cluster produced by the wrapped clusterer. Supports the NumberOfClustersRequestable interface only if the wrapped Clusterer does.
      • OPTICS - Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Joerg Sander: OPTICS: Ordering Points To Identify the Clustering Structure. In: ACM SIGMOD International Conference on Management of Data, 49-60, 1999.
      • SimpleKMeans - Cluster data using the k means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean.
      • XMeans - Cluster data using the X-means algorithm. X-Means is K-Means extended by an Improve-Structure part In this part of the algorithm the centers are attempted to be split in its region. The decision between the children of each center and itself is done comparing the BIC-values of the two structures. For more information see: Dan Pelleg, Andrew W. Moore: X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In: Seventeenth International Conference on Machine Learning, 727-734, 2000.
      • sIB - Cluster data using the sequential information bottleneck algorithm. Note: only hard clustering scheme is supported. sIB assign for each instance the cluster that have the minimum cost/distance to the instance. The trade-off beta is set to infinite so 1/beta is zero. For more information, see: Noam Slonim, Nir Friedman, Naftali Tishby: Unsupervised document classification using sequential information maximization. In: Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, 129-136, 2002.
    • Association Rules
      • Apriori - Class implementing an Apriori-type algorithm. Iteratively reduces the minimum support until it finds the required number of rules with the given minimum confidence. The algorithm has an option to mine class association rules. It is adapted as explained in the second reference. For more information see: R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules in Large Databases. In: 20th International Conference on Very Large Data Bases, 478-499, 1994. Bing Liu, Wynne Hsu, Yiming Ma: Integrating Classification and Association Rule Mining. In: Fourth International Conference on Knowledge Discovery and Data Mining, 80-86, 1998.
      • FPGrowth - Class implementing the FP-growth algorithm for finding large item sets without candidate generation. Iteratively reduces the minimum support until it finds the required number of rules with the given minimum metric. For more information see: J. Han, J.Pei, Y. Yin: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM-SIGMID International Conference on Management of Data, 1-12, 2000.
      • FilteredAssociator - Class for running an arbitrary associator on data that has been passed through an arbitrary filter. Like the associator, the structure of the filter is based exclusively on the training data and test instances will be processed by the filter without changing their structure.
      • GeneralizedSequentialPatterns - Class implementing a GSP algorithm for discovering sequential patterns in a sequential data set. The attribute identifying the distinct data sequences contained in the set can be determined by the respective option. Furthermore, the set of output results can be restricted by specifying one or more attributes that have to be contained in each element/itemset of a sequence. For further information see: Ramakrishnan Srikant, Rakesh Agrawal (1996). Mining Sequential Patterns: Generalizations and Performance Improvements.
      • PredictiveApriori - Class implementing the predictive apriori algorithm to mine association rules. It searches with an increasing support threshold for the best 'n' rules concerning a support-based corrected confidence value. For more information see: Tobias Scheffer: Finding Association Rules That Trade Support Optimally against Confidence. In: 5th European Conference on Principles of Data Mining and Knowledge Discovery, 424-435, 2001. The implementation follows the paper expect for adding a rule to the output of the 'n' best rules. A rule is added if: the expected predictive accuracy of this rule is among the 'n' best and it is not subsumed by a rule with at least the same expected predictive accuracy (out of an unpublished manuscript from T. Scheffer).
      • Tertius - Finds rules according to confirmation measure (Tertius-type algorithm). For more information see: P. A. Flach, N. Lachiche (1999). Confirmation-Guided Discovery of first-order rules with Tertius. Machine Learning. 42:61-95.
    • Predictors
      • Weka Cluster Assigner - The Weka Cluster Assigner takes a cluster model generated in a weka node and assigns the data at the inport to the corresponding clusters.
      • Weka Predictor - The Weka Predictor takes a model generated in a weka node and classifies the test data at the inport.
    • IO
  • XML