Skip to content
Yun edited this page Jun 18, 2015 · 45 revisions

Description of input

FAST requires (1) data files and a (2) configuration file.

##Data files The data files are the student observations used for training and testing. Data files are delimited by tab (\t) or comma (,) and will skip the the preceding or following spaces of the tab or comma. (i.e., it uses regular expression "\s*[,\t]+\s*" to split the columns). Data files have names with the pattern prefix-X-suffix (e.g. train0.txt). X is a number from 0 to numFolds.

IMPORTANT:

  • The input requires a line per observation. The input only requires that lines are sorted by time within a student. This means that the order of students or Knowledge Components (KCs) doesn't matter as long as the input is sorted over time.
  • The test file shouldn't have new KCs (HMMs), i.e. the code won't predict for the new KC(HMM) that it didn't train on
  • The test file should have the same feature columns (indicated by features_XXX or *features_XXX) as train file.

MANDATORY COLUMNS:

  • student COLUMN: Number or String.
    This column identifies sequences. All observations with the same "student" id will be placed in the same sequence.

  • KC COLUMN: Number or String.
    This column identifies an HMM model. Currently we only treat each observation mapping to one KC. The model will learn parameters for each KC(HMM) individually. However, you can put multiple KCs as features (in feature COLUMNS) if you have multiple KCs per item/record, and specify a more coarse-grained KC name here in KC column. See files with prefix "FAST+subskill".

  • outcome COLUMN: String (correct | incorrect) We only support binary HMMs.

OPTIONAL COLUMNS:

  • feature COLUMNS: Number Feature columns should contain string "feature".

    • Features must be numeric. This is not a limitation, because string or categorical variables can be coded as binary features. For example if you have a single feature that may take values red, green and blue, you could encode this as two different features (red = {0|1}, green={0,1}), or as three binary features (blue={0,1}).
    • By default, all input features will be used to parameterize the probabilites you specified; otherwise, you need to add "init_", "tran_" or "emit_" prefix to "feature" and specify forceUsingAllInputFeatures=true to differentiate features used for initial, transition or emission probabilites.
    • Features that are marked with a star (*) have coefficients shared by both latent states (mastery and not mastery). See files with prefix "FAST+subskill". Features that do not have a star have a different coefficient for each latent state.
    • By default, FAST adds bias feature to both hidden states. Don't put bias(intercept) feature in the input (a feature always with value 1). If you want to change the configuration of bias features, please specify bias in configuration file.
    • If some features are always 0 (never appear) in current KC but may have value 1 in other KCs, then put them as NULL(or NAN, or leave it as empty) for current KC's records (This is for the sake of computing gradient, if they are NULL then the code doesn't compute gradient for those features for current HMM, see files with prefix "FAST"). Although FAST currently has L2-regularization, yet in order to make coefficients more directly interpretable, and also speed up the training, sometimes doing some standardization or normalization of such features to map them to smaller values may help. Yet sometimes standardization or normalization is not suitable due to the feature value distribution (etc.) and will drop the performance. Please do some experimentation.
  • problem COLUMN: Number or String
    You can put the problem/item/question name/id here.

  • step COLUMN: Number or String
    It doesn't matter what you put here so far. However, it may help you to check the results if you use this column to put information identifying the order of the records.

  • fold COLUMN: Number (1|-1)
    By default, all values are 1. If you have the kind of data split where some beginning records from a student-skill sequence is used for training and remaining used for testing, then in the test0.txt file, please put the records used for training with fold COLUMN value -1. See files with prefix "FAST+item_split_seq".

#Configuration

  • We provided some example configuration file in input/XXX.conf with some default values. However, you could add other options into the file according to your need.

  • Here are the basic options:

    • modelName: FAST|KT. modelName should contain either "FAST" or "KT" in the string (capital letters)
    • inDir: for getting input files(train and test).
    • outDir: for getting output prediction files and log files.
    • trainInFilePrefix: the prefix of training set file(s).
    • testInFilePrefix: the prefix of testing set file(s).
    • inFileSuffix: the file suffix of training and testing set file(s).
    • nbFiles: used to decide how many train-test pairs so that the code can read train-test files automatically (e.g. if nbFiles=10, trainInFilePrefix="train", testInFilePrefix="test", inFileSuffix=".csv", then the code automatically reads train0.csv~train9.csv. trainInFilePrefix combined with nbFiles and inFileSuffix is used for retrieving train file. If trainInFilePrefix="train" and inFileSuffix=".csv", then (1) if nbFiles=1, train file will be "train0.csv", (2) if nbFiles>1, train file will be automatically configured with an increasing id as surfix. e.g., "train1.txt" for the 2nd file. Same rule apply for test files.
      • nbRandomRestart: used for specifying how many random starts (for initialing parameters) for each HMM(KC).
      • parameterizing: true|false. If use features(FAST), then please configure parameterizing=true.
      • parameterizingInit: true|false. parameterizingInit will parameterize initial probabilities.
      • parameterizingTran: true|false. parameterizingTran will parameterize transition probabilities.
      • parameterizingEmit: true|false. parameterizingEmit will parameterize emission probabilities.
      • forceUsingAllInputFeatures: true|false. If forceUsingInputFeatures=true, it allows you to use all input features without differentiating which probability to be parameterized (initial/transition/emission); Otherwise the code uses "init_", "tran_" or "emit_" prefix to recognize corresponding feature columns for initial, transition or emission probabilities (e.g., "tran_feature_XXX").
  • See the details of configuration options by typing following command and checking those variables with prefix "Options.":
    java -jar fast-2.0.0-final.jar -help

Clone this wiki locally