Skip to content
Yun edited this page Jun 18, 2015 · 45 revisions

Description of input

FAST requires (1) data files and a (2) configuration file.

##Data files The data files are the student observations used for training and testing. Data files are delimited by tab (\t) or comma (,) and will skip the the preceding or following spaces of the tab or comma. (i.e., it uses regular expression "\s*[,\t]+\s*" to split the columns). Data files have names with the pattern prefix-X-suffix (e.g. train0.txt). X is a number from 0 to nbFiles.

IMPORTANT:

  • The input requires a line per observation. The input only requires that lines are sorted by time within a student. This means that the order of students or Knowledge Components (KCs) doesn't matter as long as the input is sorted over time.
  • The test file shouldn't have new KCs (HMMs), i.e. the code won't predict for the new KC(HMM) that it didn't train on.
  • The test file should have the same feature columns (indicated by features_XXX or *features_XXX) as train file.
  • Make sure don't use tab (\t), comma (,) or space( ) when specifying names and values in the input train and test files. The code will use them as delimiter by default.

MANDATORY COLUMNS:

  • student COLUMN: Number or String.
    This column identifies sequences. All observations with the same "student" id will be placed in the same sequence.

  • KC COLUMN: Number or String.
    This column identifies an HMM model. Currently we only treat each observation mapping to one KC. The model will learn parameters for each KC(HMM) individually. However, you can put multiple KCs as features (in feature COLUMNS) if you have multiple KCs per item/record, and specify a more coarse-grained KC name here in KC column. See files with prefix "FAST+subskill".

  • outcome COLUMN: String (correct | incorrect) We only support binary HMMs.

OPTIONAL COLUMNS:

  • feature COLUMNS: Number
    Feature columns should contain string "feature".

    • Features must be numeric. This is not a limitation, because string or categorical variables can be coded as binary features. For example if you have a single feature that may take values red, green and blue, you could encode this as two different features (red = {0|1}, green={0,1}), or as three binary features (blue={0,1}).
    • By default, all input features will be used to parameterize the probabilites you specified; otherwise, you need to add "init_", "tran_" or "emit_" prefix to "feature" and specify forceUsingAllInputFeatures=true to differentiate features used for initial, transition or emission probabilites.
    • Features that are marked with a star (*) have coefficients shared by both latent states (mastery and not mastery). See files with prefix "FAST+subskill". Features that do not have a star have a different coefficient for each latent state.
    • By default, FAST adds bias feature to both hidden states. Don't put bias(intercept) feature in the input (a feature always with value 1). If you want to change the configuration of bias features, please specify bias in configuration file.
    • If some features are always 0 (never appear) in current KC but may have value 1 in other KCs, then put them as NULL(or NAN, or leave it as empty) for current KC's records (This is for the sake of computing gradient, if they are NULL then the code doesn't compute gradient for those features for current HMM, see files with prefix "FAST").
    • Although FAST currently has L2-regularization, yet in order to make coefficients more directly interpretable, and also speed up the training, sometimes doing some standardization or normalization of such features to map them to smaller values may help. Yet sometimes standardization or normalization is not suitable due to the feature value distribution (etc.) and will drop the performance. Please do some experimentation.
  • problem COLUMN: Number or String
    You can put the problem/item/question name/id here.

  • step COLUMN: Number or String
    It doesn't matter what you put here so far. However, it may help you to check the results if you use this column to put information identifying the order of the records.

  • fold COLUMN: Number (1|-1)
    By default, all values are 1. If you have the kind of data split where some beginning records from a student-skill sequence is used for training and remaining used for testing, then in the test0.txt file, please put the records used for training with fold COLUMN value -1. See files with prefix "FAST+item_split_seq".

#Configuration

  • We provided some example configuration file in input/XXX.conf with some default values. However, you could add other options into the file according to your need.

  • Here are the basic options:

    • modelName: FAST|KT. modelName should contain either "FAST" or "KT" in the string (capital letters).
    • inDir: for getting input files(train and test).
    • outDir: for getting output prediction files and log files.
    • trainInFilePrefix: the prefix of training set file(s).
    • testInFilePrefix: the prefix of testing set file(s).
    • inFileSuffix: the file suffix of training and testing set file(s).
    • nbFiles: used to decide how many train-test pairs so that the code can read train-test files automatically (e.g. if nbFiles=10, trainInFilePrefix="train", testInFilePrefix="test", inFileSuffix=".csv", then the code automatically reads train0.csv~train9.csv.
    • trainInFilePrefix combined with nbFiles and inFileSuffix is used for retrieving train file. If trainInFilePrefix="train" and inFileSuffix=".csv", then (1) if nbFiles=1, train file will be "train0.csv", (2) if nbFiles>1, train file will be automatically configured with an increasing id as surfix. e.g., "train1.txt" for the 2nd file. Same rule apply for test files.
      *nbRandomRestart: used for specifying how many random starts (for initialing parameters) for each HMM(KC).
      *parameterizing: true|false. If use features(FAST), then please configure parameterizing=true.
      *parameterizingInit: true|false. parameterizingInit will parameterize initial probabilities.
      *parameterizingTran: true|false. parameterizingTran will parameterize transition probabilities.
    • parameterizingEmit: true|false. parameterizingEmit will parameterize emission probabilities.
    • forceUsingAllInputFeatures: true|false. If forceUsingInputFeatures=true, it allows you to use all input features without differentiating which probability to be parameterized (initial/transition/emission); Otherwise the code uses "init_", "tran_" or "emit_" prefix to recognize corresponding feature columns for initial, transition or emission probabilities (e.g., "tran_feature_XXX").
  • Here are some more advanced options:

    *EMMaxIters: EMMaxIters is used for the maximum iteration of outer EM. Setting smaller value could make training stop earlier, yet could decrease accuracy.
    *LBFGSMaxIters: LBFGSMaxIters is used for the maximum iteration of inner LBFGS. Setting smaller value could make training stop earlier, yet could decrease accuracy.
    *EMTolerance: EMTolerance is used to decide the convergence of outer EM. Setting bigger value could make training stop earlier, yet could decrease accuracy.
    *LBFGSTolerance: LBFGSTolerance is used to decide the convergence of inner LBFGS. Setting bigger value could make training stop earlier, yet could decrease accuracy.
    *initialK0: double. initialK0 is for specifying KT init(Prob(known)). By default the code initializes randomly (initialK0=-1).
    *initialT: double. initialT is for specifying KT learn(Prob(known|unknown)). By default the code initializes randomly (initialT=-1).
    *initialG: double. initialG is for specifying KT guess(Prob(correct|unknown)). By default the code initializes randomly (initialG=-1).
    *initialS: double. initialS is for specifying KT slip(Prob(incorrect|known)). By default the code initializes randomly (initialS=-1).
    *bias: true|false. By default bias=true. bias=true means adding a bias(intercept) to the featue space (By default, different hidden states will use differnt biases). bias can only be false for KT.
    *initialFeatureWeightsBounds: double: initialWeightsBounds is for deciding initial lower and upper bound for each feature coefficient. By default 10.0.
    *LBFGSRegWeight: double. For LBFGS, regulariztion term is sum_i[ c *(w_i - b)^2 ] where c is regularization weight(LBFGSRegWeight) and b is regularization bias(LBFGSRegBias).
    *LBFGSRegBias: double. For LBFGS, regulariztion term is sum_i[ c *(w_i - b)^2 ] where c is regularization weight(LBFGSRegWeight) and b is regularization bias(LBFGSRegBias).

  • You can also see the above explanation of configuration options by typing following command and checking those variables with prefix "Options.":
    java -jar fast-2.0.0-final.jar -help

Clone this wiki locally