Skip to content
Yun edited this page Jul 20, 2022 · 45 revisions

Description of input

FAST requires (1) Data-files and a (2) Configuration file. Details are as follows.

Data files

The data files are the student observations used for training and testing. Data files are delimited by tab (\t) or comma (,) and will skip the the preceding or following spaces of the tab or comma. (i.e., it uses regular expression "\s*[,\t]+\s*" to split the columns). Data files have names with the pattern prefix-X-suffix (e.g. train0.txt). X is a number from 0 to nbFiles.

IMPORTANT:

  • The input requires a line per observation. The input only requires that lines are sorted by time within a student. This means that the order of students or Knowledge Components (KCs) doesn't matter as long as the input is sorted over time.
  • The test file shouldn't have new KCs (HMMs), i.e. the code won't predict for the new KC(HMM) that it didn't train on.
  • The test file should have the same feature columns (indicated by features_XXX or *features_XXX) as train file.
  • Make sure don't use tab (\t), comma (,) or space( ) when specifying names and values in the input train and test files. The code will use them as delimiter by default.

MANDATORY COLUMNS:

  • student COLUMN: Number or String.
    This column identifies sequences. All observations with the same "student" id will be placed in the same sequence.

  • KCs COLUMN: Number or String.
    This column identifies an HMM model. Currently we only treat each observation mapping to one KC. The model will learn parameters for each KC(HMM) individually. However, you can put multiple KCs as features (in feature COLUMNS) if you have multiple KCs per item/record, and specify a more coarse-grained KC name here in KC column. See files with prefix "FAST+subskill".

  • outcome COLUMN: String (correct | incorrect) We only support binary HMMs.

OPTIONAL COLUMNS:

  • feature COLUMNS: Number
    Feature columns should contain string "feature".

    • Features must be numeric. This is not a limitation, because string or categorical variables can be coded as binary features. For example if you have a single feature that may take values red, green and blue, you could encode this as two different features (red = {0|1}, green={0,1}), or as three binary features (blue={0,1}).
    • By default, all input features will be used to parameterize the probabilities you specified (i.e., if you set parameterizingInit as true, and parameterizingTran as true, and have a feature column like "features_read_example", then this feature is used to parameterize both pK0 and pT); otherwise, you need to add "init_", "tran_" or "emit_" prefix to "feature" and specify forceUsingAllInputFeatures=false to differentiate features used for initial, transition or emission probabilites (i.e., if you want a feature only used for transition, and another feature only used for initial, then you should create feature columns "tran_features_read_example", and "init_check_a_video", and set forceUsingAllInputFeatures=false).
    • Features that are marked with a star (*) have coefficients shared by both latent states (mastery and not mastery). See files with prefix "FAST+subskill". Features that do not have a star have a different coefficient for each latent state.
    • By default, FAST adds bias feature to both hidden states. Don't put bias(intercept) feature in the input (a feature always with value 1). If you want to change the configuration of bias features, please specify bias in configuration file.
    • If some features are always 0 (never appear) in current KC but may have value 1 in other KCs, then put them as NULL for current KC's records (This is for the sake of computing gradient, if they are NULL then the code doesn't compute gradient for those features for current HMM, see files with prefix "FAST").
    • Doing standardization or normalization of features to map them to smaller values might have some benefit (FAST currently has L2-regularization): 1) make coefficients more interpretable and comparable across different units, 2) speed up the training, 3) have more stable model, since "the ridge solutions are not equivariant under scaling of the inputs" (as Hastie,Tibshirani and Friedman points out (page 82 of the pdf or at page 63 of the book)). Yet sometimes standardization or normalization is not suitable due to the feature value distribution (etc.) and will drop the performance. Please do some experimentation.
  • problem COLUMN: Number or String
    You can put the problem/item/question name/id here, but the column name should be "problem".

  • step COLUMN: Number or String
    It doesn't matter what you put here so far. However, it may help you to check the results if you use this column to put information identifying the order of the records.

  • fold COLUMN: Number (1|-1)
    By default, all values are 1. If you have the kind of data split where some beginning records from a student-skill sequence is used for training and remaining used for testing, then in the test file (test0.txt), please put the records used for training with fold COLUMN value -1. However, you can always keep this column as 1 for the training file. See files with prefix "FAST+item_split_seq".

Configuration file

  • We provided some example configuration file in input/XXX.conf with some default values. However, you could add other options into the file according to your need.

  • Here are the basic options:

    • modelName: FAST|KT. modelName should contain either "FAST" or "KT" in the string (capital letters).
    • inDir: for getting input files(train and test).
    • outDir: for getting output prediction files and log files.
    • trainInFilePrefix: the prefix of training set file(s).
    • testInFilePrefix: the prefix of testing set file(s).
    • inFileSuffix: the file suffix of training and testing set file(s).
    • nbFiles: used to decide how many train-test pairs so that the code can read train-test files automatically (e.g. if nbFiles=10, trainInFilePrefix="train", testInFilePrefix="test", inFileSuffix=".csv", then the code automatically reads train0.csv~train9.csv.
    • trainInFilePrefix combined with nbFiles and inFileSuffix is used for retrieving train file. If trainInFilePrefix="train" and inFileSuffix=".csv", then (1) if nbFiles=1, train file will be "train0.csv", (2) if nbFiles>1, train file will be automatically configured with an increasing id as surfix. e.g., "train1.txt" for the 2nd file. Same rule apply for test files.
    • nbRandomRestart: used for specifying how many random starts (for initialing parameters) for each HMM(KC).
    • parameterizing: true|false. If use features(FAST), then please configure parameterizing=true.
    • parameterizingInit: true|false. parameterizingInit will parameterize initial probabilities.
    • parameterizingTran: true|false. parameterizingTran will parameterize transition probabilities.
    • parameterizingEmit: true|false. parameterizingEmit will parameterize emission probabilities.
    • forceUsingAllInputFeatures: true|false. If forceUsingInputFeatures=true, it allows you to use all input features without differentiating which probability to be parameterized (initial/transition/emission); Otherwise the code uses "init_", "tran_" or "emit_" prefix to recognize corresponding feature columns for initial, transition or emission probabilities (e.g., "tran_feature_XXX").
  • Here are some more advanced options:

    • generateStudentDummy: generateStudentDummy is for automatically generating binary student dummies (indicators) based on training dataset.
    • generateItemDummy: generateItemDummy is for automatically generating binary item dummies (indicators) based on training dataset. By default, it treats the "problem" column as the item column (one problem is one item).
    • EMMaxIters: EMMaxIters is used for the maximum iteration of outer EM. Setting smaller value could make training stop earlier, yet could decrease accuracy.
    • LBFGSMaxIters: LBFGSMaxIters is used for the maximum iteration of inner LBFGS. Setting smaller value could make training stop earlier, yet could decrease accuracy.
    • EMTolerance: EMTolerance is used to decide the convergence of outer EM. Setting bigger value could make training stop earlier, yet could decrease accuracy.
    • LBFGSTolerance: LBFGSTolerance is used to decide the convergence of inner LBFGS. Setting bigger value could make training stop earlier, yet could decrease accuracy.
    • initialK0: double. initialK0 is for specifying KT init(Prob(known)). By default the code initializes randomly (initialK0=-1).
    • initialT: double. initialT is for specifying KT learn(Prob(known|unknown)). By default the code initializes randomly (initialT=-1).
    • initialG: double. initialG is for specifying KT guess(Prob(correct|unknown)). By default the code initializes randomly (initialG=-1).
    • initialS: double. initialS is for specifying KT slip(Prob(incorrect|known)). By default the code initializes randomly (initialS=-1).
    • bias: true|false. By default bias=true. bias=true means adding a bias(intercept) to the featue space (By default, different hidden states will use differnt biases). bias can only be false for KT.
    • initialFeatureWeightsBounds: double: initialWeightsBounds is for deciding initial lower and upper bound for each feature coefficient. By default 10.0.
    • LBFGSRegWeight: double. For LBFGS, regulariztion term is sum_i[ c *(w_i - b)^2 ] where c is regularization weight(LBFGSRegWeight) and b is regularization bias(LBFGSRegBias).
    • LBFGSRegBias: double. For LBFGS, regulariztion term is sum_i[ c *(w_i - b)^2 ] where c is regularization weight(LBFGSRegWeight) and b is regularization bias(LBFGSRegBias).
  • You can also see the above explanation of configuration options by typing following command and checking those variables with prefix "Options.":
    java -jar fast-2.1.0-final.jar -help