Skip to content
Yun edited this page Jun 18, 2015 · 45 revisions

Description of input

FAST requires (1) data files and a (2) configuration file.

##Data files The data files are the student observations used for training and testing. Data files are delimited by tab (\t) or comma (,) and will skip the the preceding or following spaces of the tab or comma. (i.e., it uses regular expression "\s*[,\t]+\s*" to split the columns). Data files have names with the pattern prefix-X-suffix (e.g. train0.txt). X is a number from 0 to numFolds.

IMPORTANT:

  • The input requires a line per observation. The input only requires that lines are sorted by time within a student. This means that the order of students or Knowledge Components (KCs) doesn't matter as long as the input is sorted over time.
  • The test file shouldn't have new KCs (HMMs), i.e. the code won't predict for the new KC(HMM) that it didn't train on
  • The test file should have the same feature columns (indicated by features_XXX or *features_XXX) as train file.

MANDATORY COLUMNS:

  • student COLUMN: Integer or string.
    This column identifies sequences. All observations with the same "student" id will be placed in the same sequence.

  • KC COLUMN: Integer or string.
    This column identifies an HMM model. Currently we only treat each observation mapping to one KC. The model will learn parameters for each KC(HMM) individually. However, you can put multiple KCs as features (in feature COLUMNS) if you have multiple KCs per item/record, and specify a more coarse-grained KC name here in KC column. See files with prefix "FAST+subskill".

  • outcome COLUMN: correct | incorrect
    We only support binary HMMs.

OPTIONAL COLUMNS:

  • feature COLUMNS: Feature columns should have prefix features_ or *features_.

    • Features must be numeric. This is not a limitation, because string or categorical variables can be coded as binary features. For example if you have a single feature that may take values red, green and blue, you could encode this as two different features (red = {0|1}, green={0,1}), or as three binary features (blue={0,1}).
    • Features that are marked with a star (*) have coefficients shared by both latent states (mastery and not mastery). See files with prefix "FAST+subskill".
    • Features that do not have a star have a different coefficient for each latent state.
    • By default, FAST adds bias feature to both hidden states. Don't put bias(intercept) feature in the input (a feature always with value 1). If you want to change the configuration of bias features, please specify bias in configuration java class file (Opts.java).
    • If some features are always 0 (never appear) in current KC but may have value 1 in other KCs, then put them as NULL for current KC's records (This is for the sake of computing gradient, if they are NULL then the code doesn't compute gradient for those features for current HMM, see files with prefix "FAST"). Although FAST currently has L2-regularization, yet in order to make coefficients more directly interpretable, and also speed up the training, sometimes doing some standardization or normalization of such features to map them to smaller values may help. Yet sometimes standardization or normalization is not suitable due to the feature value distribution (etc.) and will drop the performance. Please do some experimentation.
  • problem COLUMN: You can put the problem/item/question name/id here. Integer or string are both ok.

  • step COLUMN: It doesn't matter what you put here so far. However, it may help you to check the results if you use this column to put information identifying the order of the records.

  • fold COLUMN: By default, all values are 1. If you have the kind of data split where some beginning records from a student-skill sequence is used for training and remaining used for testing, then in the test0.txt file, please put the records used for training with fold COLUMN value -1. See files with prefix "FAST+item_split_seq".

#Configuration

  • We provided some example configuration file in input/XXX.conf with some default values. However, you could add other options into the file according to your need.

  • Here are the basic options:

    • modelName: FAST|KT Would choose whether to run Knowledge Tracing or FAST (Now by default, when you run FAST, it only parameterizes the emission probabilities. We will release the version allowing parameterizing also transition probabilities soon).

    • allowForget: True|False. If allowForget=false, then p(forget)=0, i.e. p(unknown|known)=0.

    • differentRandomStartSeed: True|False. By default false. If you set it true, then it will use current date in millisecond long type as random seed to initialize the parameters, which may get significantly different prediction (due to EM local optimal problem).

    • inDir: input files' directory. By default, training file: train0.txt; testing file: test0.txt.

    • outDir: output prediction and evaluation and log files' directory

    • allModelComparisonOutDir: for output evaluation file containing all models runned before for comparing different models.)

    • trainInFilePrefix: the prefix of training set file(s).

    • testInFilePrefix: the prefix of testing set file(s).

    • inFileSuffix: the file Suffix of training and testing set file(s).

    • testSingleFile: If testSingleFile=true, then numFolds should be set to 1, FAST just runs on one train and test pairs; If testSingleFile=false, FAST can automatically retrieve multiple train and test files according to specified numFolds (e.g., if testSingleFile=false and numFolds=2, then FAST will find files according to trainInFilePrefix, testInFilePrefix, inFileSuffix with file id 0-1, e.g., train0.txt, test0.txt, train1.txt, test1.txt).

    • numFolds: this is the number of train and test pairs, which is used for FAST automatically retrieving multiple train and test files.

    File will be named by trainInFilePrefix(testInFilePrefix) + id + inFileSuffix with id equals to current fold id (starting from 0). For example, if numFolds=5, trainInFilePrefix=train, testInFilePrefix=test, and inFileSuffix=.txt, then there should be 5 pairs of train and test files in the inDir and they should be named by train0.txt, train1.txt...train4.txt, and test0.txt, test1.txt ... test4.txt.

  • See the details of configuration options: by command "java -jar fast-1.0.2-final.jar -help", or by command "java fast/experimenter/Run -help"(under target/classes directory) or by src/hmmfeatures/Opts.java file.

Clone this wiki locally