Skip to content

Latest commit

 

History

History
169 lines (126 loc) · 7.44 KB

intent-recognition-learner.md

File metadata and controls

169 lines (126 loc) · 7.44 KB

intent_recognition module

The intent_recognition module contains the IntentRecognitionLearner class and can be used to recognize 20 intents of a person based on text. It is recommended to use IntentRecognitionModule together with SpeechTranscriptionModule to enable intent recognition based on transcribed speech. The module supports multimodal training on face (vision), speech (audio), and text data to facilitate improved unimodal inference on text modality.

We provide data processing scripts and pre-trained model for MIntRec dataset. The class labels correspond to the following intent categories: 0 - Complain, 1 - Praise, 2 - Apologise, 3 - Thank, 4 - Criticize, 5 - Agree, 6 - Taunt, 7 - Flaunt, 8 - Joke, 9 - Oppose, 10 - Comfort, 11 - Care, 12 - Inform, 13 - Advise, 14 - Arrange, 15 - Introduce, 16 - Leave, 17 - Prevent, 18 - Greet, 19 - Ask for help.

Class IntentRecognitionLearner

The learner has the following public methods:

IntentRecognitionLearner constructor

IntentRecognitionLearner(self, text_backbone, mode, log_path, cache_path, results_path, output_path, device, benchmark)

Constructor parameters:

  • text_backbone: {"bert-base-uncased", "albert-base-v2", "prajjwal1/bert-small", "prajjwal1/bert-mini", "prajjwal1/bert-tiny"}, default="bert-base-uncased"
    Specifies the text backbone to be used. The name matches the corresponding huggingface hub model, e.g., prajjwal1/bert-small.
  • mode: {'language', 'joint'}, default="joint"
    Specifies the modality of the model. 'Language' corresponds to text-only model, 'Joint' corresponds to multimodal model with vision, audio, and language modalities trained jointly.
  • log_path: str, default="logs"
    Specifies the path where to store the logs.
  • cache_path: str, default="cache"
    Specifies the path for cache, mainly used for tokenizer files.
  • results_path: str, default="results"
    Specifies where to store the results (performance metrics).
  • output_path: str, default="outputs"
    Specifies where to store the outputs: trained models, predictions, etc.
  • device: str, default="cuda"
    Specifies the device to be used for training.
  • benchmark: {"MIntRec"}, default="MIntRec"
    Specifies the benchmark (dataset) to be used for training. The benchmark defines the class labels, feature dimensionalities, etc.

IntentRecognitionLearner.fit

IntentRecognitionLearner.fit(self, dataset, val_dataset, verbose, silent)

This method is used for training the algorithm on a training dataset and validating on a validation dataset.

Parameters:

  • dataset: object
    Object that holds the training dataset.
  • val_dataset : object, default=None
    Object that holds the validation dataset.
  • verbose : bool, default=False
    Enables verbosity.
  • silent : bool, default=False
    Enables training in the silent mode, i.e., only critical output is produced.

IntentRecognitionLearner.eval

IntentRecognitionLearner.eval(self, dataset, modality, verbose, silent, restore_best_model)

This method is used to evaluate a trained model on an evaluation dataset.

Parameters:

  • dataset : object
    Object that holds the evaluation dataset.
  • modality: str, {'audio', 'video', 'language', 'joint'}
    Specifies the modality to be used for inference. Should either match the current training mode of the learner, or for a learner trained in joint (multimodal) mode, any modality can be used for inference, although we do not recommend using only video or only audio.
  • verbose: bool, default=False
    If True, provides detailed logs.
  • silent: bool, default=False
    If True, run in silent mode, i.e., with only critical output.
  • restore_best_model : bool, default=False
    If True, best model according to performance on validation set will be loaded from self.output_path. If False, current model state will be evaluated.

IntentRecognitionLearner.infer

IntentRecognitionLearner.infer(self, batch, modality)

This method is used to perform inference from given language sequence (text). Returns a list of engine.target.Category objects, which contains calss predictions and confidence scores for each sentence in the input sequence.

Parameters:

  • batch: dict
    Dictionary with input data with keys corresponding to modalities, e.g. {'text': 'Hello'}.
  • modality: str, default='language'
    Modality to be used for inference. Currently, inference from raw data is only supported for language modality (text).

IntentRecognitionLearner.save

IntentRecognitionLearner.save(self, path)

This method is used to save a trained model.

Parameters:

  • path: str
    Path to save the model.

IntentRecognitionLearner.load

IntentRecognitionLearner.load(self, path)

This method is used to load a previously saved model.

Parameters:

  • path: str
    Path of the model to be loaded.

IntentRecognitionLearner.download

IntentRecognitionLearner.download(self, path)

Downloads the provided pretrained model into 'path'.

Parameters:

  • path: str
    Specifies the folder where data will be downloaded.

IntentRecognitionLearner.trim

IntentRecognitionLearner.trim(self, modality)

This method is used to convert a model trained in a multimodal manner ('joint' mode) for unimodal inference. This will drop unnecessary layers corresponding to other modalities for computational efficiency.

Parameters:

  • modality: str, default='language'
    Modality to which to convert the model

Examples

Additional configuration parameters/hyperparameters can be specified in intent_recognition_learner/algorithm/configs/mult_bert.py.

  • Training, evaluation and inference example

    from opendr.perception.multimodal_human_centric import IntentRecognitionLearner
    from opendr.perception.multimodal_human_centric.intent_recognition_learner.algorithm.data.mm_pre import MIntRecDataset
    
    if __name__ == '__main__':
      # Initialize the multimodal learner
      learner = IntentRecognitionLearner(text_backbone='bert-base-uncased', mode='joint', log_path='logs', cache_path='cache', results_path='results', output_path='outputs')
    
      # Initialize datasets
      train_dataset = MIntRecDataset(data_path='/path/to/data/', video_data_path='/path/to/video', audio_data_path='/path/to/audio', text_backbone='bert-base-uncased', split='train')
      val_dataset = MIntRecDataset(data_path='/path/to/data/', video_data_path='/path/to/video', audio_data_path='/path/to/audio', text_backbone='bert-base-uncased', split='dev')
      test_dataset = MIntRecDataset(data_path='/path/to/data/', video_data_path='/path/to/video', audio_data_path='/path/to/audio', text_backbone='bert-base-uncased', split='test')
    
      # Train the model
      learner.fit(dataset, val_dataset, silent=False, verbose=True)
    
      # Evaluate the best according to validation set model on multimodal input
      out = learner.eval(test_dataset, 'joint', restore_best_model=True)
    
      # Evaluate the best according to validation set model on text-only input
      out_l = learner.eval(test_dataset, 'language', restore_best_model=True)
    
      # Keep only the text-specific layers of the model and drop the rest
      learner.trim('language')
    
      # Evaluate the trimmed model. Should produce the same result as out_l.
      out_l_2 = learner.eval(test_dataset, 'language', restore_best_model=False)