-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Custom NER Inference Pipeline #34
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
in setUp() method
in setUp() method
and add test cases Currently, NER model inference is performed sequentially.
B-, B_, I-, and I_
I and B followed and underscore (e.g. I_MEA, B_MEA)
handle custom tag_delimeter and BIOE tag scheme Use seqeval to group entities
Codecov Report
@@ Coverage Diff @@
## dev #34 +/- ##
======================================
Coverage ? 89.65%
======================================
Files ? 4
Lines ? 580
Branches ? 0
======================================
Hits ? 520
Misses ? 60
Partials ? 0 Continue to review full report at Codecov.
|
LGTM |
lalital
added a commit
that referenced
this pull request
Jul 13, 2021
* Patch/issue 28 tokenizers package conflict (#30) * Change the version of required packages in order to avoid package conflicting isseue (as reported in issue #28) Reference (transformers required packages): https://github.com/huggingface/transformers/blob/v3.5.0/setup.py#L130 * Change the dev release version from 0.1.0dev2 to 0.1.1dev0 * Unspecify the veresion of pandas * Bumpup the version from 0.1.1dev0 to "0.1.1dev1 * GitHub workflow (#35) * Add github workflowm unittest * Format YAML * Create blank.yml (#36) * Gh workflow (#37) * Add github workflowm unittest * Format YAML * Rename file * GitHub workflow (#38) * Add github workflowm unittest * Format YAML * Rename file * Rename file from testing.yml to unittest * Remove duplicated library name `datasets` * Change the version of sentencepiece from 0.1.94 to 0.1.91 * Change the version of tokenizers from 0.9.4 to 0.9.3 transformers 3.5.0 depends on tokenizers==0.9.3 * Delete testing.yml * Delete blank.yml * Update unittest.yml * add workable qa notebook * add squad_newmm metric * add evaluation functions; minor fixes to normalize_answers * refactor prepare_qa_xxxfeatures * add notebook * minor fix to notebook * add qa training script * run notebook for good output * change model_max_length to optional * model_max_length 512 to 416 * Rename filename and module name from `unittest` to `test` * Update test.yml * Add argument local_rank * Update condition * Add return statement * Specify seed to torch and numpy * Set torch.backends.cudnn to be deterministic * Import module * add combine_iapp_thaiqa.py * Feature: Custom NER Inference Pipeline (#34) * Ignore tmp directory * Implement TokenClassificationPipeline (NER) and add test cases * Edit incorrect assertion * Edit incorrect assertion * Remove overspecified condition * Remove overlapped condition (strict=False) * Initialize TokenClassificationPipeline instance in setUp() method * Refer to self.base_pipeline intialized in setUp() method * Add test case for `_merged_pred` private method * Add feature to multiple sentences inference, and add test cases Currently, NER model inference is performed sequentially. * Fix rule for strict group_entities, replace all -I to O * Fix rule for strict group_entities, replace all -I to O * Handle when 'O' is the begining of sentence * Remove unused code * Edit assertion * Remove debugging messages * Replace special symbol `space_token` to space " " * Add test case * Fix error, specifying wrong operator (should be = , not +=) * Edit assertion * Support two types of IOB prefix B-, B_, I-, and I_ * Refer to variable instead of hardcoded IOB prefix * Add test case for another type of IOB prefix I and B followed and underscore (e.g. I_MEA, B_MEA) * Add required library * Specify version of seqeval * Add test case for model trained on LST20 * Add additional class argumetn to handle custom tag_delimeter and BIOE tag scheme Use seqeval to group entities * Fix incorrect reference to seqeval.scheme.Entity object * Convert attribute in Entity object to a tuple * Edit test sentence * Add @unittest.skip * Remove @unittest.skip * Edit test case * Set pipeline.strict to True * Output non-entity tag, 'O' * Remove debugging message * Remove @unittest.skip('not implement') * Add new test cases for BIOE tag (LST20) * Add condition to handle non strict entity grouping * Add script for language model finetuning on XNLI dataset (Thai sentence pairs) (#42) * Ignore tmp directory * Add language model finetuning script on XNLI dataset (only Thai sentence pairs) * add allow_no_answer flag * edit qa training script to include allow_no_answer flag * convert train_question_answering_lm_finetuning.ipynb to colab version * feature: token classification pipeline, POS tagging (#46) * refactor: rename class attributes - base_pipeline -> thainer_ner_pipeline - lst20_base_pipeline -> lst20_ner_pipeline * Perform entity groupping only if `scheme` is specified * Change condition from `and or self.scheme != None:` to `or self.scheme == None:` * Add test case for POS tagging with finetuned `wangchanberta-base-att-spm-uncased` on LST20 corpus (POS) * Add option for text file * Use swifter * Update debug message * Set default value to False * Add tqdm * Change version * Add adam beta args * Fix error * Change argument name to evaluation_strategy * Add deepspeed argument * Add deepspeed config * Move prediction_loss_only arg to TrainingArguments * Add run name * Add train_micro_batch_size_per_gpu * Set total_num_steps to 50k * Set amp * Remove amp * Set default value to None * Remove deepspeed * Add deepspeed * Change gradient_accumulation_steps, total_num_step, warmup_num_steps * Change zero_optimization to stage 2 * Divide paramters by 8 * Revert * Add new config file (ds) that compensate the global step (divide total_num_steps by 4, 24000 / 4 = 6000) * Rename config file, - Adjuste Adam epsilon to 1e-6 * Load pretrained mode via from_pretrained * Load model from checkpoint with from_pretrained * Add resume_from_checkpoint * Add zero-3 config * Change train_micro_batch_size_per_gpu from 128 to 64 * Add zero optimization stage 2 configuration * Update config zero 3 * Change bz to 32 * Change bz to 32 * Change bz to 32 * Rename file * Change optimizer type to Adam * Change warmup_num_steps to 2400 from 1250 * Change max LR * Change Max LR * Change max LR * Add config for 1cycle LR * Set cycle_max_mom to 0.999 * Set decay_step_size to 0 * Change cycle_first_step_size and cycle_second_step_size * Set cycle_max_mom to 0.99 * Set cycle_max_mom to 0.9 * Add new DS config (max step = 50k, warmup = 5k) * Change beta2 to 0.98 * Set max steps to 31250 * Rename file * Change cycle_second_step_size * Pass train_max_length and eval_max_length to MLMDataset instance * Remove redundant argument passing * Add zero-3 config * Rename file * Update config, change batch size to 64 * Change max LR * Change peak LR * Not offload param * Add new config - train_batch_size = 8064 - train_micro_batch_size_per_gpu = 48 - gradient_accumulation_steps = 21 * Add config: bz=40, grad_acc=25, train_batch_size=8000 * Add new config * Change bz to 44 and peak LR to 3e-4 * Change beta2 to 0.99 * Change beta2 to 0.999 * Change config * Add new config * Fix bz * Rename file * Add config for 8 GPUs * Fix incorrect value of gradient_accumulation_steps * Fix decay_step_size and decay_lr_rate * Change LR * Change * Update config * Add config for thwiki+news pretraining * Set decay_lr_rate to 0 * Change bz * Add ds_legal-bert-v3 config - Set cycle_min_lr to 3e-8 instead of 0.0 - Set cycle_min_mom to 0.9 instead of 0.85 * Warmup for the first 5,000 steps then linearly decays to e3-8 for 45,000 steps * Add MLM pretraining script that support any model architecture pretraining.. This script will substitีute `train_mlm_camembert_thai.ddp.py` and `train_mlm_camemberta_thai.py` as it is only applicable for RoBERTa pretraining and the arguments provided via ArgumentParser are required to add manually in order to match with the new version of transformers * Add arguments for DataTrainingArguments as follows. - train_max_length - eval_max_length * Set default value of `do_lower_case` to False * Call main function * Test if trainer is process zero with is_world_process_zero * Initlalize CamembertTokenizer * Initalize model from Config with `from_config` class method * Fix typo, change from `binarized_path_val` to `binarized_path_eval` * Add new config for deberta-base pretraining on thwiki+news * Set beta2 to 0.999 * Change batch size * Rename file * Add config with bz=40 * Rename file * Change bz to 32, effective bz to 4096 * Add configutation with effective bz = 4080, and per-device bz = 34 * Add new config - "train_batch_size": 4032, - "train_micro_batch_size_per_gpu": 28, - "gradient_accumulation_steps": 18, * Add config for effective bz = 4032 "train_batch_size": 4032, "train_micro_batch_size_per_gpu": 24, "gradient_accumulation_steps": 21, * Rename file * set logging level to debug * Add debugging message * Print debugging message only on main process * Add new config with effective bz of 4032 for 4x GPUs * Specify a number of zero_optimization parameters to True - cpu_offload - contiguous_gradients - overlap_comm * Rename file * Change bz * Update bz * Rename * Change bz * Add iapp_thaiqa dataset directory to .gitignore * Implement DataCollatorForSpanLevelMask for span-level masking * Addoption to choose masking strategy either subword-level or span-level * Fix typing checking * Fix incorrect data structure for metadata field * Access value of enum variable * Access to the value of enum variable * Edit * Fix typo * Add argumetn to specify symbol representing space token * Get the vocab_size from what tokenizer actually loaded which included additional_special_tokens * Pass vocab_size to AutoConfig.from_pretrained to overide default vocab_size * Fix data structure accessing error and remove debugging message * Assign additional_special_tokens in CamembertTokenizer.from_pretrained * Wrap to torch.LongTensor * Set the DataCollatorForSpanLevelMask to not perform token masking on "pad_token" * Pass pad_to_multiple_of to the super class (DataCollatorForLanguageModeling) * fix logical error: Fix a logical error where _mask_tokens function will not exclude special tokens from token masking. As there is an incorrect statement at L77, ```indices = [i for i in range(len(tokens)) if tokens[i] not in self.special_token_ids]``` The left operand "tokens[i] " is word token (str) while the right operand is a list of token IDS (List[int]) * Change dataclass init function * Limit the uperbound of num_to_predict by total number of input tokens (excluded special_tokens) * Fix incorrect num_to_predict * Add DeBERTa base config * Move DS config to another directory * Remove unused config file * Rename directory * Remove config file * Edit script to rename output config file as specified in -o argument * Add DeBERTa v1 config files Co-authored-by: cstorm125 <[email protected]> Co-authored-by: Charin <[email protected]> Co-authored-by: Charin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue: #32
Proposed solution:
NER Pipeline Demo (via Colab): https://colab.research.google.com/drive/1-54NeM_wsjitaiSXfMBpcnqzbPMR0a9R#scrollTo=VzSGZbwWaiOI
Added files: