Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpanMarker library for document level context Gives Error. (RuntimeError: CUDA error: device-side assert triggered) #45

Open
rudyrdx opened this issue Nov 15, 2023 · 3 comments

Comments

@rudyrdx
Copy link

rudyrdx commented Nov 15, 2023

Gives this error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

FineTune Code:

from datasets import load_dataset, Dataset
dataset = load_dataset("json", data_files=["output.jsonl"])
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained(
    "bert-base-uncased",  # Example encoder
    labels=['O','Degree','Years_of_Experience','Email_Address'
        'College_Name','Location','Designation','Graduation_Year','Skills','Name'
        'Companies_worked_at'],
    max_prev_context=2,
    max_next_context=2,
)
from transformers import TrainingArguments
args = TrainingArguments(
    output_dir="models/RUDYRDX-NER-1",
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    push_to_hub=False,
    logging_steps=50,
    fp16=True,
    warmup_ratio=0.1,
)
from span_marker import Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
)
trainer.train() # error happens when this runs

Dataset Sample:

{"document_id": 0, "sentence_id": 0, "tokens": ["Govardhana", "K", "Senior", "Software", "Engineer", "Bengaluru", "Karnataka", "Karnataka", "-", "Email", "Indeed", ":", "indeed.com/r/Govardhana-K/", "b2de315d95905b68", "Total", "experience", "5", "Years", "6", "Months", "Cloud", "Lending", "Solutions", "INC", "4", "Month", "Salesforce", "Developer", "Oracle", "5", "Years", "2", "Month", "Core", "Java", "Developer", "Languages", "Core", "Java", "Go", "Lang", "Oracle", "PL-SQL", "programming", "Sales", "Force", "Developer", "APEX", "."], "ner_tags": ["Name", "Designation", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"document_id": 0, "sentence_id": 1, "tokens": ["Designations", "&", "Promotions", "Willing", "relocate", ":", "Anywhere", "WORK", "EXPERIENCE", "Senior", "Software", "Engineer", "Cloud", "Lending", "Solutions", "-", "Bangalore", "Karnataka", "-", "January", "2018", "Present", "Present", "Senior", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2016", "December", "2017", "Staff", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "January", "2014", "October", "2016", "Associate", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2012", "December", "2013", "EDUCATION", "B.E", "Computer", "Science", "Engineering", "Adithya", "Institute", "Technology", "-", "Tamil", "Nadu", "September", "2008", "June", "2012", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "SKILLS", "APEX", "."], "ner_tags": ["Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation"]}
{"document_id": 0, "sentence_id": 2, "tokens": ["(", "Less", "1", "year", ")", "Data", "Structures", "(", "3", "years", ")", "FLEXCUBE", "(", "5", "years", ")", "Oracle", "(", "5", "years", ")", "Algorithms", "(", "3", "years", ")", "LINKS", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/", "ADDITIONAL", "INFORMATION", "Technical", "Proficiency", ":", "Languages", ":", "Core", "Java", "Go", "Lang", "Data", "Structures", "&", "Algorithms", "Oracle", "PL-SQL", "programming", "Sales", "Force", "APEX", "."], "ner_tags": ["Name", "Name", "Name", "Designation", "Designation", "Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O"]}
{"document_id": 0, "sentence_id": 3, "tokens": ["Tools", ":", "RADTool", "Jdeveloper", "NetBeans", "Eclipse", "SQL", "developer", "PL/SQL", "Developer", "WinSCP", "Putty", "Web", "Technologies", ":", "JavaScript", "XML", "HTML", "Webservice", "Operating", "Systems", ":", "Linux", "Windows", "Version", "control", "system", "SVN", "&", "Git-Hub", "Databases", ":", "Oracle", "Middleware", ":", "Web", "logic", "OC4J", "Product", "FLEXCUBE", ":", "Oracle", "FLEXCUBE", "Versions", "10.x", "11.x", "12.x", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/"], "ner_tags": ["Name", "Name", "Designation", "Designation", "Designation", "Location", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O"]}
@jackboyla
Copy link

I believe the error is coming up because the ner_tags actually need to be ints. The error you see usually comes up because PyTorch encounters an indexing mismatch.

I had some trouble with this myself and found that mapping the string ner_tags to an ID fixed the issue.

When you instantiate a SpanMarker model, the config already creates this map for you, using the list of labels you provide. You can see it by calling model.config.__getattribute__("encoder")]"label2id"].

@tomaarsen I think this should be explicitly mentioned somewhere in the repo since the errors don't make it clear what's gone wrong when integers aren't provided.

@rudyrdx
Copy link
Author

rudyrdx commented Nov 17, 2023

@jackboyla Thanks for letting me know I will try

@rudyrdx
Copy link
Author

rudyrdx commented Dec 1, 2023

Hi, I went through my training data again and noticed that the spans were wrong. when I divided the data using word length, and then tried to generate ner tags for the respective sentences, the spans were not correct. the startings and endings NER tags were wrong for the sentences. apparantly I lack the brain power to think so I switched to Spacy and was able to achieve the ner (not token classification but sentence classification (paragraph)) . So now i want to try this with SpanMarker so i will update after trying whether the problem was numeric ids or somthing else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants