-
-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(blog): data poisoning article #2566
base: main
Are you sure you want to change the base?
Conversation
TestGru AssignmentSummary
Tip You can |
Images automagically compressed by Calibre's image-actions ✨ Compression reduced images by 11.2%, saving 47.11 KB.
168 images did not require optimisation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this article. You’ve done a wonderful job distilling your thoughts and presenting them clearly. I’ve left far too many comments. They’re mostly suggestions to consider at your discretion (feel free to ignore). One thing to think about is our target audience. Make sure terms and concepts are familiar to whoever you believe will read this. I really appreciate all your hard work, and I’m excited to see how this shapes up. Keep it up!
|
||
# Defending Against Data Poisoning Attacks on LLMs: A Comprehensive Guide | ||
|
||
Data poisoning remains a top concern on the [OWASP Top 10 for 2025](https://owasp.org/www-project-top-10-for-large-language-model-applications/). However, the scope of data poisoning has expanded since the 2023 version. Data poisoning is no longer strictly a risk during the training of Large Language Models (LLMs); it now encompasses all three stages of the LLM lifecycle: pre-training, fine-tuning, and retrieval from external sources. OWASP also highlights the risk of model poisoning from shared repositories or open-source platforms, where models may contain backdoors or embedded malware. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data poisoning remains a top concern on the [OWASP Top 10 for 2025](https://owasp.org/www-project-top-10-for-large-language-model-applications/). However, the scope of data poisoning has expanded since the 2023 version. Data poisoning is no longer strictly a risk during the training of Large Language Models (LLMs); it now encompasses all three stages of the LLM lifecycle: pre-training, fine-tuning, and retrieval from external sources. OWASP also highlights the risk of model poisoning from shared repositories or open-source platforms, where models may contain backdoors or embedded malware. | |
Data poisoning remains a top concern on the [OWASP Top 10 for 2025](https://owasp.org/www-project-top-10-for-large-language-model-applications/). However, the scope of data poisoning has expanded since the 2023 version. Data poisoning is no longer strictly a risk during the training of Large Language Models (LLMs); it now encompasses all stages of the LLM lifecycle, including: pre-training, fine-tuning, and retrieval from external sources. OWASP also highlights the risk of model poisoning from shared repositories or open-source platforms, where models may contain backdoors or embedded malware. |
|
||
When exploited, data poisoning can degrade model performance, produce biased or toxic content, exploit downstream systems, or tamper with the model’s generation capabilities. | ||
|
||
Understanding how these attacks work and implementing preventative measures is crucial for developers, security engineers, and technical leaders responsible for maintaining the security and reliability of these systems. This comprehensive guide delves into the nature of data poisoning attacks and offers strategies to safeguard against these threats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please take another pass at this section?
|
||
Data poisoning attacks are malicious attempts to corrupt the training data of an LLM, thereby influencing the model's behavior in undesirable ways. These attacks typically manifest in three primary forms: | ||
|
||
1. **Poisoning the Training Dataset**: Attackers insert malicious data into the training set during pre-training or fine-tuning, causing the model to learn incorrect associations or behaviors. This can lead to the model making erroneous predictions or becoming susceptible to specific triggers. They may also create backdoors, where they poison the training dataset to cause the model to behave normally under typical conditions but produce attacker-chosen outputs when presented with certain triggers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know this is a different list from the one in your intro but it is similar enough that it's confusing to me (there's overlap). maybe not an actionable nit but wanted to share.
|
||
## Detection and Prevention Strategies | ||
|
||
To protect your LLM applications from [LLM vulnerabilities](https://www.promptfoo.dev/docs/red-team/llm-vulnerability-types/), including data poisoning attacks, it's essential to implement a comprehensive set of detection and prevention measures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrase
|
||
### Implement Data Validation and Tracking to Mitigate Risk of Data Poisoning | ||
|
||
- **Enforce Sandboxing**: Implement sandboxing to restrict model exposure to untrusted data sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this make sense in an LLM context?
|
||
- **Enforce Sandboxing**: Implement sandboxing to restrict model exposure to untrusted data sources. | ||
- **Track Data Origins**: Use tools like OWASP CycloneDX or ML-BOM to track data origins and transformations. | ||
- **Use Data Versioning**: Use a version control system to track changes in datasets and detect manipulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider rephrasing. I am not sure if this is useful detail in this article. It may be table stakes.
Regularly monitor the outputs of your LLM for signs of unusual or undesirable behavior. | ||
|
||
- **Implement Tracing**: LLM tracing provides a detailed snapshot of the decision-making and thought processes within LLMs as they generate responses. Tracing can help you monitor, debug, and understand the execution of an LLM application | ||
- **Use Golden Datasets**: Golden datasets in LLMs are high-quality, carefully curated collections of data used to evaluate and benchmark the performance of large language models. Use these datasets as a "ground truth" to evaluate the performance of your models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-deployment evals with promptfoo! consider stating this first.
|
||
- **Lock Down Access**: Restrict access to LLM repositories and implement robust monitoring to mitigate the risk of insider threats. | ||
- Access to training data should be restricted based on least privilege and need-to-know. Access should be recertified on a regular cadence (such as quarterly) to account for employee turnover or job changes. | ||
- All access should be logged and audited. Developer access should be limited to the minimum necessary to perform their job and access should be revoked when they leave the organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right level of detail for your audience?
|
||
- **Vet Your Sources**: Conduct thorough due diligence on model providers and training data sources. | ||
- Review model cards and documentation to understand the model's training processes and performance. You can learn more about this in our [foundation model security](https://www.promptfoo.dev/blog/foundation-model-security/) blog post. | ||
- Verify that models downloaded from Hugging Face [pass their malware scans](https://huggingface.co/docs/hub/en/security-malware) and [pickling scans](https://huggingface.co/docs/hub/en/security-pickle). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good
### Red Team LLM Applications to Detect Data Poisoning | ||
|
||
- **Model Red Teaming**: Run an initial [red team](https://www.promptfoo.dev/docs/red-team/) assessment against any models pulled from shared or public repositories like Hugging Face. | ||
- **Assess Bias**: In Promptfoo's eval framework, use Promptfoo's [classifier assert type](https://www.promptfoo.dev/docs/configuration/expected-outputs/classifier/#bias-detection-example) to assess grounding, factuality, and bias in models pulled from Hugging Face. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not just related to hugging face this can be run on any output. model is hosted on hugging face
Addition of article on data poisoning.