Can AI-Powered Coding Assistants Keep Your Software Secure? What the Research Says

Can AI-Powered Coding Assistants Keep Your Software Secure? What the Research Says

Large Language Models (LLMs) have revolutionized software development by helping developers generate, debug, and refactor code in seconds. But just how reliable are these AI helpers when it comes to writing secure code? Could they, instead of improving software security, be unknowingly introducing vulnerabilities that put real-world applications at risk?

A recent study, Do LLMs Consider Security?, takes a deep dive into this crucial question. Researchers tested three industry-leading AI models—GPT-4, Claude 3, and Llama 3—to see whether they proactively warn developers about security flaws in coding questions pulled from Stack Overflow. The results? A mixed bag. While some models do provide security insights, most frequently overlook vulnerabilities unless explicitly prompted. Worse, they might even reinforce insecure coding practices by failing to alert users to critical risks.

Let’s break down the study’s findings, what they mean for developers using AI as coding assistants, and how you can get safer, more informed responses from these tools.


Do AI Coding Assistants Spot Security Issues?

Developers increasingly rely on models like ChatGPT, Claude, and Llama to assist with coding tasks such as debugging, code generation, and refactoring. But security is often neglected in the rush to get working code. Even experienced developers miss vulnerabilities, and it turns out, so do AI models.

The study set out to measure whether AI assistants recognize and flag security risks in code without being explicitly asked to do so. Researchers followed a simple yet effective approach:

  1. They collected 300 programming-related questions from Stack Overflow, each containing security vulnerabilities.
  2. They tested the ability of GPT-4, Claude 3, and Llama 3 to flag these security risks.
  3. They compared AI-generated responses to human-provided answers on Stack Overflow, assessing how well the models explained the vulnerabilities and suggested fixes.

The Results: AI Models Often Miss the Mark on Security

How Often Do LLMs Warn About Security Issues?

Surprisingly, LLMs rarely pointed out security issues on their own. Across 300 tested questions:

  • Even the best-performing model, GPT-4, only warned developers about vulnerabilities 40% of the time.
  • Performance dropped significantly with more obscure or reworded vulnerabilities, with AI models detecting just 13.7% of security risks when questions were slightly modified.
  • Llama 3 and Claude 3 performed even worse, often failing to flag risks even when they were obvious to human reviewers.

This means that for every 10 programming queries containing a critical security flaw, most AI models will let at least 6 go unnoticed, leaving developers unknowingly exposed to cybersecurity risks.


What Types of Security Risks Did AI Models Detect?

The study identified that LLMs are much better at spotting some vulnerabilities than others.

LLMs were more likely to flag:**
- Hard-coded passwords & API keys (CWE-798)
- Use of outdated or weak cryptography (CWE-321)
- Logging sensitive information (CWE-532)

However, they were particularly bad at detecting:
- Uncontrolled resource consumption (Denial of Service risks)
- Path traversal vulnerabilities
- Cross-Site Scripting (XSS) & SQL Injection in poorly structured queries

Often, the models would correctly answer a coding question but completely ignore discussing the security implications.


AI vs. Human Answers: Who Does Security Better?

Interestingly, when security flaws were flagged, AI models often provided more detailed responses than Stack Overflow (SO) users. SO answers tended to contain brief warnings, while AI-generated responses offered detailed explanations, potential exploits, and fixes.

For example, if a developer asked for help debugging database queries, a Stack Overflow response might contain:
➡️ "You should use prepared statements to avoid SQL injection."

Meanwhile, GPT-4 might provide:
➡️ "Your current implementation is vulnerable to SQL injection because user input is directly appended to the query string. An attacker could exploit this to execute arbitrary commands. Use parameterized queries or prepared statements to mitigate this risk."

However, this level of detail only appeared when the AI model actually flagged the vulnerability—which, as the study showed, happened far too infrequently.


How Developers Can Get Security-Aware AI Responses

Given that AI coding assistants frequently ignore security concerns, what can developers do?

1. Use Prompt Engineering to Force Security Awareness

One simple trick that significantly improved AI performance in the study? Explicitly asking AI to check for security vulnerabilities.

Researchers found that adding “Address security vulnerabilities” to the end of a prompt substantially increased the likelihood of AI models flagging risks.

🔹 Before: "How do I modify this Python script to handle user input?"
🔹 After: "How do I modify this Python script to handle user input? Address security vulnerabilities."

The result? GPT-4 flagged 76% more vulnerabilities when this phrase was included.

2. Pair AI with Static Analysis Tools

LLMs aren’t foolproof, but combining them with static analysis tools like CodeQL helped improve security accuracy. The study found that when CodeQL-generated warnings were included in prompts, AI was far more consistent in flagging vulnerabilities and suggesting proper fixes.

If you’re a developer using AI coding assistants, consider:
✅ Running your code through a static analysis tool first.
Feeding security warnings back into the AI model with a follow-up query.
Asking AI for security best practices before implementing its suggestions.


What This Means for the Future of AI in Software Development

These findings highlight a major challenge: AI-powered coding assistants aren’t yet reliable security auditors. If used without caution, they can introduce significant risks into software projects. However, they can still be valuable tools when used correctly—especially with proper security awareness prompts and additional safety checks.

Going forward:
🚨 AI model designers need to invest in training AI to better recognize and communicate security risks.
🛠️ Developers should treat AI-generated code like code from interns—always review before using in critical applications.
📢 Researchers should develop standardized benchmarks for evaluating security awareness in AI responses.


🔑 Key Takeaways: What Every Developer Should Know

✔️ AI coding assistants frequently fail to recognize security vulnerabilities. GPT-4 flagged issues in only 40% of cases, and performance dropped even further on reworded vulnerabilities.

✔️ Adding “Address security vulnerabilities” significantly improves AI’s security awareness. Simple prompt engineering techniques can lead to safer AI-recommended code.

✔️ LLMs are more likely to warn about hard-coded secrets or bad cryptography but often miss resource management and input validation risks. Always double-check AI suggestions in high-risk areas like authentication and user input handling.

✔️ Human-generated Stack Overflow responses were less detailed than AI responses—but humans flagged security risks more often. AI might explain risks better, but only if it actually identifies them.

✔️ AI models work best when paired with security scanning tools like CodeQL. Feeding security warnings back into AI queries significantly increases its ability to provide security-aware suggestions.


Final Thoughts: Should You Trust AI With Secure Coding?

AI coding assistants are incredibly powerful learning and development tools, but they are not substitutes for security best practices. Developers should approach AI suggestions with a healthy dose of skepticism and always validate critical code—especially in production environments dealing with sensitive data.

By applying better prompting techniques and integrating analysis tools, we can enhance AI-driven development while keeping security front and center. Until models become more security-aware by default, it’s up to developers to ask the right questions and take the necessary precautions.

🚀 What’s next? Try experimenting with security-awareness prompts in your AI-assisted development and let us know how it improves your results! 🛡️💻

Stephen, Founder of The Prompt Index

About the Author

Stephen is the founder of The Prompt Index, the #1 AI resource platform. With a background in sales, data analysis, and artificial intelligence, Stephen has successfully leveraged AI to build a free platform that helps others integrate artificial intelligence into their lives.