Can AI Really Code Stats? Testing ChatGPT & Llama on SAS Programming

Introduction

Imagine having an AI assistant that could handle all your tedious statistical coding for you—no more debugging, no more syntax headaches, and no more late-night coding marathons. Sounds like a dream, right? Thanks to large language models (LLMs) like ChatGPT and Llama, this dream is inching closer to reality.

But before we get carried away, there's an important question we need to ask: Can these AI models actually write accurate and effective statistical code?

A recent study by researchers at Virginia Tech put AI to the test, evaluating how well ChatGPT (GPT-3.5 & GPT-4.0) and Llama (3.1 70B) perform in SAS programming, a popular language used in finance, healthcare, and research for statistical analysis. Their goal? To see if AI can reliably generate code that not only runs correctly but produces accurate results.

Spoiler alert: AI is getting there, but it’s far from perfect. Let’s dive into what they found.

How the Study Worked

To assess the abilities of these LLMs, the researchers:

Created 207 statistical analysis tasks covering various topics like regression, hypothesis testing, and data visualization.
Used ChatGPT and Llama to generate SAS code for each task.
Had human experts rate the AI-generated code based on three key criteria:
- Code Quality & Readability: Is the code well-structured, concise, and free from unnecessary complexity?
- Executability: Can the code run without errors or breaking?
- Output Accuracy: Does the code produce the correct statistical results?

The study's findings reveal that while AI-generated code often looks good on the surface, it struggles with deeper statistical reasoning, sometimes leading to incorrect or unusable results.

The Good: AI Can Write Clean, Readable Code

One of the biggest wins for LLMs was code quality. On average, AI-generated SAS scripts received a 94% score for readability and structure. These models followed proper formatting, avoided overly complex syntax, and made their code easy to understand.

💡 Think of it like a professional writer who has perfect grammar but doesn’t always understand the nuances of the topic they’re writing about.

For beginners in SAS programming, this means ChatGPT and Llama can be useful tools for generating well-structured code templates—but you still need to double-check the logic!

The Bad: Running the Code? Not so Fast...

Things got tricky when it came to executability—in other words, whether the SAS code actually worked when run. AI-generated code worked only about 61% of the time without errors.

The biggest mistakes included:
- Using incorrect dataset or variable names
- Syntax errors (e.g., missing operators, incorrect keywords)
- Misunderstanding SAS-specific commands

If you're relying on AI to generate code, be prepared to spend time debugging—this isn’t a magic "copy-and-paste" solution yet.

The Ugly: Faulty Results & Wrong Conclusions

Even if the code ran successfully, output correctness was a major issue. AI-generated SAS code produced correct statistical results only about 52% of the time—essentially a coin flip.

💡 Imagine asking an assistant to calculate your taxes, and they get the math right only half the time. Not great, right?

Some common issues included:
- Applying the wrong statistical methods
- Misinterpreting the dataset (e.g., treating categorical data as continuous)
- Producing redundant or unnecessary results

For professionals working in critical fields like healthcare, finance, and academia, this level of accuracy simply isn’t good enough—it could lead to serious errors in research and decision-making.

Which AI Did Best? ChatGPT vs. Llama

While none of the models were perfect, GPT-4.0 slightly outperformed the others in terms of both code executability and correctness. However, the differences weren’t statistically significant.

GPT-4.0 – Best at handling complex SAS commands & reducing errors.
GPT-3.5 – Produced cleaner, more concise code but struggled more with execution.
Llama 3.1 70B – Offered multiple code solutions (which can be useful) but sometimes generated redundant or incorrect answers.

What This Means for You

Should You Use AI for Statistical Programming?

Yes—but with caution. AI can be a great tool for generating SAS templates and debugging common issues, but it's not yet reliable for full automation. If you're a beginner, AI can help you get started, but you’ll still need human expertise to verify AI-generated code—especially if you're working with complex data analysis.

How To Get the Best AI-Generated Code

Want to boost the quality of AI-generated SAS code? Try these prompting best practices:

✅ Be clear and specific. Instead of "Write SAS code for regression," try "Generate SAS code to run a multiple linear regression analyzing patient age and cholesterol levels using PROC REG."

✅ Provide dataset details. AI does better when it knows data structures. Example: "The dataset has columns: PatientID, Age (numeric), CholesterolLevel (numeric)."

✅ Check AI’s work. Always run the code, debug errors, and confirm that outputs align with expected results.

✅ Ask AI to explain its logic. If something looks off, ask, "Why did you include PROC MEANS in this analysis?"

Key Takeaways

✔ AI CAN help with coding but isn’t perfect. The generated SAS code is well-structured but correctness is still a major issue.

✔ Code execution is a gamble. AI often misuses dataset names, misunderstands SAS syntax, and occasionally produces broken code.

✔ Your output might be wrong! About 50% of the time, AI-generated code had incorrect statistical results—meaning you need human verification before trusting its output in real-world applications.

✔ GPT-4.0 is slightly better than the rest. However, none of the models were significantly stronger across the board.

✔ AI coding is best used for assistance, not automation. Think of it as a co-pilot—great for templates but still requiring a human expert to verify the work.

👀 The Future of AI in Statistical Programming
While today’s LLMs are still imperfect, AI’s role in statistical coding is evolving rapidly. Future improvements in domain-specific training and better fact-checking mechanisms could one day make AI a dependable coding assistant—but we're not quite there yet.

Until then, use AI wisely and double-check its outputs—your data (and career) will thank you.

🚀 What are your experiences using AI for coding? Let us know in the comments!