Can AI Really Code Stats? Testing ChatGPT & Llama on SAS Programming
Introduction
Imagine having an AI assistant that could handle all your tedious statistical coding for youâno more debugging, no more syntax headaches, and no more late-night coding marathons. Sounds like a dream, right? Thanks to large language models (LLMs) like ChatGPT and Llama, this dream is inching closer to reality.
But before we get carried away, there's an important question we need to ask: Can these AI models actually write accurate and effective statistical code?
A recent study by researchers at Virginia Tech put AI to the test, evaluating how well ChatGPT (GPT-3.5 & GPT-4.0) and Llama (3.1 70B) perform in SAS programming, a popular language used in finance, healthcare, and research for statistical analysis. Their goal? To see if AI can reliably generate code that not only runs correctly but produces accurate results.
Spoiler alert: AI is getting there, but itâs far from perfect. Letâs dive into what they found.
How the Study Worked
To assess the abilities of these LLMs, the researchers:
- Created 207 statistical analysis tasks covering various topics like regression, hypothesis testing, and data visualization.
- Used ChatGPT and Llama to generate SAS code for each task.
- Had human experts rate the AI-generated code based on three key criteria:
- Code Quality & Readability: Is the code well-structured, concise, and free from unnecessary complexity?
- Executability: Can the code run without errors or breaking?
- Output Accuracy: Does the code produce the correct statistical results?
The study's findings reveal that while AI-generated code often looks good on the surface, it struggles with deeper statistical reasoning, sometimes leading to incorrect or unusable results.
The Good: AI Can Write Clean, Readable Code
One of the biggest wins for LLMs was code quality. On average, AI-generated SAS scripts received a 94% score for readability and structure. These models followed proper formatting, avoided overly complex syntax, and made their code easy to understand.
đĄ Think of it like a professional writer who has perfect grammar but doesnât always understand the nuances of the topic theyâre writing about.
For beginners in SAS programming, this means ChatGPT and Llama can be useful tools for generating well-structured code templatesâbut you still need to double-check the logic!
The Bad: Running the Code? Not so Fast...
Things got tricky when it came to executabilityâin other words, whether the SAS code actually worked when run. AI-generated code worked only about 61% of the time without errors.
The biggest mistakes included:
- Using incorrect dataset or variable names
- Syntax errors (e.g., missing operators, incorrect keywords)
- Misunderstanding SAS-specific commands
If you're relying on AI to generate code, be prepared to spend time debuggingâthis isnât a magic "copy-and-paste" solution yet.
The Ugly: Faulty Results & Wrong Conclusions
Even if the code ran successfully, output correctness was a major issue. AI-generated SAS code produced correct statistical results only about 52% of the timeâessentially a coin flip.
đĄ Imagine asking an assistant to calculate your taxes, and they get the math right only half the time. Not great, right?
Some common issues included:
- Applying the wrong statistical methods
- Misinterpreting the dataset (e.g., treating categorical data as continuous)
- Producing redundant or unnecessary results
For professionals working in critical fields like healthcare, finance, and academia, this level of accuracy simply isnât good enoughâit could lead to serious errors in research and decision-making.
Which AI Did Best? ChatGPT vs. Llama
While none of the models were perfect, GPT-4.0 slightly outperformed the others in terms of both code executability and correctness. However, the differences werenât statistically significant.
- GPT-4.0 â Best at handling complex SAS commands & reducing errors.
- GPT-3.5 â Produced cleaner, more concise code but struggled more with execution.
- Llama 3.1 70B â Offered multiple code solutions (which can be useful) but sometimes generated redundant or incorrect answers.
What This Means for You
Should You Use AI for Statistical Programming?
Yesâbut with caution. AI can be a great tool for generating SAS templates and debugging common issues, but it's not yet reliable for full automation. If you're a beginner, AI can help you get started, but youâll still need human expertise to verify AI-generated codeâespecially if you're working with complex data analysis.
How To Get the Best AI-Generated Code
Want to boost the quality of AI-generated SAS code? Try these prompting best practices:
â Be clear and specific. Instead of "Write SAS code for regression," try "Generate SAS code to run a multiple linear regression analyzing patient age and cholesterol levels using PROC REG."
â Provide dataset details. AI does better when it knows data structures. Example: "The dataset has columns: PatientID, Age (numeric), CholesterolLevel (numeric)."
â Check AIâs work. Always run the code, debug errors, and confirm that outputs align with expected results.
â Ask AI to explain its logic. If something looks off, ask, "Why did you include PROC MEANS in this analysis?"
Key Takeaways
â AI CAN help with coding but isnât perfect. The generated SAS code is well-structured but correctness is still a major issue.
â Code execution is a gamble. AI often misuses dataset names, misunderstands SAS syntax, and occasionally produces broken code.
â Your output might be wrong! About 50% of the time, AI-generated code had incorrect statistical resultsâmeaning you need human verification before trusting its output in real-world applications.
â GPT-4.0 is slightly better than the rest. However, none of the models were significantly stronger across the board.
â AI coding is best used for assistance, not automation. Think of it as a co-pilotâgreat for templates but still requiring a human expert to verify the work.
đ The Future of AI in Statistical Programming
While todayâs LLMs are still imperfect, AIâs role in statistical coding is evolving rapidly. Future improvements in domain-specific training and better fact-checking mechanisms could one day make AI a dependable coding assistantâbut we're not quite there yet.
Until then, use AI wisely and double-check its outputsâyour data (and career) will thank you.
đ What are your experiences using AI for coding? Let us know in the comments!