How I saved $12K on my taxes using Claude Code 4.8 and dynamic workflows
My CPA told me I owed $12K in fees for my 2025 taxes. A tax agent I built in a day using Claude Code 4.8, with ultracode + dynamic workflows, said otherwise.
My CPA told me I couldn’t file my taxes until August 2026 due to delayed K-1s and that I would owe $12K in fees as a result. This felt off, so I used ChatGPT to gut check her statement. After multiple attempts working with ChatGPT, I had a directional sense that there was a way to pay my anticipated underpaid taxes without filing, but I was still getting conflicting information:
Can you only do this as estimated payments before the deadline vs after the deadline?
I got using old interest rates. Is it using the current tax law?
It was citing help pages and sources like YouTube instead of direct tax law. How grounded are the answers?
If I brought this to my CPA for discussion, she would wave it off as AI hallucination. What I really wanted was direct citation from California and federal tax code.
I was curious if there were already products or people working on this. This led me to a (very) recent paper from OpenAI about Building self-improving tax agents with Codex.
Of course (!) the frontier labs will build best-in-class agents using their partnerships with other firms to get large corpuses of proprietary tax filings.
However, I didn’t need, or want to build, a full blown tax filing agent. I just wanted an agent that was grounded on tax law for individuals to help individual filers and CPAs do research.
I stumbled upon a few tax evaluations benchmarks while going down this rabbit hole: TaxEval v2, tax-calc-bench (arXiv, github), and others.
Tax-calc-bench was the only open source benchmark I could find and run myself. Tax-calc-bench focuses on benchmarking models, but not agents.
What if I built a CPA copilot using an agent harness?
I could use the OpenAI Agents SDK, provide it with federal and California tax law as a local corpus so it could search it using fused (hybrid) RAG, and finally, run it in Vercel’s Sandbox so it could grep the corpus and use a code interpreter to do calculations.
So that’s what I built: CPA-Copilot.
One of the first challenges in implementing this was that it was really slow and tripping on calculations. I looked closer at tax-calc-bench and realized they were achieving their scores using the xhigh effort setting. Some test cases ran for 30+ minutes without finishing. Even simple questions took dozens of seconds.
At this point, I decided to give Claude Code (Opus 4.8) and its newly released ultracode + dynamic workflow mode, a whirl. I asked it to help optimize the harness with some nudging:
Could we categorize user questions by difficulty?
Could we use a routing layer to route different questions to different models and effort settings?
Finally, to make sure it didn’t overfit the eval’s golden answers, I asked it to use different instances of Claude Code to run blind evals.
As a final result, CPA-Copilot tops every column of the TY24 leaderboard and runs in significantly less time due to difficulty-based routing and a deterministic calculator as a tool. This is something dynamic workflows unlocked: instead of dozens of seconds being spent on reasoning and re-creating adhoc scripts, it converged on a generalizable calculator that could be re-used. Simple questions run in seconds and more complex ones in dozens of seconds vs minutes.
Correct returns (strict): 64.71%
Correct returns (lenient): 78.43%
Correct (by line): 91.92%
Correct (by line, lenient): 94.84%
My research using CPA-Copilot led me to Section 6603 (amongst others) of the federal tax code. I brought this up with my CPA and she agreed we could avoid fees despite not filing taxes yet by prepayment past the deadline. This was news to her since other clients had reported to her that the WebPay and Direct Pay forms wouldn’t accept payments for prior tax years. Win-win for both parties.


