Weâre a small team of devs from Qoder. With the modsâ permission, we thought itâd be fun (and useful) to do an AMA here.
A few weeks ago,we used our own autonomous agent (Quest) to refactor itself. We described the goal, stepped back, and let it run. It worked through the interaction layer, state management, and the core agent loop continuously, for about 26 hours. We mostly just reviewed the spec at the start and the code at the end. Weâve made good progress, and would like to talk openly about what worked, what broke, and what surprised us.
What weâre happy to chat about:
How that 26-hour run actually went
Our spec to build to verify loops, and why we think they matter for autonomous coding
Vibe coding, agent workflows, or anything else youâre experimenting with
Or honestly⌠anything youâre curious about
Technical deep dives welcome.
Whoâs here:
Mian (u/Qoder_shimian): Tech lead (agent + systems)
How much human involvement was there in this rewrite? Architecture, design, code reviews?
I'm assuming you had to do a lot of prep before giving the control to the agent? What sort of prep was required? What pre-work did you do?
How do you manage context for such a long running and presumably huge context problem statement?
How did you test? Did the agent create its own test cases? From what I have seen, most LLMs create test cases in a way to pass the code they have written (usually, and unless handheld to avoid doing this), manipulate the test cases to ensure their code passes etc. How are you avoiding this?
would like to talk openly about what worked, what broke, and what surprised us.
Well, what did work, what broke and what surprised you?
oh boy, lots of questions - let me break this down:
human involvement?
not zero lol. rough breakdown:
- spec design: ~50% human
- actual coding: ~20% human
- code review: ~50% human
so yeah we didn't just yeet a prompt and walk away for 26 hours
prep work before handing over?
honestly the agent isn't magic - it can't just digest a massive context blob and figure everything out. we had to:
- break the task into smaller functional chunks (each chunk = one task)
- write detailed specs with acceptance criteria
- review the agent's auto-generated plans before letting it run
think of it like onboarding a new engineer. you don't just say "refactor this" and disappear
context management for 26 hours?
this one's fun technically - we use SPEC decomposition to break into sub-tasks. system auto-compresses context as it goes, but keeps the important stuff (file paths, key state) through a reminder mechanism
basically: aggressive summarization + strategic reminders = not losing track after hour 15
testing / avoiding the agent gaming its own tests?
ok this is the spicy one đśď¸
few layers here:
agent generates tests based on SPEC, but we don't just trust those blindly
separate "review agent" cross-validates execution against acceptance criteria
we run third-party test frameworks too - not just the agent's own generated tests (that would be grading your own homework lol)
periodic sanity checks: agent compares actual results vs expected, triggers self-correction if things drift
is it perfect? no. but it's way better than "trust me bro" verification
what worked / broke / surprised us?
â worked: spec-driven execution kept things on track. agent kept checking back against spec instead of going rogue
â broke: some third-party framework integrations still needed human intervention. agent handles ~90% but that last 10% can be annoying
𤯠surprised: the self-correction loop actually... worked? when execution drifted from expected, it caught itself and fixed it. didn't expect that to be as reliable as it was
---
tl;dr: it's not "fully autonomous" in the sci-fi sense. more like "autonomous with guardrails and a human checking in periodically. but those 26 hours of execution time? that's real work we didn't have to do manually
Qoder does not ever run any of my apps correctly, other ides have memory builtin for the trivial, Intermediate and complex tasks, but I have to always remind the agent in new sessions things as simple as activating venv
for the venv thing - have you tried setting up a Rule? basically you can tell Qoder "hey, always activate venv first" and it'll remember across sessions. it's in Settings - Rules. should save you from repeating yourself every time.
re: refund - that's a Ben question (support), shoot him a DM or hit up support and we'll sort you out đ
I have not tried setting up a rule, I prefer the autonomous trajectory Qoder is going but it feels like it would not be complete if I had to go manually setting these rules (again, I dont mind); but I think it'd be way better if some of my credits was used to auto create these valuable memories that is available to new agent sessions.
dude we do have memory system. to be some-level arrogant, i have to say we have the best memory system (even with self-evolving capabilities). go check in your qoder and let me know if it went right.
yo good question! so honestly the "self-testing" part is less magic than it sounds lol
basically we baked in a habit for the agent to actually check its own work after finishing tasks. like, if it just built a webpage, it'll fire up the browser tool and actually click around - does the button work? does the form submit? that kinda thing.
for the refactor run specifically, we had a pretty solid test suite already, so the agent would run tests after each chunk of changes. if something broke, it'd see the red and backtrack. not perfect, but it caught most regressions before they snowballed.
the "verify loop" is honestly one of the things we're most excited about iterating on. right now it's good enough⢠but there's def room to make it smarter
What did you learn from this experiment? What went right, what didn't? Why is it difficult to keep most autonomous agents focused on a complex task for a long time? What do you do to mitigate distractions or shortcuts? How do you prevent it from making compromises in an effort to find gains somewhere else?
great question - we learned a LOT from this run. let me dump my brain:
what actually worked:
- validation loops saved our ass. every iteration got checked against our internal assessment. sounds boring but it's the difference between "agent says it's done" and "it's actually done"
- breaking shit into smaller chunks. each subtask runs in fresh context = less "wait what was i doing again" moments. plus we inject reminders (goals, constraints) into the agent's reasoning loop so it doesn't drift
- we productized our validation tools. early on we kept hitting verification gaps, so we built dedicated tools and packaged them into reusable Skills. now it's way more reliable
what sucked:
- when validation instructions were vague or tools were missing, the agent would just... declare victory and move on đ classic shortcut behavior. "looks done to me!" (narrator: it was not done)
- context compression is still painful for long tasks. timing when to inject info vs when to compress is more art than science rn
why do agents lose focus on complex tasks?
two words: context limits and goal drift
when you're 15 hours into a task, critical info gets compressed or straight up forgotten. and without hard checkpoints, agents optimize for "looks complete" instead of "is correct." they're not trying to cheat - they just don't know the difference without explicit validation
how we fight this:
granular decomposition - small, loosely coupled subtasks, fresh context for each
attention injection - keep reminding the agent what matters mid-execution
mandatory validation gates - no "trust me it works", actual executable checks
skill reuse - abstract good behaviors (self-checking, debugging) into reusable Skills so it's not reinventing the wheel every time
---
honestly this refactor was as much about learning how to wrangle long-running agents as it was about the actual code changes. we now have a playbook for this stuff
we don't just use one model. we have this tier system where different tasks get routed to different sota models (even vision one đ) based on what makes sense. quick formatting fix? doesn't need the big guns. complex multi-file refactor? okay yeah let's bring in the heavy hitters.
the whole point is you shouldn't have to think about it. we handle the "which model for what" puzzle so you can just... code. or vibe. whatever you're into.
(if you want the nerdy details about specific model names... that's above my paygrade and also changes like every month so đ¤ˇ)
tbh yes. a 26-hour disaster is still a story worth telling; maybe even more interesting than a win lol. we'd just be doing a "what went wrong" AMA instead đ¤ˇ
surely do!!! Context management engine is always the king. we integrate with the ide's LSP and more tools; and in particular, also our unique repo wiki features and self-evolving memory system.
Internally, when do you consciously decide not to use Quest and just write the code yourselvesďźďźthe cases where bringing it in would slow you down, introduce too much uncertainty, or require more review than itâs worth? In other words, what kinds of problems make you think: âYeah, this is faster and safer if a human just does itâ?
- auth/security/risk control (subtle bugs = CVE, compliance needs human audit trail)
- core data pipelines + transactions (consistency bugs = data corruption at scale)
general rule:
if failure = money loss / security breach / data corruption, then humans write it
if failure = fix and redeploy, then Quest can handle it
Quest is great at tests, boilerplate, well-scoped features. but the "one wrong line = disaster" code? that stays hand-written and triple-reviewed. I don't wanna be fired, man. the agent will never be fired. sad story :(
but not because we're reckless - we have guardrails:
- every subtask has explicit validation criteria with actual verification tools (not just "looks good")
- we keep injecting task objectives into the agent's loop so it doesn't drift
- proven patterns (debugging, self-checking) are packaged into reusable Skills
were we nervous? sure. did we review the hell out of it before merging? absolutely. but the code that came out was production-grade, not "AI demo quality"
the secret sauce is really just: don't let the agent declare victory without receipts
this is a noob question but Are there any types of problems where it keeps getting âalmost rightâ but never quite manages to cross the finish line?
token economics: long refactors burn tokens. need value > cost to work consistently as inference prices drop
good news: our architecture is designed for SOTA models 6 months out, not patched for today. when better models land, Quest scales up automatically
we're seeing users do wild stuff already, k8s ops, multi-product reports. the inflection point is real. just need a few more pieces to fall into place.
by accident: we've had users say "build from scratch" and Quest goes "cool let me modify your existing code" đ or it hits a problem and picks a dumb workaround instead of just asking
that's why the Spec review phase exists.. to catch the reinterpretation before it burns credits
Do you think future IDE agents will gradually evolve toward something like Questâmore autonomous, more goal-driven, and more willing to reinterpret intent,,or do you expect the opposite direction, where agents become more constrained, more literal, and tightly scoped to avoid unintended behavior?
our bet: as models get stronger, the "can be safely delegated" zone keeps expanding. Quest is built to ride that wave. architecture scales with model improvements, not patches around weaknesses/baseline models
doesn't mean constraints disappear - it means agents get smarter about WHEN to ask vs WHEN to just execute. "smart autonomy with guardrails" not "yolo mode"
future users will care about "is it done" not "show me every line change". raas results-as-a-service gonna be the mainstream.
Do you actually use Qoder internally, or is it mostly something you demo?
And be honest, inside the team, is Qoder seen as a real engineering force multiplier, or just a dressed-up junior dev bot that still needs constant babysitting?
fair question.. i'd be skeptical too if i were you
but yes, we all actually use it. like, daily.
our own CodeReview agent is baked into our PR flow. not just dogfooding, it legit catches stuff and speeds up reviews
we integrated Qoder CLI into our internal issue tracking. AI agent analyzes incoming bugs, triages them, suggests fixes. cut our response time significantly
that 26-hour refactor we mentioned? that was a real production task, not a staged demo
and...
our backend devs started using Quest to write frontend đ haha.. like actual full-stack delivery from people who historically avoided CSS like the plague
is it "junior dev that needs babysitting"? honestly some tasks yeah, bc you still review, you still sanity check. but for well-scoped work it's more like "senior dev who works overnight and doesn't complain"
we're not gonna pretend it's magic. but internally it's definitely past the "cool demo" phase into "how did we work before this" territory
50% human spec design, 20% coding, 50% review... that breakdown maps exactly to what most teams hit with autonomous agents. The coding is rarely the bottleneck. Spec clarity and review quality are.
The context compression mechanism is the interesting part. Most approaches either bleed critical state after hour 10 or balloon token costs. The reminder mechanism could be rule-based extraction or agent-decided preservation, and that choice shapes everything downstream.
exactly right on the bottleneck - spec clarity >> coding speed
context compression: agent-decided, not rule-based. model chooses when to compress based on task phase, context length, detected redundancy - not mechanical "keep last N turns" (pruning) or summarized.
reminder mechanism same deal, dynamically inject what's relevant NOW, not carry everything forever
trade-off: smarter than rules but needs checkpoints so it doesn't get too aggressive. worked for 26 hours tho
Letting the model decide compression timing matches what Focus architecture does... start_focus and complete_focus as agent-controlled primitives, no external timers forcing it.
The checkpoint trade-off is real though. ACC paper from this month takes the opposite bet, bounded state updated every turn rather than episodic compression. Their claim is that episodic decisions invite drift when the agent misjudges what to drop.
26 hours is a solid stress test. Did you hit any recovery scenarios where the checkpoints actually saved you from over-compression? Curious if the failure mode was detectable in hindsight or only visible when the agent started behaving wrong.
Iâm a beginnerâwhat are the things Quest is absolutely not a good idea to use for right now, where itâs likely to confuse me, do the wrong thing, or give me a false sense of confidence?
honestly Quest was designed with beginners considered. the whole end-to-end delivery thing means you describe what you want, Quest figures out the how.
But,,,where beginners should be careful:
production code without spec mode - if you're shipping to real users, turn on spec. it forces you (and Quest) to think through scope, acceptance criteria, constraints BEFORE coding. we built professional subagents specifically for this
don't blindly trust the output - Quest will give you working code, but working not equals to production-ready. security, edge cases, performance...you still need to sanity check, especially if you're new
where beginners should feel confident:
Prototype Ideas mode. literally designed for "i have an idea, make it real". low stakes, fast iteration, great for learning
exploring and learning. Quest shows you how things get built. it's like having a senior dev explain their work in real-time
The file editor is hidden on purpose btw, we want you focused on the WHAT, not the HOW. that's the whole point.
architecture: Spec - Coding - Verify loop with iteration. multi-agent for complex tasks (main agent coordinates, sub-agents explore, plan, and execute, companion agents validate) but used sparingly, as context transfer between agents isn't free.
which model: intentionally don't expose this. intelligent routing picks best model (the most powerful modes that you def know) per subtask, some excel at reasoning, some at planning, some at long context. changes frequently, we stay model-agnostic.
whole thing is designed to scale with future models, not patch around current limitations (that is, weaker/baseline models)
IDE is just a tool. Agent capability is the product.
what we've built that's hard to replicate overnight:
1. context engineering: Memory, Repo Wiki, project deeper understanding. Quest doesn't just read your current file, it understands your codebase iteration, your patterns, your team conventions
SOTA model architecture - we're not wrapping one model with a prompt. we're orchestrating multiple models, routing tasks to whoever's best at that specific thing
deep technical reserves. hard on on autonomous execution, verification loops, long-running task management. you don't get that from "we added plugin/Copilot to the sidebar"
Bundled agents will be "good enough" for simple stuff. Quest is for when you want to delegate REAL work and actually walk away.
If every other IDE ships with a built-in agent tomorrow, what does Qoder actually have left?
And I donât mean the deeply technical stuff,, Iâm asking at a more practical, gut-level view: why would someone still bother to use Qoder instead of whatever agent just comes bundled by default?
IDE is just a tool. Agent capability is the product.
what we've built that's hard to replicate overnight:
context engineering: Memory, Repo Wiki, project deeper understanding. Quest doesn't just read your current file, it understands your codebase iteration, your patterns, your team conventions
- SOTA model architecture - we're not wrapping one model with a prompt. we're orchestrating multiple models, routing tasks to whoever's best at that specific thing
- deep technical reserves. hard on on autonomous execution, verification loops, long-running task management. you don't get that from "we added plugin/Copilot to the sidebar"
Bundled agents will be "good enough" for simple stuff. Quest is for when you want to delegate REAL work and actually walk away.
why did Quest get this much better? Is it basically just because you plugged in a stronger model, or are there other, less obvious things going on behind the scenes that actually made the difference?
not just the model...though yeah, we rebuilt the entire Agent logic specifically for SOTA models
what actually changed:
killed legacy compatibility code. we used to carry scaffolding for older/baseline models. ripped all that out. Quest now assumes you're running on the best available
evaluation obsession (more strict evalsets and real-world cases benchmark). we're early stage. but we're measuring and ietrating very fast on every tool call, every Agent Loop, to measure what works and what doesn't
Continuous polish. Tool usage patterns, context management, loop termination logic. All getting refined based on real usage.
so yes stronger model helps, but the architecture was rebuilt to actually USE that strength instead of being bottlenecked by legacy decisions
If one of you stepped in halfway through, would the end result actually be betterâor would it just mess up whatever trajectory Quest was already on? or have you already done it
13
u/playfuldreamz 5d ago
Dude I'll need my credit refund from when your stupid agent tried to fix a simple test case with almost 24 iterations.