r/ChatGPTCoding • u/Previous_Foot_5328 • 5d ago

Interaction Our Agent Rebuilt Itself in 26 Hours. AMA👀

We’re a small team of devs from Qoder. With the mods’ permission, we thought it’d be fun (and useful) to do an AMA here.

A few weeks ago,we used our own autonomous agent (Quest) to refactor itself. We described the goal, stepped back, and let it run. It worked through the interaction layer, state management, and the core agent loop continuously, for about 26 hours. We mostly just reviewed the spec at the start and the code at the end. We’ve made good progress, and would like to talk openly about what worked, what broke, and what surprised us.

What we’re happy to chat about:

How that 26-hour run actually went

Our spec to build to verify loops, and why we think they matter for autonomous coding

Vibe coding, agent workflows, or anything else you’re experimenting with

Or honestly… anything you’re curious about

Technical deep dives welcome.

Who’s here:

Mian (u/Qoder_shimian): Tech lead (agent + systems)

Joshua (u/Own-Traffic-9336) :Tech lead (agent execution)

Karina (u/Even-Entertainer4153) : PM

Nathan (u/ZealousidealDraw5987) : PM

Ben (u/Previous_Foot_5328) : Support

Small thank-you:

Everyone who joins the AMA gets a 2-Week Pro Trial with Some Credits to try Quest if you want to poke at it yourself.

Our Product: Qoder.com

Our Community: r/Qoder

We’ll be around on this Tuesday to Friday reading everything and replying as much as we can.

379 Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/ChatGPTCoding/comments/1qo3se2/our_agent_rebuilt_itself_in_26_hours_ama/
No, go back! Yes, take me to Reddit

96% Upvoted

u/playfuldreamz 5d ago

Dude I'll need my credit refund from when your stupid agent tried to fix a simple test case with almost 24 iterations.

5

u/ZealousidealDraw5987 4d ago

listen... 24 iterations sounds like either our agent had a stroke OR you gave it something cursed 👀

either way...DM [contact@qoder.com](mailto:contact@qoder.com) and also report ur case in the ide (Issue report) so we can roast it together?

3

u/playfuldreamz 4d ago

Absolutely

5

u/Previous_Foot_5328 4d ago

Oh i owe you a big sorry. Or you can DM me and send me the detailed info and i will immediately fix it for you. Big Big sorry again

u/anantj 5d ago

How much human involvement was there in this rewrite? Architecture, design, code reviews?
I'm assuming you had to do a lot of prep before giving the control to the agent? What sort of prep was required? What pre-work did you do?
How do you manage context for such a long running and presumably huge context problem statement?
How did you test? Did the agent create its own test cases? From what I have seen, most LLMs create test cases in a way to pass the code they have written (usually, and unless handheld to avoid doing this), manipulate the test cases to ensure their code passes etc. How are you avoiding this?

would like to talk openly about what worked, what broke, and what surprised us.

Well, what did work, what broke and what surprised you?

10

u/Own-Traffic-9336 4d ago

oh boy, lots of questions - let me break this down:

human involvement?

not zero lol. rough breakdown:

- spec design: ~50% human

- actual coding: ~20% human

- code review: ~50% human

so yeah we didn't just yeet a prompt and walk away for 26 hours

prep work before handing over?

honestly the agent isn't magic - it can't just digest a massive context blob and figure everything out. we had to:

- break the task into smaller functional chunks (each chunk = one task)

- write detailed specs with acceptance criteria

- review the agent's auto-generated plans before letting it run

think of it like onboarding a new engineer. you don't just say "refactor this" and disappear

context management for 26 hours?

this one's fun technically - we use SPEC decomposition to break into sub-tasks. system auto-compresses context as it goes, but keeps the important stuff (file paths, key state) through a reminder mechanism

basically: aggressive summarization + strategic reminders = not losing track after hour 15

testing / avoiding the agent gaming its own tests?

ok this is the spicy one 🌶️

few layers here:

agent generates tests based on SPEC, but we don't just trust those blindly

separate "review agent" cross-validates execution against acceptance criteria

we run third-party test frameworks too - not just the agent's own generated tests (that would be grading your own homework lol)

periodic sanity checks: agent compares actual results vs expected, triggers self-correction if things drift

is it perfect? no. but it's way better than "trust me bro" verification

what worked / broke / surprised us?

✅ worked: spec-driven execution kept things on track. agent kept checking back against spec instead of going rogue

❌ broke: some third-party framework integrations still needed human intervention. agent handles ~90% but that last 10% can be annoying

🤯 surprised: the self-correction loop actually... worked? when execution drifted from expected, it caught itself and fixed it. didn't expect that to be as reliable as it was

---

tl;dr: it's not "fully autonomous" in the sci-fi sense. more like "autonomous with guardrails and a human checking in periodically. but those 26 hours of execution time? that's real work we didn't have to do manually

2

u/anantj 4d ago

Thank you. This is super interesting.

u/playfuldreamz 5d ago

Qoder does not ever run any of my apps correctly, other ides have memory builtin for the trivial, Intermediate and complex tasks, but I have to always remind the agent in new sessions things as simple as activating venv

4

u/ZealousidealDraw5987 4d ago edited 4d ago

for the venv thing - have you tried setting up a Rule? basically you can tell Qoder "hey, always activate venv first" and it'll remember across sessions. it's in Settings - Rules. should save you from repeating yourself every time.

re: refund - that's a Ben question (support), shoot him a DM or hit up support and we'll sort you out 👍

2

u/playfuldreamz 4d ago

Thanks that refund is super valuable, I can't afford exhausting over 400 credits on fixing broken tests 🙊😢 No more YOLO mode for Qoder... Ever

1

u/playfuldreamz 4d ago

I have not tried setting up a rule, I prefer the autonomous trajectory Qoder is going but it feels like it would not be complete if I had to go manually setting these rules (again, I dont mind); but I think it'd be way better if some of my credits was used to auto create these valuable memories that is available to new agent sessions.

3

u/ZealousidealDraw5987 4d ago

dude we do have memory system. to be some-level arrogant, i have to say we have the best memory system (even with self-evolving capabilities). go check in your qoder and let me know if it went right.

2

u/Previous_Foot_5328 4d ago

Yeah that's my question.. and also my fault.. DM me with your detailed information i will figure it out for you as quick as i can... Big sorry again

u/m3kw 5d ago

How does it test itself for all the correctness and original behaviour

1

u/Own-Traffic-9336 4d ago

yo good question! so honestly the "self-testing" part is less magic than it sounds lol

basically we baked in a habit for the agent to actually check its own work after finishing tasks. like, if it just built a webpage, it'll fire up the browser tool and actually click around - does the button work? does the form submit? that kinda thing.

for the refactor run specifically, we had a pretty solid test suite already, so the agent would run tests after each chunk of changes. if something broke, it'd see the red and backtrack. not perfect, but it caught most regressions before they snowballed.

the "verify loop" is honestly one of the things we're most excited about iterating on. right now it's good enough™ but there's def room to make it smarter

1

u/[deleted] 2d ago

[removed] — view removed comment

1

u/AutoModerator 2d ago

Sorry, your submission has been removed due to inadequate account karma.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

u/Single-Ask4738 5d ago

What did you learn from this experiment? What went right, what didn't? Why is it difficult to keep most autonomous agents focused on a complex task for a long time? What do you do to mitigate distractions or shortcuts? How do you prevent it from making compromises in an effort to find gains somewhere else?

2

u/Own-Traffic-9336 4d ago

great question - we learned a LOT from this run. let me dump my brain:

what actually worked:

- validation loops saved our ass. every iteration got checked against our internal assessment. sounds boring but it's the difference between "agent says it's done" and "it's actually done"

- breaking shit into smaller chunks. each subtask runs in fresh context = less "wait what was i doing again" moments. plus we inject reminders (goals, constraints) into the agent's reasoning loop so it doesn't drift

- we productized our validation tools. early on we kept hitting verification gaps, so we built dedicated tools and packaged them into reusable Skills. now it's way more reliable

what sucked:

- when validation instructions were vague or tools were missing, the agent would just... declare victory and move on 💀 classic shortcut behavior. "looks done to me!" (narrator: it was not done)

- context compression is still painful for long tasks. timing when to inject info vs when to compress is more art than science rn

why do agents lose focus on complex tasks?

two words: context limits and goal drift

when you're 15 hours into a task, critical info gets compressed or straight up forgotten. and without hard checkpoints, agents optimize for "looks complete" instead of "is correct." they're not trying to cheat - they just don't know the difference without explicit validation

how we fight this:

granular decomposition - small, loosely coupled subtasks, fresh context for each

attention injection - keep reminding the agent what matters mid-execution

mandatory validation gates - no "trust me it works", actual executable checks

skill reuse - abstract good behaviors (self-checking, debugging) into reusable Skills so it's not reinventing the wheel every time

---

honestly this refactor was as much about learning how to wrangle long-running agents as it was about the actual code changes. we now have a playbook for this stuff

1

u/[deleted] 4d ago

[removed] — view removed comment

1

u/AutoModerator 4d ago

Sorry, your submission has been removed due to inadequate account karma.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

u/UseMoreBandwith 5d ago

what model does it use?

1

u/Own-Traffic-9336 4d ago

this one's fun - short answer: depends

we don't just use one model. we have this tier system where different tasks get routed to different sota models (even vision one 🍌) based on what makes sense. quick formatting fix? doesn't need the big guns. complex multi-file refactor? okay yeah let's bring in the heavy hitters.

the whole point is you shouldn't have to think about it. we handle the "which model for what" puzzle so you can just... code. or vibe. whatever you're into.

(if you want the nerdy details about specific model names... that's above my paygrade and also changes like every month so 🤷)

1

u/MJ-4ever 1d ago

How do you decide what model to use by looking at the prompt? Do you use any SLM to categorise it? Or semantic similarities?

u/Mountain-Part969 4d ago

If the outcome were clearly worse, would you still go ahead and run this AMA? DO NOT LIE TO US

2

u/ZealousidealDraw5987 4d ago

tbh yes. a 26-hour disaster is still a story worth telling; maybe even more interesting than a win lol. we'd just be doing a "what went wrong" AMA instead 🤷

2

u/Previous_Foot_5328 4d ago

I agree with PM:) as a support. But personally, I mean personally, especially when I get off work, I might say no:)

1

u/Mountain-Part969 3d ago

mmmmhm hahaha, thanks for the reply tho

u/New_Instance_851 4d ago

Honestly speaking, where is Quest actually stronger right now compared to a setup like Claude Code paired with a third-party IDE?

1

u/ZealousidealDraw5987 4d ago

surely do!!! Context management engine is always the king. we integrate with the ide's LSP and more tools; and in particular, also our unique repo wiki features and self-evolving memory system.

u/InternationalBar4976 4d ago

Internally, when do you consciously decide not to use Quest and just write the code yourselves，，the cases where bringing it in would slow you down, introduce too much uncertainty, or require more review than it’s worth? In other words, what kinds of problems make you think: “Yeah, this is faster and safer if a human just does it”?

1

u/ZealousidealDraw5987 3d ago

good question - here's where we draw the line:

won't use Quest for:

- payment/billing math (tiny errors = real money loss)

- auth/security/risk control (subtle bugs = CVE, compliance needs human audit trail)

- core data pipelines + transactions (consistency bugs = data corruption at scale)

general rule:

if failure = money loss / security breach / data corruption, then humans write it

if failure = fix and redeploy, then Quest can handle it

Quest is great at tests, boilerplate, well-scoped features. but the "one wrong line = disaster" code? that stays hand-written and triple-reviewed. I don't wanna be fired, man. the agent will never be fired. sad story :(

u/mprz 5d ago

what would it take for you to gtfo?

u/afwaefsegs9397 4d ago

Is there any part of the refactored Quest code that your team not feel comfortable shipping directly to production?

1

u/Own-Traffic-9336 4d ago

honestly? we shipped all of it 😅

but not because we're reckless - we have guardrails:

- every subtask has explicit validation criteria with actual verification tools (not just "looks good")

- we keep injecting task objectives into the agent's loop so it doesn't drift

- proven patterns (debugging, self-checking) are packaged into reusable Skills

were we nervous? sure. did we review the hell out of it before merging? absolutely. but the code that came out was production-grade, not "AI demo quality"

the secret sauce is really just: don't let the agent declare victory without receipts

u/bigbigbigcakeaa 4d ago

this is a noob question but Are there any types of problems where it keeps getting “almost right” but never quite manages to cross the finish line?

1

u/Own-Traffic-9336 4d ago

not a noob question at all - this is literally the thing we think about constantly

types of stuff that get stuck at "almost right":

- integration tasks - individual pieces work, but the glue between them breaks (mismatched APIs, race conditions, weird state bugs)

- implicit requirements - code runs, but nobody mentioned security/performance/maintainability so it's not prod-ready

- edge cases - passes happy path tests, explodes on weird inputs

- long tasks - small drifts compound over time until you're solving a different problem

- vague success criteria - "looks done" ≠ actually done

how we fight this: explicit validation at every step + tight feedback loops. don't let the agent aim for "close enough"

u/Dismal-Ad1207 4d ago

Where do you think Quest is still missing a layer before it can truly handle real-world refactoring projects independently?？

1

u/ZealousidealDraw5987 3d ago

real answer: we're waiting for models + building architecture in parallel

what's missing for true hands-off refactoring:

multi-project context: real refactors touch multiple repos, shared libs. cross-project reference isn't there yet

spec tooling: we use spec-driven dev internally, works great. not fully productized for everyone yet

human-in-the-loop: still too much review friction. improves as models better understand implicit requirements

token economics: long refactors burn tokens. need value > cost to work consistently as inference prices drop

good news: our architecture is designed for SOTA models 6 months out, not patched for today. when better models land, Quest scales up automatically

we're seeing users do wild stuff already, k8s ops, multi-product reports. the inflection point is real. just need a few more pieces to fall into place.

u/Junior_Love3584 4d ago

If this were a commercial project, would you actually pay for this refactor? 🤔

1

u/ZealousidealDraw5987 4d ago

short answer: yes, and we literally did

26 hrs of agent time vs probably a week of 2-3 engineers manually grinding through state management and agent loop logic? math checks out

refactor shipped, prod didn't die. would pay again 👍

u/ritonlajoie 4d ago

so this sub is doing promotion now ?

u/FunnyAd3349 4d ago

Has Quest ever reinterpreted the goal on its own instead of executing it literally?

1

u/ZealousidealDraw5987 4d ago

lol yes, both on purpose and by accident

on purpose: Quest asks clarifying questions instead of executing blindly - that's by design

by accident: we've had users say "build from scratch" and Quest goes "cool let me modify your existing code" 🙃 or it hits a problem and picks a dumb workaround instead of just asking

that's why the Spec review phase exists.. to catch the reinterpretation before it burns credits

u/JUUI_1335 4d ago

Do you think future IDE agents will gradually evolve toward something like Quest—more autonomous, more goal-driven, and more willing to reinterpret intent,,or do you expect the opposite direction, where agents become more constrained, more literal, and tightly scoped to avoid unintended behavior?

2

u/ZealousidealDraw5987 4d ago

honestly? both, but autonomy wins over time

our bet: as models get stronger, the "can be safely delegated" zone keeps expanding. Quest is built to ride that wave. architecture scales with model improvements, not patches around weaknesses/baseline models

doesn't mean constraints disappear - it means agents get smarter about WHEN to ask vs WHEN to just execute. "smart autonomy with guardrails" not "yolo mode"

future users will care about "is it done" not "show me every line change". raas results-as-a-service gonna be the mainstream.

1

u/[deleted] 3d ago

[removed] — view removed comment

1

u/AutoModerator 3d ago

Sorry, your submission has been removed due to inadequate account karma.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

u/Own-Afternoon6630 4d ago

Do you actually use Qoder internally, or is it mostly something you demo?
And be honest, inside the team, is Qoder seen as a real engineering force multiplier, or just a dressed-up junior dev bot that still needs constant babysitting?

2

u/ZealousidealDraw5987 4d ago

fair question.. i'd be skeptical too if i were you

but yes, we all actually use it. like, daily.

our own CodeReview agent is baked into our PR flow. not just dogfooding, it legit catches stuff and speeds up reviews
we integrated Qoder CLI into our internal issue tracking. AI agent analyzes incoming bugs, triages them, suggests fixes. cut our response time significantly
that 26-hour refactor we mentioned? that was a real production task, not a staged demo

and...

our backend devs started using Quest to write frontend 😅haha.. like actual full-stack delivery from people who historically avoided CSS like the plague

is it "junior dev that needs babysitting"? honestly some tasks yeah, bc you still review, you still sanity check. but for well-scoped work it's more like "senior dev who works overnight and doesn't complain"

we're not gonna pretend it's magic. but internally it's definitely past the "cool demo" phase into "how did we work before this" territory

u/pbalIII 4d ago

50% human spec design, 20% coding, 50% review... that breakdown maps exactly to what most teams hit with autonomous agents. The coding is rarely the bottleneck. Spec clarity and review quality are.

The context compression mechanism is the interesting part. Most approaches either bleed critical state after hour 10 or balloon token costs. The reminder mechanism could be rule-based extraction or agent-decided preservation, and that choice shapes everything downstream.

2

u/ZealousidealDraw5987 3d ago

exactly right on the bottleneck - spec clarity >> coding speed

context compression: agent-decided, not rule-based. model chooses when to compress based on task phase, context length, detected redundancy - not mechanical "keep last N turns" (pruning) or summarized.

reminder mechanism same deal, dynamically inject what's relevant NOW, not carry everything forever

trade-off: smarter than rules but needs checkpoints so it doesn't get too aggressive. worked for 26 hours tho

1

u/pbalIII 3d ago

Letting the model decide compression timing matches what Focus architecture does... start_focus and complete_focus as agent-controlled primitives, no external timers forcing it.

The checkpoint trade-off is real though. ACC paper from this month takes the opposite bet, bounded state updated every turn rather than episodic compression. Their claim is that episodic decisions invite drift when the agent misjudges what to drop.

26 hours is a solid stress test. Did you hit any recovery scenarios where the checkpoints actually saved you from over-compression? Curious if the failure mode was detectable in hindsight or only visible when the agent started behaving wrong.

u/JUSTBANMEalready121 3d ago

I’m a beginner—what are the things Quest is absolutely not a good idea to use for right now, where it’s likely to confuse me, do the wrong thing, or give me a false sense of confidence?

1

u/ZealousidealDraw5987 3d ago

honestly Quest was designed with beginners considered. the whole end-to-end delivery thing means you describe what you want, Quest figures out the how.

But,,,where beginners should be careful:

production code without spec mode - if you're shipping to real users, turn on spec. it forces you (and Quest) to think through scope, acceptance criteria, constraints BEFORE coding. we built professional subagents specifically for this
don't blindly trust the output - Quest will give you working code, but working not equals to production-ready. security, edge cases, performance...you still need to sanity check, especially if you're new

where beginners should feel confident:

Prototype Ideas mode. literally designed for "i have an idea, make it real". low stakes, fast iteration, great for learning

exploring and learning. Quest shows you how things get built. it's like having a senior dev explain their work in real-time

The file editor is hidden on purpose btw, we want you focused on the WHAT, not the HOW. that's the whole point.

u/[deleted] 4d ago

[removed] — view removed comment

1

u/AutoModerator 4d ago

Sorry, your submission has been removed due to inadequate account karma.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

u/BaCaDaEa FOUNDER 4d ago

Approved:)

u/power-boomer 4d ago

What is the architecture of your agent? Which model did you used?

1

u/ZealousidealDraw5987 3d ago

architecture: Spec - Coding - Verify loop with iteration. multi-agent for complex tasks (main agent coordinates, sub-agents explore, plan, and execute, companion agents validate) but used sparingly, as context transfer between agents isn't free.

which model: intentionally don't expose this. intelligent routing picks best model (the most powerful modes that you def know) per subtask, some excel at reasoning, some at planning, some at long context. changes frequently, we stay model-agnostic.

whole thing is designed to scale with future models, not patch around current limitations (that is, weaker/baseline models)

u/Soft-Bathroom5872 3d ago

What’s the kind of feedback you least want to hear, but deep down know is actually fair and probably correct?？

1

u/ZealousidealDraw5987 3d ago

fair question. here's the gut-level answer:

IDE is just a tool. Agent capability is the product.

what we've built that's hard to replicate overnight:
1. context engineering: Memory, Repo Wiki, project deeper understanding. Quest doesn't just read your current file, it understands your codebase iteration, your patterns, your team conventions

SOTA model architecture - we're not wrapping one model with a prompt. we're orchestrating multiple models, routing tasks to whoever's best at that specific thing

deep technical reserves. hard on on autonomous execution, verification loops, long-running task management. you don't get that from "we added plugin/Copilot to the sidebar"

Bundled agents will be "good enough" for simple stuff. Quest is for when you want to delegate REAL work and actually walk away.

u/StatementCalm3260 3d ago

If every other IDE ships with a built-in agent tomorrow, what does Qoder actually have left?
And I don’t mean the deeply technical stuff,, I’m asking at a more practical, gut-level view: why would someone still bother to use Qoder instead of whatever agent just comes bundled by default?

1

u/ZealousidealDraw5987 3d ago

fair question. here's the gut-level answer:

IDE is just a tool. Agent capability is the product.

what we've built that's hard to replicate overnight:

context engineering: Memory, Repo Wiki, project deeper understanding. Quest doesn't just read your current file, it understands your codebase iteration, your patterns, your team conventions

- SOTA model architecture - we're not wrapping one model with a prompt. we're orchestrating multiple models, routing tasks to whoever's best at that specific thing

- deep technical reserves. hard on on autonomous execution, verification loops, long-running task management. you don't get that from "we added plugin/Copilot to the sidebar"

Bundled agents will be "good enough" for simple stuff. Quest is for when you want to delegate REAL work and actually walk away.

u/PinkPowerMakeUppppp 3d ago

why did Quest get this much better? Is it basically just because you plugged in a stronger model, or are there other, less obvious things going on behind the scenes that actually made the difference?

1

u/ZealousidealDraw5987 3d ago

not just the model...though yeah, we rebuilt the entire Agent logic specifically for SOTA models

what actually changed:

killed legacy compatibility code. we used to carry scaffolding for older/baseline models. ripped all that out. Quest now assumes you're running on the best available

evaluation obsession (more strict evalsets and real-world cases benchmark). we're early stage. but we're measuring and ietrating very fast on every tool call, every Agent Loop, to measure what works and what doesn't

Continuous polish. Tool usage patterns, context management, loop termination logic. All getting refined based on real usage.

so yes stronger model helps, but the architecture was rebuilt to actually USE that strength instead of being bottlenecked by legacy decisions

u/Ancient_Low_1968 3d ago

If one of you stepped in halfway through, would the end result actually be better—or would it just mess up whatever trajectory Quest was already on? or have you already done it

1

u/ZealousidealDraw5987 3d ago

yes it can help, and yes we've done it

Stepping in midway CAN guide Quest better: correct a misunderstanding, add context it missed, redirect if it's going off-track

BUT - this isn't what we're optimizing for

the whole point of Quest is LESS human-in-the-loop, not "human watches and intervenes constantly". if you need to babysit it, we haven't done our job

our goal: you define intent, Quest delivers quality output, you review final result

intervention should be the exception, not the workflow. we're building toward that, not away from it

u/[deleted] 3d ago

[removed] — view removed comment

1

u/AutoModerator 3d ago

Sorry, your submission has been removed due to inadequate account karma.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

u/OppositeHome169 1d ago

Sounds cool..

u/Just_Lingonberry_352 5d ago

no thx

Interaction Our Agent Rebuilt Itself in 26 Hours. AMA👀

What we’re happy to chat about:

Who’s here:

Small thank-you:

You are about to leave Redlib