BLOG

AI models like GPT, Claude, and Gemini feel slow because their reasoning occurs at runtime. The central issue isn’t just size or status; it’s that these models solve problems on demand instead of up front. While Anthropic offers versions like Opus, Sonnet, and Haiku, those variations primarily represent trade-offs among speed, accuracy, and cost, not a shift in how work gets done.
The typical agent asks an LLM to think through schema details and plan steps, then outputs SQL (or code, or a paragraph!). That reasoning shows up as extra tokens and additional processing time. We are relying on inference to do what training could have already settled. Of course, this doesn’t mean thinking isn’t necessary or valuable; in cases where speed is critical, changing the weights and the model’s architecture matters.
The main idea: shift reasoning into model training, not runtime. Instead of having the model think through the problem each time, train it so that decision-making is embedded in the weights. At inference time, the input maps directly to the output, such as a SQL query, without intermediate steps. This reduces latency and costs by eliminating step-by-step inference.
A second core idea is specialization over size. Bigger general models can reason well (and, goes without saying, likely cost more to run), but they do so by spending tokens on internal steps. A smaller, code-focused model trained on the right examples can generate correct SQL without the need for internal planning. Smaller models fit into faster memory, cost less to run, and deliver responses in tens of milliseconds rather than seconds.
To achieve that, you need large volumes of training data where correct SQL queries are paired with real inputs. During training, you use a stronger model as a “teacher” to generate high-confidence SQL for filtered and idealized schemas. Then you train the smaller “student” model on the full, complex schemas the business actually uses, forcing it to internalize how to ignore irrelevant tables and columns. The result is a model that inherently knows how to filter noise and generate correct queries, without runtime reasoning.
Improving quality here isn’t done by clever prompts or bigger models at runtime. It’s done with data scale and with careful alignment. After supervised learning, you can refine with a reward signal: did the generated query run and return the right result? This aligns the agent with the correct results and avoids hallucinations.
Architecture matters as much as model training. A fast retriever should identify relevant tables and columns in milliseconds before the model sees anything. The model itself should run on a filtered context. That pre-filtering removes noise rather than expecting the model to reason through noise. This separation of concerns is a familiar pattern: do the cheap work first, then let the trained model do the one thing it exists to do: generate the answer.
Once you have a small, specialized model, deployment options open up. You can quantize aggressively, run on inference engines optimized for single request latency, and colocate the model near the database. At that point, faster, more precise responses.
The speed problem is solved not at runtime, but through a combination of training choices and architecture. Internalize logic within the model, specialize and shrink models, separate schema work from answer generation, and optimize runtime to achieve the best combination of speed, cost, and accuracy.
Share Article
Latest News










