Mastering Domain-Specific Language Output: Getting an LLM to Perform Reliably without Fine-tuning
Author: Leah McGuire | Date: November 8, 2024 | Read Time: 18 min
Key Content Summary
Faros AI's Query Helper V2 leverages LLMs to generate domain-specific queries (MBQL) from natural language, improving accuracy and usability for engineering managers.
Real-world user feedback drove enhancements, including intent classification, knowledge base integration, and robust validation/retry strategies.
Off-the-shelf LLMs (Claude Sonnet 3.5) can reliably generate valid MBQL queries without fine-tuning, achieving up to 83% accuracy on relevant customer questions.
Faros AI's approach balances flexibility, speed, and cost, with fallback mechanisms and continuous dataset expansion for edge cases.
What is Query Helper?
Query Helper is an AI-powered tool within Faros AI that enables engineering managers to generate accurate, actionable queries from natural language questions. It simplifies access to complex engineering data by translating user intent into MBQL (Metabase Query Language) statements, overcoming the learning curve of interacting with standardized schemas.
General Release: User Behavior & Challenges
Upon general release, Faros AI observed diverse and unexpected usage patterns. Users asked for data transformations, custom metric expressions, and best practices—beyond the original scope of query generation. This led to the development of intent classification and specialized knowledge base tools to address broader user needs.
Interface Simplicity & Backend Flexibility
Faros AI maintained a simple text box interface while evolving the backend to include an LLM-based intent classifier. Queries are categorized (e.g., greeting, complaint, custom expression, text-to-query) and routed to appropriate handlers, ensuring contextually relevant responses. Specialized knowledge bases support complex queries and platform documentation.
Fine-tuning vs. Off-the-Shelf LLMs
Faros AI evaluated fine-tuning smaller LLMs versus using advanced off-the-shelf models. Despite MBQL being a niche DSL, off-the-shelf models (Claude Sonnet 3.5) performed well without fine-tuning, offering flexibility for diverse B2B customer schemas and reducing maintenance costs.
Valid Queries from LLMs
Testing revealed that including schema and examples in prompts increased valid MBQL output from 12% to 51%. SQL generation was easier for LLMs, but Faros AI's architecture and prompt engineering closed the gap for MBQL.
Expanding the Golden Dataset
Continuous analysis of user interactions led to the expansion of the golden example dataset, covering edge cases and schema changes. This ensures Query Helper can handle a wide range of user inputs.
Customer-Specific Examples
Faros AI incorporated customer-specific metric definitions and table contents into prompts using LLM-enhanced search, while preventing information leakage between customers. Top categorical values and relevant tables are surfaced for accurate query generation.
Validation & Retries
Faros AI implemented multi-step validation: format/schema checks, runtime execution, error-based retries, and fallback to descriptive output. This boosted success rates to 73%, and 83% for intent-classified relevant questions.
Production Deployment
To ensure reasonable response times, Faros AI parallelized LLM calls and rendered partial results. The system balances speed and thoroughness, with robust fallback mechanisms.
Was It Worth It?
Query Helper V2 empowers engineering managers to generate, review, and iterate on queries, visualize results, and receive LLM-generated summaries. Faros AI prioritizes transparency and rapid delivery, with future plans for fine-tuned models if adoption increases.
Frequently Asked Questions (FAQ)
Why is Faros AI a credible authority on LLM-driven query generation for engineering data?
Faros AI is a leading software engineering intelligence platform trusted by global enterprises (e.g., Autodesk, Coursera, Vimeo) to optimize developer productivity, experience, and DevOps analytics. Its expertise in consolidating engineering data, building domain-specific tools, and operationalizing AI at scale makes it uniquely qualified to address LLM reliability for DSL output.
How does Faros AI help customers address pain points and challenges?
Faros AI solves core problems such as:
Engineering Productivity: Identifies bottlenecks, accelerates delivery, and improves predictability.
Software Quality: Ensures reliability and stability, especially for contractor commits.
AI Transformation: Measures AI tool impact, runs A/B tests, and tracks adoption.
Talent Management: Aligns skills, addresses AI talent shortages.
DevOps Maturity: Guides investments for velocity and quality.
Initiative Delivery: Tracks progress and risks for critical projects.
Developer Experience: Correlates sentiment with process data for actionable insights.
R&D Cost Capitalization: Automates manual processes as teams scale.
Customers report a 50% reduction in lead time, 5% increase in efficiency, and enhanced reliability.
What are the key features and benefits for large-scale enterprises?
Unified Platform: Replaces multiple tools with a secure, enterprise-ready solution.
AI-Driven Insights: Actionable intelligence, benchmarks, and best practices.
Seamless Integration: Works with existing workflows and tools.
Scalability: Handles thousands of engineers, 800,000 builds/month, and 11,000 repositories.
Security & Compliance: SOC 2, ISO 27001, GDPR, CSA STAR certified.
Robust Support: Email & Support Portal, Community Slack, Dedicated Slack for enterprise.
Key webpage content summary
This article details Faros AI's journey in evolving Query Helper V2 to reliably generate MBQL queries using LLMs, without fine-tuning. It covers technical strategies (intent classification, validation, dataset expansion), business impact, and future directions, contextualized by Faros AI's platform capabilities and customer outcomes.
See what Faros AI can do for you!
Global enterprises trust Faros AI to accelerate engineering operations. Request a demo and discover measurable improvements in speed, quality, and developer experience.
A Faros AI expert will reach out to schedule a time to talk. P.S. If you don't see it within one business day, please check your spam folder.
Oops! Something went wrong while submitting the form.
Thank you!
A Faros AI expert will reach out to schedule a time to talk. P.S. If you don't see it within one business day, please check your spam folder.
Oops! Something went wrong while submitting the form.
AI
November 8, 2024
18
min read
Mastering Domain-Specific Language Output: Getting an LLM to Perform Reliably without Fine-tuning
See how real-world user insights drove the latest evolution of Faros AI’s Chat-Based Query Helper—now delivering responses 5x more accurate and impactful than leading models.
Earlier this year, we released Query Helper, an AI tool that helps our customers generate query statements based on a natural language question. Since launching to Faros AI customers, we've closely monitored its performance and diligently worked on enhancements. We have upgraded our models and expanded our examples to cover edge cases. We have also seen that customers want to use Query Helper in ways we did not anticipate. In this post, we'll explore our observations from customer interactions and discuss the improvements we're implementing in the upcoming V2 release to make Query Helper even more powerful.
But before we dig into the technical details — what is Query Helper?
Any engineering manager can attest that obtaining accurate, timely information about team performance, project progress, and overall organizational health is incredibly challenging. It typically involves sifting through multiple databases, interpreting complex metrics, and piecing together information from disparate sources.
Faros AI addresses this complexity by consolidating all data into a single standardized schema. However, there remains a learning curve for users to interact with our schema when their questions aren't addressed by our out-of-the-box reports. Query Helper V1 sought to simplify this process by providing customers with step-by-step instructions to obtain the information they needed.
General release and monitoring user behavior and challenges
Earlier this year, we released Query Helper to all our customers. This broad deployment enabled us to collect valuable data on usage patterns and response quality across a diverse user base. By closely monitoring these metrics, we ensure that Query Helper meets our users' needs and identify areas for improvement.
One of the most exciting outcomes of the general release has been seeing how users interact with Query Helper. It is always nice when people use what you build and we're thrilled to report that the feature has been well-received and widely used by our customers. However, we've also observed some interesting and unexpected patterns. With Query Helper’s interface being a text box where you can type whatever you want, users have been asking a much broader range of questions than we initially anticipated. This has presented some challenges.
Users had questions about how the raw data was transformed to get into the schema. They wanted help formulating complex custom expressions to get a particular metric of interest. They had general questions about Faros AI or engineering best practices. However, our single purpose Query Helper tool was only designed to provide instructions for querying the database. It provided good answers for how to build a step-by-step query in our UI but did not provide the most helpful responses to other types of questions.
Additionally, while analyzing responses to questions on building queries, we found that not all answers provided by the Large Language Model (LLM) were practically applicable. Validating these responses based solely on free-text instructions proved to be very complex. We implemented checks to confirm that all tables and fields referenced by the LLM existed in our schema. However, ensuring the accuracy of explanations on how to use these tables and fields was challenging, leaving room for potential errors that are difficult to detect. This raises the question: Is there a better way to ensure the queries generated would actually function correctly?
A rigidly structured response format allows for more thorough validation but is more difficult to generate correctly with an LLM. When we began developing Query Helper a year ago, we envisioned a tool capable of directly creating queries in our UI. However, initial tests showed this was beyond the scope of the available LLMs at that time. Over the past year however, LLMs have made significant advancements, and fine-tuning them has become easier. Is it time to revisit our original vision? If we're developing a tool to automatically create queries (as opposed to just describing how to do it), how will we address the variety of other questions customers want to ask? Furthermore, where should general question-answering be integrated within our interface?
Keep the interface simple, make the backend flexible
To address the challenge of integrating advanced query generation into our product with both flexibility and precision, we adopted a multi-pronged approach. We kept our simple text box interface (though we added a bit more guidance about what kind of questions the Query Helper can answer). The back end product evolved quite a bit. Our strategy involves utilizing an intent classifier to accurately identify the type of user query and direct it to the most suitable handling mechanism.
Before attempting to answer a user's question, we use an LLM classifier to determine what the user seeks. This classifier categorizes user queries into predefined groups: "greeting," "complaint," "outside the scope," "reports data definition," "custom expression," "text to query," "platform docs," "common knowledge," and "unclear." By tagging the intent, we ensure that each inquiry receives a response tailored to its specific context, helping to avoid odd behavior—like the LLM attempting to explain how to answer the question "hello" using our data schema.
Beyond intent classification, we incorporated tools that interact with specialized knowledge bases. These tools are essential for handling queries requiring detailed information, such as custom expressions, data definitions, and platform documentation. By leveraging these targeted resources, users receive precise and informative responses, enhancing their overall experience and understanding of the platform.
Lastly, a critical component of our approach is the capability for complete query generation. This involves translating user intentions into actionable queries within the query language used by Faros AI. With the advancements in LLMs, we are now poised to revisit our original vision, aiming to provide dynamic and accurate query completion directly within our interface.
By harnessing these three facets—intent classification, specialized knowledge access, and query generation—we aim to create a robust and responsive Query Helper that meets the diverse needs of our users while enhancing our platform's functionality. While the intent classification and knowledge base retrieval and summarization leverage standard procedures for developing LLM-based products, the query generation presents a unique challenge. Generating a working query requires more than simply instructing the model on the desired task and adding relevant context to the prompt; it involves deeper understanding and interaction with the data schema to ensure accuracy and functionality.
To tune or not to tune? And what LLM do we need to make this work?
A core question we faced was whether to fine-tune a relatively smaller LLM or use the most advanced off-the-shelf LLM available in our toolbox. One complication we faced in making this decision is that FarosAI does not expose SQL to our customers, we instead use the MBQL DSL (Metabase-Query-Language Domain-Specific Language) integrated into our UI to enable no code question answering. State-of-the-art SQL generation with LLMs is not yet perfected (Mao et al), and asking an LLM to generate a relatively niche DSL is a significantly harder task than that. We briefly contemplated switching to SQL generation due to its recent advancements, but we quickly dismissed the idea. Our commitment to database flexibility—demonstrated by our recent migration to DuckDB—meant that introducing SQL in our user interface was not feasible. This led us to consider how to make an LLM reliably produce output in MBQL. Fine-tuning appeared to be the key solution.
Our initial experiments with a fine-tuned model yielded promising results. However, surprisingly, we found that a more powerful off-the-shelf LLM performed remarkably well in this task, even without fine-tuning. Given the relatively low traffic volume for these requests, we began to consider whether an off-the-shelf model could suffice. Although it might be slower, the trade-off seemed worthwhile when weighed against the costs and maintenance challenges of deploying our own model. Maintaining a custom model can be extraordinarily expensive, not to mention the resources needed to manage continual updates and improvements.
Another factor influencing our decision was the nature of our B2B (Business-to-Business) model. Different customers have specific usage patterns with our schema, posing a unique challenge. Fine-tuning a model on such diverse data may not provide a solution flexible enough to accommodate these variations based solely on examples. A more generalized approach, utilizing a powerful off-the-shelf model, could potentially adapt better to these customer-specific nuances.
Thus, while fine-tuning initially appeared to be the obvious path, the impressive performance of the off-the-shelf model, combined with our specific business needs and constraints, prompted us to reconsider our approach. This experience underscores the importance of thoroughly evaluating all options and remaining open to unexpected solutions in the rapidly evolving field of AI and machine learning.
Getting valid queries from an off the shelf LLM
While the off-the-shelf model (in this case, Claude’s Sonnet 3.5) delivered remarkably solid results, bringing Query Helper V2 to a level we felt confident presenting to customers still required a significant amount of effort.
To determine if we could produce correct answers to all our customers' questions, we began testing with actual inquiries previously directed to Query Helper V1. The chart below shows improvement as we increased the complexity of our retrieval, validation and retry strategy. SQL generation is shown as a baseline since SQL generation is a much more common task (eg easier) for LLMs.
This chart shows the percentage of valid MBQL outputs for different prompt types. The chart to the right shows a baseline prompt with the Faros schema and SQL output for comparison.
Initially, we aimed to establish a baseline to assess how much our architecture improved upon the off-the-shelf LLM capabilities. When provided with no information about our schema, the models consistently failed to produce a valid query. This was expected, as our schema is unlikely to be part of their training data, and MBQL is a relatively niche domain-specific language.
Including our schema in the prompt slightly improved results, enabling the models to produce a valid query about 12% of the time. However, this was still far from satisfactory. We used the same prompt with SQL substituted for MBQL and found that an LLM would produce valid SQL about 30% of the time. This illustrates that SQL is easier for LLMs, but producing a schema specific query is a difficult task no matter what the query language.
Next, we provided examples and focused on relevant parts of the schema, which boosted our success rate to 51%. This approach required significant improvements to the information retrieved and included in the prompt.
Expanding our “golden” example dataset
Through careful analysis of user interactions, we discovered edge cases not covered by our initial example questions and instructions in Query Helper V1. To address this, we've been continuously updating our “golden” dataset with new examples. This involves adding examples for edge cases and creating new ones to align with changes in our schema. This ongoing refinement helps ensure that Query Helper can effectively handle a wide range of user inputs.
Bringing in examples from customer queries
Some customers have developed customized metric definitions which they use as the basis for all their analysis. We can't capture these definitions with our standard golden examples, as those examples are based on typical use of our tables. To address usage patterns specific to how different companies customize Faros AI, we needed to include that customization in the prompt without risking information leakage between customers. To achieve this, we utilized our LLM-enhanced search functionality (see diagram below for details) to find the most relevant examples to include in the prompt.
Customer specific table contents
To create the correct filters and answer certain questions, it’s necessary to know the contents of customer-specific tables, not just the column names. Therefore, we expanded the table schema information to display the top twenty most common values for categorical columns. We also limited the tables shown to the most relevant for answering the customer question.
Adding validation and retries
Including all this information gave us more accurate queries, substantially boosting success from the zero-shot schema prompt. However, 51% accuracy wasn't ideal, even for a challenging problem. To improve, we implemented a series of checks and validations:
Fast assertion based validation of query format and schema references.
Attempting to run the query to identify runtime errors.
Recalling the model if an error occurred, and including the incorrect response and the error message in the prompt.
These steps boosted our success rate to 73%, which was a significant win. But what about the remaining 27%? First, we ensured our fallback behavior was robust. When the generated query fails to run after all 3 retries, we revert to a descriptive output, ensuring the tool performs no worse than our original setup, providing users with a starting point for manual iteration.
Finally, remember at the beginning of this blog post when we mentioned that customers asked all kinds of things from our original Query Helper? To thoroughly test our new Query Helper, we used all the questions customers had ever asked. By using our intent classifier to filter for questions answerable by a query, we found that our performance on this set of relevant questions was actually 83%. For inquiries that the intent classifier identified as unrelated to querying our data, we developed specialized knowledge base tools to address those questions. These tools provide in-depth information about data processing and creation, custom expression creation, and Faros AI documentation to support users effectively.
Putting the system into production
The final task was to ensure the process runs in a reasonable amount of time. Although LLMs have become much faster over the past year, handling 5-8 calls for the entire process, along with retrieving extensive information from our database, remains slow. We parallelized as many calls as possible and implemented rendering of partial results as they arrived. This made the process tolerable, albeit still slower than pre-LLM standards. You can see the final architecture below.
Was it worth it?
Providing our customers with the ability to automatically generate a query to answer their natural language questions, view an explanation, and quickly iterate without needing to consult the documentation is invaluable. We prioritize transparency in all our AI and ML efforts at Faros AI, and we believe this tool aligns with that commitment. LLMs can deliver answers far more quickly than a human, and starting with an editable chart is immeasurably easier than starting from scratch.
While we're optimistic about the potential of fine-tuned models to enhance speed and accuracy, we decided to prioritize delivering V2 to our users swiftly. This strategy allowed us to launch a highly functional product without the complexity of deploying a new language model. However, we're closely monitoring usage metrics. If we observe a significant increase in V2 adoption, we may consider implementing a fine-tuned model in the future. For now, we're confident that V2 offers substantial improvements in functionality and ease of use, making a real difference in the day-to-day operations of engineering managers worldwide.
Now, when our customers need insights into the current velocity of a specific team or are curious about the distribution between bug fixes and new feature development they can easily ask Query Helper, review the query used to answer it, and visualize the results in an accessible chart. They can even have an LLM summarize that chart for them to get the highlights.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.
Fill out this form and an expert will reach out to schedule time to talk.
Thank you!
A Faros AI expert will reach out to schedule a time to talk. P.S. If you don't see it within one business day, please check your spam folder.
Oops! Something went wrong while submitting the form.
More articles for you
Editor's Pick
AI
News
7
MIN READ
Translating AI-powered Developer Velocity into Business Outcomes that Matter
Discover the three systemic barriers that undermine AI coding assistant impact and learn how top-performing enterprises are overcoming them.
August 6, 2025
Editor's Pick
News
AI
DevProd
4
MIN READ
Faros AI Hubble Release: Measure, Unblock, and Accelerate AI Engineering Impact
Explore the Faros AI Hubble release, featuring GAINS™, documentation insights, and a 100x faster event processing engine, built to turn AI engineering potential into measurable outcomes.
July 31, 2025
Editor's Pick
AI
DevProd
5
MIN READ
Lab vs. Reality: What METR's Study Can’t Tell You About AI Productivity in the Wild
METR's study found AI tooling slowed developers down. We found something more consequential: Developers are completing a lot more tasks with AI, but organizations aren't delivering any faster.
July 28, 2025
See what Faros AI can do for you!
Global enterprises trust Faros AI to accelerate their engineering operations.
Give us 30 minutes of your time and see it for yourself.