AI trustworthy – longjie.yang's Site

The Spark of an Idea: Creating More Trustworthy AI

It all began in my AI Trustworthiness course. As the semester drew to a close, we were assigned a final project: create something that would make AI systems more reliable. While there seems to be multiple ways to make AI more trustworthy, I had an idea in one of my focused area: reasoning LLM.

At its core, I believed that a truly trustworthy AI should know when it doesn't know something, and then systematically seek out that information rather than hallucinate answers. Much like how we trust experts who admit knowledge gaps and then consult references, an AI that can search, reason properly with found information, and arrive at correct answers would be fundamentally more reliable.

I had been interested in large language models' reasoning capabilities, particularly how they sometimes produce more accurate results when encouraged to "think step by step." I had previously studied advanced reinforcement learning techniques, including Group Relative Policy Optimization (GRPO), Mixture of Experts (MOE), and Multi-head Latent Attention (MLA), especially after DeepSeek released their R1 model.

I thought about whether I could apply these techniques to train a smaller model to reason with greater reliability. The idea of training smaller models to search appropriately, reason carefully with found information, and avoid hallucination seemed promising as a project direction for building more trustworthy AI.

Understanding GRPO: Revolutionizing AI Training

When I first came across GRPO, I was immediately intrigued by its novel approach to training language models. Let me explain what makes it so special.

Imagine you're training a student to solve complex math problems. Traditional methods (like PPO - Proximal Policy Optimization) are like having two teachers: one who demonstrates how to solve problems (the policy) and another who predicts how well the student will do on each problem (the critic or value function). This second teacher requires just as much expertise as the first, making the whole setup expensive and resource-intensive.

GRPO takes a radically different approach. Instead of that second teacher, it has the student solve each problem multiple times using slightly different approaches. Then it compares these solutions against each other, rather than against some absolute standard. The student learns which approaches work better relative to their own other attempts.

To put it technically, GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm that eliminates the need for a critic network by generating multiple responses to the same prompt and using their average performance as a baseline. It's like learning from your own trial and error rather than relying on external feedback.

This approach brings several critical advantages:

Resource Efficiency: By eliminating the critic network, GRPO cuts memory usage nearly in half, making it possible to train large models with limited computational resources.

Simplified Training: Without the need to maintain and update two large networks simultaneously, the training process becomes more manageable.

Natural Compatibility: For language models, it's often easier to generate multiple responses and compare them than to predict an absolute value for a given response.

GRPO in Action: Real-World Impact

The most impressive showcase of GRPO's potential came from DeepSeek AI's groundbreaking models:

DeepSeekMath, a 7B parameter model trained with GRPO, achieved remarkable mathematical reasoning abilities, performing at levels comparable to models many times its size. What caught my attention was how efficiently it learned to solve complex math problems without requiring massive computational resources.

Even more fascinating was DeepSeek-R1-Zero, a model trained using pure GRPO with no supervised fine-tuning. This model developed remarkable reasoning capabilities seemingly on its own, including self-verification and step-by-step thinking. It learned to question its own answers and think through problems methodically, much like a careful human problem-solver might.

This was exactly the kind of capability I wanted to instill in smaller models for my project: the ability to reason carefully, recognize knowledge gaps, and search for information when needed.

What made GRPO particularly suitable for my project was its focus on comparing different solutions to the same problem. This naturally encourages models to develop varied thinking strategies and self-correction mechanisms - precisely what I needed for a model that could reliably determine when to search for information versus when to trust its own knowledge.

Why GRPO Matters for the Future of AI

The significance of GRPO extends far beyond my little project. It represents a key advancement in making sophisticated AI training more accessible.

Before GRPO, training high-quality reasoning models required massive computational resources, putting it out of reach for many researchers and smaller organizations. By reducing resource requirements by up to 40%, GRPO helps democratize advanced AI development.

More importantly, GRPO models demonstrate a different kind of intelligence - one that's more reflective and self-correcting. Rather than simply predicting the next word based on patterns, these models learn to evaluate different approaches, compare solutions, and adjust their reasoning strategies.

This reflective capability is exactly what we need for more trustworthy AI systems - ones that know when they don't know something and can search for reliable information rather than hallucinating answers. The core of trustworthiness is often not having all knowledge built-in, but rather knowing when and how to seek information, just as we trust humans who acknowledge limits and consult references.

From Vision to Design: The Evolution of My Thinking

The discovery of GRPO crystallized my project goals. I wanted to create a system that would teach smaller LLMs to:

Recognize when they don't know something
Explicitly search for the missing information
Reason with what they find in a thoughtful, step-by-step manner
Avoid hallucinations by maintaining awareness of knowledge boundaries

This goal directly addressed a fundamental aspect of AI trustworthiness. When we use AI systems, we need them to provide accurate information and be transparent about what they know and don't know. An AI that pretends to know things it doesn't is fundamentally untrustworthy, while one that can honestly assess its knowledge gaps and then actively fill them through search and reasoning behaves more like a trusted advisor or researcher.

This vision shaped everything that followed. I envisioned a model that would demonstrate greater epistemic humility - knowing what it doesn't know - while still being helpful through its ability to search and reason.

"I'll need to create a simulated search environment," I realized. "The model needs to experience searching and finding information in a controlled setting."

This led to my first key design decision: creating a document pool with varying reliability levels. Just like the real internet, I would include authoritative sources (like academic papers), semi-reliable sources (like educational blogs), and unreliable sources (like forum discussions). This would teach the model to evaluate source credibility - a crucial skill for trustworthy AI.

Tackling the Overfitting Challenge

But there was another problem that kept me up thinking. If I used real-world information, the model might simply memorize the training data rather than learning the reasoning process. It might appear to be searching and reasoning when in reality it was just regurgitating memorized patterns.

"How do I ensure the model is genuinely learning to reason rather than memorize?" I wondered. This question pushed me to develop several innovative solutions simultaneously:

Solution 1: Randomized Document Pools

I realized that using the same document pool for every training instance would be dangerous. The model might learn shortcuts like "whenever I see 'quantum mechanics', retrieve document #5" instead of learning how to formulate good search queries based on analyzing what it knows and doesn't know.

The solution seemed obvious: generate unique document pools for each training instance. This would force the model to focus on the reasoning process rather than memorizing specific document-response pairs.

Solution 2: Placeholder Systems

Even with different document pools, there was still a risk that the model might learn to recognize patterns in names, dates, or terminology. This led me to design a placeholder system:

"What if I replaced all the specific names, terms, and values with placeholders that could be swapped out at runtime?" I thought.

Each document would use generic placeholders like <scientist_1>, <formula_name_formula_1>, or <var_formula_1_x> instead of real names. These placeholders would be replaced with randomly generated values when the documents were presented to the model.

This way, even if the formula or concept remained the same across training runs, the specific names and values would differ, forcing the model to focus on the structural relationships rather than memorizing specific terms.

Solution 3: Content Generation Strategy

I decided to structure my content into three reliability tiers:

Authoritative sources: Like academic papers or official documentation, with accurate and complete information
Semi-reliable sources: Educational blogs or articles with mostly correct but occasionally imprecise information
Unreliable sources: Forum discussions with a mix of correct, incorrect, and debated information

This stratification would teach the model to evaluate source credibility and perform cross-validation - essential skills for trustworthy AI. I carefully controlled how much information was present in each tier to create useful learning scenarios.

The training curriculum began to take shape in my mind. I would start simple: teaching the model to perform a single search when it encounters a knowledge gap. Later, I would progress to more complex scenarios involving multiple searches and cross-validation between sources.

I sketched out the plan on a paper:

First, show the model complete examples with thinking, searching, and reasoning all laid out
Then, provide partial thinking steps and let the model complete the process
Finally, give only the problem and let the model handle the entire task independently

"This is how humans learn complex skills," I thought. "First through demonstration, then guided practice, and finally independent application."

Building the Document Pool: A Complex Ecosystem

As I developed my document pool architecture, I encountered a fundamental tension: I needed documents that were both:

Interrelated: Documents had to reference each other and contain complementary information to teach cross-document reasoning and chained searches
Independent: Documents had to be modular enough that I could easily swap them out or modify them without breaking the entire system

Finding this balance proved more challenging than I initially anticipated. If documents were too tightly coupled, changes to one document could invalidate others. If they were too independent, they wouldn't teach the model to pursue chains of references.

I attempted to solve this through a hierarchical document structure:

Primary sources: Containing core information about formulas, entities, or concepts
Secondary sources: Containing applications, examples, or extensions of concepts in primary sources
Tertiary sources: Containing discussions, comparisons, or commentaries on primary and secondary sources

This structure would allow chains of references while maintaining some modularity. A secondary source could reference multiple primary sources, creating a web of information that would require sophisticated search and reasoning to navigate.

Technical Implementation Path

With these conceptual designs in mind, I began the technical implementation, proceeding through several stages:

Stage 1: Template Creation

I started by designing document templates for each reliability tier and domain. These templates contained placeholders for:

Formula names, domains, and expressions
Variable names and descriptions
Author/scientist/user names
Dates and locations
Numerical values and units

Each template maintained structural consistency while allowing for content variability. The authoritative templates used academic formatting, the semi-reliable templates used blog-style formatting, and the unreliable templates used forum-style formatting with multiple speakers.

Stage 2: Document Generation System

Next, I built a system to populate these templates programmatically. This system would:

Select appropriate templates based on the formula/concept being taught
Generate consistent placeholder replacements
Ensure necessary cross-references between documents
Introduce appropriate errors in semi-reliable and unreliable sources
Add metadata about reliability levels and domains

I implemented this using Python, with each document stored as a structured object containing both raw markdown with placeholders and metadata that could be used by the search system.

Here's what it generated.

Stage 3: Search Interface Integration

The search interface was where things got truly complex. It needed to:

Process natural language queries from the LLM
Parse JSON parameters like keywords, top_k, and source_types
Retrieve relevant documents based on keyword matching and reliability filters
Return document content with consistent placeholder replacements
Format results in a way that preserved the learning objective

I designed the interface to force the model to think explicitly about what it knew and didn't know before searching, using structured XML tags to reinforce the reasoning process:

<think>
I need to find the definition of Smith's Equation in quantum mechanics. I know it relates to energy levels, but I don't know the specific formula or its applications. I should search for authoritative sources first.
</think>
<search>
{
  "keywords": ["Smith Equation", "quantum mechanics", "definition", "formula"],
  "top_k": 2,
  "source_types": ["authoritative"]
}
</search>

The Progressive Training Curriculum

With the document pool and search interface in place, I designed a progressive curriculum that would gradually teach the model to reason and search effectively:

Phase 1: Single Search + Reasoning

In this first phase, I would teach the model to:

Recognize when it didn't know something
Formulate an appropriate search query
Evaluate the returned information
Apply it to solve the problem

I would implement this phase in three steps of increasing difficulty:

Complete demonstration: Show the model fully worked examples with all thinking and search steps provided
Partial demonstration: Provide the beginning of the thinking and search process, but let the model complete it
Independent practice: Give only the problem and have the model perform the entire process

Phase 2: Multi-Search + Cross-Validation

In the second phase, I would teach more sophisticated reasoning patterns:

Cross-validating information across multiple sources
Identifying conflicting information and resolving inconsistencies
Following chains of references across documents
Formulating follow-up searches based on initial findings

Again, I would use the three-step approach of complete demonstration, partial demonstration, and independent practice.

When Reality Strikes: The Technical Challenges

The implementation began smoothly enough. I built the document generation system, the placeholder replacement functionality, and the initial UI components. But as the project grew more complex, problems began to emerge.

What started as a clean, modular design gradually transformed into a tangled web of dependencies. Making a small change in the document generation format would unexpectedly break the search interface. Fixing a bug in the UI would somehow affect how documents were parsed.

These challenges manifested in several critical areas:

Challenge 1: Document Interdependencies

My hierarchical document structure created unexpected dependencies. When generating a secondary source, I needed to ensure all referenced primary sources existed and contained the appropriate information. This created a complex dependency graph that was difficult to maintain and debug.

I attempted to solve this by creating a document generation pipeline that would:

First generate all primary sources
Then generate secondary sources based on the content of the primaries
Finally generate tertiary sources based on the content of both primary and secondary sources

But this created its own problems - changes to the primary source generation could invalidate already generated secondary and tertiary sources.

Challenge 2: Placeholder Consistency

Maintaining consistent placeholder replacements across documents proved challenging. If <scientist_1> was replaced with "Dr. Jane Wong" in one document, it needed to be replaced consistently in all documents that might be retrieved together.

I attempted to solve this with a global placeholder registry that would:

Track all placeholders used across documents
Maintain consistent mappings from placeholders to values
Generate new values when needed
Handle special cases like formula expressions

But as the system grew, edge cases multiplied. What happens when a formula is partially quoted in a forum post? How are variable names handled when discussing multiple formulas in the same document?

Challenge 3: The XML Parsing Problem

One particularly frustrating issue emerged when I discovered that the API didn't return the closing </search> tag that my parser was expecting. This seemingly minor detail rippled through the entire system, requiring substantial rewrites of core components.

My XML parser couldn't reliably extract search queries
This broke the feedback loop between the model and the document pool
Without reliable parsing, I couldn't track or evaluate the model's search behavior

What seemed like a minor implementation detail revealed a deeper issue - the tight coupling between components created cascading failures when any single part didn't behave exactly as expected.

As the codebase grew, so did the technical debt. Documentation couldn't keep pace with changes, and the intricate connections between modules became increasingly difficult to maintain.

"I've created a monster," I realized one night, staring at pages of error messages. The ambitious vision I had started with was drowning in implementation details and unexpected edge cases. Though UI have been built, and it almost can work, but it's so hard to make even one step further.

Some beautiful UI:

The dataset reader

The dataset generator

The LLM solving UI

The Gift of Failure: Lessons Learned

Despite not achieving my original goal, the project taught me invaluable lessons about system design and AI development:

1. Document Structure Before Content

In retrospect, I should have finalized the document structure, relationships, and placeholder system before adding any natural language content. The structure proved far more important than I initially realized, and retrofitting structure onto existing content created unnecessary complications.

2. Clean Interfaces Between Components

The tight coupling between document generation, placeholder replacement, search functionality, and the UI was a fundamental design flaw. Each component should have had clear, well-documented interfaces with minimal assumptions about how other components would use its outputs.

Building complex document pools requires establishing clear, maintainable structures from the beginning. I should have finalized the document relationships before adding the natural language elements.

3. Extreme Modularity Is Essential

For systems of significant complexity, modularization isn't just helpful—it's essential. Each component needs to function independently, with clean interfaces between them. This is especially crucial for UI components, which should be completely isolated from core processing logic.

4. API Edge Cases Matter

Small details like how token delimiters are handled can have outsized impacts on system architecture. I learned the hard way that these seemingly minor technical details can make or break a project.

5. Incremental Testing Is Critical

I should have built and thoroughly tested each component in isolation before attempting to integrate them. By trying to build the entire system at once, I created a situation where failures were difficult to isolate and diagnose.

6. Documentation as Design

Maintaining comprehensive documentation should have been a design priority, not an afterthought. As the system grew more complex, the lack of clear documentation made it increasingly difficult to understand how changes in one area would impact others.

Shifting Focus: Search First, Training Later

After these challenges and valuable lessons, I decided to take a step back and refocus my efforts. While my initial goal of training smaller models with multi-step reasoning remained compelling, I realized I needed to first validate whether existing LLMs could effectively utilize search functionality in the way I envisioned.

I considered what would be the most productive next step: "If I can't train a model yet, I can at least test whether LLMs can effectively use search functionality on their own." This would still advance my understanding of AI trustworthiness - perhaps not by creating a new model, but by testing the capabilities of existing ones.

The fundamental goal hadn't changed: creating AI systems that search, reason properly, and arrive at correct answers is essential for trustworthiness. Much like how we trust academics and researchers who methodically consult sources and follow evidence, an AI that performs similar steps would inspire more confidence in its outputs.

The core challenge remained two-fold: document analysis (determining when to search) and constructing appropriate document pools with interconnected information. The fundamental questions were: How does an AI know when it doesn't know something? And how can we structure information to test this capability?

My initial failures had taught me valuable lessons about overcomplicating systems. I had tried to simultaneously solve multiple complex problems: randomized document generation, placeholder systems, multi-tiered reliability sources, and complex interdependencies between documents. Each additional component had increased the complexity significantly.

I needed to isolate variables and test one capability at a time. It was time to simplify and focus.

Simplifying the Approach: The Formula Tree Method

I started by reconsidering the fundamental requirements of my document pool. I reviewed my notes from the failed project and thought about what was truly essential.

The answer became clear:

Documents needed clear keyword tags to enable targeted retrieval - without this, search would be meaningless
Information had to be distributed across multiple documents in a logical way - to test chained reasoning
Solving problems required chaining multiple searches together - to evaluate persistence and focus

As I considered these requirements, I realized something important: "The original placeholder system was designed to prevent overfitting during training. But since I'm just testing search capabilities now, I can temporarily set aside the randomization of variable names and formulas."

This was an important insight. For testing search and reasoning capabilities, I didn't need to worry about preventing memorization through randomization - I just needed clear, structured information distributed across documents in a way that required multiple searches to solve problems.

This realization helped me focus on creating a document pool with genuinely interconnected information that would require multi-step reasoning. But I still needed to determine what structure would work best for this purpose.

I considered various approaches for structuring the information. Stories with characters appearing in multiple documents, geographic information with connected locations, and historical events with causal relationships all seemed promising, but none provided the clean, unambiguous structure I needed.

Mathematics and formulas eventually emerged as the ideal framework. This approach offered the clear structure I was looking for.

I envisioned a system where computing a single value would require traversing a tree of formulas and variables. To solve for variable y1, you would need formula f1, which contains variables A, B, and C. Then to find those variables, you might need more formulas, creating a natural chain of dependencies.

I sketched out the basic tree structure:

(Level 1) y1 → (Level 2) Formula f1 → (Level 3) Variables A, B, C → (Level 4) Values or sub-formulas...

This approach mapped naturally to separate documents. Each node in this tree could be placed in a different document, creating an effective test case for multi-step reasoning and search. The LLM would need to:

Recognize it needs the value of y1
Search for the formula needed to calculate y1
Identify the variables in that formula
Search for each variable's value or formula
Continue this process recursively until all values are found
Calculate the final result

The structure was clean, the dependencies logical, and the reasoning required was unambiguous. It would be clear whether the model successfully navigated the chain of reasoning or not.

As I developed this concept further, I could feel the excitement return. This approach had several compelling advantages:

It created clear, unambiguous structure
It established natural dependencies between information pieces
It could be scaled in complexity based on tree depth and breadth
It would be easy to verify the correctness of reasoning

Building the Formula Tree System: Designing with Parameters

I wanted to make this system both flexible and controllable. Rather than creating just one fixed formula tree, I planned to generate many different trees with varying complexity to systematically test LLM capabilities.

I considered what makes a reasoning task difficult: Is it the number of steps? The number of variables? The distribution of information? These questions led me to design four key parameters that would shape each problem:

Expansion Depth: This would determine how deep the formula tree goes. Each additional level would require another round of searching and reasoning, potentially testing the LLM's ability to maintain context. If an LLM can handle depth=2 but fails at depth=4, that would reveal something important about its reasoning limitations.

Breadth: This parameter would control how many variables each formula contains. More variables would increase complexity and potentially create more opportunities for errors. This would determine how many branches would emerge from each node.

Number of Documents: This would determine how to split the information across documents, which directly impacts how many searches would be needed. Each formula or variable could be in its own document, or related information could be grouped together. The distribution strategy could significantly affect reasoning difficulty.

Expansion Probability: This parameter would control whether variables lead to another formula or are simple values. It would determine the likelihood of a variable requiring another formula versus being a terminal node with a direct value, creating varying patterns of reasoning depth within the same tree.

Different combinations of these parameters would create problems of varying difficulty. A simple problem might use depth=1 and breadth=2, requiring just one formula with two variables. A complex problem could use depth=4 and breadth=3, creating a large tree of nested calculations that would test an LLM's ability to maintain context across multiple searches.

This approach would allow for systematic evaluation of how LLMs handle increasing complexity in multi-step reasoning and identify exactly where their capabilities break down.

Document Structure and Tagging: Creating Searchable Information

The next challenge was designing how information would be structured within documents. I needed a format that was both human-readable and easily parseable by machines.

My first attempt used plain text with minimal structure, but I found the LLM couldn't reliably extract the relationships between variables and formulas - it was too ambiguous.

My second attempt used highly structured XML with nested relationships, but this became unwieldy, especially when representing complex formulas - it was too rigid.

After several iterations, I settled on a JSON-like structure where each document contained one or more pieces of information, tagged with their respective keywords:

[
  {
    "item": "<y1>",
    "type": "formula_ref",
    "formula_name": "<f_1>"
  },
  {
    "item": "<f_1>",
    "type": "formula_def",
    "formula": "2*<v_1> + 3*<v_2>",
    "variables": ["<v_1>", "<v_2>"],
    "output": "<y1>"
  }
]

As I tested this structure, I realized the importance of metadata. The LLM needed to know what kind of information was in each document to formulate effective queries.

I added another layer to my design - ensuring documents included appropriate tags and metadata describing which variables or formulas were defined in each document. These tags would allow for efficient search based on the specific information needed at each step.

For example, if a document contains the definition of formula f_1, it should be clearly tagged as such. That way, when the LLM is looking for a formula definition, it can search specifically for documents with that tag.

After testing different document formats, I arrived at one that balanced clarity with information density, allowing LLMs to navigate through the information effectively.

Visualization and Verification: Seeing the Structure

As the system grew more complex, I realized I needed a way to visualize it to verify it was working as intended. A subtle error in document distribution had created an impossible reasoning path during one debugging session, highlighting this need.

I recognized that I needed to visualize what I was building to better understand the structure and verify its functionality.

I developed visualization tools that would:

Display the formula tree structure as a graph with color-coded nodes
Show how information was distributed across documents by mapping node colors to document IDs
Illustrate the search paths an LLM would need to follow

When I generated a complex formula tree with depth=4 and breadth=3 and visualized it, I saw that the tree was more complex than expected, with many nodes spread across multiple documents. Following the reasoning path manually was challenging even for me.

The visualizations confirmed that solving these problems would require genuine multi-step reasoning and effective search - precisely the capabilities I wanted to test.

I generated different trees with various parameter settings to see how the structure changed. Some combinations of parameters created balanced trees, while others created structures that would test different aspects of reasoning.

The Search Process: Following the Breadcrumbs

With the system built, I could now articulate the exact process an LLM would need to follow to solve these problems:

Analyze the question (e.g., "What is the value of ?")
Formulate an appropriate search query (e.g., search for "")
Process the returned document identifying uses formula
Search for formula <f_1> to find its definition
Identify all variables needed (<v_1>, <v_2>)
Search for each variable's value or formula
Continue recursively until all values are found
Calculate the final result using the complete formula tree

This process mirrored the kind of structured reasoning I initially wanted to train smaller models to perform, while allowing me to test the capability with existing LLMs.

As I ran through test examples manually, I could see how this would challenge an LLM's ability to track information across multiple searches and maintain the overall goal. "This is exactly the kind of reasoning I want to evaluate," I realized.

Applying Lessons from Previous Failures

My earlier painful experiences hadn't been for nothing. As I built this new system, I consciously applied the lessons I had learned:

When designing the document generation system, I resisted the urge to add features before the core structure was solid. "Document structure first," I reminded myself, recalling the tangled dependencies from my previous attempt.

I created clean interfaces between components. The tree generation, document splitting, visualization, and search modules all had well-defined inputs and outputs with minimal assumptions about each other.

Rather than building everything at once, I took an incremental approach: first creating and testing the formula tree generation, then adding document splitting, then visualization, and finally the search interface. Each component was validated before moving to the next.

I embraced modularity at every step. Each component could be tested and modified independently, which made debugging much easier when issues inevitably arose.

And perhaps most importantly, I designed the entire system to be parameter-driven from the start. Key aspects like tree depth, breadth, and document distribution were controlled by a few central parameters, making it easy to adjust complexity without rewriting code.

Scaling Up: The Batch Dataset Generator

With the core formula system designed, I began thinking about how to test it systematically. Testing with individual, manually created datasets wouldn't give me the insights I needed about LLM reasoning capabilities. I needed a way to generate and test multiple datasets with varying complexity levels.

This need for systematic testing led me to develop a batch dataset generator. Throughout my research, I've found that understanding the precise boundaries of LLM capabilities requires controlled variation of key parameters. By generating many datasets with different complexity profiles, I could identify exactly where LLMs begin to struggle with multi-step reasoning.

When designing the generator, I focused on creating a parameter space that would let me explore different dimensions of complexity:

Depth: How many levels of nested formulas should the model navigate?
Breadth: How many variables should each formula contain?
Document count: How should information be distributed across documents?
Expansion probability: What's the likelihood a variable leads to another formula versus a simple value?

The batch generator would create multiple datasets by randomly sampling from these parameter ranges, producing a test suite that systematically explored the reasoning space. This approach would help identify which specific aspects of complexity (depth, breadth, document distribution) most affected LLM performance.

I included functionality to automatically assign dataset IDs and maintain a summary of the parameters used for each dataset. This metadata would be crucial for analyzing patterns in performance across different complexity dimensions.

Visualizing Formula Trees

As I began generating datasets with varying parameters, I needed to better understand what each dataset actually contained. The JSON representations were precise but didn't provide an intuitive grasp of complexity.

I had planned a visualization component from early in the project design. With multiple datasets being generated, this became essential for quickly understanding what I was testing. The visualization converted formula trees into graph representations showing the hierarchical structure, document distribution, and expected search paths.

These visualizations provided immediate insight into the tests I was running. I could quickly see when a particularly complex branch appeared or when information was distributed in a way that would require an especially challenging reasoning path. This helped me refine the parameter ranges to create more meaningful tests.

Testing with JSON Search Format

After generating datasets with varying complexity, I was ready to start testing actual LLM performance. I began with simpler cases to establish a baseline.

Initial tests showed promising results. When working with relatively simple formula trees (depths of 2-3, breadth of 1-2 variables per formula, information distributed across 2-3 documents), most capable LLMs could successfully navigate the search and reasoning process.

These positive results on simpler tests were encouraging. They confirmed that the approach was valid - LLMs could indeed perform the kind of multi-step reasoning and search I had designed for. This established a baseline for exploring more complex scenarios.

Improving the LLM API: Lessons from Experience

As I expanded testing to include more models and datasets, I encountered an issue that had nothing to do with the formula system itself, but significantly impacted the reliability of my results. The way I had implemented interactions with LLM APIs was inconsistent and difficult to maintain.

In my initial implementation, I had mixed different approaches to API calls. Sometimes I used provider-specific code directly, other times my own wrapper interfaces. This inconsistency made debugging difficult and undermined my ability to make fair comparisons between models.

The problem stemmed from the incremental way the project had evolved. As I added features and tested different models, I had created a patchwork of API interactions rather than a consistent framework. It was time to refactor.

I designed the LLMClient class to provide a unified interface for all LLM interactions. The key design goals were:

Consistent parameter handling across different providers
Centralized configuration management
Proper error handling and retry logic
Support for multiple providers without changing the main code

This refactoring wasn't just about code cleanliness - it was essential for ensuring the validity of my research. Without a consistent interface, differences in API handling could be misinterpreted as differences in model capabilities.

Handling Model-Specific Differences

As I integrated more models into my testing framework, I discovered that each had unique quirks that needed to be handled. Some models didn't support system prompts, while others required special formatting or had different token limits.

Rather than creating special cases throughout my code, I built these handling rules into the LLMClient class. This approach allowed the main code to remain clean while still accommodating model differences.

The addition of OpenRouter as a provider further complicated matters by introducing access to a wide range of models with varying capabilities. This required additional adaptation logic to handle different response formats and input constraints.

These adaptations were necessary for comprehensive testing. Understanding how different architectures handle multi-step reasoning would provide valuable insights into what makes some models more capable than others at complex reasoning tasks.

A model which does not feels like it have read the system prompt:

o4mini only support temperature set to 1:

Sometimes gemini 2 pro does not generate </output> as expected:

From JSON to Natural Language: Making Information More Accessible

As the formula search system took shape, I began thinking about the artificiality of the testing environment. The JSON format I had been using was ideal for programmatic access, but it didn't reflect how information typically exists in the real world. If my goal was to test how well LLMs could search and reason with information in a trustworthy manner, I needed to move closer to real-world conditions.

Real information rarely comes neatly packaged in structured JSON. Instead, it's usually presented in natural language text - documentation, articles, textbooks, and other written materials. By continuing with a purely JSON-based approach, I would be testing LLMs in an artificial environment that didn't match how they would need to operate in practice.

What I needed was a transformation layer that would convert my structured formula data into something resembling natural language, while still maintaining the precise relationships between variables and formulas. The challenge was finding the right balance - I didn't want to make the information fuzzy or ambiguous (since trustworthy reasoning requires precision), but I wanted to present it in a more natural format.

I thought about how mathematical relationships are typically described in documentation or textbooks. Usually, they're presented with clear statements like "The value of X is calculated using formula Y, which depends on variables A, B, and C." This kind of natural language presentation maintains precision while being more accessible and realistic.

With this insight, I designed a document processor that would convert my structured JSON entries into simple natural language statements. For example, a formula definition like:

{
  "item": "<f_1>",
  "type": "formula_def",
  "formula": "2*<v_1> * 3*<v_2> * 2*<v_3>",
  "variables": ["<v_1>", "<v_2>", "<v_3>"],
  "output": "<y1>"
}

Would be transformed into:

<f_1>'s formula is: <y1> = 2*<v_1> * 3*<v_2> * 2*<v_3> (using variables <v_1> and <v_2> and <v_3>)

The document processor applied different transformation rules based on the type of information. Formula references, formula definitions, constants, and tags each had their own natural language template. This created a collection of simple, declarative statements that retained all the mathematical relationships while being more similar to how information would be presented in educational or reference materials.

Building a Semantic Search System

With natural language representations of my formula data in place, I needed an effective way for LLMs to search through this content. Since the information was no longer in a structured format with well-defined fields, I needed a more sophisticated search approach.

This led me to develop a hybrid search system that combined keyword matching with semantic similarity. The keyword approach would allow for precise matching of variable and formula names, while semantic similarity would help capture related concepts even when exact keywords weren't present.

For the semantic similarity component, I chose a sentence embedding model that could convert text into vectors, allowing for efficient similarity comparisons. I implemented this using a FAISS index to allow for fast retrieval, even with potentially large document collections.

The search engine design also needed to handle the fact that a single document might contain multiple pieces of information. I wanted the search to return entire documents rather than just individual statements, so the LLM would have context for the information it found.

When implementing the KeywordNLPSearchEngine class, I paid particular attention to the indexing process. For each document, I extracted the keywords (variable names, formula names, etc.) and also generated embeddings for the natural language content. This dual representation allowed for both exact matching and similarity-based search, giving the LLM multiple pathways to find relevant information.

Recursive Search: Enabling Multi-Step Reasoning

With the search functionality in place, I turned my attention to the core challenge: enabling LLMs to perform multi-step reasoning through recursive search. The idea was to allow the LLM to generate partial responses, search for information when needed, and continue its reasoning with the new information.

This required a careful balance between giving the LLM enough structure to effectively use the search functionality while still allowing it freedom to reason in its own way. I wanted to guide the process without over-constraining it.

The solution I developed was to use clearly defined markers in the LLM's output to indicate when it wanted to search. When the LLM included a search request in its response, the system would extract the search parameters, perform the search, and provide the results back to the LLM to continue its reasoning.

The recursive search client would manage this entire process, handling message formatting, search requests, result integration, and continuation until the LLM produced a complete response or reached a maximum number of iterations.

One of the trickiest aspects was maintaining coherent reasoning across multiple search iterations. I needed to ensure the LLM didn't lose track of the overall problem or its previous reasoning steps. This led me to implement a conversation preservation mechanism that kept the full context available throughout the process.

The end-to-end system I developed reflected the reasoning process I would expect from a trustworthy expert:

Acknowledge when information is missing
Explicitly search for that information
Evaluate and integrate the search results
Continue reasoning or search again if necessary
Arrive at a final conclusion based on all gathered information

This approach addressed a fundamental aspect of trustworthiness: being transparent about knowledge limitations and actively seeking information to fill knowledge gaps. Just as we trust experts who consult references rather than making unsupported claims, this system encourages LLMs to exhibit similar epistemic humility.

The Need for Controlled Experimentation

As the formula-based reasoning system took shape, I began thinking more carefully about evaluation. My early tests had provided promising results, but they lacked the systematic rigor needed for drawing reliable conclusions about LLM capabilities.

Testing with randomly generated datasets was useful for development, but it introduced too many variables for rigorous analysis. If one model performed better than another on a particular dataset, was it because of the model's superior reasoning capabilities, or just because that dataset happened to be easier? Without controlling for dataset characteristics, it would be impossible to say.

This challenge is common in AI evaluation - the need to isolate variables and conduct controlled experiments. For my trustworthiness research, I needed to understand precisely how different factors affected LLM performance in search and reasoning tasks.

Designing a Matrix Testing Approach

To address these evaluation challenges, I developed a matrix-based testing methodology. Rather than testing on random datasets, I would generate datasets with specific controlled characteristics, creating a structured matrix of test cases that would allow for more rigorous comparison.

The key parameters I wanted to control were:

Document count: How many documents the information was split across
Tree depth: How many levels of nested formulas were required
Formula breadth: How many variables each formula contained

By holding some parameters constant while varying others, I could isolate the effect of each parameter on LLM performance. For example, I could fix the breadth at 3 variables per formula, then systematically vary the depth (from 2 to 4) and document count (from 2 to 4) to see how these factors affected reasoning accuracy.

This approach would give me a much clearer picture of where different models struggled with multi-step reasoning. It would also help identify which aspect of complexity (depth vs. breadth vs. document distribution) posed the greatest challenge for current LLMs.

Implementing the Matrix Dataset Generator

With this testing design in mind, I implemented the MatrixDatasetGenerator to create datasets according to the matrix specifications. Unlike the earlier batch generator that used random parameter values, this generator would create datasets with precise parameter combinations.

For each cell in the parameter matrix (e.g., depth=2 + docs=3, depth=3 + docs=2, etc.), the generator would create multiple datasets with those exact parameters. This redundancy was important for statistical reliability - testing on multiple datasets with the same characteristics would help account for random variation in dataset structure.

The implementation organized datasets hierarchically, with directories for each parameter combination and subdirectories for the individual datasets. This organization made it easy to identify and analyze datasets with specific characteristics.

I also maintained a dataset summary file that tracked all generated datasets and their parameters. This metadata would be crucial for analysis, allowing me to correlate performance with specific parameter values.

The structure of matrix dataset:

Building an End-to-End Evaluation Pipeline

With the matrix datasets in place, I needed a way to run tests systematically across all datasets and models. This required an end-to-end evaluation pipeline that would:

Load the matrix datasets
Run tests using different LLM providers and models
Track results and performance metrics
Analyze patterns across different parameters

The BatchEvaluator class I developed served this purpose, tying together all the components I had built previously - the LLM clients, search functionality, and recursive reasoning - into a comprehensive evaluation system.

The evaluation process would present each model with the same question about calculating the value of <y1> across all datasets in the matrix. By keeping the prompt consistent and varying only the datasets and models, I could make direct comparisons of reasoning performance.

A key challenge in the evaluation was determining whether responses were correct. LLMs don't always present their final answers in a consistent format, making automated evaluation difficult. To address this, I implemented a robust answer extraction and verification system that could:

Extract the final calculated value from the LLM's response
Compare it against the expected result
Verify correctness through a separate verification step

This verification step was particularly important to ensure reliability in the results. Rather than using simple string matching (which could miss correct answers presented in different formats), I used a separate LLM call specifically dedicated to answer verification.

Automating Parallel Evaluations

While the command-line approach improved flexibility, I recognized that manually launching each evaluation process would still be inefficient. I needed a way to orchestrate multiple evaluation runs without manual intervention.

This led me to develop a parallel batch generator that would create structured batch scripts to launch multiple evaluation processes simultaneously. The key design challenge was balancing test coverage with system resources. I wanted to test many model configurations in parallel, but running too many evaluations at once could overload my system or throttle API access.

My solution was a batch script generator that allowed for careful configuration of the parallel test suite. For each run, I could specify:

Which providers to include (Claude, OpenAI, Grok, OpenRouter)
Which specific models to test from each provider
Temperature settings to test performance at different randomness levels
Resource constraints like timeouts and maximum datasets

The generator would create a Windows batch file that opened separate command windows for each configuration, with each window running an independent evaluation process. This approach had several advantages:

Visual tracking of progress across multiple evaluations
Isolation between runs (an error in one wouldn't affect others)
Easy termination of individual evaluations if needed
Flexible resource allocation

The batch generator also included an interactive configuration component that allowed me to modify the test matrix before execution. I could add new models, adjust settings, or create custom testing combinations based on previous results.

This flexible orchestration was particularly valuable as I expanded testing to include more diverse models. Different models often had different strengths, and I wanted to understand these patterns systematically. For instance, I could test whether models with larger parameter counts consistently performed better on deeper formula trees, or if architecture differences were more important than scale.

Run multiple batch with one click:

Comprehensive Analysis Through Result Merging

With multiple evaluations running in parallel across different models and configurations, the result files would be scattered across many directories. To make sense of this distributed data, I developed a comprehensive result merging and analysis system.

The merger would find all relevant evaluation results matching certain criteria (such as a specific timestamp or test batch), then combine them into a unified dataset. This merged data would then be analyzed to reveal patterns that wouldn't be visible in isolated test runs.

I designed the analysis pipeline to extract insights across multiple dimensions:

Provider and Model Performance: Which providers and specific models achieved the highest accuracy rates? Were there patterns in the types of problems each model handled best?

Parameter Sensitivity: How did performance change as document count or tree depth increased? Did certain models degrade more gracefully than others with increasing complexity?

Time and Efficiency Analysis: How did models compare in terms of processing time and number of searches performed? Was there a trade-off between accuracy and efficiency?

The system would generate both numerical summaries and visualizations to make these patterns clear. For example, charts could show how accuracy declined with increasing tree depth for each model, making it easy to identify which models maintained performance better as complexity increased.

I also included detailed statistical breakdowns in the analysis:

Overall statistics across all evaluations
Group comparisons by parameter values
Detailed metrics for each model and configuration
Correlation analysis between different factors

This comprehensive analysis would be crucial for drawing reliable conclusions about model capabilities and limitations. Rather than making claims based on anecdotal examples or limited test cases, I could back up observations with systematic data across many controlled scenarios.

The merged results would also serve as a valuable dataset for future research, potentially revealing unexpected patterns or relationships that could guide the development of more trustworthy AI reasoning systems.

Refining the Testing Methodology

As I developed the evaluation infrastructure, I continued to refine the testing methodology itself. One key challenge was ensuring that verifications were reliable - I needed to be confident that the system was correctly identifying when models had successfully solved problems.

Initially, I used simple string matching to check answers, but this proved problematic. Models often presented correct answers in different formats or with additional explanation, causing valid responses to be marked incorrect. To address this, I implemented a more sophisticated verification approach that used a separate LLM call to analyze and verify responses.

This verification model would:

Parse the original model's output to extract the final answer
Compare this answer to the expected result
Determine if they were mathematically equivalent
Provide an explanation for its decision

This approach significantly improved the reliability of the evaluation results, ensuring that models weren't penalized for presentation differences when their mathematical reasoning was correct.

I also refined the prompt engineering for the evaluation tasks. The goal was to create instructions that were clear enough to guide the model without being overly prescriptive about the reasoning process. I wanted to test the models' inherent reasoning capabilities, not just their ability to follow step-by-step instructions.

Each refinement to the methodology improved the quality and reliability of the evaluation results, building toward a comprehensive understanding of model capabilities. The data generated through these evaluations would provide insights not just into which models performed best, but into the fundamental nature of how LLMs approach multi-step reasoning tasks.

The Matrix Approach in Action

With the complete evaluation system in place, I began generating systematic results across the matrix of test cases. Each cell in the matrix (e.g., depth=3, documents=4) contained multiple datasets with identical parameters, allowing for statistical analysis of performance under those specific conditions.

The test suite included models from multiple providers:

Claude models (varying sizes and capabilities)
OpenAI models (GPT-4 family)
Grok models
Various models accessible through OpenRouter

For each model, I would run the same test suite across the entire parameter matrix, generating comprehensive performance profiles. These profiles would reveal not just overall accuracy, but specific strengths and weaknesses across different complexity dimensions.

Understanding the Results

The evaluation yielded a wealth of data - 576 test cases across different models and complexity levels.

The top tier of models achieved remarkable accuracy rates above 90%:

OpenAI/GPT-4.1: 98.61%
Grok/grok-3-beta: 97.22%
Claude/claude-3-7-sonnet: 93.06%

A middle tier of models performed adequately but noticeably worse:

OpenRouter/meta-llama/llama-4-maverick: 76.39%
OpenRouter/google/gemma-3-27b-it: 75.00%

And a third tier struggled significantly with accuracy rates around 50%:

Grok/grok-3-mini-beta: 52.78%
OpenRouter/google/gemini-2.5-pro-preview: 50.00%
OpenRouter/deepseek/deepseek-chat-v3-0324: 47.22%

This clear separation was fascinating to me. It suggested that multi-step reasoning with search isn't a capability that improves gradually with model quality - instead, it seems to require crossing certain thresholds of capability. The gap between the top and middle tiers (about 15-20%) was too large to be explained by minor implementation differences.

What made this pattern even more interesting was that it cut across provider boundaries. The top tier included models from three different providers (OpenAI, xAI, and Anthropic), while significantly lesser performance came from smaller variants of those same model families. This suggested that the capability threshold might be related to model scale or architecture decisions rather than training methodology.

The stark difference between Grok-3-beta (97.22%) and Grok-3-mini-beta (52.78%) particularly caught my attention. These models presumably shared similar training approaches but differed in size, yet the performance gap was enormous. This reinforced my hypothesis that reasoning capabilities might be especially sensitive to model scaling decisions.

Document Count vs. Depth: Different Complexity Dimensions

When I began designing the matrix approach, I had assumed that both document count and tree depth would significantly affect performance. The results partly confirmed this intuition, but with an interesting twist.

Looking at accuracy rates by document count:

2 documents: 77.08%
3 documents: 75.52%
4 documents: 68.75%

There was indeed a decline as document count increased, with a particularly notable drop between 3 and 4 documents. This aligned with my expectations - managing information across more documents should be more challenging.

But when I looked at performance by reasoning depth, I found something surprising:

Depth 2: 74.48%
Depth 3: 75.00%
Depth 4: 71.88%

The performance remained remarkably stable across different depths, with only a modest decline at depth 4. This was unexpected - I had anticipated that longer chains of reasoning would be significantly more difficult.

This pattern suggested something profound about LLM reasoning capabilities. The models appeared to struggle more with the distribution of information across sources than with following longer chains of reasoning. Once a model could perform multi-step reasoning at all, adding one more step didn't substantially hurt performance. But adding one more document to search through created a larger challenge.

I started thinking about why this might be. Perhaps the attention mechanisms in transformer models are better suited to maintaining a single chain of thought than to integrating information from multiple sources. Or maybe the search aspect (knowing when and how to search) is inherently more difficult than the reasoning aspect (following a calculation chain once all information is gathered).

This finding had important implications for how we should approach trustworthy AI. If search across documents is the more challenging aspect, then research efforts might be better directed toward improving information retrieval and source integration rather than extending reasoning chains.

The Role of Search in Trustworthy Reasoning

The evaluation data also provided insights into search behavior across models. With an average of 3.02 searches performed per evaluation, it was clear that successful navigation of the formula trees often required multiple search operations to gather all necessary information.

The top-performing models were distinguishing themselves by:

Reliably recognizing when search was needed
Formulating effective search queries
Successfully interpreting search results
Keeping track of the information gathering process across multiple searches

This combination of capabilities is precisely what makes for trustworthy reasoning. A model that knows when it doesn't know something, can seek out that information effectively, and can integrate it into its reasoning process is inherently more reliable than one that guesses or hallucinates.

The failures were equally informative. Some models would refuse to use search at all, attempting to guess values they couldn't possibly know. Others would search once but fail to recognize that they needed additional information. Still others struggled with the structured format of search queries or results.

These failure patterns highlight the multi-faceted nature of trustworthy reasoning. It's not just about following a calculation correctly - it's about knowing what information you need, how to get it, and how to use it. The models that excelled in my evaluations were those that demonstrated this complete package of capabilities.

Efficiency and Practical Considerations

Beyond accuracy, I was also interested in the efficiency of different models. The evaluation data showed an average processing time of 20.21 seconds across all tests, with variations between models and complexity levels.

This relatively quick response time was encouraging. It suggested that even complex reasoning tasks involving multiple searches could be completed within timeframes acceptable for many practical applications. The combination of high accuracy with reasonable efficiency for top-tier models indicated that trustworthy reasoning through search is becoming increasingly viable with current technology.

I'm particularly interested in exploring the relationship between search count, processing time, and accuracy in future analyses. Are more efficient models (those requiring fewer searches) also more accurate? Or is there a trade-off where thoroughness (more searches) leads to better results but slower responses?

Implications for AI Development and Trustworthiness

The results from this systematic evaluation have significant implications for how we approach AI development and assessment of trustworthiness:

First, the clear performance tiers suggest that multi-step reasoning with search may emerge as a capability threshold rather than a continuous spectrum. This could be an important benchmark for evaluating model capabilities, similar to how we think about emergent abilities in other areas of AI.

Second, the differential impact of document count versus reasoning depth suggests that we should focus more attention on improving cross-document reasoning and information integration. Models that can effectively synthesize information from multiple sources are likely to be more trustworthy in real-world scenarios where information is rarely contained in a single document.

Third, the strong performance of top-tier models across providers indicates that advanced reasoning capabilities are becoming a standard feature of leading models. This is encouraging for the development of AI systems that can acknowledge knowledge limitations and seek information when needed - a core aspect of trustworthiness.

Finally, the results highlight the importance of systematic evaluation across controlled complexity levels. Without this matrix approach, the nuanced patterns in model performance might have been obscured by random variations in test case difficulty.

Next Steps in Understanding and Improving Model Reasoning

If I may still work on this project and have plenty of time and energy to deep dive, then I might carry out the previous promising plan. Though LLM calling the search function is something that could make the AI system more reliable, still the best improvement shall be done via the training process with proper dataset, and making LLM a real ability to search, find, reason, thinking, and so on.