Bagel Labs launching Tiny Tool Use, an intentionally tiny but production grade open-source library designed to simplify the process of training open-source LLMs for tool use.
Tool-aware LLMs turn text into real-world actions. Which unlocks autonomous decision making for robotics and general infrastructure use.
Tiny Tool Use distills the latest advances in tool-use RL, SFT and evaluation into easy to use templates. Letting teams train and evaluate tool-calling models without extra scaffolding.
It is fully open source : https://github.com/bagel-org/bagel-RL
Tiny Tool Use ships with:
Interchangeable Training Algorithms – swap SFT, Direct Preference Optimization (DPO), synthetic teacher signals and more with a single config change.
Configuration‑only workflows – every experiment, tool schema, and hyper‑parameter lives in a JSON file as a result performing training with different configuration is easy.
First‑class evaluation support – TensorBoard dashboards for training visualization and integration with Berkeley Function Calling Leaderboard scripts.
Dataset flexibility – plug in real data, generate synthetic traces, or compose both without touching core code.
Training Example Using Qwen3 Models
We now provide an example of using the library to train Qwen3 models. We use Low Rank Adaptation (LoRA) to customize Qwen3 models on ToolBench dataset. The library ships with the example configuration provided in configs/sft_toolbench_config.json
which downloads the data, extracts it and uses the processed data for training.
To run the training code with Qwen3—0.6B
model, use the following command
python train.py --config configs/sft_toolbench_config.json --output-dir lora_sft_qwen3/
The script will start downloading the ToolBench dataset and unzipping, which will several minutes considering the size of ToolBench.
The above code starts the training procedure, using lora adapters. The configuration file can be edited for full training instead of lora adapters. Furthermore, the configuration of the adapters can also be changed accordingly.
Evaluation and Benchmarking
Beyond its capabilities, the tiny tool use library offers a robust framework for evaluating the general tool-use capabilities of an adapted model. This includes the ability to compare evaluation results directly with established benchmarks, such as the Berkeley Function Calling Leaderboard.
The training statistics can be visualized by running tensorboard with the following command
tensorboard --logdir lora_sft_qwen3/
The performance of the adapted model on ToolBench data as training progresses. The evaluation data is displayed in the TensorBoard dashboard for the Qwen3-0.6B
model, demonstrating that the Tiny Tool Use library offers clear and interpretable training and evaluation metrics along with improved model capability for function calling.
When the training is complete, the adapters can be merged and saved using the following command
Python save_merge_model.py \
--base_model Qwen/Qwen3-0.6B \
--adapter_path\ lora_sft_qwen3/ \
--output_dir merged_model/ \
--trust_remote_code
A model adapted with a subset of Toolbench data can be obtained from the following link: Qwen3-0.6B-ToolBench
BFCL Leaderboard
The BFCL evaluation on the model can be performed using the following commands, which will generate model response on different test cases.
export BFCL_PROJECT_ROOT=/path/to/your/desired/project/directory
bfcl generate --model Qwen/Qwen3-0.6B --local-model-path merged_model/ \
--test-category simple,parallel,multiple,multi_turn
Finally to obtain the score on the generated model response, the following code is executed, which will save the scores as a csv file.
bfcl evaluate --model Qwen/Qwen3-0.6B \
--test-category simple,parallel,multiple,multi_turn
Bagel Labs team will continue to improve the library to adapt it for broader tool use, with an emphasis on distributed learning algorithms. We welcome contributions, feature requests, and issues on our fully open-source repository: https://github.com/bagel-org/bagel-RL