Training reasoning models for multi-step portfolio optimization
TIFIN
The challenge
Off-the-shelf models struggled with the multi-step reasoning required for portfolio optimization and knowledge-graph intent extraction, and TIFIN needed a repeatable way to keep improving model quality without ballooning compute costs.
Our approach
We built end-to-end AI training and inference pipelines across the Qwen, LLaMA, and GPT model families, with advanced caching and vLLM-based serving for real-time financial insights. We applied DPO and GRPO to train reasoning models for multi-step portfolio optimization, with GRPO achieving 93% knowledge-graph intent extraction accuracy via group-normalized advantage estimation. We also led SFT on frontier-model-generated synthetic data for conversational multi-agent systems.
Results
- GRPO achieved 93% knowledge-graph intent extraction accuracy
- 4x more gradient signal than DPO, with no reward model overhead
- Established scalable training cadences with continuous model improvement cycles
Have a similar challenge?
Let's discuss how we can help you achieve results like these.