Principia
Our new work: Reasoning over mathematical objects: on-policy reward modeling and test time aggregation is out! In this work we 1) built and released training data for deriving mathematical objects; 2) show that on-policy RL with strong verifier boosts performance, and 3) on-policy training on parallel generation + verification further boosts the performance.