On the above chart, the blue plot shows that as we increase the number of data points that are sent to the smaller model, although the latency and the computational requirements are lower, the accuracy does steadily drop from 78.5% to less than 72%. This drop in performance is alleviated by calibrating the 3B model on Granite Guardian 5B’s predictions as indicated in the orange curve. Instead of dropping below 72%, the performance of the 3B is now 75.5%.
For the best performance, a combination of the calibrated 3B MoE model and the 5B model is recommended, while considering the tradeoff of accuracy and computation cost. An example could be to choose the threshold such that the calibrated 3B MoE model covers 70% of the traffic and the remaining 30% is handled by the 5B model. This results in a stable overall performance of roughly 78%-78.5%, with a big computational advantage to handle the full traffic with the 5B model.
We believe the new Granite Guardian models to be the most capable open-source models of their kind available right now. And you can download and try them out now at the IBM Granite page on Hugging Face.