This is great. I'm also building an LLM evaluation framework with all these benchmarks integrated in one place so anyone can go benchmark these new models on their local setup in under 10 lines of code. Hope someone finds this useful: https://github.com/confident-ai/deepeval
The package I built is like a provider for 10+ different evaluation metrics that run both locally on your machine using models from hugging-face but also on the cloud IF you want more functionality.
If you want to evaluate a fine-tuned model, we have integrations with LM Harness and Stanford HELM coming out. If you want to evaluate a RAG application, we have 7+ metrics available for that.
You can also create your custom metrics using our interface!
Also has native ragas implementation but supports all models.