Hacker Newsnew | past | comments | ask | show | jobs | submit | 3d27's commentslogin

Checkout this instead: https://github.com/confident-ai/deepeval

Also has native ragas implementation but supports all models.


This is great. I'm also building an LLM evaluation framework with all these benchmarks integrated in one place so anyone can go benchmark these new models on their local setup in under 10 lines of code. Hope someone finds this useful: https://github.com/confident-ai/deepeval


(found this interesting post on medium, this is not my original work)


There's a lot more in the evaluation space, including this one: https://github.com/confident-ai/deepeval



Welcome contributions! https://github.com/tensorchord/ai-infra-landscape

Open-source contributions can make it better. :-)


I'm just imagining Jensen Huang laughing in his sleep right now...


How did you calculate accuracy and bias?


The package I built is like a provider for 10+ different evaluation metrics that run both locally on your machine using models from hugging-face but also on the cloud IF you want more functionality.

If you want to evaluate a fine-tuned model, we have integrations with LM Harness and Stanford HELM coming out. If you want to evaluate a RAG application, we have 7+ metrics available for that.

You can also create your custom metrics using our interface!


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: