Towards Robust And Scalable Evaluation For Large Language Models