Towards reliability and interactive debugging for large language models