Evaluate AI agent behavior on accuracy, efficiency, clarity, safety, and helpfulness, providing scores, grades, and improvement suggestions.