Achieving 95%+ accuracy in automatic conversation evaluation for complex questions requires additional tuning and following structured guidelines. Below are best practices to ensure accurate and reliable scoring.
There are three main actions to tune accuracy:
Exclusion of some conversations from being evaluated (not applicable/not needed).
Improving Scorecard Items and Descriptions to give more instructions to a robot.
Human in the loop - manually check a random small amount of automated evaluations (approve or adjust).
Configuration Process
1. Selecting Conversations for Evaluation (Routing Conversation ➡️ Scorecard)
AI is not intended and does not know if some point evaluation is applicable or not, and it will provide answers to all questions. If you see that some Scoring points or Scoring completely should not be done (not needed) in some conversations, you can tune it using Filters.
🚫 Common Mistakes:
Sending all conversations for auto-evaluation
Sending different types of discussions for auto-evaluation by the same universal scorecard (for a robot to guess what is applicable)
Only send conversations that meet these criteria:
✅ Complete Conversation – The conversation must have a logical beginning and end.
✅ Relevant Topic – The conversation's topic must align with the applicable scorecard. Explicitly add applicable Topics and Exclude Topics that should not be evaluated.
✅ Correct Conversation Direction – Ensure the incoming/outgoing direction aligns with the scorecard. If the requirements for Agent actions change in different call directions, different scorecards should be made for different directions.
For example, when making an outgoing call, an agent must state the purpose of the call, name his or her code, and perform a few other things that would not make sense in an incoming call.
✅ Appropriate Team Selection – Use the correct team assignment for scorecards when multiple teams work on the same conversation topic.
✅ Appropriate Queue Selection – If the company uses queues in telephony that correspond to specific processes, consider filtering by queues when configuring for automatic evaluation.
🚨 Important Exclusions:
For non-format conversations (incomplete, interrupted, etc) or other topics that are not covered by scorecards, you need to:
Decide on the feasibility of the assessment
Develop appropriate scorecards
🔴 Conversations where the topic or content prevents complete evaluation (e.g., client requests a transfer), should not be evaluated.
🔴 Incomplete conversations (e.g., client hangs up) should not be evaluated.
🔴 Conversations tagged as "Switching to another line" should not be evaluated by scorecards meant for other topics.
2. Naming of the Scorecard Items & Descriptions
🚫 Common Mistakes:
The question can't be answered Yes/No
Inversion of Positive and Negative Assessment
Adding '
If applicable
' to the questionAsking questions about the correctness of the provided information
Asking questions about action in 3rd party systems
Asking ambiguous and subjective questions to evaluate (when several people can make different decisions about the answer)
Not tuning questions to a corporate tone
✅ Scorecard Items and Descriptions wording should be clear and unambiguous:
✅ Yes = Positive Assessment, No = Negative Assessment - the wording of the scorecard items and their descriptions should be such that if the answer is "Yes" - it is a positive assessment of the agent's actions, and if the answer is "No" - it is a negative assessment of the agent's actions
✅ Item and Description Must allow a definitive Yes/No answer
✅ Provide clear examples of expected behavior
🚫 Common Mistakes & Fixes examples:
Inversion of Positive and Negative Assessment:
❌ Not correct: "The agent forces the client to sign the contract."
✅ Correct: "The agent did not force the client to sign the contract" - a "Yes" answer (did not force) positively evaluates the agent's actions, and a "No" answer (on the contrary, i.e., did force) negatively evaluates the agent's actions.
The question can't be answered Yes/No:
❌ Not correct: "Conversation sentiment"
✅ Correct: "The sentiment of the conversation was positive" - a "Yes" answer (sentiment was positive) positively evaluates the agent's actions, and a "No" answer (on the contrary, i.e., sentiment was not positive) negatively evaluates the agent's actions.
Adding 'If applicable
' to the question:
❌ Not correct: "Agent stated purpose of the conversation if applicable"
✅ Correct: Use the above routing rules in the section 'Selecting and Routing'.
Asking subjective questions to evaluate (when several people can make different decisions about the answer):
❌ Not correct: "The agent sounds professional"
✅ Correct: "The agent remains composed and polite when handling objections or difficult situations".
Asking questions about the correctness of the provided information:
❌ Not correct: "Agent correctly answered the question"
✅ Correct: "The agent provided information on where to find help article, which answered the customer's question" OR "The agent stated that a credit card will be issued in three days from now".
Asking questions about action in 3rd party systems:
❌ Not correct: "Agent input correct and full data into CRM"
✅ Correct: Integrate CRM to Ender Turing to compare data.
Not tuning questions to a corporate tone:
❌ Not correct: "Agent should greet" -> "Hey, what's up?" will be positively assessed
✅ Correct: "Agent should greet the customer politely and formally (like: Hello, Good Afternoon, or similar)".
Calibration Process
Before you can use automatic evaluations as the primary source of feedback for Agents, please do a calibration.
Calibration usually consists of three actions mentioned above:
Exclusion of some conversations from being evaluated (not applicable/not needed).
Improving Scorecard Items and Descriptions to give more instructions to a robot.
Human in the loop - manually check a random small amount of automated evaluations (approve or adjust).
Follow the next steps to calibrate and tune automated evaluation accuracy:
1️⃣ Select/Find at least 20 automatically evaluated conversations meeting the following:
Same scorecard - the same scorecard evaluated conversations, and the wording in the scorecard did not change during the period when these calls were made.
Complete dialogues (conversation with logical beginning and end)
Correct topic/scorecard alignment
Correct team/scorecard alignment
Correct queue/scorecard alignment
All other criteria that will exclude routing mistakes (as you only measure factual AutoQA mistakes on applicable scorecard points)
2️⃣ Review/Check Ender Turing's automated evaluation scores. Manually approve or adjust errors in auto-evaluation action, if any.
3️⃣ Go to the Scorecard configuration section and check the accuracy scores
4️⃣ For the Scorecard points with <95% accuracy:
Exclude conversations from being evaluated (not applicable/not needed) - using Conditions in Enders. Follow the 'Selecting and Routing' section as best practices.
Improve Scorecard Items and Descriptions to give more instructions to a robot. Follow the 'Naming of the Scorecard Items & Descriptions' section as best practices.
5️⃣ Repeat calibration in a few days after applying changes.
Reporting Feedback to Ender Turing Support
Once all configurations are correct and errors persist, gather feedback:
1️⃣ Collect conversations where you did NOT manually correct scores.
2️⃣ Use the feedback template and provide:
5 examples where the system incorrectly favored the agent
5 examples where the system incorrectly penalized the agent
10 examples of correct evaluations
3️⃣ Send the file to your Customer Success Manager or email [email protected].
By following these steps, you can systematically improve AutoQA accuracy and ensure reliable automated evaluations. 🚀