White Paper: Analysis of Net Promoter Score Verbatims
As you may agree, NPS measurements alone lack explanatory power. People’s free-text responses (verbatims) contain the real insights into what they think, but making sense of them is challenging.
In this white paper we tackle these questions:
- How consistent and reliable is manual coding?
- Is automating coding NPS verbatims an option?
- How does automated coding compare to manual?
- Can algorithms minimize manual work and turn NPS feedback into actionable information?
- Can they do this reliably?
What is Verbatim Coding?
Net Promoter Score values alone won’t tell you why your product or service received the ratings it did, nor how to improve satisfaction and boost customer loyalty.
Companies close this gap by adding open-ended questions to NPS surveys. Each free-text response, a verbatim, needs to be marked with the themes it mentions, regardless of the wording used. This process is called coding. Such themes capture the aspects of the business or the service, expressing them in different words. Themes can also be called codes or categories. A set of codes for each open-ended question is called a code frame.
Let’s look at three sentences from NPS responses:
- I was impressed by how friendly the person on the other end of the line was.
- The lady who helped me was friendly.
- Friendliness of staff.
All of these talk about the same thing and need to be coded with the theme staff friendliness.
Companies that receive hundreds or thousands of verbatims per week cannot rely on manual coding. It’s just too hard for a person to keep track of old and new themes emerging in this data. Feedback volume, velocity, and variety create demand for useful, repeatable, and scalable coding of free-text responses. The software is the answer, but how well can algorithms perform this important coding task?
Testing reliability of a Text Analytics solution
Without accurate and consistent coding, companies can misinterpret customer views. Not all manual coders are equally accurate and consistent, so the standard practice is to test a group of people at the same time.
Once the group’s overall score is known, we can show how consistent an algorithm is with each person and with the group overall. Knowing how closely algorithms can match people’s consistency will provide confidence in the reliability of an automated solution. For more details, check out our blog post with a deep dive on how to measure the accuracy of coding survey responses.
Two NPS customer feedback datasets
We have tested the accuracy of both people and Thematic on two NPS surveys collected from two different types of businesses.
NPS Survey 1: Software company
This software company regularly sends out NPS surveys to users who contacted the support team.
- 500 responses to the question “Why did you give us this score” were randomly selected.
- Head of Support reviewed and modified a code frame suggested by Thematic.
- Five people who work at this company, and Thematic, coded each response according to a pre-defined code frame with one or more themes.
NPS Survey 2: University
The business school at this university runs a one-off relational NPS survey of its students.
- 250 students responded to two open-ended questions “Why did you give us this score” and “What can we improve.”
- Four students read the comments and came up with 5 key themes each independently of each other.
- Thematic identified a code frame and then applied one or more themes to each verbatim. We compared only the top 5.
People’s consistency when coding NPS verbatims
NPS Survey 1: Software company
Five people, all closely familiar with the company’s business, tagged 500 verbatims. The total number of possible codes was around 15, including Other and Nothing.
A person’s consistency with another person in the group ranged from 69% to 87%. A person’s average consistency with the other four ranged from 72% to 81%. The average across the five people was 77.5%.
All people agreed on the top 3 most important themes, but the order of importance of rarer themes differed. Four people decided that being able to run the software on certain software platforms was among the top 10 reasons behind NPS, whereas one person did not pick that up. Due to confidentiality agreements, we cannot share more specific examples than that.
NPS Survey 2: University
Four students who studied at this university read all 250 responses for each of the two open-ended questions and came up with a list of the 5 most common themes.
For Question 1. Why did you give us this score? people came up with these five codes:
Students’ consistency with others ranged from 40% (2 out of 5 top themes matched) to 100% (all 5 matched). On Question 1, people’s consistency with the other three ranged from 40% to 67%, averaging at 56%. On Question 2, it ranged from 40% to 60%, averaging at 53%.
In an earlier post, we described the results of this NPS accuracy study in detail.
- The consistency of some people can be remarkably high when it comes to identifying the top 5 key NPS drivers. In both surveys, there were two pairs of people who selected exactly the same top 5 categories.
- People’s consistency varies significantly, showing just how much subjectivity and bias there is in people’s choices.
- The bias is not a personality trait, it is situational. In NPS Survey 2, Person 1, who had the lowest consistency with peers on Question 1 has the highest one on Question 2.
Algorithm’s consistency with human coding of NPS verbatims
In NPS Survey 1 of the software company, we averaged the consistency across all 500 comments. Each comment was tagged by both people and the algorithm, with one or more themes. People scored 77.5%, whereas Thematic’s was slightly lower, at 72%. The following graph compares the algorithm side-by-side with people. One of the people, somebody who has been working at the company for nearly ten years, had the same consistency as Thematic (person 1).
In NPS Survey 2, the consistency was computed once, from the final sets of top 5 themes for each of the two questions (Q1 and Q2). Thematic’s input was raw verbatims, but we allowed a manual configuration step: A human reviewer spent 30 min re-organizing the suggested themes. The reviewer used Thematic’s drag-and-drop Themes Explorer, where themes can be merged, moved or deleted to remove the redundancy or occasional noise. This screenshot of the Themes Explorer highlights two sample modifications.
People’s consistency averaged at 53% for Q1 and 56% for Q2, while Thematic’s consistency was raised from 35% to 65% for Q1 and from 50% to 60% for Q2. The following graphs compare the algorithm side-by-side with people.
- Making sense of people’s responses is a subjective task, hence we must measure the reliability of algorithms within the framework of how people perform the same task.
- Out-of-the-box Thematic did not beat people’s average consistency as a group, although it did come very close to some of the people’s consistency.
- Configured Thematic was able to beat people’s performance in two cases. The configuration took 30 minutes per question.
Can an algorithm determine key themes (a code frame) reliably?
Most commercial solutions to making sense of verbatims, including NPS, do not determine key themes. Instead, they require a pre-defined code frame, provided by people. They then either use simple rules (expensive → Price) or train algorithm based on pre-coded past data using supervised text categorization. The critical issue with such approaches is that you may be missing unknown themes or emerging themes. In situations where product evolves with each released version or an area where multiple players innovate and compete emerging themes matter. Therefore, we designed Thematic to determine key themes from raw verbatims, using unsupervised learning.
Can algorithms detect emerging themes?
What are emerging themes? These are themes that are unknown to a pre-existing code frame. Typically they are rare, at least initially, and are difficult to spot. People reading a large volume of responses will struggle to notice an emerging theme. Once they do notice it, they would need to re-code all past responses with this new theme.
Algorithms can keep track of many more items than people. They can also re-apply themes to any past data. Therefore, algorithms can deal with emerging themes better. For example, Table 5 shows that where people noticed “Publish exam dates earlier,” Thematic also found a related rare theme “Spread exam dates evenly.” Thematic uses a bottom-up approach, focusing on recurrent themes in people’s comments. If only two people mention a new theme, Thematic can find it.
Of course, any past manual adjustments matter. For example, if a person deletes a theme during the review of a code frame created in the past, this theme should not be discovered in a new batch of responses. Thematic takes care of this by keeping track of all past modifications. All new themes are either merged into existing ones or listed as sub-themes or base themes, marked with a star for faster review as shown above.
What about other algorithms?
Thematic is not the only solution that uses algorithms for survey coding. In fact, several solutions have even been tested on Survey 2 data. Maurice FitzGerald has already compared them in his excellent post which also outlines criteria for the evaluation of text analytics solutions.