What is topic modeling, and how can it help analyze customer data?

Last updated

26 July 2023

Author

Reviewed by

Summarize with AI

Working in a large organization with over 100+ employees? Discover how Dovetail can scale your ability to keep the customer at the center of every decision. Contact sales.

Transcribe and analyze your research

Upload videos and audio to Dovetail, then analyze interactive transcripts to uncover insights.

Contact sales

Throughout the early 2000s, "big data" often led to even bigger headaches. Now, most organizations want to know how they can meaningfully navigate and apply their sea of data.

Topic modeling meets this growing demand for fast and contextualized data summaries across various formats. It's a form of "unsupervised" machine learning (ML) data processing, so it doesn't require training or pre-configuration.

Let’s dive into everything you need to know about the subject, including learning how to measure topic modeling accuracy. We'll also explain how topic modeling is making waves in computer science and language modeling.

What is topic modeling?

Topic modeling is an artificial intelligence (AI) advancement that companies can use to enhance experience and improve business operations. It allows them to harness the power of big data rather than be overwhelmed by it.

Topic modeling is a form of unsupervised machine learning (ML) using natural language processing (NLP) modeling. It uncovers hidden themes or topics within a collection of text documents called corpus.

Compared to a manual review, topic modeling is a virtually effortless way to understand what large volumes of unstructured data are about.

Instead, a topic modeler magically (sort of) determines what themes run through a . It attempts to infer the most likely topics underlying the data without human involvement.

[Embed: 58gOvBXMx53S2WNLp7hwL0]

How does topic modeling work?

Consider how a document or website's "search" feature requires you to know what you're searching for. Topic modeling doesn't need a point of reference like this.

Instead, it works by:

Determining the most common word clusters throughout documents (without prompting)
Comparing word clusters between multiple sets of data
Contextualizing word clusters to determine semantically connected themes

Context is critical in topic modeling because a topic modeler goes beyond just ranking the frequency of words and phrases. The end goal is to rank how often certain topics come up.

For example, the topic modeler might determine the following four terms appear most often in a data sample:

Interest
Earning credit
Accounting for
Trust

The topic may seem to be about banking or finance, but what if terms like money, debt, and budget, are absent? Suddenly, the original terms could refer to dating, friendship, or the psychology of relationships in general.

You can't always determine a topic from word frequency. Topic modeling involves a certain amount of guesswork, even if a language-modeling algorithm does that guessing.

As in the example, unclear results could mean one of two things:

A need for more data (see the FAQ section at the bottom)
A different method is more appropriate, such as topic classification

For these reasons, topic modeling isn't perfect and relies on estimates. This is where different forms of topic modeling come in, which sort and categorize data differently.

Types of topic modeling

Topic modeling is based on natural language processing (NLP), a branch of computer science studying how people use language.

This begs the question, what is topic modeling in NLP?

NLP is a branch of computer science that draws from various algorithmic tools to model different aspects of language. Topic modeling fits into NLP as a form of abstraction, meaning it aims to unveil the latent topics behind a collection of text.

Naturally, topic modeler programs use several methods, each with their strengths.

Latent Semantic Analysis (LSA)

LSA attempts to model language as we commonly use it.

It's largely based on word sorting performance from human tests and attempts to gauge topic coherence by analyzing which words are and aren't used.

In the earlier example, LSA would likely place significant weight on the telling absence of words most closely related to finance.

Even though we sometimes use the most common words in one context, the lack of other expected words calls that context into question.

Determining the most likely topic requires balancing the actual language with estimates about what other language should also be there. If it is, the context is strong; if it isn't, the context is weak, so the algorithm will rank that topic as less likely.

The best time to use LSA is when analyzing conversational and readable data, such as:

Testimonials
Survey answers
Long-form articles or books
Articles and blogs for a general audience
Audio or video transcriptions

Latent Dirichlet Allocation (LDA)

Much like LSA, LDA compares the frequency of words, word clusters, and their connecting themes.

However, LDA takes a more probability-driven approach, emphasizing hard, statistics-driven data over natural language.

LDA still compares data with syntax, phrasing consistency, and other matters important to all NLP studies, but LSA represents these qualities better.

By contrast, LDA places statistical probability of word clusters at the core of its topic modeling algorithm. It also presents topic modeling reports in a more information-dense chart.

LDA is a better method for analyzing customer data related to:

Dense, data-driven analytics
Fields with precise language (e.g., science, law, and any kind of technical writing)
Any type of quantified data where text merely supports or presents hard, measurable data

Using Python programming language for topic modeling

One of the benefits of Python is that it closely resembles English syntax. This makes Python the perfect programming language for topic modeling.

It also features numerous text-mining features and libraries specifically for NLP.

While a guide on using Python for topic modeling is beyond the scope of this article, we wanted to mention its utility for anyone going deep into the technical aspects.

Topic classification

Like topic modeling, topic classification mines data for common phrases, but it works in the opposite way.

Unlike topic modeling, topic classification is a form of "supervised" machine learning, so the user must enter inputs for it to function.

The user begins searching by manually tagging certain keyphrases into the topic classifier.

A topic classifier program uses these keyphrases to:

Search the data (or sets of data)
Identify all instances of the keyphrases
Tag text containing text related to the keyphrase wherever found
Compare tagged passages with each other with many of the same language modeling algorithms as topic modeling

It's much more complex than a simple "search" function. Topic classification uses rule-based systems, which differentiate topics semantically.

The topic classifier can conveniently categorize portions of the text under separate tags, even if the text doesn't contain the keyphrase but only the topic implied by it.

Gradually, machine learning is replacing rule-based systems, performing the same functions with less and less required input.

There are also hybrid systems with topic classifiers that use machine learning when possible.

The user can use the topic classification system to double-check the work of the machine learning system.

However it's accomplished, what's important is simplifying customer .

allows companies to mine previously unwieldy amounts of raw data for greater context and meaning.

Of course, this applies equally to topic modeling. So how do you know when topic modeling or classification is the right option?

Topic modeling vs. topic classification

While topic classification requires more work than topic modeling, topic classification provides more accurate results.

Topic modeling basically estimates the most relevant keyphrases for you—but how can you be sure they really are the most relevant keyphrases?

After all, won't a topic modeler find the words "the," "and," or "is" more often than almost anything else? Of course, it goes beyond such simplicity by contextualizing word clusters according to themes.

But it still raises the question: How certain can you be in the topic modeler's conclusions?

With topic modeling, you still need to review the word clusters.

For example, you'll want to be sure the most relevant topic truly was "computers," when a manual review shows "computer science" more specifically was the core subject.

Even if a particular phrase occurs often, sheer frequency doesn't prove it's the main subject.

Broadly, deciding between topic modeling and topic classification raises three main issues.

Respectively, topic modeling and topic classification are:

Generic vs. specific terms
Speed vs. accuracy
Automation vs. manual effort

With this in mind, topic modeling and topic classification have their place. You just need to be sure which tool is right for the job. The following rules of thumb should help:

If you know what you're looking for, use topic classification
If you need a quick estimate, use topic modeling
If you have a short list of possible key phrases, use topic classification to narrow it down
If you have large volumes of data and only know what a small portion is about, try topic modeling—but look for word clusters overlapping with known tags

Of course, it's always possible to use topic modeling, then use topic classification to test and review those results or narrow them down using manual searches.

Use cases and applications

Generally, topic modeling is useful when you have more data than you can read or even skim through, but you still need to know what it's generally about. This is quite common.

Consider how many different types of data your organization might use on a given day:

Survey results
Product descriptions
Articles, white papers, and reports
Legal documents
Internal reports
Meeting minutes
Text-based communication, including:
Email
SMS
Web chat
Call transcriptions
Message boards and forums

It can seem like pulling it all together is an endless, futile test of your ability to compare apples and oranges.

As a "format-agnostic" text-comparison method, topic modeling can see through differences and automatically classify text into a clearer, searchable form.

Topic modeling can dramatically simplify the analysis of:

Customer service
CRM data
Customer feedback (e.g., product reviews, company ratings, direct messages)
Survey results
Product testing
Sales call transcriptions

Examples of topic modeling and topic classification

Consider the following scenarios, plus how either topic modeling or topic classification can help solve the customer data issue:

Topic modeling in customer service

Imagine acquiring a new company, then discovering their customer support ticket system is in serious disarray and severely backlogged.

Your parent company uses a totally different system, and it's unfeasible to redo or merge the disorganized system into yours.

A topic modeler can automatically parse through the disorganized support ticket system, tagging them with the most likely topics. Assigning support tickets to customer service reps with relevant skills becomes as easy as compiling all support tickets with a given tag.

Topic classification with customer feedback

A new is imminent, and you're reviewing the last possible round of customer feedback before you can make any last changes.

If you had more time, topic modeling would greatly help determine what your customers considered most important.

Instead, you know you can only address a short list of possible concerns, so you use topic classification to tag feedback according to those matters you can actually affect. You'll quickly see which product feature under your control has attracted the most customer interest.

Topic modeling for sales call transcriptions

As subjective as sales call evaluations can be, it doesn't have to be a complete mystery.

Simply run your sales call transcriptions through a topic modeling algorithm and let your customers tell you—in more ways than one—exactly what issues are at the top of their list.

You'll have a compelling list of topics most likely driving customer buying decisions, which you can further hone by testing each topic with future sales attempts.