Overview | The Potential of Transcription Technology for Sentiment Analysis
Transcription technology has the potential to interpret sentiment through the automatic analysis of spoken language. It becomes possible to automatically detect the positivity or negativity of an opinion from textual data.
For example, by transcribing customer service calls, call center managers will be able to analyze the emotions of interactions and identify areas for improvement. If many calls have negative emotions, call center managers can investigate the root cause and take action to improve customer satisfaction.
TheDepartment of AI Software Convergence at Dongguk University is leading the way in fostering convergence talents with AI convergence capabilities to lead the future society. Recently, industry-academia collaborators and researchers within the department embarked on a project to build a psychological counseling chatbot system that analyzes and answers the emotions of the speaker contained in video and text data.
Problem | Utilizing transcription data to develop a comprehensive emotion analysis model
The client needed to build training data for the development of a psychological counseling multimodal sentiment analysis model. Through the multimodal sentiment analysis model, they planned to build a psychological counseling chatbot system that analyzes and answers texts delivered by users.
In psychological counseling, speech transcription and multimodal can be used to analyze the tone and pitch of the counselor's voice in order to detect emotions and evaluate the effectiveness of the counseling session. For example, the model could analyze changes in the tone and pitch of a client's voice to evaluate the effectiveness of counseling techniques based on data that tracks changes in emotional state.
In addition, speech transcription multimodal could also be used to analyze emotions and behaviors in other contexts, such as job interviews or public speaking. The machine learning model would analyze voice and facial expressions to detect emotions such as happiness, sadness, anger, and anxiety and provide an analysis of an individual's emotional state.
Solution | Building a psychological counseling dataset with voice transcription/multimodal technology
Due to the multimodal nature, it is necessary to understand information through multiple interfaces, so that more accurate emotion analysis can be performed for the corresponding sentence. For this project, we agreed to provide the viewing materials related to the text data before data processing. By comparing not only the text of the sentence but also the video data, it was possible to build a dataset that can identify the context of the conversation and the emotions of the speaker.
Here's how Datahunt worked
The semantic segmentation taxonomy that Datahunt worked with is as follows.
Semantic segmentation via transcription
- speech to text
- text_intent
- ~represents the speaker's intent / consists of 26 classes in total
- topic
- ~Input the topic of the conversation / Consists of 15 classes in total
- subtopic
- ~No pre-defined classes, just subjective judgment by the operator. However, it should be as relevant to the category as possible without contradicting the scenario content.
- CML emotions
- ~Comprised of 6 emotions created by the CML Lab. When labeling, label only the six defined values, not similar words.
- Emotion 1
- ~This is a more specific emotion than CML emotions, consisting of 21 pairs and one "neutral" single attribute. However, the use of neutrals should be avoided whenever possible.
- Emotion Century
- ~Tag the intensity for emotion 1. For neutral, enter "2".

Workers first watch the video corresponding to the textual dialog, and then compare the video and script of the dialog. They check the class of each item to label the nine items that best fit the dialog.
Topic/subtopic is an area for human input. However, it can be entered as long as it doesn't contradict the scenario and broad categories of the conversation. There may be cases where a single conversation reflects multiple emotions, making it difficult to categorize it as a single emotion. In such cases, we tagged it in the "Emotion 2" column to allow for flexibility in responding to complex situations.
Result | Multimodal transcription data with 99.995% accuracy
In this project, Datahunt completed multimodal transcription of approximately79,346 sentences, of which only 370 were corrected after feedback, resulting in an accuracy rate of 99.995%.
It is easy to understand a text processing project as mechanically performing simple repetitive tasks. However, this project had the task of inferring topics/subtopics from a set of dialogues. In addition, five tagging entries for emotions were required. Therefore, instead of making judgments based on the text of the dialogues alone, we watched the related video material together so that more accurate tagging values could be entered.

It was a difficult and complex project among transcription tasks, but it was welcomed by the client as it achieved an unusual accuracy rate of 99%. Typical text sentiment analysis projects only tag 2-3 sentiment values and use them as training data. However, Datahunt tagged 8 attribute values to improve the quality of the training data, resulting in much more sophisticated data.
For a healthier Korea
According to a survey report on psychological and emotional support for university students, the number of students served by one counselor in a university counseling center is up to 1,505. Even the least burdened universities have to counsel 212 students per counselor. Despite the importance of ongoing counseling, the poor conditions make it difficult for students to receive more than 10-15 sessions unless they are in an extreme risk group.
By developing a multimodal emotion analysis model, our client wanted to reduce human resources for psychological counseling. We hope that artificial intelligence technology can help solve and improve social issues in the future.