Responsive Banner

Dataset of transcribed Indonesian stand-up comedy videos with audience laughter annotations from Kompas TV’s YouTube channel

Supriyono, Supriyono ORCID: https://orcid.org/0000-0002-4733-9189, Wibawa, Aji Prasetya, Suyono, Suyono, Kurniawan, Fachrul ORCID: https://orcid.org/0000-0002-3709-8764, Pranolo, Andri, Saubari, Nahdi and Wang, Kunfeng (2025) Dataset of transcribed Indonesian stand-up comedy videos with audience laughter annotations from Kompas TV’s YouTube channel. Data in Brief, 63 (112079). ISSN 2352-3409

[img] Text
data in brief.pdf

Download (906kB)

Abstract

This dataset presents a large-scale compilation of Indone-
sian stand-up comedy video transcripts collected from Kom-
pas TV’s official YouTube channel. A total of 3934 videos
were processed, capturing over 2.8 million words, 6124 sen-
tences, and 17,394 annotated audience laughter events. Each
entry includes the video title, URL, the number of laugh-
ter instances, the original transcript, and a cleaned version
suitable for downstream natural language processing (NLP)
tasks. Data collection employed Python-based web scrap-
ing, followed by pre-processing routines such as timestamp
and tag removal, whitespace normalization, and character
cleaning. The dataset supports research in humor detection,
speech emotion recognition, and cultural studies of perfor-
mative discourse in Indonesian. It is particularly valuable for
low-resource language NLP development and training mod-
els on informal spoken content. Researchers may utilize the
dataset for sentiment analysis, summarization, laughter prediction, and sociolinguistic exploration. This openly accessi-
ble resource is hosted on Mendeley Data and adheres to eth-
ical standards, with no personal identifiers and full compli-
ance with platform redistribution policies. The dataset fills
a notable gap in Indonesian language corpora, particularly
in the entertainment and humor domain, providing a foun-
dation for both academic and applied research in computa-
tional linguistics and human-cantered AI.

Item Type: Journal Article
Keywords: Indonesian stand-up comedy Laughter annotation dataset YouTube transcription corpus Humor detection NLP Spoken language resources
Subjects: 08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing > 080107 Natural Language Processing
08 INFORMATION AND COMPUTING SCIENCES > 0803 Computer Software > 080308 Programming Languages
08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing
Divisions: Faculty of Technology > Department of Informatics Engineering
Depositing User: Supriyono Supriyono
Date Deposited: 03 Dec 2025 09:22

Downloads

Downloads per month over past year

Origin of downloads

Actions (login required)

View Item View Item