Supriyono, Supriyono
ORCID: https://orcid.org/0000-0002-4733-9189, Wibawa, Aji Prasetya, Suyono, Suyono, Kurniawan, Fachrul
ORCID: https://orcid.org/0000-0002-3709-8764, Pranolo, Andri, Saubari, Nahdi and Wang, Kunfeng
(2025)
Dataset of transcribed Indonesian stand-up comedy videos with audience laughter annotations from Kompas TV’s YouTube channel.
Data in Brief, 63 (112079).
ISSN 2352-3409
|
Text
data in brief.pdf Download (906kB) |
Abstract
This dataset presents a large-scale compilation of Indone-
sian stand-up comedy video transcripts collected from Kom-
pas TV’s official YouTube channel. A total of 3934 videos
were processed, capturing over 2.8 million words, 6124 sen-
tences, and 17,394 annotated audience laughter events. Each
entry includes the video title, URL, the number of laugh-
ter instances, the original transcript, and a cleaned version
suitable for downstream natural language processing (NLP)
tasks. Data collection employed Python-based web scrap-
ing, followed by pre-processing routines such as timestamp
and tag removal, whitespace normalization, and character
cleaning. The dataset supports research in humor detection,
speech emotion recognition, and cultural studies of perfor-
mative discourse in Indonesian. It is particularly valuable for
low-resource language NLP development and training mod-
els on informal spoken content. Researchers may utilize the
dataset for sentiment analysis, summarization, laughter prediction, and sociolinguistic exploration. This openly accessi-
ble resource is hosted on Mendeley Data and adheres to eth-
ical standards, with no personal identifiers and full compli-
ance with platform redistribution policies. The dataset fills
a notable gap in Indonesian language corpora, particularly
in the entertainment and humor domain, providing a foun-
dation for both academic and applied research in computa-
tional linguistics and human-cantered AI.
| Item Type: | Journal Article |
|---|---|
| Keywords: | Indonesian stand-up comedy Laughter annotation dataset YouTube transcription corpus Humor detection NLP Spoken language resources |
| Subjects: | 08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing > 080107 Natural Language Processing 08 INFORMATION AND COMPUTING SCIENCES > 0803 Computer Software > 080308 Programming Languages 08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing |
| Divisions: | Faculty of Technology > Department of Informatics Engineering |
| Depositing User: | Supriyono Supriyono |
| Date Deposited: | 03 Dec 2025 09:22 |
Downloads
Downloads per month over past year
Origin of downloads
Actions (login required)
![]() |
View Item |

Altmetric
Altmetric