Supriyono, Supriyono
ORCID: https://orcid.org/0000-0002-4733-9189, Wibawa, Aji Prasetya, Suyono, Suyono, Kurniawan, Fachrul
ORCID: https://orcid.org/0000-0002-3709-8764, Pranolo, Andri, Saubari, Nahdi and Wang, Kunfeng
(2025)
Dataset of transcribed Indonesian stand-up comedy videos with audience laughter annotations from Kompas TV’s YouTube channel.
Data in Brief, 63.
ISSN 23523409
|
Text
24816.pdf - Published Version Available under License Creative Commons Attribution. Download (906kB) | Preview |
Abstract
This dataset presents a large-scale compilation of Indonesian stand-up comedy video transcripts collected from Kompas TV’s official YouTube channel. A total of 3934 videos were processed, capturing over 2.8 million words, 6124 sentences, and 17,394 annotated audience laughter events. Each entry includes the video title, URL, the number of laughter instances, the original transcript, and a cleaned version suitable for downstream natural language processing (NLP) tasks. Data collection employed Python-based web scraping, followed by pre-processing routines such as timestamp and tag removal, whitespace normalization, and character cleaning. The dataset supports research in humor detection, speech emotion recognition, and cultural studies of performative discourse in Indonesian. It is particularly valuable for low-resource language NLP development and training models on informal spoken content. Researchers may utilize the dataset for sentiment analysis, summarization, laughter prediction, and sociolinguistic exploration. This openly accessible resource is hosted on Mendeley Data and adheres to ethical standards, with no personal identifiers and full compliance with platform redistribution policies. The dataset fills a notable gap in Indonesian language corpora, particularly in the entertainment and humor domain, providing a foundation for both academic and applied research in computational linguistics and human-cantered AI.
| Item Type: | Journal Article |
|---|---|
| Keywords: | Indonesian stand-up comedy; Laughter annotation dataset; YouTube transcription corpus; Humor detection NLP; Spoken language resources |
| Subjects: | 08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing > 080107 Natural Language Processing 08 INFORMATION AND COMPUTING SCIENCES > 0803 Computer Software > 080308 Programming Languages 08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing |
| Divisions: | Faculty of Technology > Department of Informatics Engineering |
| Depositing User: | Supriyono Supriyono |
| Date Deposited: | 03 Dec 2025 09:22 |
Downloads
Downloads per month over past year
Origin of downloads
Actions (login required)
![]() |
View Item |
Dimensions
Dimensions