Dataset of transcribed Indonesian stand-up comedy videos with audience laughter annotations from Kompas TV’s YouTube channel

Dimensions

Supriyono, Supriyono ORCID: https://orcid.org/0000-0002-4733-9189, Wibawa, Aji Prasetya, Suyono, Suyono, Kurniawan, Fachrul ORCID: https://orcid.org/0000-0002-3709-8764, Pranolo, Andri, Saubari, Nahdi and Wang, Kunfeng (2025) Dataset of transcribed Indonesian stand-up comedy videos with audience laughter annotations from Kompas TV’s YouTube channel. Data in Brief, 63. ISSN 23523409

Preview

Text
24816.pdf - Published Version
Available under License Creative Commons Attribution.
Download (906kB) | Preview

Full text available at: https://www.sciencedirect.com/science/article/pii/...

Abstract

This dataset presents a large-scale compilation of Indonesian stand-up comedy video transcripts collected from Kompas TV’s official YouTube channel. A total of 3934 videos were processed, capturing over 2.8 million words, 6124 sentences, and 17,394 annotated audience laughter events. Each entry includes the video title, URL, the number of laughter instances, the original transcript, and a cleaned version suitable for downstream natural language processing (NLP) tasks. Data collection employed Python-based web scraping, followed by pre-processing routines such as timestamp and tag removal, whitespace normalization, and character cleaning. The dataset supports research in humor detection, speech emotion recognition, and cultural studies of performative discourse in Indonesian. It is particularly valuable for low-resource language NLP development and training models on informal spoken content. Researchers may utilize the dataset for sentiment analysis, summarization, laughter prediction, and sociolinguistic exploration. This openly accessible resource is hosted on Mendeley Data and adheres to ethical standards, with no personal identifiers and full compliance with platform redistribution policies. The dataset fills a notable gap in Indonesian language corpora, particularly in the entertainment and humor domain, providing a foundation for both academic and applied research in computational linguistics and human-cantered AI.

Item Type:	Journal Article
Keywords:	Indonesian stand-up comedy; Laughter annotation dataset; YouTube transcription corpus; Humor detection NLP; Spoken language resources
Subjects:	08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing > 080107 Natural Language Processing 08 INFORMATION AND COMPUTING SCIENCES > 0803 Computer Software > 080308 Programming Languages 08 INFORMATION AND COMPUTING SCIENCES > 0801 Artificial Intelligence and Image Processing
Divisions:	Faculty of Technology > Department of Informatics Engineering
Depositing User:	Supriyono Supriyono
Date Deposited:	03 Dec 2025 09:22

Downloads

Downloads per month over past year

Origin of downloads

Actions (login required)

View Item

Altmetric

CORE (COnnecting REpositories)