A Data Set of YouTube Audio Transcriptions

Featured in Jeremy Singer-Vine’s newsletter “Data is Plural“, a French startup has built YouTube Commons, a data set of openly licensed audio transcripts from over 2 million videos. From the newsletter: “The dataset indicates each video’s YouTube ID, title, channel, and date, as well as each transcript’s original language, translated language, word count, and character count. Translations are available primarily in Dutch, English, French, German, Italian, Russian, and Spanish.”

Such a data set could be a valuable for researchers working with YouTube video data who are interested in analyzing both audio and video channels for substantive research, as well as researchers engaging in methodological research that compares and combines modalities for interaction analysis.

Editorial Team