Abstract: Brentel, I. (2018, 26.10.). A case study of processing large scale data – A method to accomplish reproducibility. BigSurv18 – Big Data Meets Survey Science. Research and Expertise Center in Survey Methodology (RECSM), Universitat Pompeu Fabra, Barcelona, Spanien. [Tandem 1]

Relevance & Research Question:
Our project is an attempt to fill a lacuna in communications studies by creating a harmonized longitudinal dataset (since 1954) on media use in Germany exploiting the Media-Analysis-Data, which is based on representative surveys with 30.000 respondents each year. In making large-scale media use data accessible for academic research in high quality standards of data documentation lies the relevance of this project. The research question, therefore, is: how to make the Media-Analysis-Data – as a ‘big data’ – accessible for academic research while being transparent to ensure reproducibility.

Methods & Data:
This paper will present the various theoretical and practical use of a digital harmonization software, CharmStats, utilized over the course of this project. The goal of the data processing was to create a scientific use file setting excellent documentation standards with the help of CharmStats and to continue harmonizations up to 2009. Using CharmStats we review the challenges and solutions developed in large-scale data processing as a mass variable harmonization case study. With more than 1.5 million cases per dataset – in total there are two harmonized datasets –, including almost 30.000 variables for over 60 years for pressmedia, almost 40 years for radio and now eight years for online media, the Media-Analysis-Data can be counted as the biggest dataset of media use in Germany madeavailable for academics. Therefore, the methodological approach of this project can be counted as a user case for documenting and harmonizing big data for academic research to secure traceability.

Target of the project is to make the complex and labour-intensive data processing procedure for large-scale data fully transparent and traceable. CharmStats offers the possibility to fulfil the project´s goals as it produces proprietary statistical software syntaxes for data processing plus a report for documentation. For the presentation we will portrait the different steps taken to fulfil the project´s goals to answer the research question:

  1. Finding a structure to work with,
  2. Setting standards for data documentation making data processing traceable with CharmStats,
  3. Producing a harmonized dataset, and
  4. Making the dataset reproducible, moreover, making it an accessible and sustainable source for academic research throughout the Library of Online Harmonization (scheduled for release in 2019)