Data processing

What is data processing?

Any study that involves data (usually quantitative) requires some form of data processing. A Python library that is often used for this is pandas, or dplyr for R. While data processing may seem like a straightforward task (e.g. changing a column name), things can get quite complicated; think of duplicates, missing values, wrong data types, and large datasets. A clean, reproducible data processing pipeline is an essential part of quality research. The Turing Way Community even has its own job title for this, the Data Wrangler.

Data processing on the cloud

Probably every social scientists said this at some point: “I left my laptop running over night”. When that happens once, fine, but when it happens regularly, it’s time to think about cloud solutions.

How does it look like?

Example project