A Beginners Guide to Building a Dataset for Quantitative Analysis of Social, Legal and Environmental Phenomena

L.Ciezkowsky[1]; L.Lizarazo Rodriguez[1]; J.Miller[2,3]; M.Walter[4]; N.Naghabatla[2]

1. Curiae Virides Research Project hosted at the Brussels School of Governance (Vrije Universiteit Brussel)

2. United Nations University -Comparative Regional Integration Studies (UNU-CRIS)

3. Universiteit Gent

4. EJAtlas, Institut Barcelona Estudis Internacionals (IBEI, Barcelona)

1. Introduction

On 16 June 2023, the ERC Curiae Virides team held an expert workshop on building databases at the Brussels School of Governance (Vrije Universiteit Brussel (VUB)). We invited scholars whose current or recent research involved the creation of a brand-new dataset for quantitative analysis. Our purpose was to exchange experiences, more specifically, the challenges they encountered in different steps of the process and the solutions they applied to overcome them.

We did not need this workshop because we are novices in the field—all members of our team are well-versed in data collection and analysis. But some of our team members had been working in the same dataset for more than 2 years and, when you realize that there is a better way to stratify a variable or a better variable altogether and you need to re-do your work for the nth time, you can’t help a feeling of frustration and wondering if you really are going in the right direction. Of course we could have just asked ChatGPT to “Describe the challenges of building a dataset and the solutions to overcome them,” and we would have come out with a list of very truthful and valuable advice—we know; we checked. But practical advice was not enough; our team needed more than that. It was only when the first workshop speaker began to describe her issues and we saw virtually all attendees in the room smiling and nodding in agreement that we understood what it was—we needed to see that others had undergone similar situations. We needed to hear that every researcher in the room, at some point, had “hit the wall,” in extreme frustration for having to redo their work another time. And we needed to see that, for every scientific/methodological issue we had, someone had gone through a similar issue and had found a solution for it that is not readily available on the internet. For this reason, we decided to summarize the main takeaways from the workshop in this blogpost, so that other researchers who are struggling with building their first (or second, or third…) dataset may benefit as well.

The speakers we had in the workshop and whose insight we are drawing for this blogpost are:

Liliana Lizarazo-Rodriguez (Vrije Universiteit Brussel—Brussels School of Governance), on creating a dataset for ERC Project Curiae Virides on the fuzzy path of transnational ecological conflicts towards litigation;
Justine Miller (United Nations University – Institute on Comparative Regional Integration Studies and Ghent University), on creating a dataset on Regional Integration Agreements;
Daniela del Bene (ICTA-UAB, Barcelona) and Mariana Walter (IBEI, Barcelona) on creating a dataset on socio-ecological conflicts (EJAtlas);
Nidhi Nagabhatla (United Nations University—Institute on Comparative Regional Integration Studies) and Simon Happersberger (Vrije Universiteit Brussel—Brussels School of Governance and United Nations University—Institute on Comparative Regional Integration Studies) on the relevance of context in creating a dataset on the Environmental Impact Assessments of Trade Agreements.

This blogpost is divided as follows: in Part 1 we talk about the importance of finding and using only reliable sources; Part 2 deals with the challenge of designing the correct variables and indicators/variable categories; in Part 3 we talk about the need for multiple iterations; part 4 explains the importance of keeping a log book; and Part 5 contains our concluding remarks.

New York Times bestselling author and one of the world’s most influential management thinkers Dan Heath wrote: “Data are just summaries of thousands of stories.” I could not agree more. When I, Milla Cieszkowsky, joined the ERC Curiae Virides team, I had no idea how many stories I was going to learn. The project involved creating a non-exhaustive (because exhausting all sources would be impossible) but an “as-comprehensive-as-possible” dataset on transnational ecological conflicts (TEC) occurred all over the world in the past 22 years and their transformation into litigation.

The task turned out to be herculean. My colleagues had already been working on this dataset for almost two years and, still, we worked on it in a team of 5 researchers another entire year. All together, we screened reports on over 10,000 conflicts and retained 2,500+ cases that matched our selection criteria. Although the volume of information was monumental, that was definitely not the most challenging aspect of the job. The real challenge resided in creating variables and variable-values that would preserve the stories behind each conflict and, most importantly, that would allow these stories to be retold after they were converted into data points on a spreadsheet. I had always worked with extreme care with all datasets, but this time it increased my sense of duty—like a driver, who normally drives carefully as they do not want to die in an accident, but who exercises increased care when they know there are children in the backseat. I felt I owed to all those involved to tell their story right.

But telling a story through data is never easy. It is a bit like writing a non-fiction novel. First, you need to find reliable, unbiased sources: you need to know the facts exactly as they happened, as you do not want to tell a one-sided story. Then you need to sort out your details: facts are important, but adding information on the characters, their background and personalities will help your readers understand why certain things happened the way they did. Finally, your story needs to fulfill a purpose: to entertain, to inform, to foster personal growth, etc. In the context of dataset-building, the steps are quite similar: you start with reliable sources, then you mark the values corresponding to the facts/events you are mapping. Information on the conditions under which these facts happened (such as time, frequency, and other relevant factors) will allow for the identification of trends, which add “character” to your work. All this bearing in mind that, in addition to supporting your own research, your dataset will likely also serve the purpose of supporting future research of other scholars. Below, my colleagues and I share a few insights derived from our personal experiences which we believe could be of help to fellow researchers.

2. Finding/Choosing appropriate and reliable sources

When working with data, preparation is equally as important as the work itself. In Justine’s words, you need to get intimate with your sources and with your own research before you begin your data collection process. Understand your research questions, define what exactly you need to know to answer them and where you can find that information. Quite often, the pool of allowed/available sources is limited—for example in the case of UNU CRIS’ dataset on the Environmental Impact Assessment (EIA) of Trade Agreements, which could only rely on the documents and reports on EIAs reported to the WTO. In other cases, such as at Curiae Virides, myriad information sources are available; therefore, rules of inclusion concerning the type of information and qualification of sources need to be well-defined, preferably before the data collection starts.

News sources, for example, are a two-edged sword. The Curiae Virides team had long discussions on whether to use online news articles as sources or not. With their research involving the mapping of ecological conflicts in all stages, being able to rely on smaller news sources would have been an advantage, since many grassroots conflicts do not make it to national/international reliable news sources. The downside was that many of the smaller sources around the world are independent and/or unverified, and the team did not have the resources to fact-check every single source to filter out biased or fake news. In the end, Curiae Virides decided to rely only on cases that were already featured in other databases or in repositories from courts and international organizations. This method allowed them to capitalize on the work of other research groups who could count on larger teams and longer timeframes to conduct their fact-checking, such as the EJAtlas or the Climate Change Litigation Databases of the Sabin Center.

Justine Miller and her co-author at UNU CRIS chose the “database on databases” approach. It made sense in their case because all the information they needed on trade agreements and regional organizations had likely already been collected by other organizations. Besides, their core goal was to gather all disparate data and to reconcile the differences into a single document, rather than to start a new dataset from scratch.

In a more original approach, the EJAtlas team shifted from the traditional models of research sourcing and developed EJAtlas—a database of environmental conflicts worldwide—relying mostly in a process of co-production of knowledge. Their conceptual framework, Environmental Justice, draws on the concept of activist knowledge that guided the FP7 project Environmental Justice Organisations, Liabilities and Trade (EJOLT). In ten years, EJAtlas documented over 3,900 cases of socio-ecological conflicts in a collaborative process between engaged researchers (scholars, students and on-the ground journalists) and grassroots activists. Following the co-production methodology, for each case a first draft is prepared either by the researchers or by the local groups, and then sent to the other group for revision and comments. Each case includes also information collected from diverse data sources, such as official reports, NGO reports, EIAs, court decisions, academic papers, as well as ‘grey literature’ and media sources, all of which are referenced. The EJAtlas team of editors—mostly based at Institut de Ciència i Tecnologia Ambientals of the Universitat Autonoma de Barcelona (ICTA-UAB)—then checks the quality of the case and its sources and follows-up on feedback and comments to ensure quality control for clarity, completeness and reliability of sources. The advantages of using this method are plenty, such as getting first-hand information, community engagement, and substantially increasing the scope and amount of data collected. The disadvantages are, among others, the time required to review the cases, their sources and to ensure a good quality of the registry. Another potential risk of co-produced datasets is how researchers (even some from the same team) can differ in how they view, understand and translate complex socio-phenomena into data points. This risk can be hedged by having well-defined variables and variable values (as we will see in the next section) and by developing strong, detailed notation rules to be adopted by all data-collectors and moderators.

Regardless of the type of sources you decide to use for your research, we recommend having clearly defined inclusion criteria, among which at least one form of reliability check (e.g., only news published by Reuters, only documents published in official government websites, etc.). Be sure that your reliability criteria are applicable in all researched countries. There may be cases, for example, where official information is not necessarily reliable. For this reason, checking factual evidence and contrasting versions may also be a crucial step in guaranteeing data reliability. Finally, we suggest writing all source criteria before you begin the data collection work, to avoid the temptation of adding more sources as you go and never ending the job.

3. Designing appropriate variables and variable-categories

Variables

Define a tentative set of variables—but know that the definitive set will only become clear after several attempts. At first, create a list with all the variables that are necessary to answer your research question. Bear in mind that there is information that is needed and information that would be “good-to-have.” Resist the temptation of creating variables that hold unneeded information just because “the information is there” in the sources that you are consulting and they would make your work “richer” anyway. In other words, start with only the variables you need, and add more at the end, if you have time left. It may sound counterintuitive to not collect all information possible while looking at a source and to choose, instead, to re-access this source at a different occasion (which would imply re-reading part of the content and finally re-locating the bit of information to add). However, collecting and processing data is a lengthy process of many iterations, and it is very easy to underestimate the amount of time it will consume. If towards the end you find yourself in a time-crunch to finish your dataset, you will likely want to focus the remainder of your time on collecting only data that are crucial for your research and disregard those that were just “good to have.” This means that all the time you spent collecting not-really-needed information will have been wasted and you will wish you had spent it on what really mattered. For the Curiae Virides dataset, the team concluded that, even if you have to return a second time to hundreds of sources to collect information for additional variables, it would still be less costly than throwing away days, or even weeks of work because you realized that the time you have left to complete the work will not allow you to maintain the same level of detail.

Just as when writing a chapter of a novel, the set of variables you define before your work is just a first draft. It will need revisions and adjustments that will become clear as you start working with the data. If, in order to fill all the empty cells in your spreadsheet, you find yourself having to search for new sources several times during your data notation work, re-evaluate your variables and consider combining some or even eliminating unnecessary ones. If you find yourself debating on which column to add certain information, this may be an indication that some of your variables are redundant and may need restructuring. And of course, the inverse applies if you cannot find an appropriate category for your data.

Remember that every time you add a new variable or restructure your variable values after you have started the data collection work you need to go back to the sources you already consulted and see if there is any information that should have been noted under this variable. This is tedious and time-consuming, but this is what improves the quality of your work.

Variable categories

Once you define your variables, it is then time to decide if you need to categorize the quantitative ones. Variables such as income, age, temperatures, etc. may function better on quantitative analyses when grouped in distinct levels. At first sight, this may sound like an easy task, but it may require some experimenting until you find the right “fit.” Just like a good outfit, which should neither be too tight nor too loose, your variable categories cannot be so broad that they end up bundling widely different cases under the same category, neither so specific that you get a handful of observations for each category.

Think, for example, of a variable with weather temperatures. If you want to see how small changes in the average temperature affect a crop yield, you may want to keep the actual temperature values and not group them. Alternatively, if you want to see how a classroom temperature affects the students’ learning, you may divide the range of possible temperatures into “cold, medium and hot;” or “very cold, cold, medium, hot, very hot.” How to decide whether to divide them into three or five categories? We have two suggestions:

First, it is very rare to get your variables right at the first attempt. At times, the best number of categories may only become clear as you get increasingly more familiar with your data and all its possible values. Therefore, begin your work and expect to go back for adjustments until you reach the “ideal” variable categorization. It may also be a good idea to run some tabulations in the beginning of the data collection process to see if there are too many empty/barely used categories; or if the observations are too similar across the categories (remember that a “variable” needs to “vary,” or else it is a constant!).

Second, it is easier to decrease the number of categories than it is to increase them. Suppose you start with the five temperature categories above and then decide that three would be better for the purposes of your research. All you would have to do is a quick substitution—for example, instructing your program to replace all “very hot” with just “hot” (and similarly with “cold” and “very cold”) and in just a few seconds you arrive at the desired three categories. Now suppose you had three temperature levels and wanted to expand to five. You would have to go back to your sources, re-check every temperature individually, see if they should remain as “hot” or be relabeled as “very hot,” and change the notations accordingly—which would consume much more of your time. Therefore, when in doubt, always start with more categories and combine them later if needed.

4. Multiple Iterations

When creating a new dataset from scratch, no matter how careful you are, it is very unlikely that you will get your variables and variable categories right at the first (or second, or third) try. Our most important recommendation is not only to not be afraid of redoing all or parts of your work but to actually expect to redo it several times. Many factors can create the necessity to go back to your sources and re-check the work done: a redefinition of the research question may change the variables needed; the discovery of a new source with data on a new, potentially explanatory variable; or even small adjustments for “fit,” as described above. Every time you combine or expand the scope of variables (or categories thereof), you must go back to your sources and see if any previously notated observation needs reclassification. For example, in the case of Curiae Virides dataset, when the team was classifying the possible types of environmental hazards, they started with a variable subcategory to represent hazards related to oil extraction that they called “Oil-Drilling.” Later, they realized that the type of hazards caused by conventional oil extraction (through drilling) is quite different than the hazards caused by fracking, so they felt the need to add a new category (“Fracking”) that was specific for this type of extraction. This meant that they had to go back to the sources of all the cases that had “Oil-Drilling” as the type of hazard, see which ones were related to fracking and change the notation on the dataset.

This type of revision (adjust variable categories, re-check all previously noted observations and replace values accordingly) is very common and is likely to happen multiple times throughout your dataset creation process. Depending on the complexity of your sources and the number of observations logged before the change, the rework may take days or even weeks. Some of the common thoughts we had when this happened to us were: that we had failed when originally choosing the variable categories, that this was a costly mistake, and that we were wasting a terrible amount of time without making any progress. However, we learned during the workshop that this type of revision is an integral part of building a dataset. It may feel like time wasted with no progress, but it is when you do these revisions, in fact, that your dataset quality improves. Therefore, when estimating the time to create your dataset, no matter how long you think it will take to perform revisions, allocate at least double the time you thought. And every time a new revision is needed, program your brain to feel happy because it is through them that your dataset becomes more accurate and refined. Just don’t forget to register every change and the rationale for them, as we explain below.

5. Record the reasoning behind every variable, category, and changes thereof

Fitting complex or subjective phenomena/events into narrowly defined or one-sided variables and categories can become very subjective. Mariana considers this a complex and unavoidable challenge—especially when there are multiple researchers working on the same dataset. As mentioned above, different people may see and interpret events differently. A clear example of this happened with the Curiae Virides dataset. When mapping the types of environmental harms that engendered conflicts, at one point the team had, among others, two distinct categories: “Toxic waste discharge” and “freshwater pollution.” It just so happened that, in some cases, factories would discharge toxic waste into a nearby river, thus causing its waters to get polluted. During a meeting, they realized that part of the team members was classifying such events as a “toxic waste discharge” while others classified them as “freshwater pollution.” They had then to create a rule of classification specific to those types of cases so all researchers would classify such events under the same category. This new rule was filed in a log book which contained all the other information (such as rules and definitions) about every one of their variables and categories. The Curiae Virides team tries to keep this log book as updated as possible to guarantee that everyone working on the dataset can have access to the most up-to-date information at all times. The EJAtlas team also developed a system to ensure a harmonized approach. They convened a small group of editors who consult each other when questions arise, and they systematize their main decisions and lessons to be shared with new collaborators.

It may seem almost trivial to argue that, the more people you have working on your database, the more likely it is that minor errors and discrepancies will occur throughout the process. This fact alone justifies carefully defining and keeping track of any changes in your variables design and scope from the start. However, even if you are working alone, keeping clear and detailed records of your variables’ definitions and scope, and of all the important classification decisions you made during your notation process can be helpful in the event you must take an unexpected long break or if you decide to update the dataset after a few years.

6. Concluding remarks

In conclusion, creating your own database can be a great tool to study complex phenomena in fields that do not immediately or traditionally lend themselves to quantitative analysis (such as Law and Social Sciences). Although time- and resource-consuming and, at times, a rather tedious task which requires extreme attention to detail, the use of quantitative empirical methods (almost always involving the creation of a dataset) in social and legal and research is growing in popularity. One of the reasons for this growth is that quantitative analyses unveil relationships, trends and characterizations that could not be found using only qualitative methods.

The task of creating a dataset from scratch may seem daunting at first but if you choose reliable sources, test your variables and variable-categories until you find the right “fit,” and maintain good records of all your decisions you will likely create a dataset that will yield novel research findings. Remember also that, for a good quantitative analysis, accuracy is paramount. Therefore be attentive, be thorough and be precise. Keep in mind that it is your duty, as a scientist, to ensure that your work allows the thousands of stories behind your data to be retold as truthfully and vividly as possible.

A Beginners Guide to Building a Dataset for Quantitative Analysis of Social, Legal and Environmental Phenomena

Share this:

Like this:

Discover more from Curiae Virides