L.Ciezkowsky[1]; L.Lizarazo Rodriguez[1]; J.Miller[2,3]; M.Walter[4]; N.Naghabatla[2]
1. Curiae Virides Research Project hosted at the Brussels School of Governance (Vrije Universiteit Brussel)
2. United Nations University -Comparative Regional Integration Studies (UNU-CRIS)
3. Universiteit Gent
4. EJAtlas, Institut Barcelona Estudis Internacionals (IBEI, Barcelona)
1. Introduction
On 16 June 2023, the ERC Curiae Virides team held an expert workshop on building databases at the Brussels School of Governance (Vrije Universiteit Brussel (VUB)). We invited scholars whose current or recent research involved the creation of a brand-new dataset for quantitative analysis. Our purpose was to exchange experiences, more specifically, the challenges they encountered in different steps of the process and the solutions they applied to overcome them.
-
Liliana Lizarazo-Rodriguez (Vrije Universiteit Brussel—Brussels School of Governance), on creating a dataset for ERC Project Curiae Virides on the fuzzy path of transnational ecological conflicts towards litigation;
-
Justine Miller (United Nations University – Institute on Comparative Regional Integration Studies and Ghent University), on creating a dataset on Regional Integration Agreements;
-
Daniela del Bene (ICTA-UAB, Barcelona) and Mariana Walter (IBEI, Barcelona) on creating a dataset on socio-ecological conflicts (EJAtlas);
-
Nidhi Nagabhatla (United Nations University—Institute on Comparative Regional Integration Studies) and Simon Happersberger (Vrije Universiteit Brussel—Brussels School of Governance and United Nations University—Institute on Comparative Regional Integration Studies) on the relevance of context in creating a dataset on the Environmental Impact Assessments of Trade Agreements.
This blogpost is divided as follows: in Part 1 we talk about the importance of finding and using only reliable sources; Part 2 deals with the challenge of designing the correct variables and indicators/variable categories; in Part 3 we talk about the need for multiple iterations; part 4 explains the importance of keeping a log book; and Part 5 contains our concluding remarks.

The task turned out to be herculean. My colleagues had already been working on this dataset for almost two years and, still, we worked on it in a team of 5 researchers another entire year. All together, we screened reports on over 10,000 conflicts and retained 2,500+ cases that matched our selection criteria. Although the volume of information was monumental, that was definitely not the most challenging aspect of the job. The real challenge resided in creating variables and variable-values that would preserve the stories behind each conflict and, most importantly, that would allow these stories to be retold after they were converted into data points on a spreadsheet. I had always worked with extreme care with all datasets, but this time it increased my sense of duty—like a driver, who normally drives carefully as they do not want to die in an accident, but who exercises increased care when they know there are children in the backseat. I felt I owed to all those involved to tell their story right.
But telling a story through data is never easy. It is a bit like writing a non-fiction novel. First, you need to find reliable, unbiased sources: you need to know the facts exactly as they happened, as you do not want to tell a one-sided story. Then you need to sort out your details: facts are important, but adding information on the characters, their background and personalities will help your readers understand why certain things happened the way they did. Finally, your story needs to fulfill a purpose: to entertain, to inform, to foster personal growth, etc. In the context of dataset-building, the steps are quite similar: you start with reliable sources, then you mark the values corresponding to the facts/events you are mapping. Information on the conditions under which these facts happened (such as time, frequency, and other relevant factors) will allow for the identification of trends, which add “character” to your work. All this bearing in mind that, in addition to supporting your own research, your dataset will likely also serve the purpose of supporting future research of other scholars. Below, my colleagues and I share a few insights derived from our personal experiences which we believe could be of help to fellow researchers.
2. Finding/Choosing appropriate and reliable sources
Justine Miller and her co-author at UNU CRIS chose the “database on databases” approach. It made sense in their case because all the information they needed on trade agreements and regional organizations had likely already been collected by other organizations. Besides, their core goal was to gather all disparate data and to reconcile the differences into a single document, rather than to start a new dataset from scratch.
In a more original approach, the EJAtlas team shifted from the traditional models of research sourcing and developed EJAtlas—a database of environmental conflicts worldwide—relying mostly in a process of co-production of knowledge. Their conceptual framework, Environmental Justice, draws on the concept of activist knowledge that guided the FP7 project Environmental Justice Organisations, Liabilities and Trade (EJOLT). In ten years, EJAtlas documented over 3,900 cases of socio-ecological conflicts in a collaborative process between engaged researchers (scholars, students and on-the ground journalists) and grassroots activists. Following the co-production methodology, for each case a first draft is prepared either by the researchers or by the local groups, and then sent to the other group for revision and comments. Each case includes also information collected from diverse data sources, such as official reports, NGO reports, EIAs, court decisions, academic papers, as well as ‘grey literature’ and media sources, all of which are referenced. The EJAtlas team of editors—mostly based at Institut de Ciència i Tecnologia Ambientals of the Universitat Autonoma de Barcelona (ICTA-UAB)—then checks the quality of the case and its sources and follows-up on feedback and comments to ensure quality control for clarity, completeness and reliability of sources. The advantages of using this method are plenty, such as getting first-hand information, community engagement, and substantially increasing the scope and amount of data collected. The disadvantages are, among others, the time required to review the cases, their sources and to ensure a good quality of the registry. Another potential risk of co-produced datasets is how researchers (even some from the same team) can differ in how they view, understand and translate complex socio-phenomena into data points. This risk can be hedged by having well-defined variables and variable values (as we will see in the next section) and by developing strong, detailed notation rules to be adopted by all data-collectors and moderators.
Regardless of the type of sources you decide to use for your research, we recommend having clearly defined inclusion criteria, among which at least one form of reliability check (e.g., only news published by Reuters, only documents published in official government websites, etc.). Be sure that your reliability criteria are applicable in all researched countries. There may be cases, for example, where official information is not necessarily reliable. For this reason, checking factual evidence and contrasting versions may also be a crucial step in guaranteeing data reliability. Finally, we suggest writing all source criteria before you begin the data collection work, to avoid the temptation of adding more sources as you go and never ending the job.

3. Designing appropriate variables and variable-categories
Variables
Variable categories
Second, it is easier to decrease the number of categories than it is to increase them. Suppose you start with the five temperature categories above and then decide that three would be better for the purposes of your research. All you would have to do is a quick substitution—for example, instructing your program to replace all “very hot” with just “hot” (and similarly with “cold” and “very cold”) and in just a few seconds you arrive at the desired three categories. Now suppose you had three temperature levels and wanted to expand to five. You would have to go back to your sources, re-check every temperature individually, see if they should remain as “hot” or be relabeled as “very hot,” and change the notations accordingly—which would consume much more of your time. Therefore, when in doubt, always start with more categories and combine them later if needed.

4. Multiple Iterations
When creating a new dataset from scratch, no matter how careful you are, it is very unlikely that you will get your variables and variable categories right at the first (or second, or third) try. Our most important recommendation is not only to not be afraid of redoing all or parts of your work but to actually expect to redo it several times. Many factors can create the necessity to go back to your sources and re-check the work done: a redefinition of the research question may change the variables needed; the discovery of a new source with data on a new, potentially explanatory variable; or even small adjustments for “fit,” as described above. Every time you combine or expand the scope of variables (or categories thereof), you must go back to your sources and see if any previously notated observation needs reclassification. For example, in the case of Curiae Virides dataset, when the team was classifying the possible types of environmental hazards, they started with a variable subcategory to represent hazards related to oil extraction that they called “Oil-Drilling.” Later, they realized that the type of hazards caused by conventional oil extraction (through drilling) is quite different than the hazards caused by fracking, so they felt the need to add a new category (“Fracking”) that was specific for this type of extraction. This meant that they had to go back to the sources of all the cases that had “Oil-Drilling” as the type of hazard, see which ones were related to fracking and change the notation on the dataset.
This type of revision (adjust variable categories, re-check all previously noted observations and replace values accordingly) is very common and is likely to happen multiple times throughout your dataset creation process. Depending on the complexity of your sources and the number of observations logged before the change, the rework may take days or even weeks. Some of the common thoughts we had when this happened to us were: that we had failed when originally choosing the variable categories, that this was a costly mistake, and that we were wasting a terrible amount of time without making any progress. However, we learned during the workshop that this type of revision is an integral part of building a dataset. It may feel like time wasted with no progress, but it is when you do these revisions, in fact, that your dataset quality improves. Therefore, when estimating the time to create your dataset, no matter how long you think it will take to perform revisions, allocate at least double the time you thought. And every time a new revision is needed, program your brain to feel happy because it is through them that your dataset becomes more accurate and refined. Just don’t forget to register every change and the rationale for them, as we explain below.

5. Record the reasoning behind every variable, category, and changes thereof
It may seem almost trivial to argue that, the more people you have working on your database, the more likely it is that minor errors and discrepancies will occur throughout the process. This fact alone justifies carefully defining and keeping track of any changes in your variables design and scope from the start. However, even if you are working alone, keeping clear and detailed records of your variables’ definitions and scope, and of all the important classification decisions you made during your notation process can be helpful in the event you must take an unexpected long break or if you decide to update the dataset after a few years.

6. Concluding remarks
In conclusion, creating your own database can be a great tool to study complex phenomena in fields that do not immediately or traditionally lend themselves to quantitative analysis (such as Law and Social Sciences). Although time- and resource-consuming and, at times, a rather tedious task which requires extreme attention to detail, the use of quantitative empirical methods (almost always involving the creation of a dataset) in social and legal and research is growing in popularity. One of the reasons for this growth is that quantitative analyses unveil relationships, trends and characterizations that could not be found using only qualitative methods.
