1. Introduction

Long-term efforts to understand social-ecological systems (SES) involving the management of common pool resources (CPR1) has led to the generation of a large body of data composed primarily of case studies (Wade 1984; Berkes 1989; Ostrom 1990; McKean 1992; Baland and Platteau 1999; Bardhan and Mookherjee 2006; Cox 2014; Epstein et al. 2014). If we are to understand CPR governance, we must be able to make comparisons across case studies but are challenged to develop reliable methods of making this often complex and messy data comparable. Meta-analysis, in the field of environmental social science is a mixed methods approach involving data extraction from case studies, through the coding of texts, for use in statistical or other comparative data analysis techniques (Hruschka et al. 2004; Rudel 2008; Cox 2015). As the essential activity of meta-analysis, coding involves the classification and quantification of texts or other media segments, preserved in a form which can be subjected to formal analysis (Hruschka et al. 2004). In this paper, we will contribute to understanding the challenges of coding case studies in environmental social science by critically exploring the experience of a team of researchers at the Center for Behavior, Institutions and the Environment (CBIE) at Arizona State University (ASU) while coding the 69 cases that form the data for Baggio et al. (2016) and Barnett et al. (2016).

In the next section (Section 2), we will briefly discuss the overall opportunities and challenges inherent in the coding of case studies for large-N meta-analyses and why this is a particularly important methodology in the field of environmental social science. We will discuss three primary challenges which we find can hamper meta-analysis efforts: 1) methodological transparency; 2) coding reliability; and 3) replicability of findings. In Section 3, we discuss our coding methodology in some detail and compare it to recommendations in the methods literature, including: preliminary decisions, codebook development, coding protocols, and intercoder reliability testing. We explore ways of increasing methodological rigor in these areas by adopting certain techniques and strategies from other disciplines in the social sciences and compare the approaches used by the CBIE team to approaches, or “best practices”, recommended by a number of leading authorities within the methods literature. In Section 4, we utilize our findings from this comparison to develop a recommended coding protocol which we think could be widely applicable and easily adaptable to others using a comparative or meta-analysis methodology for research on SESs and the commons. We conclude the paper by sharing some ideas for future research in Section 5. We hope that by sharing these key methodological challenges and opportunities, we will stimulate a broader platform for communication and collaboration among scholars which will lead to better, more transparent research designs, opportunities in meta-­analysis and data synthesis, and discoveries that will enhance our ­understanding of SESs.

2. The challenge

Meta-analysis, comparative analysis, and synthesis rely on the use of a rich resource of case studies which have been collected by numerous researchers over a long period of time. Secondary analysis of data of this kind, gathered for other purposes using diverse measures and variables, is inherently subjective and it is therefore important to take measures to increase coding reliability and replicability. This can present challenges in research design and implementation. Secondary analysis of existing case studies, however, has the advantage of being a relatively low-cost approach, compared to primary data collection, and can enable larger scale comparative analyses (Kelder 2005; Savage 2005). Meta-analysis offers the opportunity to refine findings within a wider community, discover what the dominant discourses are and generate new knowledge through the validation of previous findings. In addition, the use of synthesized datasets allows for the use of existing data in new ways and analyses across multiple time periods, scales and sectors, thereby potentially improving researchers’ ability to understand complex system dynamics and adaptation (Ostrom 1990, 2012; Kelder 2005; Poteete et al. 2010; Cox 2014). Araral (2014) and Agrawal (2014) characterize this type of work in the study of the commons as the “emerging third generation” of research within the legacy of Elinor Ostrom, and see these efforts to generalize and extend her arguments across scales and with increased complexity as being of “fundamental importance” (Agrawal 2014, 87).

Relying on secondary data, however, is often difficult (Poteete et al. 2010) as existing data are often limited in their scope and scale, and are separated into independent databases using unique coding schema2 and storage structures which are not always made publicly available. These limitations and divisions hamper synthesis efforts and comparability. For example, there are a number of data repositories (Table 1 in Supplementary Material) based upon the work of Elinor Ostrom and her collaborators on CPR theory.1 These libraries of data represent a rich and mostly unexploited resource for increasing our understanding of CPRs via meta-analysis and comparison with contemporary data (Corti et al. 2005). These databases, however, each possess their own idiosyncrasies, sometimes leading to diverse interpretations of theory, coding schemes, organization, variables, and definitions. Researchers often do not disclose sufficient methodological information to replicate, verify or compare findings, including access to the codebooks, information on case or variable selection, theoretical assumptions, or intercoder reliability testing approaches. Problems associated with ambiguous or missing information based on unreported assumptions hamper the replicability of study findings and undermine the reliability and validity of such research.

Research is always a work in progress and case studies and comparative analysis done in isolation may be disputed or later found to be wrong. In addition, there may be issues of confirmatory bias or non-representative sampling involved in the selection of cases for secondary analyses, even when they contain sufficient levels of information. Thus, intercoder reliability testing and reporting is critically important, as is the disclosure of coding variables and codebooks. In order to advance the intra- and inter-institutional analysis of data, more rigorous standards should be established, such as common standards and protocols and the explicit reporting of assumptions. Even without consensus on standards or protocols, however, selection criteria should be made transparent by research teams in order to facilitate the emergence of common practices and increased methodological rigor in environmental social science in general.

Access to the resource of SES and commons data that currently exists can, itself, be viewed as a public good which is currently underprovided due to lack of transparency and coordination. Institutions which govern the proper and productive use of these resources could effectively reduce issues which private property dataset approaches now generate. The differences in databases and lack of transparency by researchers limit synthesis efforts and the ability to conduct broader, large-N case comparisons. Agrawal (2014) asserts that furthering this research will require methodological innovation, better theoretical sophistication and improved data. Furthermore, he states that the use of new methods involving more qualitative analysis and experimentation are the current drivers pushing the field forward. However, the successful use of these new methods will depend upon substantial amounts of new data, better integration of data, a sophisticated hierarchical organization of datasets, and increased analytical rigor (Agrawal 2014). In order to increase coding replicability, reliability, and transparency, some scholars assert that explicit identification and alignment of the coding rules, organization and work-process knowledge (or coding schema) used in coding may be important in mitigating problems of missing data and interpretations of concepts (MacQueen et al. 1998; Stemler 2001; Medjedović and Witzel 2005). Because meta-analysis of this type is a relatively new methodological approach in social science research (Corti et al. 2005), some authors argue that there has not yet been enough published research looking at the issues it may raise (Corti and Thompson 2004). In this paper, we critically explore our experience in answer to these challenges. We hope to offer some guidance and identify valid issues of consideration in the coding of secondary data for meta-analysis, thereby contributing to the dialogue in this area.

3. Coding methodology

In order to increase the replicability and the transparency of our coding process we have created a detailed Coding Manual (see Supplementary Material) and a Recommended Coding Protocol (see Section 4). A coding protocol is the common set of systematic procedures that a research team agrees to follow during the coding process (Rourke and Anderson 2004) and a coding manual typically contains the coding questions, answer codes, and information to aid in clarification and coder alignment which embody the research questions being explored in a study (MacQueen et al. 1998). Our coding manual was developed incrementally throughout the coding process and our recommended coding protocol outlines the way that we would conduct the project in retrospect, resulting from the analyses and comparison to the methods literature as detailed in the following sections. Figure 1 (below) illustrates how our process compares to practices recommended in the methods literature (MacQueen et al. 1998; Mayring 2000; Hruschka et al. 2004). We then discuss the comparison between the recommended “best practices” model synthesized from the methods literature (left side of Figure 1) and the process used by the CBIE team (right side of Figure 1), focusing on the challenges raised during the coding process and how the recommendations from the methods literature could potentially address them.

Figure 1 

Coding process comparison illustrating the process utilized by our team compared to the “best practices” model described above and discussed in further detail in the following sections. 1Preliminary decisions; 2Table 1, this document; 3Ostrom et al. 1989.

3.1. Formulate research agenda

The formulation of the research agenda for the original meta-analysis project at CBIE (Baggio et al. 2016) was related to three objectives. The primary objective of that study was to examine case studies to determine whether particular configurations of Ostrom’s (1990) design principles (DPs) were indicative of successful CPR governance. The second objective was to replicate and then expand upon a previous study conducted by Cox et al. (2010), which provided some empirical support for the claim that there is a higher chance for each of Ostrom’s (1990) individual DPs to be present in successful cases of CPR management across a range of contexts. The third objective was to link the expanded DPs (Table 1) found in Cox et al. (2010) with variables found within the existing database for the Common Pool Resources (CPR) Project (Ostrom et al. 1989). Since the DPs and the variables used in the CPR database are both founded on CPR theory, we thought it would be possible to link them, thereby facilitating the synthesis of two separate datasets that use similar concepts but different coding schema. Larger datasets of comparable cases improve meta-analyses and researchers’ ability to use mixed qualitative and quantitative methods, as well as improve analyses across multiple sectors, scales, and time periods. In doing so, our ability to understand complex system dynamics and adaptation in these system types is potentially enhanced (Poteete et al. 2010).

Design principle Description
1A The presence of the design principle 1A means that individuals or households who have rights to withdraw resource units from the common-pool resource must be clearly defined. Is this design principle present?
1B The presence of the design principle 1B means that the boundaries of the CPR must be well defined. Is this design principle present?
2A The presence of design principle 2A means that appropriation rules restricting time, place, technology, and/or quantity of resource units are related to local conditions. Is this design principle present?
2B The presence of design principle 2B means that the benefits obtained by users from a CPR, as determined by appropriation rules, are proportional to the amount of inputs required in the form of labor, material, or money, as determined by provision rules. Is this design principle present?
3 The presence of design principle 3 means that most individuals affected by the operational rules can participate in modifying the operational rules. Is this design principle present?
4A The presence of design principle 4A means that monitors are present and actively audit CPR conditions and appropriator behavior. Is this design principle present?
4B The presence of design principle 4B means that monitors are accountable to or are the appropriators. Is this design principle present?
5 The presence of design principle 5 means that appropriators who violate operation rules are likely to be assessed graduated sanctions (depending on the seriousness and context of the offense) by other appropriators, officials accountable to these appropriators, or both. Is this design principle present?
6 The presence of design principle 6 means that appropriators and their officials have rapid access to low-cost local arenas to resolve conflicts among appropriators or between appropriators and officials. Is this design principle present?
7 The presence of design principle 7 means that the rights of appropriators to devise their own institutions are not challenged by external governmental authorities. Is this design principle present?
8 The presence of design principle 8 means that appropriation, provision, monitoring, enforcement, conflict resolution, and governance activities are organized in multiple layers of nested enterprises. Is this design principle present?

Table 1

Expanded design principle questions (adapted from Cox et al. 2010) as basis of coding variables and questions.

3.2. Identify dataset

Decisions about case selection and subsequent text segmentation are extremely important steps in the identification of the dataset to be used for meta-analysis (Hinds et al. 1997; Stemler 2001; Weed 2005). Cases should typically be screened and analyzed for fit based on both their applicability to the research questions and data completeness (Hinds et al. 1997; Stemler 2001; Weed 2005). Longer texts, like the case studies used in this study, should be segmented into smaller units of text (e.g. a sentence or a paragraph) to increase intercoder agreement and reliability (Krippendorff 2013) and decrease coding discrepancies (Hruschka et al. 2004). A coding protocol generally includes guidelines as to how a text should be segmented for data analysis and coding (Hruschka et al. 2004; Bernard and Ryan 2010; Bernard 2011). Inclusion and exclusion criteria formally clarify the reasoning behind the selection of cases and segmentation of texts (Hruschka et al. 2004). Ostrom et al. (1989) found exclusion criteria to be extremely important and included careful screening criteria for the cases included in the original CPR database.

Because the primary and secondary objectives of the CBIE team’s research agenda were to replicate and extend upon the findings of a previous study, the selection of cases was predetermined by the dataset used in the study by Cox et al. (2010). Consequently, this limited our ability to select cases for fitness and data completeness. We did, however, limit our selection of cases to a sub-set of the Cox et al. dataset by sector (irrigation, fishery, and forestry), based on our third objective of synthesis with the existing CPR dataset (Ostrom et al. 1989). This resulted in the coding of 69 out of the 77 cases presented in Cox et al. (2010). During the coding process, our team experienced some difficulties with the fitness of the dataset due to missing data. For example, there were some cases which we found had sufficient social outcome data but not enough biological data, or vice versa, making the overall determination of success or failure in these cases difficult. Without explicit information on the inclusion/exclusion criteria used by the Cox team, it was more difficult for us to replicate and validate findings of success or failure across cases. We also found that some cases had ample data on one or two specific DPs but lacked information on the presence or absence of others. The Cox study may have been less sensitive to missing data on DPs because they were analyzing individual DPs against ­success, rather than looking for combinations of DPs as in the CBIE approach (Baggio et al. 2016). While analyzing combinations of DPs may present increased issues with missing data, Baggio et al. show the potential advantages of this approach.

Cox et al. (2010) segmented text by dividing longer documents into individual cases representing a single geographical location and temporal period. The text segmentation for the CBIE study was pre-determined by the divisions made in Cox study, and inter-related with case selection and the issues previously described. We found that the segmentation of texts contributed to the issues of missing data and fitness because some cases might include a single paragraph within a larger document or might instead include a number of sentences or excerpts related to a specific location scattered throughout the document which were considered one segment. Since criteria for the segmentation of texts into cases from larger regional studies was not explicitly reported in the Cox et al. publication, the CBIE team initially debated whether to include or exclude cases based on our own screening criteria, but ultimately decided to use the same cases that were also evaluated by the Cox team.

3.3. Form coding team

The use of two or more coders is important for assessing the replicability and reliability of coded data (MacQueen et al. 1998). The number of coders sufficient to establish reliability is not agreed upon in the literature, but in general, the more coder inference required and/or the rarer that codes appear in texts, the greater the number of coders that should be utilized (Bernard and Ryan 2010). We divided all 69 cases among the entire coder team assuring that there were generally three coders per case. This resulted in eighteen distinct coding team combinations. Since our coding project involved case studies that reported on SES conditions from a variety of perspectives requiring a certain amount of coder inference, utilizing three coders, rather than just two, was an appropriate and beneficial design feature.

3.4. Define coding schema (categories and organization)

Definition of the coding schema for a comparative or meta-analysis project involves the theoretical interpretation of categories and organization of the relational database (MacQueen et al. 1998; Mayring 2000; Hruschka et al. 2004; Weed 2005; Guest and MacQueen 2008). The theoretical interpretation of categories refers to a deductive approach to specifying themes, codes, or variables which will be searched for and coded within the texts and which are based on a defined body of theory (Weed 2005). The organization of the relational database simply refers to the way that the data will be organized in the database.

The primary coding categories used within our study were derived from the expanded design principles defined by the Cox et al. (2010) study (Table 1). Araral (2014) argues that there are two specification problems in the Cox et al. (2010) study that may also apply to our study. Araral’s (2014) first concern is the re-specification of Ostrom’s (1990) DP for clear boundary rules (DP1) into two distinct DPs for user boundaries (DP1A) and resource boundaries (DP1B) (Cox et al. 2010). Araral (2014) asserts that Ostrom (1990) intentionally did not separate the original design principle in this manner because within the “context of collective action in the commons” (p. 18), boundaries refer to enforceable property rights, not spatial boundaries. He also points out that the relevant critical literature has previously illuminated that spatially based definitions of community are problematic because the “overlapping, fuzzy and temporal nature of rights” can lead to difficulties in defining community across scales (Araral 2014). This issue has been previously illuminated in the relevant literature, with claims that spatially based definitions of community are problematic because the overlapping and temporal nature of rights can potentially lead to difficulties in defining community across scales (Brewer 2012; Araral 2014; Barnett 2014). Others, however, have suggested that this is a faulty argument and that the distinction made by Cox et al. (2010) is a helpful tool in defining clear agent boundaries (Pitt et al. 2012). Ostrom (1990) stated that “Without defining the boundaries of the CPR and closing it to ‘outsiders’, local appropriators face the risk that any benefits they ­produce by their efforts will by reaped by others who have not contributed to those efforts” (p. 91). The definition of the CPR boundary can be seen as the definition of the spatial boundary (DP1A), while the exclusion of “outsiders” can be seen as the definition of the user boundary (DP1B).

Araral (2014) also points to the definition of a “successful CPR” as the second specification error of concern. Our team found that the definition of success and failure are complex, and ended up using a different approach than that reported by Cox et al. (2010). Cox et al. (2010) defined “success” in cases that “reported successful long-term environmental management” (Cox et al. 2010, 40), while we define success according to a number of dimensions defined by social and ecological outcomes variables (Table 2) drawn from the CPR project coding schema (Ostrom et al. 1989), including: 1) resource sustainability (variables 1a–6b); 2) process of collective choice arrangements (variables 7a–9); 3) equity among users (variables 10–13); and 4) overall assessment of Success or Failure for the case (variable 14). Overall success (used in Baggio et al. 2016 and Barnett et al. 2016) was then coded as “success” when the resource was utilized sustainably, and there was an absence of conflict among resource users. We also used CPR variables to augment each DP variable, making each DP a theoretical category. Fifty-seven variables, in total, were specified and divided into 15 categories; one for each of the four dimensions of outcome “success” and the 11 expanded design principle categories (Table 2).

Table 2

Coding variables/questions and categories.

The specification of success may be a fundamental issue in our field (Araral 2014). Ostrom (1990) defined “success” within CPR governance as those “institutions that enable individuals to achieve productive outcomes in situations where temptations to free-ride and shirk are ever present” (p. 15). “Institutions” are the rules, norms, and shared strategies that people use to organize all forms of repetitive and structured interactions at all scales (Ostrom 2005). When Ostrom talks about “success,” she is referring to successful collective action. Cox et al. (2010) used this definition, stating that cases were coded as unsuccessful if there was a “clear failure in collective action and management” (p. 40). Both the Cox et al. definition and the outcomes variables, which we used to construct our definition of success, capture this part of Ostrom’s (1990) definition. The major difference in Cox et al., however, comes from including the idea of “long-term environmental management” (Cox et al. 2010, 40) which is not included within the outcome variables used in our study. While the idea of long-enduring CPR institutions is well founded within the literature (Ostrom 1990, 2005; Anderies et al. 2004; Cox et al. 2010; Poteete et al. 2010), we found this to be a difficult concept to assess within the meta-analysis of secondary data. Most cases in the dataset only captured a limited snapshot in time and did not include adequate longitudinal data to indicate the longevity of success within the case. In addition, Cox et al. divided some texts into separate cases for a single location but different time periods, which further limited any temporal analysis of success.

Agrawal (2014) has argued that commons scholars have not clearly differentiated between different measures, dimensions, and outcomes but have relied upon relatively vague terms like “sustainability”, “success”, and “long-term viability” instead. This raises fundamental questions within our field about what constitutes appropriate longevity for an assessment of success in a case and/or across comparative cases. Ambiguities involved in the specification of variables and problems with the definition of success and longevity assessments in cases made it difficult to reproduce the results of the Cox et al., study and hindered our synthesis and meta-analysis efforts. Specification problems, like these, are often key drivers of the missing data problem in studies which can plague both analysis efforts and intercoder agreement and require further dialogue within the field of research (Araral 2014).

3.5. Develop codebook and code sample set

According to the methods literature, sample coding should typically be performed on a random sub-set of the dataset and coding questions should be iteratively refined until intercoder reliability testing results are deemed satisfactory (Mayring 2000; Hruschka et al. 2004). Sample coding is the testing of the coding schema on a small random sample of the data to facilitate iterative refinement prior to the coding of the full dataset. The variables described in Table 2 (above) were initially documented in a set of preliminary coding questions and were pre-tested on a sample of three cases representing each sector (fisheries, forestry, irrigation) randomly selected from the existing CPR database. This allowed us to compare current coding3 results with those of the original coding conducted by Ostrom’s team (1989) and determine consistency in the interpretation of the CPR variables. Although the three sample cases from the CPR database were not a part of the dataset for the meta-analysis project, this allowed us to more accurately assess alignment with the CPR variables thereby providing a measure of inter-coder agreement. Coding results from the pre-test sample coding were subjected to formal intercoder reliability testing by one of the primary investigators of the project before coding the entire dataset commenced. Any questions related to further interpretation of variables were discussed and clarified by the entire research team during periodic meetings as an informal means of increasing intercoder alignment. Issues clarified in project meetings were then incorporated into a preliminary coding guide which included the questions for each of the original 57 coding variables supplemented with explanations and answers derived from coder questions and team discussions.

3.6. Perform intercoder reliability testing and iteratively refine

The best practices model (Figure 1) recommends formal intercoder reliability testing on a subset of the dataset, as well as iterative intercoder agreement testing throughout and after the formal coding process (MacQueen et al. 1998; Mayring 2000; Hruschka et al. 2004; Guest and MacQueen 2008). We have found that this step is often missing from reports on studies of CPRs using meta-analyses (Netting 1976; Wade 1984; Berkes 1989; Ostrom 1990; McKean 1992; Baland and Platteau 1999; Bardhan and Mookherjee 2006; Cox 2014; Epstein et al. 2014; Fleischman et al. 2014; Villamayor-Tomas et al. 2014). Hruschka et al. (2004) explain that a reluctance to assess coder agreement is common in some branches of social science because: (1) researchers may generally believe that the quantification of qualitative data is unnecessary because qualitative research is a “distinct paradigm” that cannot or should not be subject to a quantitative evaluation; and (2) there is a general skepticism about the ability to actually measure subjective data and reproduce coding results. We believe the latter argument to be the most viable reason for the apparent lack or under-reporting of intercoder reliability testing in our field but have found that it would potentially be helpful when iteratively included throughout the coding process.

Our team only tested intercoder agreement on the initial sample set of CPR cases and did not test for intercoder reliability again until the analysis and interpretation phase of the project. Our informal coding guide development process was aimed at establishing an informal feedback loop of intercoder alignment, refinement of theoretical interpretations and iterative adjustments to the coding questions based on ambiguities and questions that arose during the coding process. Assessment of coding conducted in other studies (Ostrom et al. 1993; Wollenberg et al. 2007; Cox 2014) suggests that this is a more common practice in our research community than the more formal methods. Hruschka et al. (2004) recognize this consensus-based approach toward “interpretive convergence” (p. 321) as a potentially useful method for increasing intercoder reliability, but state that more analysis may be needed to determine the validity of this approach.

3.7. Code dataset

Coding is the essential activity of the content analysis methodology and requires the identification of themes or categories that appear in text or other media segments (Hruschka et al. 2004). Coding can be done in a number of ways ranging from highlighting pieces of text by hand to the use of sophisticated Qualitative Data Analysis (QDA) software packages. While QDA software is sometimes expensive and requires training, some studies have found that use of QDA software has been found to aid in increasing rigor and intercoder reliability during the coding process (Denzin and Lincoln 2000; Rambaree 2007), allowing coders to identify and tag specific text segments and associate them with a particular category or memo. Texts coded by individual coders can later be combined and analyzed, thus allowing for easier identification of coding discrepancies (Bernard and Ryan 2010). In contrast, hand-coding and/or use of spreadsheet software is inexpensive and requires little to no additional coder training.

Individual coders on the CBIE team coded text segments which they felt exhibited explicit evidence supporting their answer to each of the 57 coding questions and documented the answer to the question and the supporting text segment(s) in spreadsheet format (Table 3). QDA software was not used in the CBIE project due to time and cost constraints. Each team of three coders then met to compare answers and decide upon a single group code, reducing the subjectivity of codes and generating more reliable coding (Hruschka et al. 2004). Where there was consensus on the answer to a coding question between the individual coders on any variable, the same answer was given as the group code for that variable. Selected text segments were then utilized as “evidence” of an appropriate code when mitigating discrepancies between team members to arrive at an agreed-upon group code. Any coding disagreements were resolved through group discussion among the coding team members and during project meetings where study PIs addressed unresolved issues. Final coding results for all cases were later combined into a single master spreadsheet.

Table 3

Example of coding results by case study (column SECDESC) and coding group (AEN).

The coding results displayed are the codes for individual coders “A”, “E”, “N”, as well as the agreed-upon “Group” code. The blue color of column “1b.ENDQUAL” indicates a disagreement between coders which was resolved by group agreement for the resulting group code of −1 MIC, indicating that the group decided that there was not enough information in the text to make a decision.

3.8. Analyze and interpret results – post hoc intercoder reliability testing

Analysis of coding team dynamics and formal post hoc intercoder reliability testing4 (see Supplementary Material) were conducted along with other analyses for the meta-analysis study (Baggio et al. 2016; Barnett et al. 2016). Results showed potential inconsistencies in intercoder agreement and coding team dynamics may have developed from the informal consensus-based process used by the CBIE team. The informal methodology may have resulted in distinct advantages for coders who were able to more forcefully argue their positions or better document all instances of text that led them to code a variable in a certain way, highlighting the need for explicit rules of coding and for increased ­attention to both intercoder agreement and reliability (MacQueen et al. 1998; Stemler 2001).

Post hoc intercoder reliability ratings were calculated to examine the overall intercoder agreement by team, but also to determine which coding variables were more difficult to identify within the texts (see Baggio et al. 2016). We found that inconsistencies the challenges discussed above contributed to low intercoder reliability ratings, but that these challenges are not insurmountable. They should be considered part of a normal coding process and are typical of many similar projects within our field of study. Coder agreement is generally expected to be low initially, particularly when coding “focuse[s] on identifying and describing both implicit and explicit ideas” (Namey et al. 2008, 138), such as inferring the presence or absence of DPs in case studies. The fact that many case studies in our dataset were lengthy texts may have further contributed to marginal intercoder agreement. These challenges can be decreased through more formal methods, like the “best practices” model presented here (Figure 1). For example, to address discrepancies in coder interpretation, the literature recommends coding several iterations of subsets of the data, followed by formal reliability testing (percent agreement and a Kappa statistic that takes chance into account) and iterative codebook revisions until acceptable intercoder reliability ratings have been reached (Hruschka et al. 2004; MacQueen et al. 2008; Bernard 2011). Once acceptable intercoder agreement has been reached, coding of the entire dataset proceeds which is supplemented by continued random sample intercoder reliability testing to prevent “coder drift” or “code favoritism” (Carey and Gelaude 2008, 251).

3.8.1. Data preparation

Post hoc intercoder reliability testing required considerable data preparation in order to unify coding data, minimize bias due to incompatible comparisons, and transfer complex coding values into a format that could be analyzed by intercoder reliability statistical software. Details of these processes are outlined in the Supplementary Material.

3.8.2. Intercoder reliability testing

For coding projects involving >2 coders and coding values that are nominal and multiple, Feng (2014) recommends Krippendorff’s alpha, Fleiss’ kappa, and/or percent agreement. Krippendorff’s alpha is a reliability coefficient that is a “generalization of several known reliability indices” (Krippendorff 2013, 1). Its advantage lies in its ability to calculate intercoder agreement among an indefinite number of coders and any number of scale values. It can handle missing and incomplete data, as well as large and small sample sizes and is considered a robust measure of intercoder reliability (Bernard and Ryan 2010; Krippendorff 2013). Fleiss’ kappa is a variant of the popular Cohen’s kappa statistic which allows for more than two coders (Bernard and Ryan 2010). Similar to Krippendorff’s alpha, Fleiss is a statistic that measures coders’ agreement with respect to chance (Bernard 2011). Finally, although simple percent agreement tends to overestimate intercoder reliability because it does not account for chance agreement (Hruschka et al. 2004; Feng 2014), it is appropriate to utilize this technique in conjunction with other measures if the variables analyzed are nominal (Feng 2014). Simple percent agreement provides a good yardstick to determine whether the intercoder reliability ratings obtained through Krippendorff and Fleiss may be skewed due to particularly high agreement or missing variables.

Utilizing the irr-package in R (Gamer et al. 2012), intercoder agreement for all three statistics was calculated for 11 variable groups in each of the 13 coding teams (see Table 4 for excerpt and the Supplementary Material for complete intercoder reliability ratings and R code). Before evaluating whether coding agreement reached high (>0.80) or acceptable (0.70–0.79) reliability levels, simply adding the Krippendorff and Fleiss values by variable group and coding team provides a first insight into those variable groups/teams with high/low scores. For the Krippendorff values, Figures 2 and 3 reveal DP1 (clearly defined boundaries) and coding team “AEN” as those with the highest intercoder agreement. In contrast, DP8 (nested governance) and team “ACH” had the lowest intercoder agreement. Fleiss’ statistics mirrored those findings (see Supplementary Material). This suggests that determining the evidence of resource and user boundaries within a case study requires less inference from coders than determining whether the reported institutional structure represents a “nested enterprise.” For codebook and coding protocol development purposes, such initial high/low values could be important bellwethers of particularly well or poorly functioning coding questions/teams, identifying weaknesses that may require further investigation in order to strengthen intercoder agreement before commencing with coding the entire dataset.

Coding team Variable group Krippendorff values Fleiss values Percent agreement
ACH Env 0.603 0.602 80.60
ACH Soc 0.693 0.692 68.80
ACH Success 1.000 1.000 100.00
ACH DP1 0.261 0.256 33.30
ACH DP2 0.327 0.322 37.50
ACH DP3 0.387 0.384 64.30
ACH DP4 0.591 0.590 59.30
ACH DP5 −0.138 −0.149 50.00
ACH DP6 −0.241 −0.258 16.70
ACH DP7 0.389 0.385 50.00
ACH DP8 −0.274 −0.286 33.30
CHN Env 0.636 0.634 66.70
CHN Soc 0.507 0.503 45.80
CHN Success −0.063 −0.125 66.70

Table 4

Excerpt of intercoder reliability testing results (all statistics).

Column “coding team” identifies the coding team. Column “variable group” identifies the coding variable categories/groups, i.e. “env” = variables 1a–6b; “soc” = variables 7a–13; “success” = variable 14; DP1 = variables 15–18; DP2 = variables 19–22; DP3 = variables 23–26; DP4 = variables 27–32; DP5 = variables 33–34; DP6 = variables 35–36; DP7 = variables 37–38; and DP8 = variables 39–41. Values for Krippendorff’s alpha and Fleiss’ kappa range between 0 and 1, with 1 demonstrating perfect agreement between coders and 0 indicating agreement that is consistent with chance, i.e. the absence of reliability. Negative alpha values signify coder agreement that is below chance (Krippendorff 2008).

Figure 2 

Sum of Krippendorff values by variable group for all coded cases. Results ­indicate that generally Design Principle 1 (DP1) had the highest overall intercoder agreement and Design Principle 8 (DP8) the lowest.

Figure 3 

Sum of Krippendorff values by coding team/all cases coded. Results reflect highest coder agreement for team AEN and lowest coder agreement for team ACH.

Despite the aforementioned problems, many of the intercoder agreement ratings were >0.65 for both Krippendorff and Fleiss statistics. This places our data reliability/replicability factor only slightly below the 0.70 score that is generally deemed as acceptable in the literature. Given the subjective nature of some of the variables, the large number of missing values, and the iterative nature of our coding process, such ratings are defensible for the completed project and may easily be improved in the future through the use of a more detailed codebook and ­coding protocol. More importantly, by disclosing our intercoder reliability ratings, ­procedures, preliminary codebook and coding protocol, we have taken additional steps to enhance the ability of others to analyze and replicate our findings as well.

3.8.3. Coder drift

One important reason we found to assess intercoder reliability is known as “coder drift”. Coder drift is the process over time, in which coders may become less reliable in their coding due to the adoption of coding biases and the not-­rigorous application of coding criteria (Bartholomew et al. 2000). To avoid coder drift, Carey and Gelaude (2008) recommend spot checking of coder agreement throughout the coding process. After coding was completed and intercoder reliability ratings performed, discussion among coders revealed that there may have been some coder drift which could have produced inconsistencies in the way that coders applied information within the text to answer the question of overall success (variable 14). In our study, spot checks of coder agreement throughout the coding process may have mitigated some of the ambiguity with regard to coders’ assessment of “success”. Subsequent random sampling of the answers given to question 14, as well as purposive sampling of an additional ten cases, revealed notes that indicated several coders may have considered more than the outcome variables in their answer to this question. However, in all but two cases, coders were in agreement with their assessment of the studies overall success or failure, regardless of the potential for coder drift. In the two instances of coder drift where there was no initial coder agreement, the coders were able to resolve the disagreement through discussion. As outlined throughout this paper, a codebook containing detailed coding descriptions that is iteratively updated to include coder questions and coding ambiguities, as well as continuous spot-checking of intercoder agreement, might have resolved these instances of coding bias.

4. Recommended coding protocol

Through analysis of our coding process and review of the literature, we have found that increased transparency, reliability, and replicability are of primary importance in increasing our ability to perform meta-analysis and the synthesis of case studies. While qualitative research often generates complex information that is difficult to process and can lead to judgments based on subjective, or “intuitive heuristics” (Hruschka et al. 2004), the level of agreement can and should be quantified. It is precisely the subjective nature of the evaluations which makes them more susceptible to individual interpretation and the intentional or unintentional introduction of biases, random errors, and other distortions (Hruschka et al. 2004; Krippendorff 2013). The establishment of more rigorous coding protocols including intercoder reliability testing represents an effort to “reduce [such] error and bias” (Hruschka et al. 2004) by ensuring that the data meaning remains consistent across a variety of coders and research teams. In fact, it can be argued that coding is an essential element of classical content analysis because it converts qualitative data into datasets that are supportive of robust analyses and can be replicated by other scholars (Krippendorff 2013). Replicability creates greater reliability which empirically grounds confidence in the data and, thus, the study findings (Krippendorff 2013). For these reasons, we include here our Recommended Coding Protocol (Figure 4). This is based, in retrospect, on the examination of the CBIE meta-analysis project, but we will briefly discuss the considerations which may be affected by project and team type. More detailed information on all steps outlined here can be found in the Detailed Recommended Coding Protocol included within the Coding Manual in the Supplementary Material.

Figure 4 

Recommended coding protocol. 1Boxes shaded in gray represent preliminary considerations, while unshaded boxes are a part of the main coding process.

4.1. Preliminary considerations

We found a number of preliminary considerations (gray boxes in Figure 4) which should precede the coding process.

4.1.1. Identify dataset

We highly recommend that teams develop a screening process during the identification of the dataset to ensure that cases included in the study have sufficient information to answer the research question. Inclusion/exclusion and text segmentation criteria should be clearly defined and reported. This step is likely to decrease missing and ambiguous data for analysis.

4.1.2. Select qualitative data analysis (QDA) software or other technique for coding

Teams should consider the use of QDA software prior to the commencement of the coding process. Although QDA software will add cost and training considerations to the project, it may facilitate data processing, decrease discrepancies, and potentially reduce the time needed to conduct intercoder reliability testing.

4.1.3. Form coding team

We found that utilizing two or more coders increases data reliability because coding agreement between different people, who have been given the same instructions and have independently coded the same text segments, demonstrates a reduction of subjective biases and increases data reliability (Guest and MacQueen 2008). Coding team dynamics may be a concern, however, such dynamics can be mitigated through the use of more rigorous coding protocols and coder training. Although each additional coder increases the need for iterative intercoder reliability testing and training to achieve intercoder alignment, two coders per text should be a necessary condition for any meta-analyses.

4.2. Coding process

4.2.1. Define coding schema

We recommend that coding schema definition include explicit consideration and documentation of the organization and work processes to be used during the ­coding process, the development of detailed coding variable descriptions, and the iterative and consensus-based definition of theoretical categories by the entire coding team.

4.2.2. Sample coding and intercoder reliability testing

We recommend that the principal investigator and all coding team members ­independently test code a randomly selected subset of the actual dataset, followed by formal intercoder reliability testing of the results until acceptable levels of intercoder reliability ratings have been reached.

4.2.3. Codebook development, iterative refinement and training

We recommend that a consensus-based process of codebook development, based on the previous definition of the coding schema, sample coding, and intercoder reliability testing be included in the coding process. This can be considered part of coder training. Discussions on the development of codes and theoretical ­categories among the coding team will likely result in increased understanding of key issues and variables to be coded. Training should also include coder instruction in the use of any selected QDA software.

4.2.4. Coding with intercoder reliability spot checks

Once acceptable intercoder reliability ratings have been achieved through sample coding and iterative codebook refinement, the entire dataset can be coded. At least one spot-check should be performed during this process to assess coder drift.

4.2.5. Analyses and interpretation of results with final intercoder reliability testing

The coding process should be assessed, along with final intercoder reliability ­testing, after coding is complete. The results of these analyses should be reported in the final project outcomes.

4.2.6. Reporting of results

Results should include the analyses of the data produced by the coding process, such as that reported in Baggio et al. (2016) and Barnett et al. (2016), but should also include the explicit disclosure of assumptions made during the preliminary steps of the project, as well as an analyses of the coding process itself and final intercoder reliability testing.

5. Conclusions

Social-ecological systems (SESs) vary across spatial and temporal scales and studying them is critical to understanding governance challenges involving common-pool resources (CPRs). Scholars, like Agrawal (2014) and Araral (2014), see current trajectories within SESs research as fundamental, yet still in their infancy. Araral (2014), in particular, argues that Ostrom’s theories may only be applicable to the special case of locally governed, small-scale commons and may not be easily generalized. The body of evidence collected within Ostrom’s legacy has not been able to effectively assess natural resource issues at larger scales. We question whether there has been a sufficiently sizable body of data gathered and analyzed, including information on larger-scale systems, multi-scalar governance structures, temporal dimensions, and other important factors with which to compare the existing studies, or if there are any sufficiently developed methods by which to conduct such comparisons.

It was one of Ostrom’s (2005) deep convictions that SESs are composed of a set of universal building blocks which could be tapped to create adaptive and long-enduring governance systems. Work towards creating a methodology that will foster cooperation and cross-comparison of data could allow us to expand our understanding of these systems. By sharing our coding experience and protocols, we hope to stimulate the development of transparency norms within the commons research community which others may build upon as we move further toward the identification of these universal building blocks. It is important to continue pushing social-ecological science towards greater rigor and a greater understanding of the complex interactions that lead to successful outcomes. Towards this goal, we assert that methodology must be tested and refined for more precise measurement of the dependent and independent variables involved in SESs. Furthermore, the commons research community should work to ensure that studies are replicable and that different research teams are able to achieve similar answers. In conclusion, while there may be many challenges and opportunities associated with the coding and synthesis of case studies, increased collaboration and consensus in a few key areas within the research community may lead to new horizons and possibilities in understanding SESs and the commons.

6. Supplementary files