Blog Author: Bailey McNichol
Citations:
Preston, F.W. 1962. The canonical distribution of commonness and rarity: part 1.
Ecology 43(2): 185-215.
Van Proosdij, A.S.J., Sosef, M.S.M., Wieringa, J.J., and Raes, N. 2016. Minimum
Required number of specimen records to develop accurate species distribution
models.
Author Background:
Frank Preston is an English-American scholar with expertise in engineering, ecology, and conservation. He studied in London, worked briefly in the U.S. on behalf of his employer in lens/camera manufacturing, returned to London for his Ph.D., and later returned and established a glass research laboratory in Butler, Pennsylvania. It was in the valleys of western Pennsylvania that he developed an interest in conservation, mapping geologic features, studying migration patterns of extinct mammals, and eventually beginning his long endeavor of studying birds. His experience both in ecology and engineering allowed him to examine the theory and mathematical relationships behind ecological rarity and commonness.
The lead author on the companion paper, André S.J. van Proosdij, is currently the Project Manager in Ecology at DAG Netherlands, a consulting group that works in projects related to sustainability, environmental monitoring, and product design. He received his Ph.D. from Wageningen University (Wageningen, Netherlands) and his dissertation focused on determining factors of plant species diversity in Central Africa. He has expertise in botany, modeling (specifically with species distribution models), and biodiversity research.
Preston paper:
Summary of paper/main questions:
Building upon his 1948 paper, Preston introduced the theory behind a new model to conceptualize distributions and abundances of species. He observed that we tend to describe abundances in relative rather than absolute terms, and demonstrated a method of grouping species into geometric classes, or “octaves”, based on the binary logarithm (log base 2) so that each octave represents species with individuals that are two time more abundant than the previous octave. He showed that through log-transformation, species abundances tend to follow a Gaussian (normal) distribution. In this paper, he expanded upon his previous findings and aimed to demonstrate that, in addition to the species abundance distribution following a lognormal distribution, the distributions of the number of individuals per species also follow a lognormal curve. Preston introduced his theoretical framework and equations (some purely his own, some drawing from relationships demonstrated by other scientists) that allow us to test this theory with empirical data. He then used observational surveys of various populations of organisms to test the relationship between the theory and actual ecological distributions of species and individuals, and finally discusses the limitations of these models.
Summary of Methods and Results:
Preston argued that individuals across numerous species tend to have a lognormal distribution between the species because of the relative rarity and commonness of these species in a given environment. He introduced a “Species Curve” (1948) that includes the logarithmic number of individuals per species as abscissa (i.e., along the x-axis), grouped into octaves that describe how many species in a given sample are ~equally as abundant, and the number of species falling in each octave as ordinate (i.e., on the y-axis). When graphed, the distributions of species amongst the octaves results in a lognormally distributed species curve with the least common species occurring on the left and the most common species occurring on the right. He termed the “modal octave” as the octave with the most species (i.e., most species in a given area are neither very rare nor very common), which occurs at the peak of the curve.
He then introduced a second curve – with the number of individuals per species on the y-axis – which he referred to as the “Individuals Curves” and which also follows a lognormal distribution and has the same dispersion constant as the species curve. The individuals curve reaches its maximum (just before descending) in correspondence with the position of the most common species on the species curve. He clarified that the theoretical species curve extends infinitely to the left and right, but that the true number of octaves generally lies around ±9 and varies depending on the number of species included (in other words, the octaves shown describe the most probable range of the distribution based on what is known about species abundance). He also introduced the concept of the “veil line”, a vertical line on the left side of the species curve to denote that the abundances of any species occurring below (to the left of) the line have not included due to inadequate sampling.
He called the close lognormal correspondence between the species curve and peak of the individual curve “canonical” in an ecological context to accentuate the importance of these paired distributions. Preston then defined an additional equation that accounts for the number of individuals/pairs within a given area, the Species-Area Curve, which is particularly useful in situations where the population density is uniform over the considered area. The species-area curve may approximate a linear relation in the log-log plot (the Arrhenius plot) because, as the number of individuals and species sampled included increases, the total proportion of species actually occurring in a given area that have been sampled begins to approximate the given “universe” (all species in an area).
After introducing these three theoretical curves, Preston tested the relationships with numerous datasets of observed abundances of species and individuals, including (but not limited to) birds from England/Wales, Finland, the Caribbean (“West Indies”), and Madagascar, land plants of the Galapagos, and flowering plants of the world. In examining curves for the species/individuals distributions of these data, he made the distinction between “isolates” (communities of organisms that have reached some equilibrium but are geographically separated from larger populations, e.g., on an island) and “samples” (that are a subset of a larger geographic community). He validated the results for each dataset by making a comparison to the z-value – a constant exponent that represents the average index over the range of the theoretical regression line (z=0.262). For all of the case studies he assessed, z-values ranged from 0.220-0.325, which surprised even Preston in remaining fairly close to the theoretical index.
Preston concludes that the predominant factor controlling the slope of the curves is the phenomenon of the “expanding universe”: as you increase the sample size (of both individuals and species that are included), your sample more accurately approximates the environmental abundances.
Proosdij et al. paper:
I chose this paper because I thought it provided an interesting expansion on the basic ideas of the occurrence of species demonstrated by Preston’s Species-Area curve. Specifically, Proosdij et al. examined the effects of sample size and species prevalence (i.e., the proportion of the focal area occupied by that species) on the performance of species distribution models (SDMs). SDMs are widely employed in ecological niche modeling to predict the occurrence of species, given records of species presences and environmental factors that likely correspond to a given specie’s distribution, but it is not well-documented how large of a sample size is required to optimize model performance. The authors compared model performance in a simulated study area (100x100 raster cells along a west-east and north-south gradient with six prevalence classes) and an African study area (179,994 raster cells which excluded large aquatic environments and featured elevational/environmental variables). In both the simulated and real study areas, the presence of a given species for a given raster cell occurred when the environmental bivariate variables (2 axes taken from a principal component analysis that relate to the species most preferred conditions) overlapped with the central region of the bivariate normal density with probability = 68% (broadly, this indicates where conditions would be most favorable for the species). They ran models with numerous sample sizes for each prevalence class, and ultimately found that model performance increased as sample size increased (given constant prevalence) but decreased as prevalence increased (given a constant sample size). Though SDMs are fairly common, the Proosdij et al. study is novel in that it: A) demonstrated a new, rigorous method for determining minimum sample and B) highlighted the importance of identifying minimum sample sizes before attempting to predict species occurrences.
My Thoughts:
Overall, I thought that Preston made a logical, clearly defined argument that was well-substantiated by both his theory and the examples he used, especially in comparison to some of the authors in previous papers. I thought he was very forthcoming in stating some of the inherent weaknesses and possible limitations of his approach, and he generally (although not always) defined the meaning of the variables he introduced in each equation. I also appreciated seeing the applications of the abundance distributions in a variety of taxa and geographical locations. However, I still felt that some of his assumptions had a bias towards temperate ecosystems; while he did include samples and observations from a wide array of “universes”, he did not provide a discussion on how the relative abundances shift when considering tropical ecosystems (or other study areas that have higher numbers of species and much lower numbers of individuals per species). I also thought that in a few instances, he seemed to over-manipulate some of the distributions that didn’t fit the theoretical relationship as closely, which felt like he was moreso “fishing” for a good fit.
I think that the Proosdij et al. paper was overall very rigorous in their methods, and in controlling for any extraneous “noise” or variability by testing both the virtual study area and African study area across six distinct prevalence classes and over 24 sample sizes. However, I would have liked to see as more clear interpretation of some of their statistical values, particularly those that were very specific to this discipline (e.g., the Schoener’s D and Hellinger distance). I also found their statement in the conclusion that their method could be applied to “any taxonomic clade or other group, study area and past, current or future climate scenario” to be overly broad – how would we go about applying these results (in terms of minimum constraints for narrow-ranged vs. widespread species on sample size) in the context of a different set of ecosystem/group of species/study area?
Preston’s work with mathematical modeling is an interesting comparison to the Nicholson and Bailey model. While there were some criticisms that Nicholson and Bailey did not include empirical data, the same criticisms can’t be applied to Preston’s work. He included examples from a variety of “universes” (to use his language) to demonstrate how the mathematical model fit real-world scenarios, but I agree with Bailey that there seemed to be some “fishing” in the models. Preston even seemed to acknowledge that “fishing” in the data could improve the match to the model: “If we include the two lowest points, a very questionable procedure, and give them equal weight with the other points, the slop of the curve increases to about 0.30. Giving them even a little weight brings the exponent close to the theoretical value” (p. 203).
ReplyDeleteI am not familiar with species distribution models, so I can’t provide an accurate evaluation of Proosdiij’s work, but I do want to call out Proosdiij for using subjective language in a scientific paper. Several times in the results section, Proosdiij used terms such as “small,” “large,” “strong,” “better,” and “low.” For example: “SDMs for species with a small prevalence perform better than SDMs for species with a large prevalence when using the same number of records to train the model (Fig 2-4)” (p. 546). I find this language especially problematic because the terms “small” and “large” are not indicated in Figures 2, 3 or 4. (How “small” is a small prevalence? How “large” is a large prevalence?). Maybe I’m just being too picky, but I think it is best practice for scientific papers to use objective language to prevent misinterpretation of subjective words.
Preston’s paper built mathematical models to study species abundance. He linked the species number and individual numbers together based on the relationships between species curve and individual curve (although I think it is arbitrary to define rarity of species based on the “octaves”). To me this paper is well-organized and interesting. Because the author used published data around the world to test his theory, it is much more convincing than the papers with pure mathematical models we discussed before. It was amazing that the “z” value fit well in many cases. However, the model didn’t always work well for plant communities, as the “universe” didn’t quite fit what the author defined. In fact, although the author tested the model using data from other studies, his explanation for those that didn’t fit well was not adequate.
ReplyDeleteThe Proosdiij et al. paper designed a new way to examine the minimum sampling size, which could be helpful to remedy the shortages of Preston’s method. The method that the authors used is beyond my knowledge, and it is difficult for me to understand how the models work. They built the virtual species based on the assumption that the virtual species’ distribution was shaped by temperature and precipitation only. Then for the simulated species, the authors used PCA to construct two orthogonal factors from the multiple environmental variables. Because of my poor background of this area, the different assumptions confuse me. Also, it seems that the only empirical data the authors used were the environmental factors. It would be great if they could test their theory like what Preston did.
Both papers present mathematical models in order to (very generally) understand how species are organized in an area. The Preston paper accomplished what many of us wanted to see last week: applying a proposed mathematical model to real-life data. In fact, Preston applied his model to a great assortment of data sets produced for other studies, seemingly verifying his model. Personally, I liked this approach (though at time it may have seemed like a little bit of overkill). From what I gathered from this paper, Preston confirmed that many distributions in nature follow a log-normal distribution (with rare and common species in an environment). I was at first annoyed because Preston really focused on terrestrial systems (primarily birds). Later on through the paper, he did include one system of marine organisms (Conus snails). His focus on terrestrial systems and isolates got me thinking; I was interested to see how Preston would account for marine organisms with more ubiquitous distributions. For example, many fish, whales, and sharks migrate and occupy an enormous range. I did not really see how his model could apply to organisms with enormous ranges, especially when focusing in on isolates (islands) as good areas of study. I am also still unclear what an "octave" is, though this may be because my background is not in ecology. As a nit picky note, I was not keen on the fact Preston drew some of his plots "by eye" (page 196). This seemed to negate some of the precision in his paper by saying "good enough".
ReplyDeleteI found the Proosdiji paper slightly hard to understand, but again that is a personal note. I found it more of an aid to other scientists than making a grand contribution. This is not to say that the work was not important, just that it filled in gaps more than went in a new direction.
I found Preston's paper to be interesting and to have a much more mathematical approach than earlier papers. Other than the "data fishing" that has been mentioned in previous comments, as my one additional qualm with his work I would have liked to have heard more of his basis behind how he separated the models into octaves. Although separating into groups like he did can help with the interpretation of a model, I would have liked to have gotten more of an explanation as to why he chose the intervals that he did and whether there is a significant mathematical or ecological reason for spacing the intervals as such.
ReplyDeleteWhile Proosdiji was dense, as others have mentioned, it gave an interesting perspective on how as scientists we can improve sampling and model building, as well as highlighting some of the shortcomings of traditionally used models.
I forgot to sign off by this comment was by Elizabeth Chambers
DeleteLast week, we discussed whether theoretical work should be "stand-alone" and agreed that models are much more convincing when paired with empirical support. Thus, I can appreciate the work Preston put in to pair his models with data. However, as others have said, in some cases it seemed that data didn't quite support his predictions and conclusions.
ReplyDeleteI was also surprised that he mentioned that "in none of the biological examples we have yet encountered do we have as many as a thousand species to work with". Although some of his empirical data has relatively few species (e.g. land vertebrates on Lake Michigan islands) some other examples have many more (e.g. 540 birds species on New Guinea). It is easy to think of situations where you would expect at least 1000 species, if not more. Insects in the Amazon come to mind, although as Proosdij et al point out, the tropics are particularly "data-sparse". I found the Proosdij paper to be a good fit for the Preston paper, as they demonstrated how generate species distribution models can be made more useful by ensuring that a minimum number of records are present.
Building on our discussion from last week, I really liked that Preston included case studies. Not only does this help ground reader but it puts him models on context and tests them. After all, simulations and models are great, but they only truly informative when comparing them to the real world. Along this line, my favorite aspect of the Proosdij paper was that it was written for the express purpose of creating a tool that ecologist can use to improve their science. This is an excellent example of a paper that merges theory and imperial approaches.
ReplyDeleteOne line Preston wrote particularly caught my interest. "The universe being sampled is simply what the sample says it is," page 206. This highlights just how important sampling is in a very elegant way. If I only look at the macroscopic, I would assume that that's all that exist. While this is an extreme example, it shows the danger in bias science in any way. Of course, we are naturally bias creatures, but being aware that such underlying assumption exist is the first step to combating them.
-Miranda
The Preston paper is a fascinating exploration of underlying mathematical patterns in nature. Like others, I found the pairing of theory and data compelling, and am also curious about the reasoning behind some of his decisions (e.g., splitting species into base-2 octaves). I was particularly fascinated by the emergence of lognormal patterns in the data. It reminded me very much of Sugihara's (1980) random distribution model, which I have just realized is actually an effort to understand Preston's model from a biological perspective while incorporating stochasticity.
ReplyDeleteI also enjoyed the van Proosdij et al. paper very much. It's a very logical effort to tackle a systemic problem in macroecology: how to pair sampling and statistical designs. One extension I would like to see would have been some examination of how heterogeneity in landscapes contributes to sampling needs.
Whenever I see results like Preston’s work, that demonstrates that how large-scale ecological patterns can be captured with statistical relationships that, as far as I understand, don’t really depend on any particular ecological mechanisms, it makes wonder how we can possible develop a biological understanding of how these patterns arise. I’m not very familiar with macroecology, but it would be interesting to see what people have tried or proposed to address this issue of identifying the processes that generate the patterns we see.
ReplyDeletePreston’s statement that we can never really sample the complete ensemble made me wonder what microbial species abundance curves look. Do they more closely resemble the complete ensemble curve?
I’ve never done any species distribution modeling, but I thought the the van Proosdij paper was interesting. I never thought about how an SDM for rare species might have an artificially high AUC simple because the model is good at detected absences. One thing that did confuse me, was the difference between the “real AUC” and the “MaxEnt AUC.” I was under the impression that as long as you can get the false positive rate and the true positive rate, you can calculate an AUC – so I don’t get what the difference is or why there is an issue with its calculation.
- David
Generally speaking, it seemed to me that Proosdij used a similar method to approximate the number of species, i.e., integrating the area under the curve. Proosdij just took this one step further and provided a method for finding the minimum sample necessary for making that estimate, which very much struck me to be a catch 22—you need to estimate the prevalence of the species in order to find the minimum sample size, but you need the minimum sample size to be sure you can estimate the prevalence of the species.. Either way, the general conclusion I drew from both papers is that we always have and likely always will detect animals imperfectly. The best solution so far to this problem seems to be to find good ways to acknowledge and account for imperfect detection.
ReplyDeleteOne thing I found anecdotally interesting was the use of the term "octaves" in Preston (1962) to represent the intervals between the species increases, mostly because I have only ever seen octaves used in a musical theory context, not in ecology (the definitions were equivalent). Of course, Preston denounced the use of that term because a log2 scale fit the species increases better, but nonetheless I thought it was interesting to note that music theory influenced ecological theory for a small point in time.