Sequence-based species delimitation in the Balkan Bythinella Moquin-Tandon, 1856 (Gastropoda: Rissooidea) with general mixed Yule coalescent model

Three sets of sequences of cytochrome oxidase subunit I (COI) from a spring snail Bythinella representing all the Balkans (63 sequences), Greece (78 sequences), and Romania (136 sequences), were used to infer maximum likelihood ultrametric trees. The trees were used to run General Mixed Yule Coalescent (GMYC) analyses assuming single threshold and multiple threshold models. For the single threshold model the threshold value was identical (0.00248 substitution per site) for each data set; for the multiple one the threshold value varied widely for the Balkan tree. Despite the same threshold value, the distinctness of the same lineages varied among trees, mostly due to differences in the models of substitution inferred for each set of sequences, but also due to different proportions of singletons in the data sets. The inferred numbers of ML entities, theoretically equalling the numbers of species, compared with all the biological evidence available so far, were overestimated in Romanian and Greek trees, but realistic in the tree for all the Balkans.


INTRODUCTION
Bythinella, minute freshwater snails, common inhabitants of springs, are found from southern Poland to the Mediterranean, and from Spain to Asia Minor. This widely distributed genus has been a common subject of study, but where the delimitation of Bythinella species is concerned, it is enigmatic. RADOMAN (1976RADOMAN ( , 1983 considered that shell characters were useful at the species level, while BOETERS (1973) and JUNGBLUTH & BOETERS (1977) considered soft-part anatomy to be crucial for species delimitation. However, considering wide morphological variability of shell and soft parts in this genus, none of the characters of shell or soft-part anatomy appears to be useful for clear delimitation and recognition of species (GIUSTI & PEZZOLI 1977, 1980, FALNIOWSKI 1987, 1992, BICHAIN et al. 2007a. Recently, molecular studies on Bythinella (BICHAIN et al. 2007a, HAASE et al. 2007, BENKE et al. 2009, FALNIOWSKI et al. 2009a, b, c, 2012, FALNIOWSKI & SZAROWSKA 2011 shed some light on the problem of species delimitation in this genus, but it is still far from resolved.
There are only a few cases known where morphological differences are paralleled by molecular differences in sympatric taxa of Bythinella. In these cases the species distinctness of the taxa is doubtless (FALNIOW-SKI et al. 2009b, FALNIOWSKI & SZAROWSKA 2011 Otherwise taxonomic decisions at the species level are controversial. Recently PONS et al. (2006) introduced a procedure for delimiting species from sequence data, without the prior definition of taxon and/or population, usually defined geographically. The procedure is based on analyses of branch lengths on a DNA tree for tests of species boundaries, assuming the difference in branching rates at the level of species and populations, determined between species by speciation and extinction rates (PONS et al. 2006), and within a species reflecting coalescence processes. The technique combines the equations that describe processes of lin-eage birth at the species level (assuming the simplest model of YULE 1924) with coalescence models within species, and applies a likelihood ratio test to assess the fit of branch lengths. This enables the location of shift in dynamics of branching associated with species boundary. Thus the technique estimates species boundaries directly from branching rates in mixed population phylogenetic trees without the need for any prior definition of populations or species. The procedure is known as General Mixed Yule Coalescent (GMYC) procedure ("general", because it has relaxed some biologically rather unrealistic assumption of the MYC model).

MATERIAL AND METHODS
All the sequences of cytochrome oxidase subunit I (COI) considered in the present study are listed in our previous publications on Bythinella from Romania (FALNIOWSKI et al. 2009b), Bulgaria (FALNIOWSKI et al. 2009a) and Greece (FALNIOWSKI & SZAROWSKA 2011), and on the radiation of the genus in the Balkans (FALNIOWSKI et al. 2012). See the above papers for the description of the localities, collection and molecular techniques, and GenBank Accession numbers of the used sequence.
The first tree, inferred for the Balkans (FALNIOW-SKI et al. 2012) consisted of 63 COI sequences (Table  1). Two species represented the outgroup; 34 sequences belonged to 10 putative species; all six species from Romania were represented by one sequence each; the ten putative species from Greece were represented by one or two haplotypes each. The second tree, inferred for Greece (FALNIOWSKI & SZAROWSKA 2011), consisted of 78 haplotypes ( Table 1); two of the haplotypes represented the outgroup, 5 represented Bythinella species from the outside of Greece, and 71 sequences represented all the haplotypes found in Greece (from 302 sequences analysed). The third tree, inferred for the Romanian Bythinella (FALNIOW-SKI et al. 2009b), consisted of all 136 sequences (representing 10 outgroup species, five species of Bythinella from the outside of Romania, and 23 haplotypes of the Romanian Bythinella); thus the tree included Romanian haplotypes represented more than once in the data set. Such a design of data sets was intended to compare the behaviour of the technique with (slightly) different data. All the trees are presented with an outgroup species, to show the relative length of the branches, but in each case an ultrametric tree (i. e. with molecular clock enforced) was computed without the outgroup taxa.
For each maximum likelihood analysis, we used the best fit model of sequence evolution found by ModelTest v3.06 (POSADA & CRANDALL 1998, POSADA 2003. Following the recommendations of SOBER (2002) and POSADA & BUCKLEY (2004), we chose the best model for each dataset using the Akaike Information Criterion (AKAIKE 1974). We performed ML analyses in PAUP*4.0b10 (SWOFFORD 2002) and used an heuristic search strategy with stepwise addition of taxa, 10 random-sequence addition replicates, and treebisection-reconnection (TBR) branch swapping (SWOFFORD et al. 1996). For each of the datasets (Romania, Greece, and whole Balkans) phylogeny was inferred with and without an enforced molecular clock, to perform the Likelihood Ratio Test (LRT) (NEI & KUMAR 2000, POSADA 2003 with PAUP, to test the molecular clock hypothesis for COI. To calibrate the trees -thus to give some impression of the time frame of the inferred process of speciation -data from WILKE (2003) were applied. The ultrametric trees of Bythinella (with molecular clock enforced) were used for GMYC analysis (PONS et al. 2006).
This method uses a maximum likelihood approach to optimize the shift in the branching patterns of the gene tree from interspecific branches (assuming Yule's model: YULE 1924) to intraspecific branches (assuming neutral coalescent). It optimizes the ML value of a threshold, such that the nodes before the threshold are identified as speciation events, and the nodes beyond the threshold value are interpreted as reflecting coalescent processes. The method was applied to the ultrametric trees inferred with PAUP*4.0b10 (SWOFFORD 2002), and GMYC procedure was run, using the package R v.2.8.0 (R DE-VELOPMENT CORE TEAM 2008), with the extension APE v.2.6.2 (PARADIS et al. 2004, PARADIS 2006, applying the code provided by Dr. TIM BARRACLOUGH. The numbers of lineages and likelihoods were plotted against time (reflected by the number of substitutions per site), and a threshold between speciation and coalescence was fitted. A unique species/population split for a particular time (single threshold) may not reflect the real diversification for all the lineages, especially in large trees, thus we also performed analysis allowing multiple, independent thresholds over time across the tree (option multiple). Chi-square tests of significance of inferred clus-ters (null model assuming a uniform branching pattern) were performed. Support limits for the number of clusters were estimated from the likelihood surface for the time of transition between phases. For each of the three datasets, and each of the two GMYC models (single vs. multiple) the analysis was run three times.

RESULTS
In total, 63 COI sequences for the whole Balkans, 78 for Greece and 136 for Romania (Table 1) were analysed. The models inferred for each of the three trees with Akaike Information Criterion and ModelTest are presented in Table 1. Molecular clock hypothesis was rejected for the trees representing the Balkans and Romania, but not rejected for Greece (Table 1), in which the calibration point was presented ( Fig. 2).
PAUP* found 71 ultrametric trees for the Balkans, 490 for Greece, and 1,002 for Romania (Table 2). Strict consensus trees computed with PAUP* showed that for each data set the trees did not differ in deeper branches, before the threshold values were inferred. Nevertheless, in each set of trees the analysis was run on ten randomly chosen trees, with the same results for a given set of trees. Thus, the results of the GMYC analysis are presented on one tree for the Balkans (Fig. 1), one for Greece (Fig. 2), and one for Romania (Fig. 3).
The results of the GMYC analysis (Table 2) rejected the null hypothesis of a uniform branching pattern (p = 0.0000) for both single and multiple thresholds. The ML values of the GMYC model assuming a single and multiple threshold were similar ( Table 2). The likelihood values for the Balkans, even for the single threshold model (Fig. 4), suggested at least two threshold values. The inferred threshold time for the model assuming single threshold was the same (0.00248 substitutions per site) for all three trees ( Table 2). It has to be noted that for the Balkans, in one run another value (0.00496) was found.
For the GMYC model assuming multiple thresholds (Table 2, Fig. 4) the same threshold was found in each of the trees. However, yet another value (0.00095) was found for Greece, and two other values (0.00187 and 0.02303) were inferred for the Balkans.
The threshold values were plotted in the trees (Figs  1-3). The number of ML clusters for each tree was higher for the multiple threshold model (Table 2): 8, 9 and 35, respectively, vs. 6, 8 and 23 (Table 2). The highest numbers were found in the Romanian tree. For recognition of species more informative is the number of ML entities. This is the number of inferred lineages at presumably species level, including all the clusters as well as all the singletons. The number of ML entities (68) was the same for the single and multiple threshold model for the Greek tree (Table 2), slightly higher (55 vs. 53) for the multiple threshold model for the Romanian tree (

DISCUSSION
All the trees, following the assumptions of the technique, were fully dichotomous and ultrametric (PONS et al. 2006). However, as the clock hypothesis was rejected for the Romanian and whole Balkan trees, they were ultrametric formally, but this ultrametricity did not reflect real evolutionary processes. This obviously weakens the biological meaning of the results.
It is obvious that different substitution models may affect results of the GMYC technique (e.g. LIM et al. 2012). Table 1 shows that the models differ markedly between the trees. Unexpectedly, the threshold values inferred with the single threshold model were the same for all three phylogenies considered.
In general, a limitation of the method of PONS et al. (2006) is that it fits a single transition time between the pure-birth and coalescent phases. This may make the outcome sensitive to over-sampling of individual genotypes. In the present study the Romanian data set consisted of several haplotypes represented more than once in the data set. In fact, this comprised all sequences obtained in FALNIOWSKI et al. (2009b), and the tree might represent oversampling. However, the inferred threshold value was the same as in the case of the Greek dataset including all the haplotypes, each of them represented once. On the other hand, the threshold value is different for the Balkan tree which included mostly species represented by one sequence each, together with some species represented by some more sequences.
Sampling only a small number of populations may lead to artificial clustering within species when using the GMYC procedure (LOHSE 2009, PAPADOPOULOU et al. 2009c), but in our data as few as 12 Romanian populations sampled (FALNIOWSKI et al. 2009b) resulted in 23 (35) ML clusters; 29 Greek populations (FALNIOWSKI & SZAROWSKA 2011) in 8 (9) ML clusters; 10 populations from the Balkans (FALNIOWSKI et al. 2012) represented by more than one sequence, resulted in 6 (8) ML clusters. Thus, we cannot confirm LOHSE's opinion (LOHSE 2009). It is also clear that the ML threshold number need not be informative: it was highest in the Romanian tree, but the number of the ML entities was higher in the Greek tree. Surely, this reflects numerous sequences that appeared more than once in the Romanian dataset. However, where several sequences differ in, say, one substitution, the number of ML clusters will be higher as well. ZALDIVAR et al. (2010) observed that the multiple threshold GMYC model, even where it resulted in a lower number of barcode species, frequently considered the existence of two or more species within various clusters of sequences that evidently belonged to a single lineage. In our study the single and multiple threshold models resulted in approximate numbers of ML entities in the Greek Bythinella. In the case of the Romanian Bythinella the number inferred with the multiple threshold model approximated the number inferred with the single threshold model, but the confidence interval of the former was wider (Table 2). On the other hand, in the case of the Balkan Bythinella the number of ML entities inferred with the multiple threshold model was much lower than the one found with the single threshold model (27 vs. 38), and it was unexpectedly low.
GMYC models can accommodate singletons, but are thought to yield skewed results when too many are included (LIM et al. 2012). Fig. 1, the tree representing the Balkans, shows numerous singletons (22 in a tree consisting of 63 sequences). Indeed, the multiple threshold model resulted in some threshold values which were biologically unrealistic (Fig. 4). On the other hand, the threshold value inferred with the single threshold model was the same as for the other trees.
Despite all the ambiguities concerning the specieslevel taxonomy in Bythinella (e.g. GIUSTI & PEZZOLI 1977, 1980, FALNIOWSKI 1987, 1992, BICHAIN et al. 2007a, b, HAASE et al. 2007, there are data on morphology, biology, distribution etc. of the taxa of Bythinella considered in the present study (FALNIOWSKI et al. 2009a, b, c, FALNIOWSKI & SZAROWSKA 2011, FAL-NIOWSKI et al. 2012), and some species-level classification of those taxa has been proposed. Thus the results of the GMYC analysis could be compared with those data.
Obviously, a straightforward comparison of the trees is not possible. In the Balkan tree the Romanian Bythinella is represented by six singletons only, but the number of the ML entities (presumably species) in the Romanian tree is 38 (53 minus 15 singletons representing outgroup). Thus, in Romania, either the GMYC technique gave unrealistic results, as only six species were distinguished (FALNIOWSKI et al. 2009 b, c), or there really are many more species. Similarly, in the Greek tree 61 (68 minus 7 singletons representing outgroup) ML entities are represented by 12 sequences only (representing 10 presumed species). Thus, again, two interpretations are theoretically possible, but in the Romanian and Greek Bythinella the GMYC technique seems to overestimate the number of species.
In the Balkan tree the single threshold model resulted in 38 ML entities (27 in the multiple threshold one). Two singletons are outgroup, there are also three Bythinella species from outside of the Balkans [B. austriaca (Frauenfeld, 1857), B. compressa (Frauenfeld, 1857), B. pannonica (Frauenfeld, 1865)], and outside of the Balkans are B. austriaca ehrmanni Pax, 1938, and B. micherdzinskii Falniowski, 1980. From Slovenia are B. robiciana (Clessin, 1890) and B. schmidti (Küster, 1852). Six species are from Romania (B. radomani Falniowski, Szarowska et Sirbu, 2009, B. dacica Grossu, 1946, B. viseuiana Falniowski, Szarowska et Sirbu, 2009, B. grossui Falniowski, Szarowska et Sirbu, 2009, B. molcsanyi H. Wagner, 1941, and B. calimanica Falniowski, Szarowska et Sirbu, 2009 Glöer et Georgiev, 2011, and B. rhodopensis Glöer et Georgiev, 2009), B. dispersa Radoman, 1976, B. taraensis Glöer et Pešiae, 2010, B. pesterica Glöer, 2008, B. luteola Radoman, 1976, and B. nonveilleri Glöer, 2008. Thus the total number of species distinguished prior to this study is 31 (29 Bythinella), assuming that B. micherdzinskii and B. austriaca ehrmanni (subspecies level so far) are distinct species. The three sequences of B. pesterica most probably represent two species. B. dispersa, as well as B. charpentierii, geographically rather widely distributed, may also represent two species. The two sequences representing G8 from Greece represent rather two species (again, one from the southern, and the other from the northern part of the Peloponnese peninsula); it seems that yet another species occurs in the Kythira island. Thus, the total number of species distinguished prior to the GMYC analysis equals 36, which is very close to the number 38 inferred with the single threshold model of GMYC for the Balkan data set. Anyway, this total number clusters within the range given by the two GMYC models.