Abstract
Communication needs to be complex enough to be functional while minimizing learning and production costs. Recent work suggests that the vocalizations and gestures of some songbirds, cetaceans, and great apes may conform to linguistic laws that reflect this trade-off between efficiency and complexity. In studies of non-human communication, though, clustering signals into types cannot be done a priori, and decisions about the appropriate grain of analysis may affect statistical signals in the data. The aim of this study was to assess the evidence for language-like efficiency and structure in house finch (Haemorhous mexicanus) song across three levels of granularity in syllable clustering. The results show strong evidence for Zipf’s rank-frequency law, Zipf’s law of abbreviation, and Menzerath’s law. Additional analyses show that house finch songs have small-world structure, thought to reflect systematic structure in syntax, and the mutual information decay of sequences is consistent with a combination of Markovian and hierarchical processes. These statistical patterns are robust across three levels of granularity in syllable clustering, pointing to a limited form of scale invariance. In sum, it appears that house finch song has been shaped by pressure for efficiency, possibly to offset the costs of female preferences for complexity.
© National Park Service
Communication systems tend to be optimized for efficiency, or the benefit that they bestow relative to the costs of learning and producing them:
\[\begin{equation} \textrm{efficiency} \propto \frac{\sum{\pi}}{C_{L} + \sum{C_{P}}} \tag{1} \end{equation}\]
where \(\sum{\pi}\) is the lifetime benefit of the communication system (which includes perception by receivers), \(C_{L}\) is the cost of learning it, and \(\sum{C_{P}}\) is the lifetime cost of producing it (1). This framing is consistent with understandings of efficiency in linguistics (2): the famous “principle of least effort” can be thought of as a minimization of these costs (3).
On the other hand, simpler sounds that are easier to learn and produce vary across fewer dimensions and are less distinctive from one another (4). Signals need to be distinguishable to be functional (5), and complexity expands the possibilities within a signal space in a way that enhances functionality (e.g. to communicate concepts in language, or attract mates in birdsong) (6). But there is, of course, a limit. Simulations show that cultural evolution plateaus when complex behaviors are too costly (7). Evidence from linguistics, animal behavior, and cultural evolution suggests that communication systems generally evolve to balance this complexity-efficiency trade-off (1,2,8).
In the birdsong literature, complexity is usually described at one of three levels: syllables (individual sounds within songs), songs (sequences of syllables), or repertoires (full set of syllables or songs that a bird produces). Syllable and repertoire complexity are approximated with measures of production cost (e.g. frequency bandwidth, number of transitions in pitch (9)) and learning cost (e.g. diversity of unique syllables or songs (10)), respectively. Song complexity, on the other hand, is often characterized by measures of hierarchical or combinatorial structure (11,12). Structured signals are more compressible and learnable (2,13,14), which allows more information to “pass through the bottleneck” of cultural transmission (1,15). Syllable complexity increases both production and learning costs, whereas song complexity decreases learning costs. Both notions of complexity, however, should boost the benefits of communication by expanding the signal space. Syllable complexity allows for greater diversity in syllable types, and song complexity allows those syllable types to be combined into a wider array of song types.
The effect of this complexity-efficiency trade-off has been confirmed by experiments showing that artificial languages gain efficiency (16) and structure (17) as they are culturally transmitted and used by participants. Studies of continuous whistled communication systems that resemble birdsong have also detected increases in combinatorial structure (i.e. signals comprised of constituent parts that are recombined in different ways) (14). Birdsong experiments yield similar results. Zebra finches raised in isolation sing atypical songs, but over several generations they converge towards wild-type song by shortening the longest syllables and gaining spectral structure (18). Even foraging behaviors that are culturally transmitted become more efficient over time (19).
Pressure for compression and efficiency lead to regularities in organization that are so universal in human language that they are referred to as linguistic laws. The three most commonly-studied of these are Zipf’s rank-frequency law, Zipf’s law of abbreviation, and Menzerath’s law (20).
Zipf’s rank-frequency law predicts that the frequency of an item will be proportional to the inverse of its rank (i.e. first most common, second most common, etc.), a relationship that holds for most, if not all, of the world’s languages (21). In this study, I will focus on Mandelbrot’s more flexible parameterization of Zipf’s rank-frequency law (see Analysis) (22,23), which is its most common form in contemporary linguistics (21). Non-human animal communication systems, including bird and cetacean vocalizations (24–28) and lizard courtship displays (29), exhibit more redundancy than languages, leading to a convex rank-frequency relationship that is better captured by the Zipf-Mandelbrot distribution. Zipf and Mandelbrot both interpreted this rank-frequency law as resulting from a minimization of production and perception costs (3,22), and there are models showing that it can be derived from communicative efficiency (30–32). However, some of these models assume that signals map to objects or concepts which is not the case in birdsong, and other causes are still debated (21). Even though there is still uncertainty about its causes, the presence of Zipf’s rank-frequency law in non-human communication systems has been interpreted as evidence for both communicative efficiency and information content (33,34).
Zipf’s law of abbreviation predicts that common items will tend to be shorter than rare items because their production cost is lower (3). This negative correlation between frequency and duration is widespread in both written (35) and spoken language (36), and has also been observed in writing systems (37) and non-human communication like chimpanzee gestures (38) and bird and primate vocalizations (39–41). The explanation for Zipf’s law of abbreviation is simple: when common items have a lower production cost than rare items then the overall production cost of signals goes down (42).
Menzerath’s law predicts that longer sequences will be comprised of shorter items to balance production costs (43). This negative correlation between sequence length and item length is found at various levels of analysis in language (e.g. clauses in sentences, morphemes in words) (44–46). In non-human communication, Menzerath’s law appears to be present in chimpanzee gesture (38) and in the vocalizations of some primates and birds (40,41,47,48), including house finches (49). The explanation for Menzerath’s law is an extension of Zipf’s law of abbreviation: when production costs are increased in one domain (e.g. song sequence length) they should be decreased in another (e.g. syllable duration).
Beyond linguistic laws, there are two proxy measure of linguistic structure that have recently been investigated in non-human communication: small-world structure (50) and mutual information decay (12).
Small-world networks are highly clustered and have short average path lengths, so that it only takes a few steps to jump between any pair of nodes (think “six degrees of separation”) (50,51). These sorts of networks are quite common in biological and social systems (50,51), including language. For example, networks of neighboring words in sentences and co-occurring words in thesauruses exhibit small-world structure (52,53). Small-worldness is thought to reflect general systematic structure and recurrence, which in turn improve the compressibility and learnability of information (13,24,54). It is also hypothesized to reflect the emergence of syntactic structure over time (55), as both nightingales and children in more advanced stages of vocal development have greater small-world structure in their transition networks (56,57). Humpback whales (24) and several songbird species (26,56,58,59) all exhibit small-worldness in their song syntax to a similar degree.
Mutual information is a measure of dependency, or the amount of information that the presence of one thing has about another. In the context of language, past words provide information that can help predict future words above chance levels (i.e. “do you need anything from the” \(\to\) “store”). Intuitively, a word contains more information about the very next word than the one that comes after it (60), but correlations can be detected even at very long distances of hundreds of words (61). The rate at which the mutual information of words decreases with increasing distance can provide clues about underlying syntactic structure. Simple models of grammar that assume Markov processes (i.e. next word depends only on last few words) lead to exponential decay in mutual information with distance, whereas hierarchical processes (i.e. word order comes from nested syntactic rules that allow for long-range dependencies (62), common view going back to (63)) lead to power-law decay (64,65). In German, Italian, Japanese, and English, mutual information decay is exponential at short distances and fits a power-law at long distances, suggesting that sequences are generated by a combination of Markovian and hierarchical processes (12,66). This pattern has also been documented in four bird species, suggesting that sequential organization of song is more complex than previously thought (12).
In languages, the boundaries between signals are apparent to the humans who use them. In non-human communication systems, categorizing signals into types is its own challenge. Automated methods that classify signals into types based on their acoustic features (67–69) reduce subjectivity and enable people to work with much more data, but they also require tuning. For example, hierarchical clustering requires the user to choose an appropriate threshold below which the “branches” of the “tree” are combined into types (see Figure 1). This threshold effectively controls the granularity of clustering: higher values over-lump signals into fewer categories, while lower values over-split signals into more categories. The granularity of an analysis may influence the kinds of patterns that can detected. Philosophers of cultural evolution call this the “grain problem”—some statistical patterns may be more apparent at certain levels of analysis (70).
The aim of this study is to assess the evidence for language-like efficiency and structure in house finch (Haemorhous mexicanus) song across three levels of granularity in syllable clustering. By doing so, I hope to (1) identify which features of birdsong may be most subject to the complexity-efficiency trade-off, and (2) determine how clustering decisions affect the manifestation of linguistic laws in non-human communication systems. The data for this study come from a large corpus of house finch songs collected between 1975 and 2019 (9,71,72). House finch song is an excellent model for these questions for several reasons. First, house finch song is socially learned (73) and culturally evolves (71), and thus should be subject to information compression. Second, male house finches are more likely to learn complex syllables (9)—a content bias that may be an adaptation to female preferences for complexity. In house finches, males that sing longer songs at a faster rate are more attractive (74) and have higher reproductive performance (75), and courtship songs are longer and contain more syllable types (76). Because these measures of complexity relate to production and learning costs, they may increase pressure for efficiency in other domains such as duration. Finally, house finch song is known to be subject to efficiency constraints. When house finches tutored by canaries reproduce the trills of their foster parents they are slower and much shorter (73), and house finches increase the frequency of their vocalizations to minimize competition with the lower frequency sounds (77).
Unless otherwise stated, all models were fit in STAN using the brms package in R (78), with 20,000 iterations across four MCMC chains. Prior specifications, model diagnostics, and full model output for all analyses can be found in the Supplementary Information.
The recordings used in this study (2,724 songs from 331 individuals) were collected in 1975 (71), 2012 (72), and 2019 (9) in the New York metropolitan area, and analyzed by (9) using Luscinia (https://rflachlan.github.io/Luscinia/) (full recording and analysis details, and an example of an analyzed song, are in (9) and the Supplementary Information). In this population, males have a repertoire of ~35-40 syllable types, which they combine into 1-7 stereotyped song types (mean of 3.1) that are between 6-31 syllables long (mean of 12.1) (72). The main analysis was conducted using recordings from all three years, but the patterns are qualitatively the same when each year is analyzed separately (see Supplementary Information).
#load data from Youngblood & Lahti (2022)
load("data_models/youngblood_lahti_2022.RData")
Clustering was conducted using the log-transformed mean frequency traces (mean frequency in each 1 ms bin) for each syllable in every song (79). I used the mean frequency traces for clustering because they are time series, which means they capture more subtle variation than other features from Luscinia (e.g. overall bandwidth) and can be compared using dynamic time warping. First, the normalized distances between all of the syllables were calculated via dynamic time warping with a window size of 10 (10% of the average signal length) using the dtwclust package in R (80). A window size of 10% of the signal length is commonly used in speech processing research and seems to be a practical upper limit for many applications (81). Infinite distances (0.19% of cases) caused by comparisons of syllables with extreme signal length differences were assigned the maximum observed distance value. Next, hierarchical clustering and dynamic tree cut were used to cluster the syllables into types (82). Hierarchical clustering was conducted with the UPGMA method implemented in the fastcluster package in R (83), and dynamic tree cut was conducted with the dynamicTreeCut package in R (84).
For dynamic tree cut, I ran the hybrid algorithm with a minimum cluster size of 1 to maximize the representation of rare syllable types, and used the deep split parameter (\(DS\)) to determine the granularity of clustering (details of \(DS\) are in the Supplementary Information). I restricted this analysis to \(DS = \{2, 3, 4\}\) because values of \(DS = \{0, 1\}\) lead to extremely unrealistic underestimates of the number of syllables types. \(DS = 2\) leads to over-lumping of syllables into types (\(n = 114\); low granularity), \(DS = 3\) leads to a typical syllable classification (\(n = 596\)) (9,72,82), and \(DS = 4\) lead to over-splitting (\(n = 1646\); high granularity). The dendrogram and syllable classifications can be seen in Figure 1.
Automated clustering methods carry a risk of duration-bias—variation may be more detectable in longer syllables (85). A replication of Zipf’s law of abbreviation with manually-classified syllables (73), as well as an exploratory assessment of duration bias in the automated clustering algorithm, can be found in Supplementary Information.
#load library
library(dtwclust)
#log transform each syllable individually
data$meanfreq <- sapply(1:nrow(data), function(x){log(data$meanfreq[[x]])})
#z-score normalization across entire dataset
m <- mean(unlist(data$meanfreq))
sd <- sd(unlist(data$meanfreq))
data$meanfreq <- sapply(1:nrow(data), function(x){(data$meanfreq[[x]]-m)/sd})
#set window size
window <- round(mean(lengths(data$meanfreq))*0.10) #10
#calculate distances
distances <- proxy::dist(data$meanfreq, method = "dtw_basic", window.size = window, normalize = TRUE)
#strip everything from distances
attr(distances, "dimnames") <- NULL
attr(distances, "method") <- NULL
attr(distances, "call") <- NULL
class(distances) <- "matrix"
#replace infinite distances with the maximum value
distances[which(distances == Inf)] <- max(distances[-which(distances == Inf)])
#convert to format compatible with hclust
dtw_dist <- stats::as.dist(distances)
#run clustering
clustering <- fastcluster::hclust(dtw_dist, method = "average")
#save clustering
save(clustering, file = "data_models/clustering.RData")
#run hybrid tree cut
hybrid_cut <- parallel::mclapply(0:4, function(x){dynamicTreeCut::cutreeDynamic(dendro = clustering, distM = distances, method = "hybrid", minClusterSize = 1, deepSplit = x)}, mc.cores = 5)
#save hybrid tree cut
save(hybrid_cut, file = "data_models/hybrid_cut.RData")
Figure 1: The results of hierarchical clustering. The colored bars below the dendrogram correspond to the categories assigned to each syllable when deep split is 2 (over-lumping), 3 (baseline), and 4 (over-splitting).
Once syllable types were identified by hierarchical clustering, I followed (72) and (9) in calculating the following syllable-level acoustic features for further analysis: average frequency (Hz), minimum frequency (Hz), maximum frequency (Hz), bandwidth (Hz), duration (ms), concavity (changes in the sign of the slope of the mean frequency trace/ms) and excursion (cumulative absolute change in Hz/ms). Concavity and excursion are both indicators of syllable complexity (72,86). Concavity was calculated after smoothing the mean frequency trace using a polynomial spline with a smoothing parameter of 5 (9).
#load data from clustering
load("data_models/clustering.RData")
load("data_models/hybrid_cut.RData")
#calculate concavity
inds <- 1:nrow(data)
inds <- inds[-which(lengths(data$meanfreq) < 6)] #remove things that are too short to analyze but have a concavity of 0
concavity <- rep(0, nrow(data))
for(i in inds){
temp <- pspline::sm.spline(data$meanfreq[[i]], spar = 5)
temp <- diff(c(temp$ysmth)) #extract first derivative (slopes)
concav <- 0
for(j in 2:length(temp)){
if((temp[[j]] < 0 & temp[[j-1]] > 0) | (temp[[j]] > 0 & temp[[j-1]] < 0)){
concavity[i] <- concavity[i] + 1
}
}
concavity[i] <- concavity[i]/length(temp)
#concavity[i] <- concavity[i]
}
#calculate excursion
excursion <- rep(0, nrow(data))
for(i in inds){
temp <- data$meanfreq[[i]]
excursion[i] <- sum(sapply(2:length(temp), function(x){abs(temp[x]-temp[x-1])}))/length(temp)
}
#add to data table
data$concavity <- concavity
data$excursion <- excursion
#calculate other measures from mean frequency trace
data$average <- sapply(1:nrow(data), function(x){mean(data$meanfreq[[x]])})
data$min <- sapply(1:nrow(data), function(x){min(data$meanfreq[[x]])})
data$max <- sapply(1:nrow(data), function(x){max(data$meanfreq[[x]])})
data$bandwidth <- sapply(1:nrow(data), function(x){max(data$meanfreq[[x]])-min(data$meanfreq[[x]])})
data$duration <- sapply(1:nrow(data), function(x){length(data$meanfreq[[x]])})
#add syllable types to data file
data$cluster_0 <- as.numeric(hybrid_cut[[1]])
data$cluster_1 <- as.numeric(hybrid_cut[[2]])
data$cluster_2 <- as.numeric(hybrid_cut[[3]])
data$cluster_3 <- as.numeric(hybrid_cut[[4]])
data$cluster_4 <- as.numeric(hybrid_cut[[5]])
#add year
data$year <- NA
data$year[which(substring(data$individual, 1, 4) == "1975")] <- 1
data$year[which(substring(data$individual, 1, 4) == "2012")] <- 2
data$year[which(substring(data$individual, 1, 4) == "2019")] <- 3
#add counts
syl_freqs_0 <- as.data.frame(table(data$cluster_0))
syl_freqs_1 <- as.data.frame(table(data$cluster_1))
syl_freqs_2 <- as.data.frame(table(data$cluster_2))
syl_freqs_3 <- as.data.frame(table(data$cluster_3))
syl_freqs_4 <- as.data.frame(table(data$cluster_4))
data$count_0 <- sapply(1:nrow(data), function(x){syl_freqs_0$Freq[which(syl_freqs_0$Var1 == data$cluster_0[x])]})
data$count_1 <- sapply(1:nrow(data), function(x){syl_freqs_1$Freq[which(syl_freqs_1$Var1 == data$cluster_1[x])]})
data$count_2 <- sapply(1:nrow(data), function(x){syl_freqs_2$Freq[which(syl_freqs_2$Var1 == data$cluster_2[x])]})
data$count_3 <- sapply(1:nrow(data), function(x){syl_freqs_3$Freq[which(syl_freqs_3$Var1 == data$cluster_3[x])]})
data$count_4 <- sapply(1:nrow(data), function(x){syl_freqs_4$Freq[which(syl_freqs_4$Var1 == data$cluster_4[x])]})
#add song lengths
song_freq_table <- as.data.frame(table(data$song))
data$song_length <- sapply(1:nrow(data), function(x){song_freq_table$Freq[which(song_freq_table$Var1 == data$song[x])]})
#create data frame of syllable types
syl_types <- lapply(0:4, function(x){
col_id <- which(names(data) == paste0("cluster_", x))
temp <- data.frame(type = sort(unique(as.data.frame(data)[, col_id])), count = NA, concavity = NA, excursion = NA, average = NA, min = NA, max = NA, bandwidth = NA, duration = NA)
#get average measures for each syllable type
for(i in 1:nrow(temp)){
inds <- which(as.data.frame(data)[, col_id] == temp$type[i])
temp$count[i] <- length(inds)
temp$concavity[i] <- median(data$concavity[inds])
temp$excursion[i] <- median(data$excursion[inds])
temp$average[i] <- median(data$average[inds])
temp$min[i] <- median(data$min[inds])
temp$max[i] <- median(data$max[inds])
temp$bandwidth[i] <- median(data$bandwidth[inds])
temp$duration[i] <- median(data$duration[inds])
}
return(temp)
})
#save processed version of data
processed_data <- list(data = data, syl_types = syl_types)
save(processed_data, file = "data_models/processed_data.RData")
Mandelbrot’s generalization of Zipf’s rank-frequency law takes the following form (22,23):
\[\begin{equation} f(r) = \frac{c}{(r + \beta)^\alpha} \tag{2} \end{equation}\]
where \(f(r)\) is the normalized frequency at each rank \(r\), \(c\) is a normalization parameter, and \(\alpha\) and \(\beta\) are parameters that control slope and convexity (respectively). According to (87), the bounds of (2) are \(\alpha > 1\), \(\beta > -1\), and \(c > 1\). When \(\beta = 0\), this function simplifies to the original form of Zipf’s rank-frequency law: \(f(r) \propto 1/r^\alpha\).
I fit (2) to the rank-frequency distributions of the syllable classifications from each level of deep split in two batches. First, I fit all three parameters to approximate Mandelbrot’s versions of the rank-frequency law. Then, I set \(\beta = 0\) to approximate Zipf’s original formulation.
#load library
library(brms)
#load processed data and rename
load("data_models/processed_data.RData")
data <- processed_data$data
syl_types <- processed_data$syl_types
rm(processed_data)
#get frequencies for three levels of deep split
freqs_a <- as.numeric(sort(table(data$cluster_2), decreasing = TRUE))/nrow(data)
freqs_b <- as.numeric(sort(table(data$cluster_3), decreasing = TRUE))/nrow(data)
freqs_c <- as.numeric(sort(table(data$cluster_4), decreasing = TRUE))/nrow(data)
data_a <- data.frame(freq = freqs_a, rank = 1:length(freqs_a))
data_b <- data.frame(freq = freqs_b, rank = 1:length(freqs_b))
data_c <- data.frame(freq = freqs_c, rank = 1:length(freqs_c))
#set priors
zipf_priors <- prior(normal(0, 10), lb = 1, nlpar = "a") +
prior(normal(0, 10), lb = 0, nlpar = "c")
mand_priors <- prior(normal(0, 10), lb = 1, nlpar = "a") +
prior(normal(0, 10), lb = -1, nlpar = "b") +
prior(normal(0, 10), lb = 0, nlpar = "c")
#fit the zipf models
zipf_fit_a <- brm(bf(freq ~ c/(rank^a), a + c ~ 1, nl = TRUE),
data = data_a, prior = zipf_priors,
iter = 5000, cores = 4)
zipf_fit_b <- brm(bf(freq ~ c/(rank^a), a + c ~ 1, nl = TRUE),
data = data_b, prior = zipf_priors,
iter = 5000, cores = 4)
zipf_fit_c <- brm(bf(freq ~ c/(rank^a), a + c ~ 1, nl = TRUE),
data = data_c, prior = zipf_priors,
iter = 5000, cores = 4)
#fit the mandelbrot models
mand_fit_a <- brm(bf(freq ~ c/((rank + b)^a), a + b + c ~ 1, nl = TRUE),
data = data_a, prior = mand_priors,
iter = 5000, cores = 4)
mand_fit_b <- brm(bf(freq ~ c/((rank + b)^a), a + b + c ~ 1, nl = TRUE),
data = data_b, prior = mand_priors,
iter = 5000, cores = 4)
mand_fit_c <- brm(bf(freq ~ c/((rank + b)^a), a + b + c ~ 1, nl = TRUE),
data = data_c, prior = mand_priors,
iter = 5000, cores = 4)
#store and save all relevant output
zipf_rf_models <- list(data = list(ds_2 = data_a,
ds_3 = data_b,
ds_4 = data_c),
zipf = list(ds_2 = summary(zipf_fit_a)$fixed,
ds_3 = summary(zipf_fit_b)$fixed,
ds_4 = summary(zipf_fit_c)$fixed),
mand = list(ds_2 = summary(mand_fit_a)$fixed,
ds_3 = summary(mand_fit_b)$fixed,
ds_4 = summary(mand_fit_c)$fixed),
prior = prior_summary(mand_fit_a),
waic = data.frame(zipf = c(waic(zipf_fit_a)$waic, waic(zipf_fit_b)$waic, waic(zipf_fit_c)$waic),
mand = c(waic(mand_fit_a)$waic, waic(mand_fit_b)$waic, waic(mand_fit_c)$waic)),
r2 = c(bayes_R2(zipf_fit_a)[1], bayes_R2(zipf_fit_b)[1], bayes_R2(zipf_fit_c)[1],
bayes_R2(mand_fit_a)[1], bayes_R2(mand_fit_b)[1], bayes_R2(mand_fit_c)[1]))
save(zipf_rf_models, file = "data_models/zipf_rf_models.RData")
Figure 2: The relationship between rank (x-axis) and count (y-axis) at the intermediate granularity level (DS = 3, 596 syllable types). The blue and orange lines denote the expected distributions according to Zipf’s rank-frequency law (blue; a = 1.00, c = 0.08) and Mandelbrot’s extension of it (orange; a = 1.20, b = 5.69, c = 0.52).
As there is no established hypothesis test for Zipf’s rank-frequency law, I will simply report the goodness-of-fit (the norm in linguistics) and refer to it as “consistent” with the law when it is within the range reported for human languages (\(R^2\) > 0.8) (21,88).
The observed frequency distribution is consistent with the Zipf-Mandelbrot distribution at all three levels of granularity in syllable clustering (\(R^2 = \{0.995, 0.995, 0.995\}\) at \(DS = \{2, 3, 4\}\)), but is inconsistent with the the original form of Zipf’s rank frequency law at the two of the three levels of granularity (\(R^2 = \{0.945, 0.777, 0.523\}\)). Figure 2 shows the rank-frequency distribution at the intermediate granularity level (\(DS = 3\); 596 syllable types), while all three can be seen in the Supplementary Information. The distribution has a poorer fit to the rarer syllable types—a common pattern in human language (21) that is further investigated in the Supplementary Information. The Zipf-Mandelbrot distribution also outcompetes Zipf’s original form and has \(R^2 > 0.98\) when models are fit separately to the data from each year (1975, 2012, and 2019; see Supplementary Information).
Zipf’s law of abbreviation predicts that common items will be shorter in duration than rarer ones. Rather than focusing duration alone, I explored whether the frequency of syllables is negatively correlated with four different measures of production cost: duration (ms), bandwidth (Hz), concavity (changes in the sign of the slope of the mean frequency trace/ms) and excursion (cumulative absolute change in Hz/ms).
For each level of deep split and each measure of production cost, I constructed a log-normal model with the measure in question as the outcome variable, count as the predictor variable, and syllable type as a varying intercept. The alternative formulation, a Poisson model with count as the outcome variable, does not allow for the correct random effects structure (e.g. counts are identical across observations of the same syllable type).
#load packages
library(brms)
#load processed data and rename
load("data_models/processed_data.RData")
data <- processed_data$data
#set priors, a has lower bound on intercept (because outcome is only positive) while b does not
zipf_priors_a <- prior(normal(0, 4), lb = 0, class = "Intercept") +
prior(normal(0, 0.5), class = "b") +
prior(normal(0, 0.5), lb = 0, class = "sd") +
prior(normal(0, 0.5), lb = 0, class = "sigma")
zipf_priors_b <- prior(normal(0, 4), class = "Intercept") +
prior(normal(0, 0.5), class = "b") +
prior(normal(0, 0.5), lb = 0, class = "sd") +
prior(normal(0, 0.5), lb = 0, class = "sigma")
#run duration model across all three deep split values
duration_model_2 <- brm(log(duration) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
duration_model_3 <- brm(log(duration) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
duration_model_4 <- brm(log(duration) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
#run bandwidth model across all three deep split values
bandwidth_model_2 <- brm(log(bandwidth) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
bandwidth_model_3 <- brm(log(bandwidth) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
bandwidth_model_4 <- brm(log(bandwidth) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
#run concavity model across all three deep split values, with 0.0001 added to avoid log-transformation issues
concavity_model_2 <- brm(log(concavity + 0.0001) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
concavity_model_3 <- brm(log(concavity + 0.0001) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
concavity_model_4 <- brm(log(concavity + 0.0001) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
#run excursion model across all three deep split values, with 0.0001 added to avoid log-transformation issues
excursion_model_2 <- brm(log(excursion + 0.0001) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
excursion_model_3 <- brm(log(excursion + 0.0001) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
excursion_model_4 <- brm(log(excursion + 0.0001) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
#combine into object that contains model estimates, diagnostics, and prior specifications, and then save
zipf_models <- list(duration = list(summary(duration_model_2)$fixed, summary(duration_model_3)$fixed, summary(duration_model_4)$fixed, prior_summary(duration_model_2)),
bandwidth = list(summary(bandwidth_model_2)$fixed, summary(bandwidth_model_3)$fixed, summary(bandwidth_model_4)$fixed, prior_summary(bandwidth_model_2)),
concavity = list(summary(concavity_model_2)$fixed, summary(concavity_model_3)$fixed, summary(concavity_model_4)$fixed, prior_summary(concavity_model_2)),
excursion = list(summary(excursion_model_2)$fixed, summary(excursion_model_3)$fixed, summary(excursion_model_4)$fixed, prior_summary(excursion_model_2)))
save(zipf_models, file = "data_models/zipf_models.RData")
Model | DS | Est. | Err. | 2.5% | 97.5% |
|
---|---|---|---|---|---|---|
duration ~ count | 2 | -0.36 | 0.14 | -0.63 | -0.10 | * |
3 | -0.46 | 0.06 | -0.58 | -0.34 | * | |
4 | -0.42 | 0.03 | -0.48 | -0.35 | * | |
bandwidth ~ count | 2 | -0.55 | 0.11 | -0.75 | -0.34 | * |
3 | -0.68 | 0.06 | -0.79 | -0.57 | * | |
4 | -0.64 | 0.03 | -0.69 | -0.58 | * | |
concavity ~ count | 2 | -0.02 | 0.10 | -0.21 | 0.17 | |
3 | -0.06 | 0.04 | -0.15 | 0.02 | ||
4 | -0.07 | 0.03 | -0.12 | -0.02 | * | |
excursion ~ count | 2 | -0.26 | 0.07 | -0.41 | -0.11 | * |
3 | -0.33 | 0.04 | -0.41 | -0.25 | * | |
4 | -0.32 | 0.02 | -0.36 | -0.27 | * |
Duration, bandwidth, and excursion had strong negative effects on count at all three levels of granularity in syllable clustering (Table 1). Concavity had no effect on count at the first two levels of granularity, but had a very weak negative effect at the third level of granularity. These patterns are apparent in the plots of production costs and counts in the Supplementary Information. The results are qualitatively identical when the year of recording (1975, 2012, or 2019) is included as a varying intercept (see Supplementary Information). Several other robustness checks, including replications of this analysis using the rank-based method from the R package ZLAvian (85,89) and using manually-classified syllables from an experimental study in house finches (73), can be found in the Supplementary Information.
Menzerath’s law predicts that longer sequences will be comprised of smaller items. Importantly, Menzerath’s law is sometimes detected in random sequences from null models (90–92). There are two sorts of null models that make sense in this context: (1) random sequences with the same number of syllables as real songs, and (2) pseudorandom sequences that match the cumulative duration in time but can vary in the number of syllables. (49) interpret the latter as approximating simple motor constraints—Menzerath’s law resulting from efficiency in production alone—where stronger effects in the real data would indicate additional mechanisms (e.g. communicative efficiency). In this study, I compare the real data against each of these to assess both the presence of Menzerath’s law and whether it is beyond what would be expected from production constraints.
The production constraint model of (49) works as follows. For each iteration of the model, a pseudorandom sequence was produced for each real song in the dataset. Syllables were randomly sampled (with replacement) from the population until the duration of the random sequence exceeded the duration of the real song. If the difference between the duration of the random sequence and the real song was <50% of the duration of the final syllable, then the final syllable was kept in the sequence. Otherwise, it was removed. Each iteration of the model produces a set of random sequences with approximately the same distribution of durations as the real data.
To estimate the strength of Menzerath’s law, I constructed a log-normal model with syllable duration as the outcome variable, song length in number of syllables as the predictor variable, and the song as a varying intercept (93). This model was used to estimate the strength of the effect of song length on syllable duration in both the real data and 10 simulated datasets from both null models. The brm_multiple function from the brms package in R was used to fit a single model to the the 10 simulated datasets from each null model and produce a combined posterior distribution (78).
This analysis differs from (49) in two ways: (1) I use the actual duration of syllables rather than a single median value for each song, and (2) I compare the full posterior distributions rather than point estimates of effects. These decisions should yield more conservative conclusions. Note that syllable type is not incorporated into the modeling, so this is the only analysis that is not conducted across multiple levels of deep split.
#load packages
library(brms)
#load processed data
load("data_models/processed_data.RData")
data <- processed_data$data
syl_types <- processed_data$syl_types
rm(processed_data)
#get unique songs and durations
unique_songs <- unique(data$song)
individuals <- as.character(sapply(unique_songs, function(q){data$individual[which(data$song == q)][1]}))
song_durs <- as.numeric(sapply(unique_songs, function(x){sum(data$duration[which(data$song == x)])}))
#create 10 simulated datasets from the simple null model
simple_null_sims <- lapply(1:10, function(x){
temp <- data
temp$duration <- sample(temp$duration)
return(temp)
})
#create production constraint model based on james et al. (2021)
prod_null <- function(){
#get shuffled versions of data
sampled_inds <- sample(nrow(data))
sampled_durs <- data$duration[sampled_inds]
#double everything so the algorithm can loop back around if it needs to
sampled_inds <- c(sampled_inds, sampled_inds)
sampled_durs <- c(sampled_durs, sampled_durs)
#set starting points and sampling window for cumulative duration calculation
ind <- 1
m <- 1
window <- 200
#create empty sequences list to fill
sequences <- list()
#iterate through unique songs and create sequences
while(m <= length(unique_songs)){
#get cumulative durations from ind to the end of the sampling window
cum_durs <- cumsum(sampled_durs[ind:(ind + window)])
#set the targets as the value at which the cumulative durations surpass the desired duration
rel_position <- min(which(cum_durs > song_durs[m])) #relative position in the cumulative duration window
abs_position <- ind + rel_position - 1 #absolute position in the sampled vectors
#if syllable duration surpasses song duration at first position, do exception handling
if(rel_position == 1){
#add the single-unit sequence and iterate ind
sequences[[m]] <- data.frame(song = unique_songs[m], individual = individuals[m], duration = sampled_durs[ind], song_length = rel_position)
ind <- ind + 1
} else{
#if the difference between the cumulative duration at the target and the desired duration is less that half of the length of target
if(cum_durs[rel_position] - song_durs[m] < sampled_durs[abs_position]/2){
#add the sequence and iterate ind
sequences[[m]] <- data.frame(song = unique_songs[m], individual = individuals[m], duration = sampled_durs[ind:abs_position], song_length = rel_position)
ind <- abs_position + 1
} else{
#otherwise, add abbreviated sequence and iterate ind
sequences[[m]] <- data.frame(song = unique_songs[m], individual = individuals[m], duration = sampled_durs[ind:(abs_position - 1)], song_length = rel_position - 1)
ind <- abs_position
}
}
#iterate m
m <- m + 1
}
#return combined sequences
return(do.call(rbind, sequences))
}
#create 10 simulated datasets from the production constraint model
prod_null_sims <- lapply(1:10, function(x){prod_null()})
#set priors
menz_priors <- prior(normal(0, 3), lb = 0, class = "Intercept") +
prior(normal(0, 0.1), class = "b") +
prior(normal(0, 0.5), lb = 0, class = "sd") +
prior(normal(0, 0.5), lb = 0, class = "sigma")
#run the actual model on the data
model <- brm(log(duration) ~ scale(song_length) + (1|song), data = data, prior = menz_priors, iter = 5000, cores = 4)
#fit a single model across the 10 simple null datasets
simple_null_models <- brm_multiple(log(duration) ~ scale(song_length) + (1|song), data = simple_null_sims, prior = menz_priors, iter = 5000, cores = 7)
#fit a single model across the 10 production constraint datasets
prod_null_models <- brm_multiple(log(duration) ~ scale(song_length) + (1|song), data = prod_null_sims, prior = menz_priors, iter = 5000, cores = 7)
#combine into object that contains posterior samples, model estimates and diagnostics, and prior specifications, and then save
menz_models <- list(actual = list(estimates = summary(model)$fixed, samples = posterior_samples(model, pars = c("scalesong_length"))$b_scalesong_length),
simple_null = list(estimates = summary(simple_null_models)$fixed, samples = posterior_samples(simple_null_models, pars = c("scalesong_length"))$b_scalesong_length, rhats = simple_null_models$rhats),
prod_null = list(estimates = summary(prod_null_models)$fixed, samples = posterior_samples(prod_null_models, pars = c("scalesong_length"))$b_scalesong_length, rhats = prod_null_models$rhats),
prior = prior_summary(model))
save(menz_models, file = "data_models/menz_models.RData")
Figure 3: The left panel shows the relationship between song length in number of syllables (x-axis) and syllable duration (y-axis), with a best fit line from the fitted lognormal model. The right panel shows the posterior distribution of the effect of song length on duration for the real data (orange) compared to 10 simulated datasets from the simple (blue) and production (green) null models. The combined posteriors for the simulated datasets are based on 2,500 posterior samples from each of the 10 models.
The results of the log-normal model indicate that song length has a negative effect on syllable duration (Mean Estimate: -0.052; 95% CI: [-0.066, -0.038]) (left panel of Figure 3). The posterior distribution for this effect from the model fit to the actual data (orange) is more negative than the posterior distributions from 10 simulated datasets from the production constraint null model of (49) (green) and a simple random null model (blue), although there is notable overlap between the lower tail of the production constraint model and the upper tail of the actual data (3.5% of the combined distribution area; right panel of Figure 3). The negative effect of song length on syllable duration persists when the year of recording (1975, 2012, or 2019) is included as a varying intercept (see Supplementary Information).
The small-worldness index (\(SWI\)) is a ratio based on the clustering coefficient (\(C\)) and average path length (\(L\)) (51):
\[\begin{equation} SWI = \frac{C/C_{rand}}{L/L_{rand}} \tag{3} \end{equation}\]
where \(C_{rand}\) and \(L_{rand}\) are calculated from random networks of the same size as the real network. Values of \(SWI > 1\) are consistent with small-world structure (51), which is thought to reflect general systematic structure and thus compressibility (13,24).
In this study, I followed (24) in calculating \(SWI\) from the unweighted directed network of all syllable transitions in the population (lower panel of Figure 4A) to allow comparison with previous studies of non-human song. Importantly, if the frequency distribution of syllable types is very skewed, then random sequences could exhibit small-world structure simply because of clustering around common types. To avoid this confound, \(SWI\) was calculated 1,000 times from the real data and 1,000 times from random sequences with the same distribution of types.
#load library
library(igraph)
#load processed data
load("data_models/processed_data.RData")
data <- processed_data$data
syl_types <- processed_data$syl_types
rm(processed_data)
#get number of unique songs
unique_songs <- unique(data$song)
#set number of iterations
iters <- 1000
#compute swi at each value of deep split
small_world <- parallel::mclapply(0:4, function(ds){
#get column ID corresponding to deep split value
if(ds == 0){syls <- data$cluster_0}
if(ds == 1){syls <- data$cluster_1}
if(ds == 2){syls <- data$cluster_2}
if(ds == 3){syls <- data$cluster_3}
if(ds == 4){syls <- data$cluster_4}
#extract all of the structured directed sequences
directed_sequences <- do.call(rbind, lapply(1:length(unique_songs), function(i){
sequence <- syls[which(data$song == unique_songs[i])]
return(matrix(c(sequence[1:(length(sequence)-1)], sequence[2:length(sequence)]), ncol = 2))
}))
#convert to factor
directed_sequences <- matrix(as.numeric(factor(directed_sequences)), ncol = 2)
#extract "iters" sets of shuffled sequences for null models
shuffled_sequences <- lapply(1:iters, function(x){
shuffled_syls <- sample(syls)
do.call(rbind, lapply(1:length(unique_songs), function(i){
sequence <- shuffled_syls[which(data$song == unique_songs[i])]
return(matrix(c(sequence[1:(length(sequence)-1)], sequence[2:length(sequence)]), ncol = 2))
}))
})
#construct structured graph, ignoring edge weights
graph <- simplify(graph_from_edgelist(unique(directed_sequences), directed = TRUE))
#get small world indices for it
structured_vals <- sapply(1:iters, function(x){
temp <- sample_gnm(n = length(V(graph)), m = length(E(graph)), directed = TRUE)
(transitivity(graph)/transitivity(temp))/(mean_distance(graph)/mean_distance(temp))
})
#get small world indices for null models
null_vals <- sapply(1:iters, function(x){
target_sequence <- matrix(as.numeric(factor(shuffled_sequences[[x]])), ncol = 2)
null <- simplify(graph_from_edgelist(unique(target_sequence), directed = TRUE))
temp <- sample_gnm(n = length(V(null)), m = length(E(null)), directed = TRUE)
(transitivity(null)/transitivity(temp))/(mean_distance(null)/mean_distance(temp))
})
#return objects
return(list(structured = structured_vals, null = null_vals))
}, mc.cores = 5)
#save indices
save(small_world, file = "data_models/small_world.RData")
Figure 4: The left panel shows transition network between syllables types for a single male (#22 from 1975) above the global transition network for the entire dataset. The right panel shows the estimated small-worldness index calculated 1,000 times from the real data (orange) compared to pseudorandom sequences with the same frequency distribution of syllable types (blue). The dashed vertical line at 1 corresponds to the standard threshold for a network having small-world structure (51).
Figure 4B shows the distribution of \(SWI\) calculated from random sequences with the same distribution of types (orange) and the actual sequences (blue). At all three levels of granularity in syllable clustering, the observed small-worldness of the real songs is above both the standard threshold (\(SWI > 1\)) and the pseudorandom sequences. \(SWI > 1\) when computed separately from the data from each year (1975, 2012, and 2019; see Supplementary Information)
The rate at which the mutual information between items decays with distance reflects whether underlying syntactic structure is Markovian, hierarchical, or a composite of the two (12). Mutual information (\(MI\)) was calculated from pairs of sequential syllables separated by a certain distance, using the method of (12). \(MI\) was computed on concatenated sequences of songs from each individual. For example, if there are two individuals with two concatenated sequences of songs, \(a \to b \to c \to d\) and \(e \to f \to g\), and the target distance is 2, then the pairs used for \(MI\) calculated would be \(((a, c), (b, d), (e, g))\). \(MI\) would then be calculated using:
\[\begin{equation} \hat{I}(X, Y) = \hat{S}(X) + \hat{S}(Y) - \hat{S}(X, Y) \tag{4} \end{equation}\]
where \(\hat{I}\) denotes information and \(\hat{S}\) denotes entropy. \(X\) is the distribution of first syllables \((a, b, e)\), \(Y\) is the distribution of second syllables \((c, d, g)\), and \(XY\) is the joint distribution of pairs \(((a, c), (b, d), (e, g))\). \(\hat{S}\) was calculated using the method of (94) and (95) used by (65):
\[\begin{equation} \hat{S} = ln(N) - \frac{1}{N} \sum_{i=1}^{K} N_{i} \psi(N_{i}) \tag{5} \end{equation}\]
where \(N\) is the total number of tokens in the distribution, \(N_i\) is the number of tokens within each type \(i\) out of the total number of types \(K\), and \(\psi\) is the digamma function. The actual \(MI\) is then measured using:
\[\begin{equation} MI = \hat{I} - \hat{I}_{sh} \tag{6} \end{equation}\]
where \(\hat{I}_{sh}\) is the estimated lower bound of \(MI\) calculated from shuffled sequences, created by randomly permuting the concatenated sequences of individuals’ songs.
(12) and (66) simulated data from the hierarchical model of (65), the Markov model of (96), and their composite model in which Markov chains are nested within a larger hierarchical structure.
To determine whether the observed \(MI\) decay was consistent with a Markov process, hierarchical process, or both, I fit the following three decay models to the data:
\[\begin{equation} \textrm{exponential decay: } y = ae^{-xb} \tag{7} \end{equation}\] \[\begin{equation} \textrm{power-law decay: } y = cx^{d} \tag{8} \end{equation}\] \[\begin{equation} \textrm{composite decay: } y = ae^{-xb} + cx^{d} \tag{9} \end{equation}\]
where \(y\) is the estimated \(MI\) and \(x\) is the distance between syllables (12,66). (12) included an intercept \(f\) in all three models, but I removed it because it led to overfitting issues (e.g. exponential model with an extra “knee” after the initial decay), did not significantly improve model fit (\(\Delta WAIC < 2\)), and was not included in (65). I focused my analysis on distances of up to 100 syllables to enable easy comparison with (12). In the Supplementary Information, I followed (12) in comparing the fit of each model at increasing distances from 100 to 1,200 (the longest individual sequence is 1,219), and found that the composite model outperforms both the exponential and power-law models at all distances.
#load packages
library(brms)
#load data
load("data_models/processed_data.RData")
data <- processed_data$data
data$year <- strtrim(data$individual, 4)
#create function for the grassberger method of entropy estimation used by lin and tegmark (log is ln by default, which is what grassberger used originally)
s_calc <- function(vals){return(log(sum(vals)) - ((1/sum(vals))*sum(vals*digamma(vals))))}
#get unique songs and song lengths
unique_songs <- unique(data$song)
song_lengths <- as.numeric(sapply(unique_songs, function(x){length(which(data$song == x))}))
#get unique individuals and the lengths of their sequences
unique_inds <- unique(data$individual)
inds_lengths <- as.numeric(sapply(unique_inds, function(x){length(which(data$individual == x))}))
#set random seed
set.seed(12345)
#calculate mutual information across three levels of deep split
mi_data <- parallel::mclapply(1:3, function(i){
#use appropriate syllables
if(i == 1){data$syls <- data$cluster_2}
if(i == 2){data$syls <- data$cluster_3}
if(i == 3){data$syls <- data$cluster_4}
#compute actual mutual information curves
actual <- sapply(1:(max(inds_lengths) - 1), function(m){
#set m as the distance between syllables in the sequence
dist <- m
#store pairs within bouts
pairs <- do.call(rbind, lapply(unique_inds[which(inds_lengths > m)], function(r){
seq <- data$syls[which(data$individual == r)]
t(sapply(1:(length(seq)-dist), function(x){c(seq[x], seq[x+dist])}))
}))
#get frequencies of syllables in each column
n_x <- as.numeric(table(pairs[, 1]))
n_y <- as.numeric(table(pairs[, 2]))
#get frequencies of pairs in the data
pairs <- pairs[order(pairs[, 1], pairs[, 2]), ]
index <- !duplicated(pairs)
pair_freqs <- as.numeric(tapply(index, cumsum(index), length))
#calculate s for each
s_x <- s_calc(n_x)
s_y <- s_calc(n_y)
s_xy <- s_calc(pair_freqs)
#compute mutual information and return
return(s_x + s_y - s_xy)
})
#compute shuffled mutual information curves
shuffled <- sapply(1:(max(inds_lengths) - 1), function(m){
#set m as the distance between syllables in the sequence
dist <- m
#store pairs within bouts
pairs <- do.call(rbind, lapply(unique_inds[which(inds_lengths > m)], function(r){
seq <- sample(data$syls[which(data$individual == r)])
t(sapply(1:(length(seq)-dist), function(x){c(seq[x], seq[x+dist])}))
}))
#get frequencies of syllables in each column
n_x <- as.numeric(table(pairs[, 1]))
n_y <- as.numeric(table(pairs[, 2]))
#get frequencies of pairs in the data
pairs <- pairs[order(pairs[, 1], pairs[, 2]), ]
index <- !duplicated(pairs)
pair_freqs <- as.numeric(tapply(index, cumsum(index), length))
#calculate s for each
s_x <- s_calc(n_x)
s_y <- s_calc(n_y)
s_xy <- s_calc(pair_freqs)
#compute mutual information and return
return(s_x + s_y - s_xy)
})
#return mutual information decay, with shuffled baseline substracted from actual information
return(data.frame(dist = 1:length(actual), mi = c(actual - shuffled), mi_raw = actual, mi_base = shuffled))
}, mc.cores = 3)
#save output
save(mi_data, file = "data_models/mi_data.RData")
#load packages
library(brms)
#load data
load("data_models/mi_data.RData")
#set priors
exponential_priors <- prior(normal(0, 1), lb = 0, nlpar = "a") +
prior(normal(0, 1), lb = 0, nlpar = "b")
power_law_priors <- prior(normal(0, 1), lb = 0, nlpar = "c") +
prior(normal(0, 1), lb = 0, nlpar = "d")
composite_priors <- prior(normal(0, 1), lb = 0, nlpar = "a") +
prior(normal(0, 1), lb = 0, nlpar = "b") +
prior(normal(0, 1), lb = 0, nlpar = "c") +
prior(normal(0, 1), lb = 0, nlpar = "d")
#find the point at which the distances are no longer best fit by a composite model for each level of deep split
increments <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200)
waic_test <- lapply(1:3, function(h){
parallel::mclapply(increments, function(x){
exponential_test <- brm(bf(mi ~ a*exp((-dist)*b), a + b ~ 1, nl = TRUE),
data = mi_data[[h]][1:x, ], prior = exponential_priors, iter = 5000)
power_law_test <- brm(bf(mi ~ c*(dist^(-d)), c + d ~ 1, nl = TRUE),
data = mi_data[[h]][1:x, ], prior = power_law_priors, iter = 5000)
composite_test <- brm(bf(mi ~ a*exp((-dist)*b) + c*(dist^(-d)), a + b + c + d ~ 1, nl = TRUE),
data = mi_data[[h]][1:x, ], prior = composite_priors, iter = 5000)
waic_temp <- WAIC(exponential_test, power_law_test, composite_test)
return(c(waic_temp$loos$exponential_test$waic,
waic_temp$loos$power_law_test$waic,
waic_temp$loos$composite_test$waic))
}, mc.cores = 7)
})
#format and save
waic_test <- lapply(1:3, function(x){
temp <- data.frame(do.call(rbind, waic_test[[x]]))
colnames(temp) <- c("exponential", "power_law", "composite")
rownames(temp) <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200)
return(temp)
})
save(waic_test, file = "data_models/waic_test.RData")
#run models at all three levels of deep split
mi_models <- lapply(1:3, function(g){
#fit the exponential model
exponential_fit <- brm(bf(mi ~ a*exp((-dist)*b), a + b ~ 1, nl = TRUE),
data = mi_data[[g]][1:100, ], prior = exponential_priors,
iter = 5000, cores = 4)
#fit the power law model
power_law_fit <- brm(bf(mi ~ c*(dist^(-d)), c + d ~ 1, nl = TRUE),
data = mi_data[[g]][1:100, ], prior = power_law_priors,
iter = 5000, cores = 4)
#fit the composite model
composite_fit <- brm(bf(mi ~ a*exp((-dist)*b) + c*(dist^(-d)), a + b + c + d ~ 1, nl = TRUE),
data = mi_data[[g]][1:100, ], prior = composite_priors,
iter = 5000, cores = 4)
#calculate r2
r2 <- rbind(bayes_R2(exponential_fit), bayes_R2(power_law_fit), bayes_R2(composite_fit))
rownames(r2) <- c("exponential", "power_law", "composite")
#compute aic
waic <- WAIC(exponential_fit, power_law_fit, composite_fit)
waic <- c(waic$loos$exponential_fit$estimates[3, 1], waic$loos$power_law_fit$estimates[3, 1], waic$loos$composite_fit$estimates[3, 1])
#return data and model fits and aic
return(list(waic = waic,
r2 = r2,
priors = prior_summary(composite_fit),
exponential_fit = summary(exponential_fit)$fixed,
power_law_fit = summary(power_law_fit)$fixed,
composite_fit = summary(composite_fit)$fixed))
})
#save data and models
save(mi_models, file = "data_models/mi_models.RData")
#extract simulated values from sainburg et al.
model_fig <- xml2::read_xml("https://raw.githubusercontent.com/timsainb/ParallelsBirdsongLanguagePaper/master/figures/modelfig.svg")
xml2::xml_ns_strip(model_fig)
#convert raw values to actual values
raw_to_log <- function(y, range = c(281.188594, 6.838594), axis = c(10e-4, 10e1)){return(exp(scales::rescale(x = -c(range, y), from = -range, to = log(axis)))[-c(1:2)])}
#get heirarchical values
y_heirarchy <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_1']/g/use"), "y"))
y_heirarchy <- raw_to_log(y_heirarchy)
#get markov values
y_markov_a <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_2']/g/use"), "y"))[1:29]
y_markov_b <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_3']/g/use"), "y"))[1:12]
y_markov_a <- raw_to_log(y_markov_a)
y_markov_b <- raw_to_log(y_markov_b)
#get combined values
y_combined <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_4']/g/use"), "y"))
y_combined <- raw_to_log(y_combined)
#combine all simulated values into one data frame
mi_simulated <- data.frame(x = c(1:100, 1:12, 1:100),
y = c(y_heirarchy, y_markov_b, y_combined),
group = c(rep("heirarchy", 100), rep("markov", 12), rep("combined", 100)))
#save simulated values
save(mi_simulated, file = "data_models/mi_simulated.RData")
Figure 5: The top panel shows simulated mutual information decay curves from the (65) hierarchical model (left), the (96) Markov model (center), and the (12) composite model (right) (data and inspiration for diagrams from 12). The bottom panel shows the computed mutual information decay curves for the observed data at different levels of deep split (left, center, right). The solid line corresponds to the full composite model, while the dashed and dotted lines correspond to the exponential and power-law terms, respectively.
At all three levels of granularity in syllable clustering, the composite model has a lower \(WAIC\) and higher \(R^2\) than both the exponential and power-law models (see Supplementary Information), suggesting that mutual information decay in house finch song is more consistent with a combination of Markovian and hierarchical processes. The composite model also outcompetes both the exponential and power-law models when fit to mutual information decay curves computed separately for each year of recording (1975, 2012, and 2019; see Supplementary Information).
Interestingly, the transitions in the composite curves from Figure 5 roughly correspond to the average length of a house finch song (~12 syllables) (72). I reran the analysis with individual song sequences rather than song bouts and found that the exponential model outcompeted the composite model at \(DS = \{3, 4\}\) (see Supplementary Information). This suggests that individual song sequences may have Markovian structure within hierarchically-organized song bouts.
All three linguistic laws considered here are present in house finch song. Three out of the four measures of production cost, most importantly duration, are consistently and strongly negatively correlated with frequency, providing robust evidence for Zipf’s law of abbreviation. Menzerath’s law also found solid support, with a steeper negative relationship between song length and syllable duration than predicted by a model of production constraints (49). Together, these results show clear evidence for efficiency—syllables that are difficult to produce are less common and more likely to appear in shorter songs. Mandelbrot’s form of Zipf’s rank-frequency law provided a good fit to the data and has a more convex shape than the original form, which is consistent with studies of other non-human communication systems that have more redundancy than human language (24–29). I will limit my interpretation of this result, given ongoing debates about the cause of the rank-frequency law, but it is notable that the goodness-of-fit of the Zipf-Mandelbrot distribution to house finch song (\(R^2 > 0.99\)) is within the range of human language (21,88).
The two structural properties of language considered here are also found in house finch song. First, the syllable network has a small-world structure, characterized by high levels of clustering and low average path lengths, which is thought to reflect systematic structure and efficient recall in human language (52,53,97). The small-worldness index of house finch song is within the range seen in humpback whales (24) and other songbirds at the first two levels of granularity in syllable clustering (26,56–59) (\(SWI \sim 1.69-4.7\)), but is much higher when syllables are over-split into types (\(SWI \sim 10-11\)). Second, the decay in mutual information between syllables with increasing distance has a similar shape to that found in several human languages and songbird species (12), which is associated with Markov chains nested within a larger hierarchical structure. Historically, birdsong sequences were assumed to follow simple Markov chains (96,98,99). This result lends support to an emerging consensus that song bouts may be much more complex, containing long-range dependencies and hierarchical structures that resemble human language (11,12,100,101). In combination, the small-worldness and mutual information decay of house finch song sequences suggest that they exhibit the kind of systematic structure that is thought to maximize expressivity while reducing learning costs (13,24,54), making it easier for more information to “pass through the bottleneck” of social learning (1).
Notably, these patterns are consistent across three levels of granularity in syllable clustering. I have heard others studying the cultural evolution of birdsong refer to this as “fractal equivalency”—different resolutions of clustering should show similar forms of organization (102). A practical conclusion of this finding is that information theoretic measures may not be reliable indicators of clustering quality in non-human communication systems (103).
Additional discussion can be found in the Supplementary Information.
This work was inspired in large part by conversations with Olivier Morin, Alexey Koshevoy, and the attendees of the Communicative Efficiency Workshop hosted by the Max Planck Institute for Geoanthropology in April 2023.
The data and code for this study are available on Zenodo (https://doi.org/10.5281/zenodo.10689807).