Language-like efficiency and structure in house finch song

Mason Youngblood

Minds and Traditions Research Group, Max Planck Institute for Geoanthropology
Institute for Advanced Computational Science, Stony Brook University
masonyoungblood@gmail.com

Abstract

Communication needs to be complex enough to be functional while minimizing learning and production costs. Recent work suggests that the vocalizations and gestures of some songbirds, cetaceans, and great apes may conform to linguistic laws that reflect this trade-off between efficiency and complexity. In studies of non-human communication, though, clustering signals into types cannot be done a priori, and decisions about the appropriate grain of analysis may affect statistical signals in the data. The aim of this study was to assess the evidence for language-like efficiency and structure in house finch (Haemorhous mexicanus) song across three levels of granularity in syllable clustering. The results show strong evidence for Zipf’s rank-frequency law, Zipf’s law of abbreviation, and Menzerath’s law. Additional analyses show that house finch songs have small-world structure, thought to reflect systematic structure in syntax, and the mutual information decay of sequences is consistent with a combination of Markovian and hierarchical processes. These statistical patterns are robust across three levels of granularity in syllable clustering, pointing to a limited form of scale invariance. In sum, it appears that house finch song has been shaped by pressure for efficiency, possibly to offset the costs of female preferences for complexity.

1 Introduction

1.1 Efficiency & Complexity

Communication systems tend to be optimized for efficiency, or the benefit that they bestow relative to the costs of learning and producing them:

\[\begin{equation} \textrm{efficiency} \propto \frac{\sum{\pi}}{C_{L} + \sum{C_{P}}} \tag{1} \end{equation}\]

where \(\sum{\pi}\) is the lifetime benefit of the communication system (which includes perception by receivers), \(C_{L}\) is the cost of learning it, and \(\sum{C_{P}}\) is the lifetime cost of producing it (1). This framing is consistent with understandings of efficiency in linguistics (2): the famous “principle of least effort” can be thought of as a minimization of these costs (3).

On the other hand, simpler sounds that are easier to learn and produce vary across fewer dimensions and are less distinctive from one another (4). Signals need to be distinguishable to be functional (5), and complexity expands the possibilities within a signal space in a way that enhances functionality (e.g. to communicate concepts in language, or attract mates in birdsong) (6). But there is, of course, a limit. Simulations show that cultural evolution plateaus when complex behaviors are too costly (7). Evidence from linguistics, animal behavior, and cultural evolution suggests that communication systems generally evolve to balance this complexity-efficiency trade-off (1,2,8).

In the birdsong literature, complexity is usually described at one of three levels: syllables (individual sounds within songs), songs (sequences of syllables), or repertoires (full set of syllables or songs that a bird produces). Syllable and repertoire complexity are approximated with measures of production cost (e.g. frequency bandwidth, number of transitions in pitch (9)) and learning cost (e.g. diversity of unique syllables or songs (10)), respectively. Song complexity, on the other hand, is often characterized by measures of hierarchical or combinatorial structure (11,12). Structured signals are more compressible and learnable (2,13,14), which allows more information to “pass through the bottleneck” of cultural transmission (1,15). Syllable complexity increases both production and learning costs, whereas song complexity decreases learning costs. Both notions of complexity, however, should boost the benefits of communication by expanding the signal space. Syllable complexity allows for greater diversity in syllable types, and song complexity allows those syllable types to be combined into a wider array of song types.

The effect of this complexity-efficiency trade-off has been confirmed by experiments showing that artificial languages gain efficiency (16) and structure (17) as they are culturally transmitted and used by participants. Studies of continuous whistled communication systems that resemble birdsong have also detected increases in combinatorial structure (i.e. signals comprised of constituent parts that are recombined in different ways) (14). Birdsong experiments yield similar results. Zebra finches raised in isolation sing atypical songs, but over several generations they converge towards wild-type song by shortening the longest syllables and gaining spectral structure (18). Even foraging behaviors that are culturally transmitted become more efficient over time (19).

1.2 Linguistic Laws & Structure

Pressure for compression and efficiency lead to regularities in organization that are so universal in human language that they are referred to as linguistic laws. The three most commonly-studied of these are Zipf’s rank-frequency law, Zipf’s law of abbreviation, and Menzerath’s law (20).

Zipf’s rank-frequency law predicts that the frequency of an item will be proportional to the inverse of its rank (i.e. first most common, second most common, etc.), a relationship that holds for most, if not all, of the world’s languages (21). In this study, I will focus on Mandelbrot’s more flexible parameterization of Zipf’s rank-frequency law (see Analysis) (22,23), which is its most common form in contemporary linguistics (21). Non-human animal communication systems, including bird and cetacean vocalizations (24–28) and lizard courtship displays (29), exhibit more redundancy than languages, leading to a convex rank-frequency relationship that is better captured by the Zipf-Mandelbrot distribution. Zipf and Mandelbrot both interpreted this rank-frequency law as resulting from a minimization of production and perception costs (3,22), and there are models showing that it can be derived from communicative efficiency (30–32). However, some of these models assume that signals map to objects or concepts which is not the case in birdsong, and other causes are still debated (21). Even though there is still uncertainty about its causes, the presence of Zipf’s rank-frequency law in non-human communication systems has been interpreted as evidence for both communicative efficiency and information content (33,34).

Zipf’s law of abbreviation predicts that common items will tend to be shorter than rare items because their production cost is lower (3). This negative correlation between frequency and duration is widespread in both written (35) and spoken language (36), and has also been observed in writing systems (37) and non-human communication like chimpanzee gestures (38) and bird and primate vocalizations (39–41). The explanation for Zipf’s law of abbreviation is simple: when common items have a lower production cost than rare items then the overall production cost of signals goes down (42).

Menzerath’s law predicts that longer sequences will be comprised of shorter items to balance production costs (43). This negative correlation between sequence length and item length is found at various levels of analysis in language (e.g. clauses in sentences, morphemes in words) (44–46). In non-human communication, Menzerath’s law appears to be present in chimpanzee gesture (38) and in the vocalizations of some primates and birds (40,41,47,48), including house finches (49). The explanation for Menzerath’s law is an extension of Zipf’s law of abbreviation: when production costs are increased in one domain (e.g. song sequence length) they should be decreased in another (e.g. syllable duration).

Beyond linguistic laws, there are two proxy measure of linguistic structure that have recently been investigated in non-human communication: small-world structure (50) and mutual information decay (12).

Small-world networks are highly clustered and have short average path lengths, so that it only takes a few steps to jump between any pair of nodes (think “six degrees of separation”) (50,51). These sorts of networks are quite common in biological and social systems (50,51), including language. For example, networks of neighboring words in sentences and co-occurring words in thesauruses exhibit small-world structure (52,53). Small-worldness is thought to reflect general systematic structure and recurrence, which in turn improve the compressibility and learnability of information (13,24,54). It is also hypothesized to reflect the emergence of syntactic structure over time (55), as both nightingales and children in more advanced stages of vocal development have greater small-world structure in their transition networks (56,57). Humpback whales (24) and several songbird species (26,56,58,59) all exhibit small-worldness in their song syntax to a similar degree.

Mutual information is a measure of dependency, or the amount of information that the presence of one thing has about another. In the context of language, past words provide information that can help predict future words above chance levels (i.e. “do you need anything from the” \(\to\) “store”). Intuitively, a word contains more information about the very next word than the one that comes after it (60), but correlations can be detected even at very long distances of hundreds of words (61). The rate at which the mutual information of words decreases with increasing distance can provide clues about underlying syntactic structure. Simple models of grammar that assume Markov processes (i.e. next word depends only on last few words) lead to exponential decay in mutual information with distance, whereas hierarchical processes (i.e. word order comes from nested syntactic rules that allow for long-range dependencies (62), common view going back to (63)) lead to power-law decay (64,65). In German, Italian, Japanese, and English, mutual information decay is exponential at short distances and fits a power-law at long distances, suggesting that sequences are generated by a combination of Markovian and hierarchical processes (12,66). This pattern has also been documented in four bird species, suggesting that sequential organization of song is more complex than previously thought (12).

1.3 Granularity

In languages, the boundaries between signals are apparent to the humans who use them. In non-human communication systems, categorizing signals into types is its own challenge. Automated methods that classify signals into types based on their acoustic features (67–69) reduce subjectivity and enable people to work with much more data, but they also require tuning. For example, hierarchical clustering requires the user to choose an appropriate threshold below which the “branches” of the “tree” are combined into types (see Figure 1). This threshold effectively controls the granularity of clustering: higher values over-lump signals into fewer categories, while lower values over-split signals into more categories. The granularity of an analysis may influence the kinds of patterns that can detected. Philosophers of cultural evolution call this the “grain problem”—some statistical patterns may be more apparent at certain levels of analysis (70).

1.4 Aim & Model

The aim of this study is to assess the evidence for language-like efficiency and structure in house finch (Haemorhous mexicanus) song across three levels of granularity in syllable clustering. By doing so, I hope to (1) identify which features of birdsong may be most subject to the complexity-efficiency trade-off, and (2) determine how clustering decisions affect the manifestation of linguistic laws in non-human communication systems. The data for this study come from a large corpus of house finch songs collected between 1975 and 2019 (9,71,72). House finch song is an excellent model for these questions for several reasons. First, house finch song is socially learned (73) and culturally evolves (71), and thus should be subject to information compression. Second, male house finches are more likely to learn complex syllables (9)—a content bias that may be an adaptation to female preferences for complexity. In house finches, males that sing longer songs at a faster rate are more attractive (74) and have higher reproductive performance (75), and courtship songs are longer and contain more syllable types (76). Because these measures of complexity relate to production and learning costs, they may increase pressure for efficiency in other domains such as duration. Finally, house finch song is known to be subject to efficiency constraints. When house finches tutored by canaries reproduce the trills of their foster parents they are slower and much shorter (73), and house finches increase the frequency of their vocalizations to minimize competition with the lower frequency sounds (77).

2 Analysis

Unless otherwise stated, all models were fit in STAN using the brms package in R (78), with 20,000 iterations across four MCMC chains. Prior specifications, model diagnostics, and full model output for all analyses can be found in the Supplementary Information.

2.1 Data

The recordings used in this study (2,724 songs from 331 individuals) were collected in 1975 (71), 2012 (72), and 2019 (9) in the New York metropolitan area, and analyzed by (9) using Luscinia (https://rflachlan.github.io/Luscinia/) (full recording and analysis details, and an example of an analyzed song, are in (9) and the Supplementary Information). In this population, males have a repertoire of ~35-40 syllable types, which they combine into 1-7 stereotyped song types (mean of 3.1) that are between 6-31 syllables long (mean of 12.1) (72). The main analysis was conducted using recordings from all three years, but the patterns are qualitatively the same when each year is analyzed separately (see Supplementary Information).

#load data from Youngblood & Lahti (2022)
load("data_models/youngblood_lahti_2022.RData")

2.2 Clustering

Clustering was conducted using the log-transformed mean frequency traces (mean frequency in each 1 ms bin) for each syllable in every song (79). I used the mean frequency traces for clustering because they are time series, which means they capture more subtle variation than other features from Luscinia (e.g. overall bandwidth) and can be compared using dynamic time warping. First, the normalized distances between all of the syllables were calculated via dynamic time warping with a window size of 10 (10% of the average signal length) using the dtwclust package in R (80). A window size of 10% of the signal length is commonly used in speech processing research and seems to be a practical upper limit for many applications (81). Infinite distances (0.19% of cases) caused by comparisons of syllables with extreme signal length differences were assigned the maximum observed distance value. Next, hierarchical clustering and dynamic tree cut were used to cluster the syllables into types (82). Hierarchical clustering was conducted with the UPGMA method implemented in the fastcluster package in R (83), and dynamic tree cut was conducted with the dynamicTreeCut package in R (84).

For dynamic tree cut, I ran the hybrid algorithm with a minimum cluster size of 1 to maximize the representation of rare syllable types, and used the deep split parameter (\(DS\)) to determine the granularity of clustering (details of \(DS\) are in the Supplementary Information). I restricted this analysis to \(DS = \{2, 3, 4\}\) because values of \(DS = \{0, 1\}\) lead to extremely unrealistic underestimates of the number of syllables types. \(DS = 2\) leads to over-lumping of syllables into types (\(n = 114\); low granularity), \(DS = 3\) leads to a typical syllable classification (\(n = 596\)) (9,72,82), and \(DS = 4\) lead to over-splitting (\(n = 1646\); high granularity). The dendrogram and syllable classifications can be seen in Figure 1.

Automated clustering methods carry a risk of duration-bias—variation may be more detectable in longer syllables (85). A replication of Zipf’s law of abbreviation with manually-classified syllables (73), as well as an exploratory assessment of duration bias in the automated clustering algorithm, can be found in Supplementary Information.

#load library
library(dtwclust)

#log transform each syllable individually
data$meanfreq <- sapply(1:nrow(data), function(x){log(data$meanfreq[[x]])})

#z-score normalization across entire dataset
m <- mean(unlist(data$meanfreq))
sd <- sd(unlist(data$meanfreq))
data$meanfreq <- sapply(1:nrow(data), function(x){(data$meanfreq[[x]]-m)/sd})

#set window size
window <- round(mean(lengths(data$meanfreq))*0.10) #10

#calculate distances
distances <- proxy::dist(data$meanfreq, method = "dtw_basic", window.size = window, normalize = TRUE)

#strip everything from distances
attr(distances, "dimnames") <- NULL
attr(distances, "method") <- NULL
attr(distances, "call") <- NULL
class(distances) <- "matrix"

#replace infinite distances with the maximum value
distances[which(distances == Inf)] <- max(distances[-which(distances == Inf)])

#convert to format compatible with hclust
dtw_dist <- stats::as.dist(distances)

#run clustering
clustering <- fastcluster::hclust(dtw_dist, method = "average")

#save clustering
save(clustering, file = "data_models/clustering.RData")

#run hybrid tree cut
hybrid_cut <- parallel::mclapply(0:4, function(x){dynamicTreeCut::cutreeDynamic(dendro = clustering, distM = distances, method = "hybrid", minClusterSize = 1, deepSplit = x)}, mc.cores = 5)

#save hybrid tree cut
save(hybrid_cut, file = "data_models/hybrid_cut.RData")

The results of hierarchical clustering. The colored bars below the dendrogram correspond to the categories assigned to each syllable when deep split is 2 (over-lumping), 3 (baseline), and 4 (over-splitting).

Figure 1: The results of hierarchical clustering. The colored bars below the dendrogram correspond to the categories assigned to each syllable when deep split is 2 (over-lumping), 3 (baseline), and 4 (over-splitting).

Once syllable types were identified by hierarchical clustering, I followed (72) and (9) in calculating the following syllable-level acoustic features for further analysis: average frequency (Hz), minimum frequency (Hz), maximum frequency (Hz), bandwidth (Hz), duration (ms), concavity (changes in the sign of the slope of the mean frequency trace/ms) and excursion (cumulative absolute change in Hz/ms). Concavity and excursion are both indicators of syllable complexity (72,86). Concavity was calculated after smoothing the mean frequency trace using a polynomial spline with a smoothing parameter of 5 (9).

#load data from clustering
load("data_models/clustering.RData")
load("data_models/hybrid_cut.RData")

#calculate concavity
inds <- 1:nrow(data)
inds <- inds[-which(lengths(data$meanfreq) < 6)] #remove things that are too short to analyze but have a concavity of 0
concavity <- rep(0, nrow(data))
for(i in inds){
  temp <- pspline::sm.spline(data$meanfreq[[i]], spar = 5)
  temp <- diff(c(temp$ysmth)) #extract first derivative (slopes)
  concav <- 0
  for(j in 2:length(temp)){
    if((temp[[j]] < 0 & temp[[j-1]] > 0) | (temp[[j]] > 0 & temp[[j-1]] < 0)){
      concavity[i] <- concavity[i] + 1
    }
  }
  concavity[i] <- concavity[i]/length(temp)
  #concavity[i] <- concavity[i]
}

#calculate excursion
excursion <- rep(0, nrow(data))
for(i in inds){
  temp <- data$meanfreq[[i]]
  excursion[i] <- sum(sapply(2:length(temp), function(x){abs(temp[x]-temp[x-1])}))/length(temp)
}

#add to data table
data$concavity <- concavity
data$excursion <- excursion

#calculate other measures from mean frequency trace
data$average <- sapply(1:nrow(data), function(x){mean(data$meanfreq[[x]])})
data$min <- sapply(1:nrow(data), function(x){min(data$meanfreq[[x]])})
data$max <- sapply(1:nrow(data), function(x){max(data$meanfreq[[x]])})
data$bandwidth <- sapply(1:nrow(data), function(x){max(data$meanfreq[[x]])-min(data$meanfreq[[x]])}) 
data$duration <- sapply(1:nrow(data), function(x){length(data$meanfreq[[x]])}) 

#add syllable types to data file
data$cluster_0 <- as.numeric(hybrid_cut[[1]])
data$cluster_1 <- as.numeric(hybrid_cut[[2]])
data$cluster_2 <- as.numeric(hybrid_cut[[3]])
data$cluster_3 <- as.numeric(hybrid_cut[[4]])
data$cluster_4 <- as.numeric(hybrid_cut[[5]])

#add year
data$year <- NA
data$year[which(substring(data$individual, 1, 4) == "1975")] <- 1
data$year[which(substring(data$individual, 1, 4) == "2012")] <- 2
data$year[which(substring(data$individual, 1, 4) == "2019")] <- 3

#add counts
syl_freqs_0 <- as.data.frame(table(data$cluster_0))
syl_freqs_1 <- as.data.frame(table(data$cluster_1))
syl_freqs_2 <- as.data.frame(table(data$cluster_2))
syl_freqs_3 <- as.data.frame(table(data$cluster_3))
syl_freqs_4 <- as.data.frame(table(data$cluster_4))
data$count_0 <- sapply(1:nrow(data), function(x){syl_freqs_0$Freq[which(syl_freqs_0$Var1 == data$cluster_0[x])]})
data$count_1 <- sapply(1:nrow(data), function(x){syl_freqs_1$Freq[which(syl_freqs_1$Var1 == data$cluster_1[x])]})
data$count_2 <- sapply(1:nrow(data), function(x){syl_freqs_2$Freq[which(syl_freqs_2$Var1 == data$cluster_2[x])]})
data$count_3 <- sapply(1:nrow(data), function(x){syl_freqs_3$Freq[which(syl_freqs_3$Var1 == data$cluster_3[x])]})
data$count_4 <- sapply(1:nrow(data), function(x){syl_freqs_4$Freq[which(syl_freqs_4$Var1 == data$cluster_4[x])]})

#add song lengths
song_freq_table <- as.data.frame(table(data$song))
data$song_length <- sapply(1:nrow(data), function(x){song_freq_table$Freq[which(song_freq_table$Var1 == data$song[x])]})

#create data frame of syllable types
syl_types <- lapply(0:4, function(x){
  col_id <- which(names(data) == paste0("cluster_", x))
  temp <- data.frame(type = sort(unique(as.data.frame(data)[, col_id])), count = NA, concavity = NA, excursion = NA, average = NA, min = NA, max = NA, bandwidth = NA, duration = NA)
  
  #get average measures for each syllable type
  for(i in 1:nrow(temp)){
    inds <- which(as.data.frame(data)[, col_id] == temp$type[i])
    temp$count[i] <- length(inds)
    temp$concavity[i] <- median(data$concavity[inds])
    temp$excursion[i] <- median(data$excursion[inds])
    temp$average[i] <- median(data$average[inds])
    temp$min[i] <- median(data$min[inds])
    temp$max[i] <- median(data$max[inds])
    temp$bandwidth[i] <- median(data$bandwidth[inds])
    temp$duration[i] <- median(data$duration[inds])
  }
  
  return(temp)
})

#save processed version of data
processed_data <- list(data = data, syl_types = syl_types)
save(processed_data, file = "data_models/processed_data.RData")

2.3 Zipf’s Rank-Frequency Law

Mandelbrot’s generalization of Zipf’s rank-frequency law takes the following form (22,23):

\[\begin{equation} f(r) = \frac{c}{(r + \beta)^\alpha} \tag{2} \end{equation}\]

where \(f(r)\) is the normalized frequency at each rank \(r\), \(c\) is a normalization parameter, and \(\alpha\) and \(\beta\) are parameters that control slope and convexity (respectively). According to (87), the bounds of (2) are \(\alpha > 1\), \(\beta > -1\), and \(c > 1\). When \(\beta = 0\), this function simplifies to the original form of Zipf’s rank-frequency law: \(f(r) \propto 1/r^\alpha\).

I fit (2) to the rank-frequency distributions of the syllable classifications from each level of deep split in two batches. First, I fit all three parameters to approximate Mandelbrot’s versions of the rank-frequency law. Then, I set \(\beta = 0\) to approximate Zipf’s original formulation.

#load library
library(brms)

#load processed data and rename
load("data_models/processed_data.RData")
data <- processed_data$data
syl_types <- processed_data$syl_types
rm(processed_data)

#get frequencies for three levels of deep split
freqs_a <- as.numeric(sort(table(data$cluster_2), decreasing = TRUE))/nrow(data)
freqs_b <- as.numeric(sort(table(data$cluster_3), decreasing = TRUE))/nrow(data)
freqs_c <- as.numeric(sort(table(data$cluster_4), decreasing = TRUE))/nrow(data)
data_a <- data.frame(freq = freqs_a, rank = 1:length(freqs_a))
data_b <- data.frame(freq = freqs_b, rank = 1:length(freqs_b))
data_c <- data.frame(freq = freqs_c, rank = 1:length(freqs_c))

#set priors
zipf_priors <- prior(normal(0, 10), lb = 1, nlpar = "a") + 
  prior(normal(0, 10), lb = 0, nlpar = "c")
mand_priors <- prior(normal(0, 10), lb = 1, nlpar = "a") + 
  prior(normal(0, 10), lb = -1, nlpar = "b") + 
  prior(normal(0, 10), lb = 0, nlpar = "c")

#fit the zipf models
zipf_fit_a <- brm(bf(freq ~ c/(rank^a), a + c ~ 1, nl = TRUE),
                  data = data_a, prior = zipf_priors,
                  iter = 5000, cores = 4)
zipf_fit_b <- brm(bf(freq ~ c/(rank^a), a + c ~ 1, nl = TRUE),
                  data = data_b, prior = zipf_priors,
                  iter = 5000, cores = 4)
zipf_fit_c <- brm(bf(freq ~ c/(rank^a), a + c ~ 1, nl = TRUE),
                  data = data_c, prior = zipf_priors,
                  iter = 5000, cores = 4)

#fit the mandelbrot models
mand_fit_a <- brm(bf(freq ~ c/((rank + b)^a), a + b + c ~ 1, nl = TRUE),
                  data = data_a, prior = mand_priors,
                  iter = 5000, cores = 4)
mand_fit_b <- brm(bf(freq ~ c/((rank + b)^a), a + b + c ~ 1, nl = TRUE),
                  data = data_b, prior = mand_priors,
                  iter = 5000, cores = 4)
mand_fit_c <- brm(bf(freq ~ c/((rank + b)^a), a + b + c ~ 1, nl = TRUE),
                  data = data_c, prior = mand_priors,
                  iter = 5000, cores = 4)

#store and save all relevant output
zipf_rf_models <- list(data = list(ds_2 = data_a,
                                   ds_3 = data_b,
                                   ds_4 = data_c),
                       zipf = list(ds_2 = summary(zipf_fit_a)$fixed,
                                   ds_3 = summary(zipf_fit_b)$fixed, 
                                   ds_4 = summary(zipf_fit_c)$fixed), 
                       mand = list(ds_2 = summary(mand_fit_a)$fixed, 
                                   ds_3 = summary(mand_fit_b)$fixed, 
                                   ds_4 = summary(mand_fit_c)$fixed), 
                       prior = prior_summary(mand_fit_a),
                       waic = data.frame(zipf = c(waic(zipf_fit_a)$waic, waic(zipf_fit_b)$waic, waic(zipf_fit_c)$waic),
                                         mand = c(waic(mand_fit_a)$waic, waic(mand_fit_b)$waic, waic(mand_fit_c)$waic)),
                       r2 = c(bayes_R2(zipf_fit_a)[1], bayes_R2(zipf_fit_b)[1], bayes_R2(zipf_fit_c)[1], 
                              bayes_R2(mand_fit_a)[1], bayes_R2(mand_fit_b)[1], bayes_R2(mand_fit_c)[1]))
save(zipf_rf_models, file = "data_models/zipf_rf_models.RData")

The relationship between rank (x-axis) and count (y-axis) at the intermediate granularity level (DS = 3, 596 syllable types). The blue and orange lines denote the expected distributions according to Zipf's rank-frequency law (blue; a = 1.00, c = 0.08) and Mandelbrot's extension of it (orange; a = 1.20, b = 5.69, c = 0.52).

Figure 2: The relationship between rank (x-axis) and count (y-axis) at the intermediate granularity level (DS = 3, 596 syllable types). The blue and orange lines denote the expected distributions according to Zipf’s rank-frequency law (blue; a = 1.00, c = 0.08) and Mandelbrot’s extension of it (orange; a = 1.20, b = 5.69, c = 0.52).

As there is no established hypothesis test for Zipf’s rank-frequency law, I will simply report the goodness-of-fit (the norm in linguistics) and refer to it as “consistent” with the law when it is within the range reported for human languages (\(R^2\) > 0.8) (21,88).

The observed frequency distribution is consistent with the Zipf-Mandelbrot distribution at all three levels of granularity in syllable clustering (\(R^2 = \{0.995, 0.995, 0.995\}\) at \(DS = \{2, 3, 4\}\)), but is inconsistent with the the original form of Zipf’s rank frequency law at the two of the three levels of granularity (\(R^2 = \{0.945, 0.777, 0.523\}\)). Figure 2 shows the rank-frequency distribution at the intermediate granularity level (\(DS = 3\); 596 syllable types), while all three can be seen in the Supplementary Information. The distribution has a poorer fit to the rarer syllable types—a common pattern in human language (21) that is further investigated in the Supplementary Information. The Zipf-Mandelbrot distribution also outcompetes Zipf’s original form and has \(R^2 > 0.98\) when models are fit separately to the data from each year (1975, 2012, and 2019; see Supplementary Information).

2.4 Zipf’s Law of Abbreviation

Zipf’s law of abbreviation predicts that common items will be shorter in duration than rarer ones. Rather than focusing duration alone, I explored whether the frequency of syllables is negatively correlated with four different measures of production cost: duration (ms), bandwidth (Hz), concavity (changes in the sign of the slope of the mean frequency trace/ms) and excursion (cumulative absolute change in Hz/ms).

For each level of deep split and each measure of production cost, I constructed a log-normal model with the measure in question as the outcome variable, count as the predictor variable, and syllable type as a varying intercept. The alternative formulation, a Poisson model with count as the outcome variable, does not allow for the correct random effects structure (e.g. counts are identical across observations of the same syllable type).

#load packages
library(brms)

#load processed data and rename
load("data_models/processed_data.RData")
data <- processed_data$data

#set priors, a has lower bound on intercept (because outcome is only positive) while b does not
zipf_priors_a <- prior(normal(0, 4), lb = 0, class = "Intercept") + 
  prior(normal(0, 0.5), class = "b") + 
  prior(normal(0, 0.5), lb = 0, class = "sd") + 
  prior(normal(0, 0.5), lb = 0, class = "sigma")
zipf_priors_b <- prior(normal(0, 4), class = "Intercept") + 
  prior(normal(0, 0.5), class = "b") + 
  prior(normal(0, 0.5), lb = 0, class = "sd") + 
  prior(normal(0, 0.5), lb = 0, class = "sigma")

#run duration model across all three deep split values
duration_model_2 <- brm(log(duration) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
duration_model_3 <- brm(log(duration) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
duration_model_4 <- brm(log(duration) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)

#run bandwidth model across all three deep split values
bandwidth_model_2 <- brm(log(bandwidth) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
bandwidth_model_3 <- brm(log(bandwidth) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)
bandwidth_model_4 <- brm(log(bandwidth) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_a, iter = 5000, cores = 4)

#run concavity model across all three deep split values, with 0.0001 added to avoid log-transformation issues
concavity_model_2 <- brm(log(concavity + 0.0001) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
concavity_model_3 <- brm(log(concavity + 0.0001) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
concavity_model_4 <- brm(log(concavity + 0.0001) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)

#run excursion model across all three deep split values, with 0.0001 added to avoid log-transformation issues
excursion_model_2 <- brm(log(excursion + 0.0001) ~ scale(count_2) + (1|cluster_2), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
excursion_model_3 <- brm(log(excursion + 0.0001) ~ scale(count_3) + (1|cluster_3), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)
excursion_model_4 <- brm(log(excursion + 0.0001) ~ scale(count_4) + (1|cluster_4), data = data, prior = zipf_priors_b, iter = 5000, cores = 4)

#combine into object that contains model estimates, diagnostics, and prior specifications, and then save
zipf_models <- list(duration = list(summary(duration_model_2)$fixed, summary(duration_model_3)$fixed, summary(duration_model_4)$fixed, prior_summary(duration_model_2)),
                    bandwidth = list(summary(bandwidth_model_2)$fixed, summary(bandwidth_model_3)$fixed, summary(bandwidth_model_4)$fixed, prior_summary(bandwidth_model_2)),
                    concavity = list(summary(concavity_model_2)$fixed, summary(concavity_model_3)$fixed, summary(concavity_model_4)$fixed, prior_summary(concavity_model_2)),
                    excursion = list(summary(excursion_model_2)$fixed, summary(excursion_model_3)$fixed, summary(excursion_model_4)$fixed, prior_summary(excursion_model_2)))
save(zipf_models, file = "data_models/zipf_models.RData")

Table 1: The estimated effect of count on each measure of production cost, using the syllable classifications from each level of deep split. 95% credible intervals that do not overlap with 0 are marked with an asterisk.
Model	DS	Est.	Err.	2.5%	97.5%
duration ~ count	2	-0.36	0.14	-0.63	-0.10	*
	3	-0.46	0.06	-0.58	-0.34	*
	4	-0.42	0.03	-0.48	-0.35	*
bandwidth ~ count	2	-0.55	0.11	-0.75	-0.34	*
	3	-0.68	0.06	-0.79	-0.57	*
	4	-0.64	0.03	-0.69	-0.58	*
concavity ~ count	2	-0.02	0.10	-0.21	0.17
	3	-0.06	0.04	-0.15	0.02
	4	-0.07	0.03	-0.12	-0.02	*
excursion ~ count	2	-0.26	0.07	-0.41	-0.11	*
	3	-0.33	0.04	-0.41	-0.25	*
	4	-0.32	0.02	-0.36	-0.27	*

Duration, bandwidth, and excursion had strong negative effects on count at all three levels of granularity in syllable clustering (Table 1). Concavity had no effect on count at the first two levels of granularity, but had a very weak negative effect at the third level of granularity. These patterns are apparent in the plots of production costs and counts in the Supplementary Information. The results are qualitatively identical when the year of recording (1975, 2012, or 2019) is included as a varying intercept (see Supplementary Information). Several other robustness checks, including replications of this analysis using the rank-based method from the R package ZLAvian (85,89) and using manually-classified syllables from an experimental study in house finches (73), can be found in the Supplementary Information.

2.5 Menzerath’s Law

Menzerath’s law predicts that longer sequences will be comprised of smaller items. Importantly, Menzerath’s law is sometimes detected in random sequences from null models (90–92). There are two sorts of null models that make sense in this context: (1) random sequences with the same number of syllables as real songs, and (2) pseudorandom sequences that match the cumulative duration in time but can vary in the number of syllables. (49) interpret the latter as approximating simple motor constraints—Menzerath’s law resulting from efficiency in production alone—where stronger effects in the real data would indicate additional mechanisms (e.g. communicative efficiency). In this study, I compare the real data against each of these to assess both the presence of Menzerath’s law and whether it is beyond what would be expected from production constraints.

The production constraint model of (49) works as follows. For each iteration of the model, a pseudorandom sequence was produced for each real song in the dataset. Syllables were randomly sampled (with replacement) from the population until the duration of the random sequence exceeded the duration of the real song. If the difference between the duration of the random sequence and the real song was <50% of the duration of the final syllable, then the final syllable was kept in the sequence. Otherwise, it was removed. Each iteration of the model produces a set of random sequences with approximately the same distribution of durations as the real data.

To estimate the strength of Menzerath’s law, I constructed a log-normal model with syllable duration as the outcome variable, song length in number of syllables as the predictor variable, and the song as a varying intercept (93). This model was used to estimate the strength of the effect of song length on syllable duration in both the real data and 10 simulated datasets from both null models. The brm_multiple function from the brms package in R was used to fit a single model to the the 10 simulated datasets from each null model and produce a combined posterior distribution (78).

This analysis differs from (49) in two ways: (1) I use the actual duration of syllables rather than a single median value for each song, and (2) I compare the full posterior distributions rather than point estimates of effects. These decisions should yield more conservative conclusions. Note that syllable type is not incorporated into the modeling, so this is the only analysis that is not conducted across multiple levels of deep split.

#load packages
library(brms)

#load processed data
load("data_models/processed_data.RData")
data <- processed_data$data
syl_types <- processed_data$syl_types
rm(processed_data)

#get unique songs and durations
unique_songs <- unique(data$song)
individuals <- as.character(sapply(unique_songs, function(q){data$individual[which(data$song == q)][1]}))
song_durs <- as.numeric(sapply(unique_songs, function(x){sum(data$duration[which(data$song == x)])}))

#create 10 simulated datasets from the simple null model
simple_null_sims <- lapply(1:10, function(x){
  temp <- data
  temp$duration <- sample(temp$duration)
  return(temp)
})

#create production constraint model based on james et al. (2021)
prod_null <- function(){
  #get shuffled versions of data
  sampled_inds <- sample(nrow(data))
  sampled_durs <- data$duration[sampled_inds]

  #double everything so the algorithm can loop back around if it needs to
  sampled_inds <- c(sampled_inds, sampled_inds)
  sampled_durs <- c(sampled_durs, sampled_durs)

  #set starting points and sampling window for cumulative duration calculation
  ind <- 1
  m <- 1
  window <- 200
  
  #create empty sequences list to fill
  sequences <- list()
  
  #iterate through unique songs and create sequences
  while(m <= length(unique_songs)){
    #get cumulative durations from ind to the end of the sampling window
    cum_durs <- cumsum(sampled_durs[ind:(ind + window)])
    
    #set the targets as the value at which the cumulative durations surpass the desired duration
    rel_position <- min(which(cum_durs > song_durs[m])) #relative position in the cumulative duration window
    abs_position <- ind + rel_position - 1 #absolute position in the sampled vectors
    
    #if syllable duration surpasses song duration at first position, do exception handling
    if(rel_position == 1){
      #add the single-unit sequence and iterate ind
      sequences[[m]] <- data.frame(song = unique_songs[m], individual = individuals[m], duration = sampled_durs[ind], song_length = rel_position)
      ind <- ind + 1
    } else{
      #if the difference between the cumulative duration at the target and the desired duration is less that half of the length of target
      if(cum_durs[rel_position] - song_durs[m] < sampled_durs[abs_position]/2){
        #add the sequence and iterate ind
        sequences[[m]] <- data.frame(song = unique_songs[m], individual = individuals[m], duration = sampled_durs[ind:abs_position], song_length = rel_position)
        ind <- abs_position + 1
      } else{
        #otherwise, add abbreviated sequence and iterate ind
        sequences[[m]] <- data.frame(song = unique_songs[m], individual = individuals[m], duration = sampled_durs[ind:(abs_position - 1)], song_length = rel_position - 1)
        ind <- abs_position
      }
    }
    
    #iterate m
    m <- m + 1
  }
  
  #return combined sequences
  return(do.call(rbind, sequences))
}

#create 10 simulated datasets from the production constraint model
prod_null_sims <- lapply(1:10, function(x){prod_null()})

#set priors
menz_priors <- prior(normal(0, 3), lb = 0, class = "Intercept") + 
  prior(normal(0, 0.1), class = "b") + 
  prior(normal(0, 0.5), lb = 0, class = "sd") + 
  prior(normal(0, 0.5), lb = 0, class = "sigma")

#run the actual model on the data
model <- brm(log(duration) ~ scale(song_length) + (1|song), data = data, prior = menz_priors, iter = 5000, cores = 4)

#fit a single model across the 10 simple null datasets
simple_null_models <- brm_multiple(log(duration) ~ scale(song_length) + (1|song), data = simple_null_sims, prior = menz_priors, iter = 5000, cores = 7)

#fit a single model across the 10 production constraint datasets
prod_null_models <- brm_multiple(log(duration) ~ scale(song_length) + (1|song), data = prod_null_sims, prior = menz_priors, iter = 5000, cores = 7)

#combine into object that contains posterior samples, model estimates and diagnostics, and prior specifications, and then save
menz_models <- list(actual = list(estimates = summary(model)$fixed, samples = posterior_samples(model, pars = c("scalesong_length"))$b_scalesong_length), 
                    simple_null = list(estimates = summary(simple_null_models)$fixed, samples = posterior_samples(simple_null_models, pars = c("scalesong_length"))$b_scalesong_length, rhats = simple_null_models$rhats), 
                    prod_null = list(estimates = summary(prod_null_models)$fixed, samples = posterior_samples(prod_null_models, pars = c("scalesong_length"))$b_scalesong_length, rhats = prod_null_models$rhats), 
                    prior = prior_summary(model))
save(menz_models, file = "data_models/menz_models.RData")

The left panel shows the relationship between song length in number of syllables (x-axis) and syllable duration (y-axis), with a best fit line from the fitted lognormal model. The right panel shows the posterior distribution of the effect of song length on duration for the real data (orange) compared to 10 simulated datasets from the simple (blue) and production (green) null models. The combined posteriors for the simulated datasets are based on 2,500 posterior samples from each of the 10 models.

Figure 3: The left panel shows the relationship between song length in number of syllables (x-axis) and syllable duration (y-axis), with a best fit line from the fitted lognormal model. The right panel shows the posterior distribution of the effect of song length on duration for the real data (orange) compared to 10 simulated datasets from the simple (blue) and production (green) null models. The combined posteriors for the simulated datasets are based on 2,500 posterior samples from each of the 10 models.

The results of the log-normal model indicate that song length has a negative effect on syllable duration (Mean Estimate: -0.052; 95% CI: [-0.066, -0.038]) (left panel of Figure 3). The posterior distribution for this effect from the model fit to the actual data (orange) is more negative than the posterior distributions from 10 simulated datasets from the production constraint null model of (49) (green) and a simple random null model (blue), although there is notable overlap between the lower tail of the production constraint model and the upper tail of the actual data (3.5% of the combined distribution area; right panel of Figure 3). The negative effect of song length on syllable duration persists when the year of recording (1975, 2012, or 2019) is included as a varying intercept (see Supplementary Information).

2.6 Small-Worldness Index

The small-worldness index (\(SWI\)) is a ratio based on the clustering coefficient (\(C\)) and average path length (\(L\)) (51):

\[\begin{equation} SWI = \frac{C/C_{rand}}{L/L_{rand}} \tag{3} \end{equation}\]

where \(C_{rand}\) and \(L_{rand}\) are calculated from random networks of the same size as the real network. Values of \(SWI > 1\) are consistent with small-world structure (51), which is thought to reflect general systematic structure and thus compressibility (13,24).

In this study, I followed (24) in calculating \(SWI\) from the unweighted directed network of all syllable transitions in the population (lower panel of Figure 4A) to allow comparison with previous studies of non-human song. Importantly, if the frequency distribution of syllable types is very skewed, then random sequences could exhibit small-world structure simply because of clustering around common types. To avoid this confound, \(SWI\) was calculated 1,000 times from the real data and 1,000 times from random sequences with the same distribution of types.

#load library
library(igraph)

#load processed data
load("data_models/processed_data.RData")
data <- processed_data$data
syl_types <- processed_data$syl_types
rm(processed_data)

#get number of unique songs
unique_songs <- unique(data$song)

#set number of iterations
iters <- 1000

#compute swi at each value of deep split
small_world <- parallel::mclapply(0:4, function(ds){
  #get column ID corresponding to deep split value
  if(ds == 0){syls <- data$cluster_0}
  if(ds == 1){syls <- data$cluster_1}
  if(ds == 2){syls <- data$cluster_2}
  if(ds == 3){syls <- data$cluster_3}
  if(ds == 4){syls <- data$cluster_4}

  #extract all of the structured directed sequences
  directed_sequences <- do.call(rbind, lapply(1:length(unique_songs), function(i){
    sequence <- syls[which(data$song == unique_songs[i])]
    return(matrix(c(sequence[1:(length(sequence)-1)], sequence[2:length(sequence)]), ncol = 2))
  }))

  #convert to factor
  directed_sequences <- matrix(as.numeric(factor(directed_sequences)), ncol = 2)

  #extract "iters" sets of shuffled sequences for null models
  shuffled_sequences <- lapply(1:iters, function(x){
    shuffled_syls <- sample(syls)
    do.call(rbind, lapply(1:length(unique_songs), function(i){
      sequence <- shuffled_syls[which(data$song == unique_songs[i])]
      return(matrix(c(sequence[1:(length(sequence)-1)], sequence[2:length(sequence)]), ncol = 2))
    }))
  })

  #construct structured graph, ignoring edge weights
  graph <- simplify(graph_from_edgelist(unique(directed_sequences), directed = TRUE))

  #get small world indices for it
  structured_vals <- sapply(1:iters, function(x){
    temp <- sample_gnm(n = length(V(graph)), m = length(E(graph)), directed = TRUE)
    (transitivity(graph)/transitivity(temp))/(mean_distance(graph)/mean_distance(temp))
  })

  #get small world indices for null models
  null_vals <- sapply(1:iters, function(x){
    target_sequence <- matrix(as.numeric(factor(shuffled_sequences[[x]])), ncol = 2)
    null <- simplify(graph_from_edgelist(unique(target_sequence), directed = TRUE))
    temp <- sample_gnm(n = length(V(null)), m = length(E(null)), directed = TRUE)
    (transitivity(null)/transitivity(temp))/(mean_distance(null)/mean_distance(temp))
  })

  #return objects
  return(list(structured = structured_vals, null = null_vals))
}, mc.cores = 5)

#save indices
save(small_world, file = "data_models/small_world.RData")

The left panel shows transition network between syllables types for a single male (#22 from 1975) above the global transition network for the entire dataset. The right panel shows the estimated small-worldness index calculated 1,000 times from the real data (orange) compared to pseudorandom sequences with the same frequency distribution of syllable types (blue). The dashed vertical line at 1 corresponds to the standard threshold for a network having small-world structure [@humphriesNetworkSmallworldnessQuantitative2008].

Figure 4: The left panel shows transition network between syllables types for a single male (#22 from 1975) above the global transition network for the entire dataset. The right panel shows the estimated small-worldness index calculated 1,000 times from the real data (orange) compared to pseudorandom sequences with the same frequency distribution of syllable types (blue). The dashed vertical line at 1 corresponds to the standard threshold for a network having small-world structure (51).

Figure 4B shows the distribution of \(SWI\) calculated from random sequences with the same distribution of types (orange) and the actual sequences (blue). At all three levels of granularity in syllable clustering, the observed small-worldness of the real songs is above both the standard threshold (\(SWI > 1\)) and the pseudorandom sequences. \(SWI > 1\) when computed separately from the data from each year (1975, 2012, and 2019; see Supplementary Information)

2.7 Mutual Information

The rate at which the mutual information between items decays with distance reflects whether underlying syntactic structure is Markovian, hierarchical, or a composite of the two (12). Mutual information (\(MI\)) was calculated from pairs of sequential syllables separated by a certain distance, using the method of (12). \(MI\) was computed on concatenated sequences of songs from each individual. For example, if there are two individuals with two concatenated sequences of songs, \(a \to b \to c \to d\) and \(e \to f \to g\), and the target distance is 2, then the pairs used for \(MI\) calculated would be \(((a, c), (b, d), (e, g))\). \(MI\) would then be calculated using:

\[\begin{equation} \hat{I}(X, Y) = \hat{S}(X) + \hat{S}(Y) - \hat{S}(X, Y) \tag{4} \end{equation}\]

where \(\hat{I}\) denotes information and \(\hat{S}\) denotes entropy. \(X\) is the distribution of first syllables \((a, b, e)\), \(Y\) is the distribution of second syllables \((c, d, g)\), and \(XY\) is the joint distribution of pairs \(((a, c), (b, d), (e, g))\). \(\hat{S}\) was calculated using the method of (94) and (95) used by (65):

\[\begin{equation} \hat{S} = ln(N) - \frac{1}{N} \sum_{i=1}^{K} N_{i} \psi(N_{i}) \tag{5} \end{equation}\]

where \(N\) is the total number of tokens in the distribution, \(N_i\) is the number of tokens within each type \(i\) out of the total number of types \(K\), and \(\psi\) is the digamma function. The actual \(MI\) is then measured using:

\[\begin{equation} MI = \hat{I} - \hat{I}_{sh} \tag{6} \end{equation}\]

where \(\hat{I}_{sh}\) is the estimated lower bound of \(MI\) calculated from shuffled sequences, created by randomly permuting the concatenated sequences of individuals’ songs.

(12) and (66) simulated data from the hierarchical model of (65), the Markov model of (96), and their composite model in which Markov chains are nested within a larger hierarchical structure.

To determine whether the observed \(MI\) decay was consistent with a Markov process, hierarchical process, or both, I fit the following three decay models to the data:

\[\begin{equation} \textrm{exponential decay: } y = ae^{-xb} \tag{7} \end{equation}\] \[\begin{equation} \textrm{power-law decay: } y = cx^{d} \tag{8} \end{equation}\] \[\begin{equation} \textrm{composite decay: } y = ae^{-xb} + cx^{d} \tag{9} \end{equation}\]

where \(y\) is the estimated \(MI\) and \(x\) is the distance between syllables (12,66). (12) included an intercept \(f\) in all three models, but I removed it because it led to overfitting issues (e.g. exponential model with an extra “knee” after the initial decay), did not significantly improve model fit (\(\Delta WAIC < 2\)), and was not included in (65). I focused my analysis on distances of up to 100 syllables to enable easy comparison with (12). In the Supplementary Information, I followed (12) in comparing the fit of each model at increasing distances from 100 to 1,200 (the longest individual sequence is 1,219), and found that the composite model outperforms both the exponential and power-law models at all distances.

#load packages
library(brms)

#load data
load("data_models/processed_data.RData")
data <- processed_data$data
data$year <- strtrim(data$individual, 4)

#create function for the grassberger method of entropy estimation used by lin and tegmark (log is ln by default, which is what grassberger used originally)
s_calc <- function(vals){return(log(sum(vals)) - ((1/sum(vals))*sum(vals*digamma(vals))))}

#get unique songs and song lengths
unique_songs <- unique(data$song)
song_lengths <- as.numeric(sapply(unique_songs, function(x){length(which(data$song == x))}))

#get unique individuals and the lengths of their sequences
unique_inds <- unique(data$individual)
inds_lengths <- as.numeric(sapply(unique_inds, function(x){length(which(data$individual == x))}))

#set random seed
set.seed(12345)

#calculate mutual information across three levels of deep split
mi_data <- parallel::mclapply(1:3, function(i){
  #use appropriate syllables
  if(i == 1){data$syls <- data$cluster_2}
  if(i == 2){data$syls <- data$cluster_3}
  if(i == 3){data$syls <- data$cluster_4}

  #compute actual mutual information curves
  actual <- sapply(1:(max(inds_lengths) - 1), function(m){
    #set m as the distance between syllables in the sequence
    dist <- m

    #store pairs within bouts
    pairs <- do.call(rbind, lapply(unique_inds[which(inds_lengths > m)], function(r){
      seq <- data$syls[which(data$individual == r)]
      t(sapply(1:(length(seq)-dist), function(x){c(seq[x], seq[x+dist])}))
    }))
    
    #get frequencies of syllables in each column
    n_x <- as.numeric(table(pairs[, 1]))
    n_y <- as.numeric(table(pairs[, 2]))

    #get frequencies of pairs in the data
    pairs <- pairs[order(pairs[, 1], pairs[, 2]), ]
    index <- !duplicated(pairs)
    pair_freqs <- as.numeric(tapply(index, cumsum(index), length))

    #calculate s for each
    s_x <- s_calc(n_x)
    s_y <- s_calc(n_y)
    s_xy <- s_calc(pair_freqs)

    #compute mutual information and return
    return(s_x + s_y - s_xy)
  })

  #compute shuffled mutual information curves
  shuffled <- sapply(1:(max(inds_lengths) - 1), function(m){
    #set m as the distance between syllables in the sequence
    dist <- m

    #store pairs within bouts
    pairs <- do.call(rbind, lapply(unique_inds[which(inds_lengths > m)], function(r){
      seq <- sample(data$syls[which(data$individual == r)])
      t(sapply(1:(length(seq)-dist), function(x){c(seq[x], seq[x+dist])}))
    }))

    #get frequencies of syllables in each column
    n_x <- as.numeric(table(pairs[, 1]))
    n_y <- as.numeric(table(pairs[, 2]))

    #get frequencies of pairs in the data
    pairs <- pairs[order(pairs[, 1], pairs[, 2]), ]
    index <- !duplicated(pairs)
    pair_freqs <- as.numeric(tapply(index, cumsum(index), length))

    #calculate s for each
    s_x <- s_calc(n_x)
    s_y <- s_calc(n_y)
    s_xy <- s_calc(pair_freqs)

    #compute mutual information and return
    return(s_x + s_y - s_xy)
  })

  #return mutual information decay, with shuffled baseline substracted from actual information
  return(data.frame(dist = 1:length(actual), mi = c(actual - shuffled), mi_raw = actual, mi_base = shuffled))
}, mc.cores = 3)

#save output
save(mi_data, file = "data_models/mi_data.RData")

#load packages
library(brms)

#load data
load("data_models/mi_data.RData")

#set priors
exponential_priors <- prior(normal(0, 1), lb = 0, nlpar = "a") + 
  prior(normal(0, 1), lb = 0, nlpar = "b")
power_law_priors <- prior(normal(0, 1), lb = 0, nlpar = "c") + 
  prior(normal(0, 1), lb = 0, nlpar = "d")
composite_priors <- prior(normal(0, 1), lb = 0, nlpar = "a") + 
  prior(normal(0, 1), lb = 0, nlpar = "b") + 
  prior(normal(0, 1), lb = 0, nlpar = "c") + 
  prior(normal(0, 1), lb = 0, nlpar = "d")

#find the point at which the distances are no longer best fit by a composite model for each level of deep split
increments <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200)
waic_test <- lapply(1:3, function(h){
  parallel::mclapply(increments, function(x){
    exponential_test <- brm(bf(mi ~ a*exp((-dist)*b), a + b ~ 1, nl = TRUE), 
                            data = mi_data[[h]][1:x, ], prior = exponential_priors, iter = 5000)
    power_law_test <- brm(bf(mi ~ c*(dist^(-d)), c + d ~ 1, nl = TRUE), 
                          data = mi_data[[h]][1:x, ], prior = power_law_priors, iter = 5000)
    composite_test <- brm(bf(mi ~ a*exp((-dist)*b) + c*(dist^(-d)), a + b + c + d ~ 1, nl = TRUE), 
                          data = mi_data[[h]][1:x, ], prior = composite_priors, iter = 5000)
    waic_temp <- WAIC(exponential_test, power_law_test, composite_test)
    return(c(waic_temp$loos$exponential_test$waic, 
             waic_temp$loos$power_law_test$waic,
             waic_temp$loos$composite_test$waic))
  }, mc.cores = 7)
})

#format and save
waic_test <- lapply(1:3, function(x){
  temp <- data.frame(do.call(rbind, waic_test[[x]]))
  colnames(temp) <- c("exponential", "power_law", "composite")
  rownames(temp) <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200)
  return(temp)
})
save(waic_test, file = "data_models/waic_test.RData")

#run models at all three levels of deep split
mi_models <- lapply(1:3, function(g){
  #fit the exponential model
  exponential_fit <- brm(bf(mi ~ a*exp((-dist)*b), a + b ~ 1, nl = TRUE),
                         data = mi_data[[g]][1:100, ], prior = exponential_priors,
                         iter = 5000, cores = 4)
  
  #fit the power law model
  power_law_fit <- brm(bf(mi ~ c*(dist^(-d)), c + d ~ 1, nl = TRUE),
                       data = mi_data[[g]][1:100, ], prior = power_law_priors, 
                       iter = 5000, cores = 4)
  
  #fit the composite model
  composite_fit <- brm(bf(mi ~ a*exp((-dist)*b) + c*(dist^(-d)), a + b + c + d ~ 1, nl = TRUE), 
                       data = mi_data[[g]][1:100, ], prior = composite_priors, 
                       iter = 5000, cores = 4)
  
  #calculate r2
  r2 <- rbind(bayes_R2(exponential_fit), bayes_R2(power_law_fit), bayes_R2(composite_fit))
  rownames(r2) <- c("exponential", "power_law", "composite")

  #compute aic
  waic <- WAIC(exponential_fit, power_law_fit, composite_fit)
  waic <- c(waic$loos$exponential_fit$estimates[3, 1], waic$loos$power_law_fit$estimates[3, 1], waic$loos$composite_fit$estimates[3, 1])

  #return data and model fits and aic
  return(list(waic = waic,
              r2 = r2,
              priors = prior_summary(composite_fit),
              exponential_fit = summary(exponential_fit)$fixed,
              power_law_fit = summary(power_law_fit)$fixed,
              composite_fit = summary(composite_fit)$fixed))
})

#save data and models
save(mi_models, file = "data_models/mi_models.RData")

#extract simulated values from sainburg et al.
model_fig <- xml2::read_xml("https://raw.githubusercontent.com/timsainb/ParallelsBirdsongLanguagePaper/master/figures/modelfig.svg")
xml2::xml_ns_strip(model_fig)

#convert raw values to actual values
raw_to_log <- function(y, range = c(281.188594, 6.838594), axis = c(10e-4, 10e1)){return(exp(scales::rescale(x = -c(range, y), from = -range, to = log(axis)))[-c(1:2)])}

#get heirarchical values
y_heirarchy <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_1']/g/use"), "y"))
y_heirarchy <- raw_to_log(y_heirarchy)

#get markov values
y_markov_a <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_2']/g/use"), "y"))[1:29]
y_markov_b <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_3']/g/use"), "y"))[1:12]
y_markov_a <- raw_to_log(y_markov_a)
y_markov_b <- raw_to_log(y_markov_b)

#get combined values
y_combined <- as.numeric(xml2::xml_attr(xml2::xml_find_all(model_fig, xpath = "//*[@id='PathCollection_4']/g/use"), "y"))
y_combined <- raw_to_log(y_combined)

#combine all simulated values into one data frame
mi_simulated <- data.frame(x = c(1:100, 1:12, 1:100), 
                           y = c(y_heirarchy, y_markov_b, y_combined),
                           group = c(rep("heirarchy", 100), rep("markov", 12), rep("combined", 100)))

#save simulated values
save(mi_simulated, file = "data_models/mi_simulated.RData")

The top panel shows simulated mutual information decay curves from the @linCriticalBehaviorPhysics2017 hierarchical model (left), the @katahiraComplexSequencingRules2011a Markov model (center), and the @sainburgParallelsSequentialOrganization2019 composite model (right) [data and inspiration for diagrams from @sainburgParallelsSequentialOrganization2019]. The bottom panel shows the computed mutual information decay curves for the observed data at different levels of deep split (left, center, right). The solid line corresponds to the full composite model, while the dashed and dotted lines correspond to the exponential and power-law terms, respectively.

Figure 5: The top panel shows simulated mutual information decay curves from the (65) hierarchical model (left), the (96) Markov model (center), and the (12) composite model (right) (data and inspiration for diagrams from 12). The bottom panel shows the computed mutual information decay curves for the observed data at different levels of deep split (left, center, right). The solid line corresponds to the full composite model, while the dashed and dotted lines correspond to the exponential and power-law terms, respectively.

At all three levels of granularity in syllable clustering, the composite model has a lower \(WAIC\) and higher \(R^2\) than both the exponential and power-law models (see Supplementary Information), suggesting that mutual information decay in house finch song is more consistent with a combination of Markovian and hierarchical processes. The composite model also outcompetes both the exponential and power-law models when fit to mutual information decay curves computed separately for each year of recording (1975, 2012, and 2019; see Supplementary Information).

Interestingly, the transitions in the composite curves from Figure 5 roughly correspond to the average length of a house finch song (~12 syllables) (72). I reran the analysis with individual song sequences rather than song bouts and found that the exponential model outcompeted the composite model at \(DS = \{3, 4\}\) (see Supplementary Information). This suggests that individual song sequences may have Markovian structure within hierarchically-organized song bouts.

3 Discussion

All three linguistic laws considered here are present in house finch song. Three out of the four measures of production cost, most importantly duration, are consistently and strongly negatively correlated with frequency, providing robust evidence for Zipf’s law of abbreviation. Menzerath’s law also found solid support, with a steeper negative relationship between song length and syllable duration than predicted by a model of production constraints (49). Together, these results show clear evidence for efficiency—syllables that are difficult to produce are less common and more likely to appear in shorter songs. Mandelbrot’s form of Zipf’s rank-frequency law provided a good fit to the data and has a more convex shape than the original form, which is consistent with studies of other non-human communication systems that have more redundancy than human language (24–29). I will limit my interpretation of this result, given ongoing debates about the cause of the rank-frequency law, but it is notable that the goodness-of-fit of the Zipf-Mandelbrot distribution to house finch song (\(R^2 > 0.99\)) is within the range of human language (21,88).

The two structural properties of language considered here are also found in house finch song. First, the syllable network has a small-world structure, characterized by high levels of clustering and low average path lengths, which is thought to reflect systematic structure and efficient recall in human language (52,53,97). The small-worldness index of house finch song is within the range seen in humpback whales (24) and other songbirds at the first two levels of granularity in syllable clustering (26,56–59) (\(SWI \sim 1.69-4.7\)), but is much higher when syllables are over-split into types (\(SWI \sim 10-11\)). Second, the decay in mutual information between syllables with increasing distance has a similar shape to that found in several human languages and songbird species (12), which is associated with Markov chains nested within a larger hierarchical structure. Historically, birdsong sequences were assumed to follow simple Markov chains (96,98,99). This result lends support to an emerging consensus that song bouts may be much more complex, containing long-range dependencies and hierarchical structures that resemble human language (11,12,100,101). In combination, the small-worldness and mutual information decay of house finch song sequences suggest that they exhibit the kind of systematic structure that is thought to maximize expressivity while reducing learning costs (13,24,54), making it easier for more information to “pass through the bottleneck” of social learning (1).

Notably, these patterns are consistent across three levels of granularity in syllable clustering. I have heard others studying the cultural evolution of birdsong refer to this as “fractal equivalency”—different resolutions of clustering should show similar forms of organization (102). A practical conclusion of this finding is that information theoretic measures may not be reliable indicators of clustering quality in non-human communication systems (103).

Additional discussion can be found in the Supplementary Information.

Acknowledgments

This work was inspired in large part by conversations with Olivier Morin, Alexey Koshevoy, and the attendees of the Communicative Efficiency Workshop hosted by the Max Planck Institute for Geoanthropology in April 2023.

Data & Code Availability

The data and code for this study are available on Zenodo (https://doi.org/10.5281/zenodo.10689807).

References

Gruber T, Chimento M, Aplin LM, Biro D. Efficiency fosters cumulative culture across species. Phil Trans R Soc B [Internet]. 2022 Jan 31 [cited 2023 Apr 12];377(1843):20200308. Available from: https://royalsocietypublishing.org/doi/10.1098/rstb.2020.0308

Gibson E, Futrell R, Piantadosi SP, Dautriche I, Mahowald K, Bergen L, et al. How efficiency shapes human language. Trends in Cognitive Sciences [Internet]. 2019 May [cited 2023 Apr 12];23(5):389–407. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1364661319300580

Zipf G. Human Behavior and the Principle of Least Effort: An Introducton to Human Ecology. Cambridge: Addison-Wesley; 1949.

Miton H, Morin O. Graphic complexity in writing systems. Cognition [Internet]. 2021;214:104771. Available from: https://doi.org/10.1016/j.cognition.2021.104771

De Boer B. Self-organization in vowel systems. Journal of Phonetics [Internet]. 2000 Oct [cited 2023 Aug 30];28(4):441–65. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0095447000901256

Fitch WT. The evolution of speech: a comparative review. Trends in Cognitive Sciences [Internet]. 2000 Jul [cited 2023 Aug 30];4(7):258–67. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1364661300014947

Mesoudi A. Variable Cultural Acquisition Costs Constrain Cumulative Cultural Evolution. PLOS ONE. 2011;6(3):e18239.

Youngblood M, Miton H, Morin O. Statistical signals of copying are robust to time- and space-averaging. Evolut Hum Sci [Internet]. 2023 [cited 2023 Aug 28];5:e10. Available from: https://www.cambridge.org/core/product/identifier/S2513843X23000051/type/journal_article

Youngblood M, Lahti D. Content bias in the cultural evolution of house finch song. Animal Behaviour. 2022;185:37–48.

10.

Garamszegi LZ, Balsby TJS, Bell BD, Borowiec M, Byers BE, Draganoiu T, et al. Estimating the complexity of bird song by using capture-recapture approaches from community ecology. Behav Ecol Sociobiol [Internet]. 2005 Feb [cited 2023 Aug 25];57(4):305–17. Available from: http://link.springer.com/10.1007/s00265-004-0866-6

11.

Kershenbaum A, Bowles AE, Freeberg TM, Jin DZ, Lameira AR, Bohn K. Animal vocal sequences: not the Markov chains we thought they were. Proc R Soc B [Internet]. 2014 Oct 7 [cited 2023 Apr 12];281(1792):20141370. Available from: https://royalsocietypublishing.org/doi/10.1098/rspb.2014.1370

12.

Sainburg T, Theilman B, Thielk M, Gentner TQ. Parallels in the sequential organization of birdsong and human speech. Nat Commun [Internet]. 2019 Aug 12 [cited 2023 Apr 17];10(1):3636. Available from: https://www.nature.com/articles/s41467-019-11605-y

13.

Raviv L, De Heer Kloots M, Meyer A. What makes a language easy to learn? A preregistered study on how systematic structure and community size affect language learnability. Cognition [Internet]. 2021 May [cited 2023 Aug 9];210:104620. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0010027721000391

14.

Verhoef T. The origins of duality of patterning in artificial whistled languages. Language and Cognition. 2012;4(4):357–80.

15.

Kirby S. Culture and biology in the origins of linguistic structure. Psychon Bull Rev [Internet]. 2017 Feb [cited 2023 Aug 28];24(1):118–37. Available from: http://link.springer.com/10.3758/s13423-016-1166-7

16.

Fedzechkina M, Jaeger TF, Newport EL. Language learners restructure their input to facilitate efficient communication. Proc Natl Acad Sci USA [Internet]. 2012 Oct 30 [cited 2023 Aug 28];109(44):17897–902. Available from: https://pnas.org/doi/full/10.1073/pnas.1215776109

17.

Kirby S, Cornish H, Smith K. Cumulative cultural evolution in the laboratory: An experimental approach to the origins of structure in human language. Proc Natl Acad Sci USA [Internet]. 2008 Aug 5 [cited 2023 Aug 28];105(31):10681–6. Available from: https://pnas.org/doi/full/10.1073/pnas.0707835105

18.

Fehér O, Wang H, Saar S, Mitra PP, Tchernichovski O. De novo establishment of wild-type song culture in the zebra finch. Nature [Internet]. 2009;459(7246):564–8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19412161

19.

Chimento M, Alarcón-Nieto G, Aplin L. Population turnover facilitates cultural selection for efficiency. Current Biology. 2021;31:2477–83.

20.

Semple S, Ferrer-i-Cancho R, Gustison ML. Linguistic laws in biology. Trends in Ecology & Evolution [Internet]. 2022 Jan [cited 2023 Aug 3];37(1):53–66. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0169534721002305

21.

Piantadosi ST. Zipf’s word frequency law in natural language: A critical review and future directions. Psychon Bull Rev [Internet]. 2014 Oct [cited 2023 Aug 3];21(5):1112–30. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592/

22.

Mandelbrot B. An informational theory of the statistical structure of language. Communication Theory. 1953;486–502.

23.

Mandelbrot B. On the theory of word frequencies and on related Markovian models of discourse. 1962;190–219.

24.

Allen JA, Garland EC, Dunlop RA, Noad MJ. Network analysis reveals underlying syntactic features in a vocally learnt mammalian display, humpback whale song. Proc R Soc B [Internet]. 2019 Dec 18 [cited 2023 May 29];286(1917):20192014. Available from: https://royalsocietypublishing.org/doi/10.1098/rspb.2019.2014

25.

Briefer E, Osiejuk TS, Rybak F, Aubin T. Are bird song complexity and song sharing shaped by habitat structure? An information theory and statistical approach. Journal of Theoretical Biology [Internet]. 2010;262(1):151–64. Available from: http://dx.doi.org/10.1016/j.jtbi.2009.09.020

26.

Cody ML, Stabler E, Sánchez Castellanos HM, Taylor CE. Structure, syntax and “small-world” organization in the complex songs of California Thrashers (Toxostoma redivivum). Bioacoustics [Internet]. 2016 Jan 2 [cited 2023 Apr 12];25(1):41–54. Available from: http://www.tandfonline.com/doi/full/10.1080/09524622.2015.1089418

27.

Hailman JP, Ficken MS, Ficken RW. The “chick-a-dee” calls of Parus atricapillus: a recombinant system of animal communication compared with written English. Semiotica [Internet]. 1985 [cited 2023 Aug 3];56(3-4). Available from: https://www.degruyter.com/document/doi/10.1515/semi.1985.56.3-4.191/html

28.

McCowan B, Hanser SF, Doyle LR. Quantitative tools for comparing animal communication systems: information theory applied to bottlenose dolphin whistle repertoires. Animal Behaviour [Internet]. 1999 Feb [cited 2023 Aug 3];57(2):409–19. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0003347298910004

29.

Martins EP. Structural complexity in a lizard communication system: the Sceloporus graciosus “push-up” display. Copeia [Internet]. 1994 Dec 19 [cited 2023 Aug 3];1994(4):944. Available from: https://www.jstor.org/stable/1446717?origin=crossref

30.

Manin DYu. Mandelbrot’s model for Zipf’s law: can Mandelbrot’s model explain Zipf’s law for language? Journal of Quantitative Linguistics [Internet]. 2009 Aug [cited 2023 Sep 7];16(3):274–85. Available from: http://www.tandfonline.com/doi/abs/10.1080/09296170902850358

31.

Salge C, Ay N, Polani D, Prokopenko M. Zipf’s law: balancing signal usage cost and communication efficiency. Smalheiser NR, editor. PLoS ONE [Internet]. 2015 Oct 1 [cited 2023 Aug 28];10(10):e0139475. Available from: https://dx.plos.org/10.1371/journal.pone.0139475

32.

Ferrer-i-Cancho R. Compression and the origins of Zipf’s law for word frequencies. Complexity [Internet]. 2016 Nov [cited 2023 Aug 28];21(S2):409–11. Available from: https://onlinelibrary.wiley.com/doi/10.1002/cplx.21820

33.

Genty E, Byrne RW. Why do gorillas make sequences of gestures? Anim Cogn [Internet]. 2010 Mar [cited 2023 Sep 19];13(2):287–301. Available from: http://link.springer.com/10.1007/s10071-009-0266-4

34.

Kershenbaum A, Demartsev V, Gammon DE, Geffen E, Gustison ML, Ilany A, et al. Shannon entropy as a robust estimator of Zipf’s Law in animal vocal communication repertoires. Zamora‐Gutierrez V, editor. Methods Ecol Evol [Internet]. 2021 Mar [cited 2023 Sep 19];12(3):553–64. Available from: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13536

35.

Bentz C, Ferrer-i-Cancho R. Zipf’s law of abbreviation as a language universal. In: Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics. 2016. p. 1–4.

36.

Petrini S, Casas-i-Muñoz A, Cluet-i-Martinell J, Wang M, Bentz C, Ferrer-i-Cancho R. The optimality of word lengths. Theoretical foundations and an empirical study [Internet]. arXiv; 2023 [cited 2023 Aug 9]. Available from: http://arxiv.org/abs/2208.10384

37.

Koshevoy A, Miton H, Morin O. Zipf’s law of abbreviation holds for individual characters across a broad range of writing systems. Cognition [Internet]. 2023 Sep [cited 2023 Aug 9];238:105527. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0010027723001610

38.

Heesen R, Hobaiter C, Ferrer-i-Cancho R, Semple S. Linguistic laws in chimpanzee gestural communication. Proceedings of the Royal Society B: Biological Sciences. 2019;286:20182900.

39.

Semple S, Hsu MJ, Agoramoorthy G. Efficiency of coding in macaque vocal communication. Biol Lett [Internet]. 2010 Aug 23 [cited 2023 Aug 9];6(4):469–71. Available from: https://royalsocietypublishing.org/doi/10.1098/rsbl.2009.1062

40.

Huang M, Ma H, Ma C, Garber PA, Fan P. Male gibbon loud morning calls conform to Zipf’s law of brevity and Menzerath’s law: insights into the origin of human language. Animal Behaviour [Internet]. 2020 Feb [cited 2023 Aug 9];160:145–55. Available from: https://linkinghub.elsevier.com/retrieve/pii/S000334721930377X

41.

Favaro L, Gamba M, Cresta E, Fumagalli E, Bandoli F, Pilenga C, et al. Do penguins’ vocal sequences conform to linguistic laws? Biol Lett [Internet]. 2020 Feb [cited 2023 Aug 9];16(2):20190589. Available from: https://royalsocietypublishing.org/doi/10.1098/rsbl.2019.0589

42.

Ferrer-i-Cancho R, Hernández-Fernández A, Lusseau D, Agoramoorthy G, Hsu MJ, Semple S. Compression as a universal principle of animal behavior. Cogn Sci [Internet]. 2013 Nov [cited 2023 Aug 28];37(8):1565–78. Available from: https://onlinelibrary.wiley.com/doi/10.1111/cogs.12061

43.

Menzerath P. Die Architektonik des Deutschen Wortschatzes. Bonn: Dümmler; 1954.

44.

Cramer I. The parameters of the Altmann-Menzerath law. Journal of Quantitative Linguistics [Internet]. 2005 Apr [cited 2023 Aug 9];12(1):41–52. Available from: http://www.tandfonline.com/doi/abs/10.1080/09296170500055301

45.

Stave M, Paschen L, Pellegrino F, Seifart F. Optimization of morpheme length: a cross-linguistic assessment of Zipf’s and Menzerath’s laws. Linguistics Vanguard [Internet]. 2021 Apr 21 [cited 2023 Aug 9];7(s3):20190076. Available from: https://www.degruyter.com/document/doi/10.1515/lingvan-2019-0076/html

46.

Eroglu S. Menzerath–Altmann law for distinct word distribution analysis in a large text. Physica A: Statistical Mechanics and its Applications [Internet]. 2013 Jun [cited 2023 Aug 9];392(12):2775–80. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0378437113001702

47.

Gustison ML, Semple S, Ferrer-i-Cancho R, Bergman TJ. Gelada vocal sequences follow Menzerath’s linguistic law. Proc Natl Acad Sci USA [Internet]. 2016 May 10 [cited 2023 May 29];113(19). Available from: https://pnas.org/doi/full/10.1073/pnas.1522072113

48.

Fedurek P, Zuberbühler K, Semple S. Trade-offs in the production of animal vocal sequences: insights from the structure of wild chimpanzee pant hoots. Front Zool [Internet]. 2017 Dec [cited 2023 Aug 9];14(1):50. Available from: http://frontiersinzoology.biomedcentral.com/articles/10.1186/s12983-017-0235-8

49.

James LS, Mori C, Wada K, Sakata JT. Phylogeny and mechanisms of shared hierarchical patterns in birdsong. Current Biology [Internet]. 2021 Jul [cited 2023 Apr 12];31(13):2796–2808.e9. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0960982221005285

50.

Watts DJ, Strogatz SH. Collective dynamics of “small-world” networks. Nature [Internet]. 1998;393:440–2. Available from: https://doi.org/10.1038/30918

51.

Humphries MD, Gurney K. Network “small-world-ness”: a quantitative method for determining canonical network equivalence. Sporns O, editor. PLoS ONE [Internet]. 2008 Apr 30 [cited 2023 May 29];3(4):e0002051. Available from: https://dx.plos.org/10.1371/journal.pone.0002051

52.

Cancho RF i, Solé RV. The small world of human language. Proc R Soc Lond B [Internet]. 2001 Nov 7 [cited 2023 Apr 12];268(1482):2261–5. Available from: https://royalsocietypublishing.org/doi/10.1098/rspb.2001.1800

53.

Steyvers M, Tenenbaum JB. The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive Science [Internet]. 2005 Jan 2 [cited 2023 Sep 19];29(1):41–78. Available from: http://doi.wiley.com/10.1207/s15516709cog2901_3

54.

Kirby S, Tamariz M, Cornish H, Smith K. Compression and communication in the cultural evolution of linguistic structure. Cognition. 2015 Aug;141:87–102.

55.

Solé RV, Corominas-Murtra B, Valverde S, Steels L. Language networks: their structure, function, and evolution. Complexity [Internet]. 2010 [cited 2023 Sep 19];NA–. Available from: https://onlinelibrary.wiley.com/doi/10.1002/cplx.20305

56.

Weiss M, Hultsch H, Adam I, Scharff C, Kipper S. The use of network analysis to study complex animal communication systems: a study on nightingale song. Proc R Soc B [Internet]. 2014 Jun 22 [cited 2023 Apr 12];281(1785):20140460. Available from: https://royalsocietypublishing.org/doi/10.1098/rspb.2014.0460

57.

Beckage N, Smith L, Hills T. Small worlds and semantic network growth in typical and late talkers. Perc M, editor. PLoS ONE [Internet]. 2011 May 11 [cited 2023 Aug 9];6(5):e19348. Available from: https://dx.plos.org/10.1371/journal.pone.0019348

58.

Sasahara K, Cody ML, Cohen D, Taylor CE. Structural design principles of complex bird songs: a network-based approach. Troyer TW, editor. PLoS ONE [Internet]. 2012 Sep 24 [cited 2023 Apr 12];7(9):e44436. Available from: https://dx.plos.org/10.1371/journal.pone.0044436

59.

Hedley RW. Composition and sequential organization of song repertoires in Cassin’s Vireo (Vireo cassinii). J Ornithol [Internet]. 2016 Jan [cited 2023 Apr 12];157(1):13–22. Available from: http://link.springer.com/10.1007/s10336-015-1238-x

60.

Pothos EM, Juola P. Characterizing linguistic structure with mutual information. British Journal of Psychology [Internet]. 2007 May [cited 2023 Aug 10];98(2):291–304. Available from: http://doi.wiley.com/10.1348/000712606X122760

61.

Alvarez-Lacalle E, Dorow B, Eckmann J-P, Moses E. Hierarchical structures induce long-range dynamical correlations in written texts. Proc Natl Acad Sci USA [Internet]. 2006 May 23 [cited 2023 Aug 11];103(21):7956–61. Available from: https://pnas.org/doi/full/10.1073/pnas.0510673103

62.

Frank SL, Bod R, Christiansen MH. How hierarchical is language use? Proc R Soc B [Internet]. 2012 Nov 22 [cited 2023 Aug 10];279(1747):4522–31. Available from: https://royalsocietypublishing.org/doi/10.1098/rspb.2012.1741

63.

Chomsky N. Syntactic Structures. Mouton & Co.; 1957.

64.

Li W. Mutual information functions versus correlation functions. J Stat Phys [Internet]. 1990 Sep [cited 2023 Aug 10];60(5-6):823–37. Available from: http://link.springer.com/10.1007/BF01025996

65.

Lin H, Tegmark M. Critical behavior in physics and probabilistic formal languages. Entropy [Internet]. 2017 Jun 23 [cited 2023 May 29];19(7):299. Available from: http://www.mdpi.com/1099-4300/19/7/299

66.

Sainburg T, Mai A, Gentner TQ. Long-range sequential dependencies precede complex syntactic production in language acquisition. Proceedings of the Royal Society B: Biological Sciences [Internet]. 2022;289:20212657. Available from: https://doi.org/10.1098/rspb.2021.2657

67.

Sainburg T, Thielk M, Gentner TQ. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. Theunissen FE, editor. PLoS Comput Biol [Internet]. 2020 Oct 15 [cited 2023 Sep 5];16(10):e1008228. Available from: https://dx.plos.org/10.1371/journal.pcbi.1008228

68.

Rivera M, Edwards JA, Hauber ME, Woolley SMN. Machine learning and statistical classification of birdsong link vocal acoustic features with phylogeny. Sci Rep [Internet]. 2023 May 1 [cited 2023 Sep 5];13(1):7076. Available from: https://www.nature.com/articles/s41598-023-33825-5

69.

Merino Recalde N. pykanto: A python library to accelerate research on wild bird song. Methods Ecol Evol [Internet]. 2023 Aug [cited 2023 Sep 5];14(8):1994–2002. Available from: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.14155

70.

Charbonneau M, Bourrat P. Fidelity and the grain problem in cultural evolution. Synthese [Internet]. 2021 Dec [cited 2023 Sep 5];199(3-4):5815–36. Available from: https://link.springer.com/10.1007/s11229-021-03047-1

71.

Mundinger PC. Song dialects and colonization in the house finch, Carpodacus mexicanus, on the east coast. The Condor [Internet]. 1975;77(4):407–22. Available from: https://doi.org/10.2307/1366088

72.

Ju C, Geller FC, Mundinger PC, Lahti DC. Four decades of cultural evolution in house finch songs. The Auk: Ornithological Advances. 2019;136:1–18.

73.

Mann DC, Lahti DC, Waddick L, Mundinger PC. House finches learn canary trills. Bioacoustics. 2020;1–17.

74.

Nolan PM, Hill GE. Female choice for song characteristics in the house finch. Animal Behaviour. 2004;67(3):403–10.

75.

Mennill DJ, Badyaev AV, Jonart LM, Hill GE. Male house finches with elaborate songs have higher reproductive performance. Ethology. 2006;112(2):174–80.

76.

Ciaburri I, Williams H. Context-dependent variation of house finch song syntax. Animal Behaviour. 2019;147:33–42.

77.

Bermúdez‐Cuamatzin E, Slabbekoorn H, Macías Garcia C. Spectral and temporal call flexibility of House Finches (Haemorhous mexicanus) from urban areas during experimental noise exposure. Ibis [Internet]. 2023 Apr [cited 2023 Apr 12];165(2):571–86. Available from: https://onlinelibrary.wiley.com/doi/10.1111/ibi.13161

78.

Bürkner PC. brms: an R package for Bayesian multilevel models using Stan. J Stat Soft [Internet]. 2017 [cited 2023 Aug 15];80(1). Available from: http://www.jstatsoft.org/v80/i01/

79.

Lachlan RF, Ratmann O, Nowicki S. Cultural conformity generates extremely stable traditions in bird song. Nature Communications [Internet]. 2018;9. Available from: http://dx.doi.org/10.1038/s41467-018-04728-1

80.

Sardá-Espinosa A. Time-series clustering in R using the dtwclust package. The R Journal [Internet]. 2019;11(1):22–43. Available from: https://doi.org/10.32614/RJ-2019-023

81.

Ratanamahatana CA, Keogh E. Three myths about dynamic time warping data mining. In: Proceedings of the 2005 SIAM International Conference on Data Mining. 2005. p. 506–10.

82.

Roginek EW. Spatial variation of house finch (Haemorhous mexicanus) song along the American Southwest coast. Queens College; 2018.

83.

Müllner D. fastcluster: Fast hierarchical, agglomerative clustering routines for R and python. Journal of Statistical Software [Internet]. 2013;53(9):1–18. Available from: https://www.jstatsoft.org/index.php/jss/article/view/v053i09

84.

Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics [Internet]. 2008 Mar 1 [cited 2023 Jun 2];24(5):719–20. Available from: https://academic.oup.com/bioinformatics/article/24/5/719/200751

85.

Gilman RT, Durrant C, Malpas L, Lewis RN. Does Zipf’s law of abbreviation shape birdsong? [Internet]. Animal Behavior and Cognition; 2023 Dec [cited 2024 Jan 25]. Available from: http://biorxiv.org/lookup/doi/10.1101/2023.12.06.569773

86.

Podos J, Moseley DL, Goodwin SE, McClure J, Taft BN, Strauss AVH, et al. A fine-scale, broadly applicable index of vocal performance: frequency excursion. Animal Behaviour. 2016;116:203–12.

87.

Izsák J. Some practical aspects of fitting and testing the Zipf-Mandelbrot model: a short essay. Scientometrics [Internet]. 2006 Apr [cited 2023 Sep 6];67(1):107–20. Available from: http://link.springer.com/10.1007/s11192-006-0052-x

88.

Linders GM, Louwerse MM. Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort. Psychon Bull Rev [Internet]. 2023 Feb [cited 2023 Sep 19];30(1):77–101. Available from: https://link.springer.com/10.3758/s13423-022-02142-9

89.

Lewis RN, Kwong A, Soma M, De Kort SR, Gilman RT. Java sparrow song conforms to Menzerath’s Law but not Zipf’s Law of Abbreviation [Internet]. Animal Behavior and Cognition; 2023 Dec [cited 2024 Jan 25]. Available from: http://biorxiv.org/lookup/doi/10.1101/2023.12.13.571437

90.

G. Torre I, Dębowski Ł, Hernández-Fernández A. Can Menzerath’s law be a criterion of complexity in communication? Amancio DR, editor. PLoS ONE [Internet]. 2021 Aug 20 [cited 2023 Aug 9];16(8):e0256133. Available from: https://dx.plos.org/10.1371/journal.pone.0256133

91.

Tanaka-Ishii K. Menzerath’s law in the syntax of languages compared with random sentences. Entropy [Internet]. 2021 May 25 [cited 2023 Aug 9];23(6):661. Available from: https://www.mdpi.com/1099-4300/23/6/661

92.

Ferrer-i-Cancho R, Hernández-Fernández A, Baixeries J, Dębowski Ł, Mačutek J. When is Menzerath-Altmann law mathematically trivial? a new approach. Statistical Applications in Genetics and Molecular Biology [Internet]. 2014 Jan 1 [cited 2023 Aug 9];13(6). Available from: https://www.degruyter.com/document/doi/10.1515/sagmb-2013-0034/html

93.

Safryghin A, Cross C, Fallon B, Heesen R, Ferrer-i-Cancho R, Hobaiter C. Variable expression of linguistic laws in ape gesture: a case study from chimpanzee sexual solicitation. Royal Society Open Science [Internet]. 2022;9:220849. Available from: https://doi.org/10.1098/rsos.220849

94.

Grassberger P. Entropy estimates from insufficient samplings [Internet]. arXiv; 2008 [cited 2023 Aug 10]. Available from: http://arxiv.org/abs/physics/0307138

95.

Grassberger P. On generalized Schürmann entropy estimators. Entropy [Internet]. 2022 May 11 [cited 2023 Aug 15];24(5):680. Available from: https://www.mdpi.com/1099-4300/24/5/680

96.

Katahira K, Suzuki K, Okanoya K, Okada M. Complex sequencing rules of birdsong can be explained by simple hidden Markov processes. De Polavieja GG, editor. PLoS ONE [Internet]. 2011 Sep 7 [cited 2023 May 29];6(9):e24516. Available from: https://dx.plos.org/10.1371/journal.pone.0024516

97.

Motter AE, De Moura APS, Lai YC, Dasgupta P. Topology of the conceptual network of language. Phys Rev E [Internet]. 2002 Jun 25 [cited 2023 Sep 19];65(6):065102. Available from: https://link.aps.org/doi/10.1103/PhysRevE.65.065102

98.

Leonardo A, Konishi M. Decrystallization of adult birdsong by perturbation of auditory feedback. Nature [Internet]. 1999 Jun [cited 2023 Sep 14];399(6735):466–70. Available from: https://www.nature.com/articles/20933

99.

Berwick RC, Okanoya K, Beckers GJL, Bolhuis JJ. Songs to syntax: the linguistics of birdsong. Trends in Cognitive Sciences [Internet]. 2011 Mar [cited 2023 Aug 29];15(3):113–21. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1364661311000039

100.

Searcy WA, Soha J, Peters S, Nowicki S. Long-distance dependencies in birdsong syntax. Proc R Soc B [Internet]. 2022 Jan 26 [cited 2023 Sep 14];289(1967):20212473. Available from: https://royalsocietypublishing.org/doi/10.1098/rspb.2021.2473

101.

Morita T, Koda H, Okanoya K, Tachibana RO. Measuring context dependency in birdsong using artificial neural networks. PLOS Computational Biology [Internet]. 2021;17(12):e1009707. Available from: http://dx.doi.org/10.1371/journal.pcbi.1009707

102.

Lahti DC. 2020.

103.

Kershenbaum A, Blumstein DT, Roch MA, Akçay Ç, Backus G, Bee MA, et al. Acoustic sequences in non-human animals: a tutorial review and prospectus. Biol Rev [Internet]. 2016 [cited 2023 Jun 1];91:13–52. Available from: https://doi.org/10.1111/brv.12160