chatter.features#
Classes
|
Post-processing class for autoencoder features and associated metadata. |
- class chatter.features.FeatureProcessor(df, config)[source]#
Post-processing class for autoencoder features and associated metadata.
This class provides methods for dimensionality reduction (PaCMAP), clustering (BIRCH), computing within-sequence cosine distances, computing VAR-based surprisal scores, assigning sequence identifiers, and visualizing embedding structures.
- df#
DataFrame containing latent features and associated metadata.
- Type:
pd.DataFrame
- config#
Configuration dictionary containing post-processing parameters such as ‘lag_size’ and ‘seq_bound’.
- Type:
dict
- __init__(df, config)[source]#
Initialize the FeatureProcessor with a DataFrame and configuration.
- Parameters:
df (pd.DataFrame) – DataFrame containing latent features and corresponding metadata.
config (dict) – Configuration dictionary containing post-processing parameters.
- run_pacmap(**kwargs)[source]#
Run PaCMAP dimensionality reduction on latent features and add coordinates.
This method automatically identifies feature columns, runs PaCMAP to embed them into a two-dimensional space, and stores the resulting coordinates in new columns ‘pacmap_x’ and ‘pacmap_y’ in the DataFrame.
- Parameters:
**kwargs – Additional keyword arguments passed directly to the pacmap.PaCMAP constructor, allowing customization of the embedding.
- Returns:
The current instance, returned to enable method chaining.
- Return type:
- run_birch_clustering(n_clusters_list)[source]#
Run BIRCH clustering for multiple values of ‘n_clusters’.
This method performs BIRCH clustering on the PaCMAP embeddings (columns ‘pacmap_x’ and ‘pacmap_y’) for each requested number of clusters and stores cluster labels in separate columns.
- Parameters:
n_clusters_list (list of int) – List of ‘n_clusters’ values for which to compute BIRCH cluster assignments. For each value ‘n’, a column ‘birch_n’ is added to the DataFrame.
- Returns:
The current instance, returned to enable method chaining.
- Return type:
- compute_density_probability(use_pacmap=False, scaled=True, **kwargs)[source]#
Compute probability density estimates for embeddings using denmarf.
This method fits a Masked AutoRegressive Flow (MAF) density estimator to the latent features (or PaCMAP coordinates) and assigns a log-probability density score to each unit. High scores indicate ‘typical’ points in high-density regions; low scores indicate outlines or rare examples.
- Parameters:
use_pacmap (bool, optional) – If True, computes density on the 2D ‘pacmap_x/y’ coordinates instead of the full latent space. Default is False (recommended: density estimation is more rigorous in the full latent space).
scaled (bool, optional) – If True, standardizes features (zero mean, unit variance) before fitting. Highly recommended for neural density estimators. Default is True.
**kwargs – Additional keyword arguments to pass to the denmarf.DensityEstimate constructor or fit method.
- Returns:
The current instance, returned to enable method chaining. A new column ‘density_log_prob’ is added to the DataFrame.
- Return type:
- compute_cosine_distances()[source]#
Compute cosine distance between subsequent latent features within sequences.
This method calculates the cosine distance between each pair of consecutive rows that share the same sequence identifier ‘seq_id’. For each sequence, the first item has an undefined previous neighbor and therefore receives a distance of NaN. The results are stored in a new column ‘cosine_dist’.
Requirements#
The DataFrame must contain: - Columns representing latent features. - A ‘seq_id’ column identifying sequences. - An ‘onset’ column to ensure temporal ordering within each sequence.
- returns:
The current instance, returned to enable method chaining.
- rtype:
FeatureProcessor
- compute_sse_resid()[source]#
Compute VAR-based sum of squared error residuals as a surprisal proxy.
This method fits a single global vector autoregression (VAR) model with a specified lag size across all sequences while respecting sequence boundaries defined by ‘seq_id’. It then computes per-timestep sum of squared error (SSE) residuals for each sequence, including short sequences and early time steps using reduced lag orders when necessary.
Requirements#
The DataFrame must contain: - Columns representing latent features. - A ‘seq_id’ column identifying sequences. Configuration must include: - ‘lag_size’ : int, the lag order p of the VAR model.
- returns:
The current instance, returned to enable method chaining. A new column ‘sse_resid’ is added to the DataFrame containing SSE values or NaN where predictions are not defined (for example, the first time step of each sequence).
- rtype:
FeatureProcessor
- assign_sequence_ids()[source]#
Assign sequence identifiers to syllables based on temporal proximity.
Sequences are defined separately for each ‘source_file’. Within each file, syllables are sorted by ‘onset’ time, and a new sequence is started whenever the silent gap between the previous syllable’s ‘offset’ and the current syllable’s ‘onset’ exceeds the threshold ‘seq_bound’ in seconds.
Requirements#
The DataFrame must contain: - ‘source_file’ : identifier for the audio file. - ‘onset’ : onset time (in seconds) for each syllable. - ‘offset’ : offset time (in seconds) for each syllable. The configuration must contain: - ‘seq_bound’ : float, maximum allowed silent gap in seconds.
- returns:
The current instance, returned to enable method chaining. A new column ‘seq_id’ is added to the DataFrame.
- rtype:
FeatureProcessor
- compute_dtw_distance(seq_id_1, seq_id_2)[source]#
Compute Dynamic Time Warping (DTW) cosine distance between two sequences.
This method calculates the DTW distance between the latent feature sequences of two specified sequence IDs. It first computes a local distance matrix using the cosine distance between all pairs of units from the two sequences, and then uses this matrix to find the optimal alignment path cost with DTW.
- Parameters:
seq_id_1 (int) – The identifier for the first sequence.
seq_id_2 (int) – The identifier for the second sequence.
- Returns:
The total DTW distance (cost) between the two sequences.
- Return type:
float
- Raises:
ValueError – If ‘seq_id’ or feature columns are not found in the DataFrame, or if one or both of the specified seq_ids do not exist.
- compute_frequency_statistics(h5_path, return_traces=False)[source]#
Compute minimum, mean, and maximum frequency statistics for each unit.
This method loads spectrograms from the HDF5 file and calculates frequency statistics for each time bin of each unit. It always updates the internal DataFrame in place with summary statistics (global min, mean, max per unit), and can optionally return detailed per-time-bin traces.
- Parameters:
h5_path (str or Path) – Path to the HDF5 file containing the spectrograms dataset.
return_traces (bool, optional) – If True, returns a dictionary with detailed per-time-bin frequency traces for each unit. If False (default), returns the FeatureProcessor instance to support method chaining.
- Returns:
If return_traces is False (default), returns self after adding the new summary columns (‘min_freq’, ‘mean_freq’, ‘max_freq’, ‘time_bin_ms’) to self.df. If return_traces is True, returns a dictionary containing:
’time_bins_info’: metadata about the time axis
’units’: mapping from unit index (h5_index) to per-time-bin traces for ‘min_freq_trace’, ‘mean_freq_trace’, and ‘max_freq_trace’.
- Return type:
FeatureProcessor or dict
Notes
If the metadata contains a ‘max_unit_length_s’ column (added during segmentation), it is used to determine the effective duration represented by each spectrogram column. Otherwise, the method falls back to the configuration’s max unit length settings.
- plot_birch_sse_elbow(k_range)[source]#
Plot the sum of squared errors for a range of cluster counts in BIRCH clustering.
This method computes and visualizes the sum of squared errors (SSE) corresponding to different numbers of clusters in BIRCH clustering. It helps identify an appropriate number of clusters using the elbow method.
- Parameters:
k_range (iterable of int) – Iterable (for example, a list or range) of cluster counts ‘k’ to evaluate.
- Returns:
The current instance, returned to enable method chaining.
- Return type:
- interactive_embedding_plot(h5_path, output_html_path, thumb_size=96, point_alpha=0.7, point_size=3)[source]#
Export a self-contained HTML file with an interactive embedding plot.
This method creates a standalone HTML file that uses Plotly.js in the browser (no Python backend required) to display the PaCMAP embedding. Hovering over points in the scatterplot updates a grid of pre-rendered spectrogram thumbnails for the focal unit and its nearest neighbors.
All required data (coordinates, neighbor indices, and spectrogram thumbnails encoded as base64 PNGs) are embedded directly into the HTML file so it can be shared and opened without any additional files.
- Parameters:
h5_path (str or Path) – Path to the HDF5 file containing the spectrograms dataset.
output_html (str or Path) – Path at which to save the resulting HTML file.
thumb_size (int, optional) – Approximate size (in pixels) of the square spectrogram thumbnails. The default is 96.
point_alpha (float, optional) – Opacity of the scatter points (0.0 to 1.0). The default is 0.7.
point_size (int, optional) – Size of the scatter points in pixels. The default is 3.
- Returns:
The current instance, returned to enable method chaining.
- Return type:
- static_embedding_plot(h5_path, output_path=None, seed=42, focal_quantile=0.8, point_alpha=0.3, point_size=2, margin=0.02, zoom_padding=0.05, num_neighbors=3)[source]#
Create a publication-quality static plot of the embedding space.
This method generates a visualization that includes a 2D density map of the embedding and four “callouts” showing focal syllables and their nearest neighbors. Focal points are selected automatically from the fringes of each quadrant of the embedding space to ensure a representative sample of unique points. The plot is designed to be border-free with a seamless viridis background.
- Parameters:
h5_path (str or Path) – Path to the HDF5 file containing the spectrograms dataset.
output_path (str or Path, optional) – Path to save the final PNG image. If None, the plot is displayed directly using plt.show(). The default is None.
seed (int, optional) – Seed for the random number generator to ensure reproducible selection of focal points. The default is 42.
point_alpha (float, optional) – The alpha (transparency) of the scatter points in the background. The default is 0.3.
point_size (int or float, optional) – The size of the scatter points in the background. The default is 2.
margin (float, optional) – The margin from the plot edge to the nearest edge of a callout group, as a fraction of the plot’s total width/height. The default is 0.02.
zoom_padding (float, optional) – The padding to add around the data points as a percentage of the data’s range, effectively controlling the zoom level. The default is 0.05 (5%).
num_neighbors (int, optional) – The number of nearest neighbors to display for each focal point. The default is 3.