Audio Features
Beyond spectrograms, the library provides common audio features for music and speech analysis. These are extensions built on top of the core spectrogram computations.
MFCC (Mel-Frequency Cepstral Coefficients)
MFCCs are widely used in speech recognition and audio classification.
Basic Usage
import spectrograms as sg
import numpy as np
# Generate or load audio
samples = np.random.randn(16000) # 1 second at 16 kHz
# Configure parameters
stft = sg.StftParams(n_fft=512, hop_size=160, window=sg.WindowType.hanning)
mfcc_params = sg.MfccParams(n_mfcc=13)
# Compute MFCCs
mfccs = sg.compute_mfcc(
samples,
stft_params=stft,
sample_rate=16000,
n_mels=40,
mfcc_params=mfcc_params
)
print(f"Shape: {mfccs.shape}") # (13, n_frames)
Standard Configuration
For speech recognition:
mfcc_params = sg.MfccParams.speech_standard() # 13 coefficients
Custom Configuration
# More coefficients for detailed analysis
mfcc_params = sg.MfccParams(n_mfcc=20)
The first coefficient (C0) represents energy. Coefficients 1-12 capture spectral envelope.
Chromagram
Chromagrams represent pitch class energy, useful for music analysis and chord recognition.
Basic Usage
import spectrograms as sg
# Configure chroma parameters
chroma_params = sg.ChromaParams(
tuning=440.0, # A4 reference frequency
f_min=32.7, # C1
f_max=4186.0, # C8
norm="l2" # Normalize each frame
)
stft = sg.StftParams(n_fft=4096, hop_size=1024, window=sg.WindowType.hanning)
# Compute chromagram
chroma = sg.compute_chromagram(
samples,
stft_params=stft,
sample_rate=44100,
chroma_params=chroma_params
)
print(f"Shape: {chroma.shape}") # (12, n_frames)
Output
Returns a (12, n_frames) array where each row corresponds to a pitch class:
Row 0: C
Row 1: C#/Db
Row 2: D
…
Row 11: B
Standard Configuration
chroma_params = sg.ChromaParams.music_standard()
Normalization Options
# L2 normalization (unit length)
chroma_params = sg.ChromaParams(norm="l2")
# L1 normalization (sum to 1)
chroma_params = sg.ChromaParams(norm="l1")
# Max normalization (peak at 1)
chroma_params = sg.ChromaParams(norm="max")
# No normalization
chroma_params = sg.ChromaParams(norm=None)
Constant-Q Transform (CQT)
CQT provides logarithmically-spaced frequency bins, matching musical scales.
Basic Usage
import spectrograms as sg
# Configure CQT
cqt_params = sg.CqtParams(
bins_per_octave=12, # Semitone resolution
n_octaves=7, # 7 octaves
f_min=55.0 # A1
)
# Compute CQT
cqt = sg.compute_cqt(
samples,
sample_rate=22050,
cqt_params=cqt_params,
hop_size=512
)
print(f"Shape: {cqt.shape}") # (num_bins, n_frames)
print(f"Complex output: {cqt.dtype}") # complex128
Output
Returns a complex-valued array. To get magnitude:
magnitude = np.abs(cqt)
power = np.abs(cqt) ** 2
db = 20 * np.log10(np.abs(cqt) + 1e-10)
Configuration
# High frequency resolution (36 bins per octave, third-tones)
cqt_params = sg.CqtParams(bins_per_octave=36, n_octaves=7, f_min=55.0)
# Standard (12 bins per octave, semitones)
cqt_params = sg.CqtParams(bins_per_octave=12, n_octaves=7, f_min=55.0)
The total number of bins is bins_per_octave * n_octaves.
Applications
Speech Recognition
# Standard ASR features
mfcc_params = sg.MfccParams.speech_standard()
stft = sg.StftParams(n_fft=512, hop_size=160, window=sg.WindowType.hamming)
mfccs = sg.compute_mfcc(samples, stft, 16000, 40, mfcc_params)
Music Analysis
# Chord recognition
chroma_params = sg.ChromaParams.music_standard()
stft = sg.StftParams(n_fft=4096, hop_size=1024, window=sg.WindowType.hanning)
chroma = sg.compute_chromagram(samples, stft, 44100, chroma_params)
Audio Classification
# Mel spectrograms for neural networks
mel_params = sg.MelParams(n_mels=128, f_min=0.0, f_max=8000.0)
db_params = sg.LogParams(floor_db=-80.0)
mel_spec = sg.compute_mel_db_spectrogram(samples, params, mel_params, db_params)