Choosing Parameters
Selecting the right parameters is crucial for spectrogram quality and performance.
STFT Parameters
FFT Size (n_fft)
Controls frequency resolution and time resolution trade-off:
Larger values (2048, 4096): Better frequency resolution, poorer time resolution
Smaller values (512, 256): Better time resolution, poorer frequency resolution
Recommendations:
Speech: 512
Music: 2048
General audio: 1024
Hop Size
Number of samples between successive frames:
Smaller hop: Better time resolution, more computation
Larger hop: Faster computation, coarser time resolution
Common ratios:
hop_size = n_fft / 4(75% overlap) - standard for speechhop_size = n_fft / 2(50% overlap) - good balance
Window Function
Affects spectral leakage:
"hanning": General purpose, good sidelobe suppression"hamming": Similar to Hanning, slightly different characteristics"blackman": Excellent sidelobe suppression, wider main lobe"kaiser=5.0": Adjustable (higher beta = less leakage, wider main lobe)
Centering
When centre=True, frames are centered by padding:
First frame centered at
t=0Last frame centered at end of signal
Recommended for most applications
When False, no padding is applied (useful for streaming).
Default Configurations
The library provides sensible defaults:
Speech Processing
import spectrograms as sg
params = sg.SpectrogramParams.speech_default(sample_rate=16000)
# Uses: n_fft=512, hop_size=160, Hanning window, centre=True
Music Processing
import spectrograms as sg
params = sg.SpectrogramParams.music_default(sample_rate=44100)
# Uses: n_fft=2048, hop_size=512, Hanning window, centre=True
Mel Scale Parameters
Number of Mel Bands
Speech recognition: 40-80 bands
Music analysis: 80-128 bands
General audio: 64 bands
Frequency Range
Set based on your signal content:
# Full range (0 Hz to Nyquist)
mel_params = sg.MelParams(n_mels=80, f_min=0.0, f_max=sample_rate/2)
# Speech range (common human voice frequencies)
mel_params = sg.MelParams(n_mels=40, f_min=80.0, f_max=8000.0)
# Music range
mel_params = sg.MelParams(n_mels=128, f_min=20.0, f_max=20000.0)
Decibel Conversion
The floor parameter clips low values:
# Standard for visualization
db_params = sg.LogParams(floor_db=-80.0)
# Higher floor for very quiet signals
db_params = sg.LogParams(floor_db=-60.0)
ERB Scale
ERB (Equivalent Rectangular Bandwidth) models human auditory perception:
# Good for psychoacoustic applications
erb_params = sg.ErbParams(
n_filters=32,
f_min=50.0,
f_max=8000.0
)
Performance Considerations
Memory Usage
Memory scales with:
n_fft: Larger FFT = more memorySignal length /
hop_size: More frames = more memory
Computation Time
Factors affecting speed:
FFT size (larger = slower)
Number of frames (signal length / hop size)
FFT backend (FFTW is fastest)
For batch processing, use the Batch Processing to reuse FFT plans.