Choosing Parameters

Selecting the right parameters is crucial for spectrogram quality and performance.

STFT Parameters

FFT Size (n_fft)

Controls frequency resolution and time resolution trade-off:

Larger values (2048, 4096): Better frequency resolution, poorer time resolution
Smaller values (512, 256): Better time resolution, poorer frequency resolution

Recommendations:

Speech: 512
Music: 2048
General audio: 1024

Hop Size

Number of samples between successive frames:

Smaller hop: Better time resolution, more computation
Larger hop: Faster computation, coarser time resolution

Common ratios:

hop_size = n_fft / 4 (75% overlap) - standard for speech
hop_size = n_fft / 2 (50% overlap) - good balance

Window Function

Affects spectral leakage:

"hanning": General purpose, good sidelobe suppression
"hamming": Similar to Hanning, slightly different characteristics
"blackman": Excellent sidelobe suppression, wider main lobe
"kaiser=5.0": Adjustable (higher beta = less leakage, wider main lobe)

Centering

When centre=True, frames are centered by padding:

First frame centered at t=0
Last frame centered at end of signal
Recommended for most applications

When False, no padding is applied (useful for streaming).

Default Configurations

The library provides sensible defaults:

Speech Processing

import spectrograms as sg

params = sg.SpectrogramParams.speech_default(sample_rate=16000)
# Uses: n_fft=512, hop_size=160, Hanning window, centre=True

Music Processing

import spectrograms as sg

params = sg.SpectrogramParams.music_default(sample_rate=44100)
# Uses: n_fft=2048, hop_size=512, Hanning window, centre=True

Mel Scale Parameters

Number of Mel Bands

Speech recognition: 40-80 bands
Music analysis: 80-128 bands
General audio: 64 bands

Frequency Range

Set based on your signal content:

# Full range (0 Hz to Nyquist)
mel_params = sg.MelParams(n_mels=80, f_min=0.0, f_max=sample_rate/2)

# Speech range (common human voice frequencies)
mel_params = sg.MelParams(n_mels=40, f_min=80.0, f_max=8000.0)

# Music range
mel_params = sg.MelParams(n_mels=128, f_min=20.0, f_max=20000.0)

Decibel Conversion

The floor parameter clips low values:

# Standard for visualization
db_params = sg.LogParams(floor_db=-80.0)

# Higher floor for very quiet signals
db_params = sg.LogParams(floor_db=-60.0)

ERB Scale

ERB (Equivalent Rectangular Bandwidth) models human auditory perception:

# Good for psychoacoustic applications
erb_params = sg.ErbParams(
    n_filters=32,
    f_min=50.0,
    f_max=8000.0
)

Performance Considerations

Memory Usage

Memory scales with:

n_fft: Larger FFT = more memory
Signal length / hop_size: More frames = more memory

Computation Time

Factors affecting speed:

FFT size (larger = slower)
Number of frames (signal length / hop size)
FFT backend (FFTW is fastest)

For batch processing, use the Batch Processing to reuse FFT plans.