Choosing Parameters

Selecting the right parameters is crucial for spectrogram quality and performance.

STFT Parameters

FFT Size (n_fft)

Controls frequency resolution and time resolution trade-off:

  • Larger values (2048, 4096): Better frequency resolution, poorer time resolution

  • Smaller values (512, 256): Better time resolution, poorer frequency resolution

Recommendations:

  • Speech: 512

  • Music: 2048

  • General audio: 1024

Hop Size

Number of samples between successive frames:

  • Smaller hop: Better time resolution, more computation

  • Larger hop: Faster computation, coarser time resolution

Common ratios:

  • hop_size = n_fft / 4 (75% overlap) - standard for speech

  • hop_size = n_fft / 2 (50% overlap) - good balance

Window Function

Affects spectral leakage:

  • "hanning": General purpose, good sidelobe suppression

  • "hamming": Similar to Hanning, slightly different characteristics

  • "blackman": Excellent sidelobe suppression, wider main lobe

  • "kaiser=5.0": Adjustable (higher beta = less leakage, wider main lobe)

Centering

When centre=True, frames are centered by padding:

  • First frame centered at t=0

  • Last frame centered at end of signal

  • Recommended for most applications

When False, no padding is applied (useful for streaming).

Default Configurations

The library provides sensible defaults:

Speech Processing

import spectrograms as sg

params = sg.SpectrogramParams.speech_default(sample_rate=16000)
# Uses: n_fft=512, hop_size=160, Hanning window, centre=True

Music Processing

import spectrograms as sg

params = sg.SpectrogramParams.music_default(sample_rate=44100)
# Uses: n_fft=2048, hop_size=512, Hanning window, centre=True

Mel Scale Parameters

Number of Mel Bands

  • Speech recognition: 40-80 bands

  • Music analysis: 80-128 bands

  • General audio: 64 bands

Frequency Range

Set based on your signal content:

# Full range (0 Hz to Nyquist)
mel_params = sg.MelParams(n_mels=80, f_min=0.0, f_max=sample_rate/2)

# Speech range (common human voice frequencies)
mel_params = sg.MelParams(n_mels=40, f_min=80.0, f_max=8000.0)

# Music range
mel_params = sg.MelParams(n_mels=128, f_min=20.0, f_max=20000.0)

Decibel Conversion

The floor parameter clips low values:

# Standard for visualization
db_params = sg.LogParams(floor_db=-80.0)

# Higher floor for very quiet signals
db_params = sg.LogParams(floor_db=-60.0)

ERB Scale

ERB (Equivalent Rectangular Bandwidth) models human auditory perception:

# Good for psychoacoustic applications
erb_params = sg.ErbParams(
    n_filters=32,
    f_min=50.0,
    f_max=8000.0
)

Performance Considerations

Memory Usage

Memory scales with:

  • n_fft: Larger FFT = more memory

  • Signal length / hop_size: More frames = more memory

Computation Time

Factors affecting speed:

  1. FFT size (larger = slower)

  2. Number of frames (signal length / hop size)

  3. FFT backend (FFTW is fastest)

For batch processing, use the Batch Processing to reuse FFT plans.