Audio basic knowledge and coding principles

1. Basic concepts

1) Bit rate: indicates how many bits per second the encoded (compressed) audio data needs to be represented, and the unit is usually kbps.

2) Loudness and intensity: The subjective attributes of a sound. Loudness indicates how loud a sound sounds. Loudness mainly varies with the intensity of the sound, but it is also affected by frequency. Generally speaking, pure mid-frequency sounds are better than pure low-frequency and high-frequency sounds.

3) Sampling and sampling rate: Sampling is to transform a continuous time signal into a discrete digital signal. The sampling rate refers to how many samples are collected per second.

Nyquist sampling law: When the sampling rate is greater than or equal to 2 times the highest frequency component of the continuous signal, the sampled signal can be used to perfectly reconstruct the original continuous signal.

2. common audio formats

1) WAV format is a sound file format developed by Microsoft, also called wave sound file. It is the earliest digital audio format, widely supported by the Windows platform and its applications, and has a low compression rate.

2) MIDI is the abbreviation of Musical Instrument Digital Interface, also known as Musical Instrument Digital Interface, which is a unified international standard for digital music/electronic synthetic musical instruments. It defines the way that computer music programs, digital synthesizers, and other electronic devices exchange music signals, and specifies the data transmission protocol between cables and hardware and devices connecting electronic musical instruments from different manufacturers to computers, and can simulate the sound of multiple musical instruments. A MIDI file is a file in the MIDI format, and some commands are stored in the MIDI file. Send these instructions to the sound card, and the sound card will synthesize the sound according to the instructions.

3) The full name of MP3 is MPEG-1 Audio Layer 3, which was merged into the MPEG specification in 1992. MP3 can compress digital audio files with high sound quality and low sampling rate. The most common application.

4) MP3Pro was developed by Swedish Coding Technology Company, which contains two major technologies: one is the unique decoding technology from Coding Technology Company, and the other is the integration of MP3 patent holder French Thomson Multimedia Company and German Fraunhofer A decoding technology jointly researched by the Circuit Association. MP3Pro can improve the original MP3 music sound quality without basically changing the file size. It can maintain the sound quality before compression to the greatest extent while compressing audio files at a lower bit rate.

5) MP3Pro was developed by Swedish Coding Technology Company, which contains two major technologies: one is the unique decoding technology from Coding Technology Company, and the other is the integration of MP3 patent holder French Thomson Multimedia Company and German Fraunhofer A decoding technology jointly researched by the Circuit Association. MP3Pro can improve the original MP3 music sound quality without basically changing the file size. It can maintain the sound quality before compression to the greatest extent while compressing audio files at a lower bit rate.

6) WMA (Windows Media Audio) is Microsoft's masterpiece in the field of Internet audio and video. The WMA format achieves a higher compression rate by reducing data traffic but maintaining sound quality. The compression rate can generally reach 1:18. In addition, WMA can also protect copyright through DRM (Digital Rights Management).

7) RealAudio is a file format launched by Real Networks. The biggest feature is that it can transmit audio information in real time, especially when the network speed is slow, it can still transmit data smoothly, so RealAudio is mainly suitable for network Play online on. The current RealAudio file formats mainly include RA (RealAudio), RM (RealMedia, RealAudio G2), RMX (RealAudio Secured), etc. The commonality of these files is that the quality of the sound changes with the difference in network bandwidth. Under the premise that most people hear smooth sound, listeners with a wider bandwidth can get better sound quality.

8) Audible has four different formats: Audible1, 2, 3, 4. The Audible.com website mainly sells audio books on the Internet, and provides protection for the goods and files they sell through one of the four Audible.com dedicated audio formats. Each format mainly considers the audio source and the listening device used. Formats 1, 2 and 3 use different levels of voice compression, while format 4 uses a lower sampling rate and the same decoding method as MP3. The resulting voice is clearer and can be downloaded more efficiently from the Internet. Audible uses their own desktop playback tool, which is Audible Manager. With this player, you can play Audible format files stored on a PC or transferred to a portable player.

9) AAC is actually an abbreviation for Advanced Audio Coding. AAC is an audio format jointly developed by Fraunhofer IIS-A, Dolby and AT&T. It is part of the MPEG-2 specification. The algorithm used by AAC is different from that of MP3. AAC combines other functions to improve coding efficiency. AAC's audio algorithm far exceeds some previous compression algorithms (such as MP3, etc.) in compression capabilities. It also supports up to 48 audio tracks, 15 low-frequency audio tracks, more sample rates and bit rates, multi-language compatibility, and higher decoding efficiency. In short, AAC can provide better sound quality under the premise that it is 30% smaller than MP3 files.

10) Ogg Vorbis is a new audio compression format, similar to existing music formats such as MP3. But one difference is that it is completely free, open and without patent restrictions. Vorbis is the name of this audio compression mechanism, and Ogg is the name of a project that intends to design a completely open multimedia system. VORBIS is also lossy compression, but it uses more advanced acoustic models to reduce loss. Therefore, OGG encoded with the same bit rate sounds better than MP3.

11) APE is a lossless compressed audio format, under the premise that the sound quality is not reduced, the size is compressed to half of the traditional lossless format WAV file.

12) FLAC is the abbreviation of Free Lossless Audio Codec, a set of well-known free audio lossless compression codes, which is characterized by lossless compression.

3. the basic principle of audio coding

Speech coding is dedicated to reducing the channel bandwidth required for transmission while maintaining the high quality of the input speech.

The goal of speech coding is to design a low-complexity encoder to achieve high-quality data transmission at the lowest possible bit rate.

1) Mute threshold curve: The threshold at which the human ear can hear sound at various frequencies only in a quiet environment.

2) Critical frequency band

Because the human ear has different resolutions for different frequencies, MPEG1/Audio divides the perceptible frequency range within 22khz into 23~26 critical frequency bands according to different coding layers and different sampling frequencies. The following figure lists the center frequency and bandwidth of the ideal critical frequency band. As can be seen in the figure, the human ear has a better resolution of low frequen

3) Masking effect in the frequency domain: A signal with a larger amplitude will mask a signal with a similar frequency and a smaller amplitude, as shown in the figure below:

4) Masking effect in the time domain: In a short period of time, if two sounds appear, the sound with a larger SPL (sound pressure level) will mask the sound with a smaller SPL. The time-domain masking effect is divided into forward masking (pre-masking) and backward masking (post-masking). The post-masking time will be longer, about 10 times that of pre-masking.

The time-domain masking effect helps to eliminate the pre-echo.

4. the basic means of coding

1) Quantizer and quantizer

Quantization and quantizer: Quantization converts a continuous signal in discrete time into a discrete signal in discrete time. Common quantizers are: uniform quantizer, logarithmic quantizer, and non-uniform quantizer. The goal pursued by the quantization process is to minimize the quantization error and minimize the complexity of the quantizer (the two are in themselves a contradiction).

(A) Uniform quantizer: the simplest, the worst performance, only suitable for telephone voice.

(B) Logarithmic quantizer: It is more complicated than uniform quantizer and easy to implement, and its performance is better than uniform quantizer.

(C) Non-uniform quantizer: According to the distribution of the signal, design the quantizer. Detailed quantification is performed where the signal is dense, and rough quantification is performed where the signal is sparse.

2) Voice encoder

There are three types of speech encoders: (a) Waveform encoder; (b) Vocoder; (c) Hybrid encoder.

The waveform encoder aims at constructing an analog waveform including the background noise sheet. Acting on all input signals, it will produce high-quality samples and consume a high bit rate. The vocoder will not regenerate the original waveform. This set of encoders will extract a set of parameters, which are sent to the receiving end to derive the voice generation model. The voice quality of the vocoder is not good enough. Hybrid encoder, which incorporates the advantages of waveform encoder and sounder.

2.1 Waveform encoder

The design of the waveform encoder is often independent of the signal. So it is suitable for the coding of various signals and is not limited to speech.

1) Time domain coding

a) PCM: pulse code modulation, is the simplest encoding method. It is only the discretization and quantization of the signal, and logarithmization is often used.

b) DPCM: differential pulse code modulation, which only encodes the difference between samples. The previous one or more samples are used to predict the current sample value. The more samples used to make predictions, the more accurate the predicted value. The difference between the true value and the predicted value is called the residual, which is the object of encoding.

c) ADPCM: adaptive differential pulse code modulation, adaptive differential pulse code. That is, on the basis of DPCM, the quantizer and predictor are appropriately adjusted according to the changes of the signal, so that the predicted value is closer to the real signal, the residual is smaller, and the compression efficiency is higher.

(2) Frequency domain coding

Frequency domain coding is to decompose a signal into a series of different frequency elements and perform independent coding.

a) Sub-band coding: Sub-band coding is the simplest frequency domain coding technique. It is a technology that transforms the original signal from the time domain to the frequency domain, then divides it into several sub-bands, and performs digital coding on them respectively. It uses a band-pass filter (BPF) group to divide the original signal into several (for example, m) sub-bands (referred to as sub-bands). Pass each sub-band through the modulation characteristics equivalent to single-sideband amplitude modulation, move each sub-band to near zero frequency, respectively pass through BPF (a total of m), and then transfer each sub-band at a prescribed rate (Nyquist rate) The sub-band output signal is sampled, and the sampled value is usually digitally coded, and m digital encoders are set. Send each digital coded signal to the multiplexer, and finally output the sub-band coded data stream.

For different subbands, different quantization methods can be used and different numbers of bits can be allocated to the subbands according to the human ear perception model.

b) transform coding: DCT coding.

5. Vocoder

Channel vocoder: Utilizes the insensitivity of the human ear to phase.

homomorphic vocoder: can effectively process synthetic signals.

Formant vocoder: Most of the information of the voice signal is located on the position and bandwidth of the formant.

linear predictive vocoder: The most commonly used vocoder.

6. Hybrid encoder

The waveform encoder tries to preserve the waveform of the coded signal and can provide high-quality speech at a medium bit rate (32 kbps), but it cannot be applied to low bit rate occasions. The vocoder attempts to generate a signal that is aurally similar to the encoded signal, and can provide intelligible speech at a low bit rate, but the resulting speech sounds unnatural. The hybrid encoder combines the advantages of both.

RELP: On the basis of linear prediction, the residual is encoded. The mechanism is: only transmit a small part of the residuals, and reconstruct all the residuals at the receiving end (copy the residuals of the baseband).

MPC: multi-pulse coding, which removes the correlation of the residuals, and is used to compensate for the vocoder's simple classification of voices into voiced and unvoiced without the defects of intermediate states.

CELP: codebook excited linear prediction, which uses vocal tract prediction and the cascade of pitch predictor to better approximate the original signal.

MBE: multiband excitation, the purpose is to avoid a large number of CELP calculations, to obtain higher quality than the vocoder.