Audio processing-1 basic knowledge

Audio

Refers to the sound waves with a sound frequency between 20 Hz and 20 kHz that can be heard by the human ear.

If you add a corresponding audio card to the computer—the sound card we often say, we can record all the sounds, and the acoustic characteristics of the sound, such as the level of the sound, can be stored as files on the computer's hard disk. Conversely, we can also use a certain audio program to play the stored audio file to restore the previously recorded sound.

1 Audio file format
The audio file format specifically refers to the format of the file storing the audio data. There are many different formats.

The general method of obtaining audio data is to sample (quantize) the audio voltage at a fixed time interval, and store the result at a certain resolution (for example, each sample of CDDA is 16 bits or 2 bytes). The sampling interval can have different standards. For example, CDDA uses 44,100 times per second; DVD uses 48,000 or 96,000 times per second. Therefore, [sampling rate], [resolution] and the number of [channels] (for example, 2 channels for stereo) are the key parameters of the audio file format.

1.1 Loss and lossless
According to the production process of digital audio, audio coding can only be infinitely close to natural signals. At least the current technology can only do this. Any digital audio coding scheme is lossy because it cannot be completely restored. In computer applications, the highest level of fidelity is PCM encoding, which is widely used for material preservation and music appreciation. It is used in CDs, DVDs and our common WAV files. Therefore, PCM has become a lossless encoding by convention, because PCM represents the best fidelity level in digital audio.

There are two main types of audio file formats:

Lossless formats, such as WAV, PCM, TTA, FLAC, AU, APE, TAK, WavPack(WV)
Lossy formats, such as MP3, Windows Media Audio (WMA), Ogg Vorbis (OGG), AAC

2 parameter introduction

2.1 Sampling rate

Refers to the number of sound samples obtained per second. Sound is actually a kind of energy wave, so it also has the characteristics of frequency and amplitude. The frequency corresponds to the time axis and the amplitude corresponds to the level axis. The wave is infinitely smooth, and the string can be regarded as composed of countless points. Because the storage space is relatively limited, the points of the string must be sampled during the digital encoding process.

The sampling process is to extract the frequency value of a certain point. Obviously, the more points are extracted in one second, the more frequency information is obtained. In order to restore the waveform, the higher the sampling frequency, the better the sound quality. The more real the restoration is, but at the same time it occupies more resources. Due to the limited resolution of the human ear, too high a frequency cannot be distinguished. The sampling frequency of 22050 is commonly used, 44100 is already CD sound quality, and sampling over 48,000 or 96,000 is no longer meaningful to the human ear. This is similar to the 24 frames per second in movies. If it is stereo, the sample is doubled and the file is almost doubled.

According to the Nyquist sampling theory, in order to ensure that the sound is not distorted, the sampling frequency should be around 40kHz. We don’t need to know how this theorem came about. We only need to know that this theorem tells us that if we want to record a signal accurately, our sampling frequency must be greater than or equal to twice the maximum frequency of the audio signal. Remember, it is the maximum frequency.

In the field of digital audio, commonly used sampling rates are:

8000 Hz-the sampling rate used by the phone, which is sufficient for human speech
11025 Hz-sampling rate used by the phone
22050 Hz-sampling rate used in radio broadcasting
32000 Hz-sampling rate for miniDV digital video camcorder, DAT (LP mode)
44100 Hz-Audio CD, also commonly used as sampling rate for MPEG-1 audio (VCD, SVCD, MP3)
47250 Hz-sampling rate used by commercial PCM recorders
48000 Hz-sampling rate for digital sound used in miniDV, digital TV, DVD, DAT, movies, and professional audio
50000 Hz-sampling rate used by commercial digital recorders
96000 Hz or 192000 Hz-the sampling rate used for DVD-Audio, some LPCM DVD audio tracks, BD-ROM (Blu-ray Disc) audio tracks, and HD-DVD (High Definition DVD) audio tracks

2.2 Number of sampling bits
The number of sampling bits is also called the sampling size or the number of quantization bits. It is a parameter used to measure the fluctuation of the sound, that is, the resolution of the sound card or can be understood as the resolution of the sound card processed by the sound card. The larger the value, the higher the resolution, and the more realistic the sound recorded and played back. The bit of the sound card refers to the binary digits of the digital sound signal used by the sound card when collecting and playing sound files. The bit of the sound card objectively reflects the accuracy of the digital sound signal's description of the input sound signal. Common sound cards are mainly 8-bit and 16-bit. Nowadays, all mainstream products on the market are 16-bit and above sound cards.

Each sampled data records the amplitude, and the sampling accuracy depends on the number of sampling bits:

1 byte (that is, 8bit) can only record 256 numbers, which means that the amplitude can only be divided into 256 levels;
2 bytes (that is, 16bit) can be as small as 65536, which is already a CD standard;
4 bytes (that is, 32bit) can subdivide the amplitude into 4294967296 levels, which is really unnecessary.
2.3 Number of channels
That is, the number of sound channels. Common mono and stereo (dual-channel) have now developed to four-sound surround (four-channel) and 5.1 channels.

2.3.1 Mono
Mono is a relatively primitive form of sound reproduction, and early sound cards used it more commonly. Mono sound can only be sounded using one speaker, and some are also processed into two speakers to output the same sound channel. When monophonic information is played back through two speakers, we can clearly feel that the sound is from two speakers. It is impossible to determine the specific location of the sound source that is transmitted to our ears from the middle of the speaker.

2.3.2 Stereo
Binaural channels have two sound channels. The principle is that when people hear a sound, they can judge the specific position of the sound source based on the phase difference between the left and right ears. The sound is allocated to two independent channels during the recording process, so as to achieve a good sound localization effect. This technique is particularly useful in music appreciation. The listener can clearly distinguish the direction from which various instruments come from, which makes the music more imaginative and closer to the on-site experience.

Two voices are currently the most commonly used. In karaoke, one is for playing music and the other is for singer's voice; in VCD, one is dubbing in Mandarin and the other is dubbing in Cantonese.

2.3.3 Four-tone surround
Four-channel surround defines four sounding points, front left, front right, rear left, and rear right, and the audience is surrounded by these. It is also recommended to add a subwoofer to strengthen the playback processing of low-frequency signals (this is the reason why 4.1-channel speaker systems are widely popular today). As far as the overall effect is concerned, the four-channel system can bring the listeners surround sound from multiple different directions, can obtain the auditory experience of being in a variety of different environments, and give users a brand-new experience. Nowadays, four-channel technology has been widely integrated into the design of various mid-to-high-end sound cards, becoming the mainstream trend of future development.

2.3.4 5.1 channel
5.1 channels have been widely used in various traditional theaters and home theaters. Some of the more well-known sound recording compression formats, such as Dolby AC-3 (Dolby Digital), DTS, etc., are based on the 5.1 sound system. The ".1" channel is a specially designed subwoofer channel that can produce subwoofers with a frequency response range of 20 to 120 Hz. In fact, the 5.1 sound system comes from 4.1 surround, the difference is that it adds a center unit. This center unit is responsible for transmitting the sound signal below 80Hz, which is helpful to strengthen the human voice when watching the film, and concentrate the dialogue in the middle of the entire sound field to increase the overall effect.

At present, many online music players, such as QQ Music, have provided 5.1-channel music for trial listening and downloading.

2.4 Frame
The concept of audio frames is not as clear as video frames. Almost all video encoding formats can simply think of a frame as an encoded image. However, the audio frame is related to the encoding format, which is implemented by each encoding standard.

For example, in the case of PCM (unencoded audio data), it does not need the concept of frames at all, and can be played according to the sampling rate and sampling accuracy. For example, for dual audio with a sampling rate of 44.1kHZ and a sampling accuracy of 16 bits, you can calculate that the bit rate is 44100162bps, and the audio data per second is a fixed 44100162/8 bytes.

The amr frame is relatively simple. It stipulates that every 20ms of audio is a frame, and each frame of audio is independent, and it is possible to use different encoding algorithms and different encoding parameters.

The mp3 frame is a bit more complicated and contains more information, such as sampling rate, bit rate, and various parameters.

2.5 cycles
The number of frames required by an audio device for processing at a time, and the data access of the audio device and the storage of audio data are all based on this unit.

2.6 Interleaved mode
The storage method of digital audio signal. The data is stored in continuous frames, that is, the left channel samples and right channel samples of frame 1 are recorded first, and then the recording of frame 2 is started.

2.7 Non-interlaced mode
First, record the left channel samples of all frames in a period, and then record all the right channel samples.

2.8 Bit rate (bit rate)
Bit rate is also called bit rate, which refers to the amount of data played by music per second. The unit is expressed by bit, which is binary bit. bps is the bit rate. b is bit (bit), s is second (second), p is every (per), one byte is equivalent to 8 binary bits. That is to say, the file size of a 4-minute song of 128bps is calculated like this (128/8)460=3840kB=3.8MB, 1B (Byte)=8b (bit), generally mp3 is beneficial at about 128 bit rate, and it is probably The size is around 3-4 BM.

In computer applications, the highest level of fidelity is PCM encoding, which is widely used for material preservation and music appreciation. CDs, DVDs and our common WAV files are all used. Therefore, PCM has become a lossless encoding by convention, because PCM represents the best fidelity level in digital audio. It does not mean that PCM can ensure the absolute fidelity of the signal. PCM can only achieve the maximum infinite proximity.

To calculate the bit rate of a PCM audio stream is a very easy task, sampling rate value × sampling size value × channel number bps. A WAV file with a sampling rate of 44.1KHz, a sampling size of 16bit, and dual-channel PCM encoding, its data rate is 44.1K×16×2 = 1411.2Kbps. Our common Audio CD uses PCM encoding, and the capacity of a CD can only hold 72 minutes of music information.

A dual-channel PCM encoded audio signal requires 176.4KB of space in 1 second, and about 10.34M in 1 minute. This is unacceptable for most users, especially those who like to listen to music on the computer. Disk occupancy, there are only two methods, downsampling index or compression. It is not advisable to reduce the sampling index, so experts have developed various compression schemes. The most original are DPCM, ADPCM, and the most famous is MP3. Therefore, the code rate after data compression is much lower than the original code.

2.9 Example calculation
For example, the file length of "Windows XP startup.wav" is 424,644 bytes, which is in the format of "22050HZ / 16bit / stereo".

Then its transmission rate per second (bit rate, also called bit rate, sampling rate) is 22050162 = 705600 (bps), converted to byte unit is 705600/8 = 88200 (bytes per second), playback time: 424644 (Total bytes) / 88200 (bytes per second) ≈ 4.8145578 (seconds).

But this is not accurate enough. The WAVE file (*.wav) in the standard PCM format has at least 42 bytes of header information, which should be removed when calculating the playback time, so there is: (424644-42) / (22050162/8) ≈ 4.8140816 (seconds). This is more accurate.

3 PCM audio encoding
PCM stands for Pulse Code Modulation. In the PCM process, the input analog signal is sampled, quantized, and coded, and the binary coded number represents the amplitude of the analog signal; the receiving end then restores these codes to the original analog signal. That is, the A/D conversion of digital audio includes three processes: sampling, quantization, and encoding.

The adoption rate of voice PCM is 8kHz, and the number of sampling bits is 8bit, so the code rate of the voice digital coded signal is 8bits×8kHz=64kbps=8KB/s.

3.1 Principles of Audio Coding
Anyone who has a certain electronic foundation knows that the audio signal collected by the sensor is an analog quantity, but what we use in the actual transmission process is a digital quantity. And this involves the process of converting analog to digital. The analog signal has to go through three processes, namely sampling, quantization and coding, to realize the pulse code modulation (PCM, Pulse Coding Modulation) technology of voice digitization.

Conversion process

3.1.1 Sampling
Sampling is the process of extracting samples (sampling rate) from an analog signal at a frequency that is more than 2 times the signal bandwidth (Lequist Sampling Theorem) and turning it into a discrete sampling signal on the time axis.
Sampling rate: The number of samples extracted from a continuous signal per second to form a discrete signal, expressed in Hertz (Hz).

sample:
For example, the audio signal sampling rate is 8000hz.
It can be understood that the sample in the above figure corresponds to the curve of the voltage change with time in the figure for 1 second, then the lower 1 2 3… 10, because there should be 1-8000 points, that is, 1 second is divided into 8000 parts, and then take them out in turn The voltage value corresponding to that 8000 point time.

3.1.2 Quantification
Although the sampled signal is a discrete signal on the time axis, it is still an analog signal, and its sample value can have an infinite number of values within a certain range of values. The “rounding” method must be adopted to “round up” the sample values, so that the sample values within a certain value range are changed from an infinite number of values to a finite number of values.This process is called quantification.

Sampling number of bits: refers to the number of bits used to describe the digital signal.
8 bits (8bit) represent 2 to the 8th power=256, 16 bits (16bit) represent 2 to the 16th power=65536;

sample:
For example, the voltage range collected by the audio sensor is 0-3.3V, and the sampling number is 8bit (bit)
That is, we regard 3.3V/ 2^8 = 0.0128 as the quantization accuracy.
We divide 3.3v into 0.0128 as the stepping Y axis, as shown in Figure 3, 1 2… 8 becomes 0 0.0128 0.0256… 3.3 V
For example, the voltage value of a sampling point is 1.652V (between 1280.128 and 1290.128). We round it to 1.65V and the corresponding quantization level is 128.

3.1.3 Encoding
The quantized sampling signal is transformed into a series of decimal digital code streams arranged according to the sampling sequence, that is, the decimal digital signal. A simple and efficient data system is a binary code system. Therefore, the decimal digital code should be converted into a binary code. According to the total number of decimal digital codes, the number of bits required for binary coding can be determined, that is, the word length (number of sampling bits). This process of transforming the quantized sample signal into a binary code stream with a given word length is called encoding.

sample:
Then the above 1.65V corresponds to a quantization level of 128. The corresponding binary system is 10000000. That is, the result of encoding the sampling point is 10000000. Of course, this is an encoding method that does not consider the positive and negative values, and there are many types of encoding methods that require specific analysis of specific issues. (PCM audio format encoding is A-law 13 polyline encoding)

3.2 PCM audio coding
The PCM signal has not undergone any encoding and compression (lossless compression). Compared with analog signals, it is not easily affected by the clutter and distortion of the transmission system. The dynamic range is wide, and the sound quality is quite good.

3.2.1 PCM encoding
The coding used is the A-law 13 polyline coding.
For details, please refer to: PCM voice coding

3.2.2 Channel
Channels can be divided into mono and stereo (dual channel).

Each sample value of PCM is contained in an integer i, and the length of i is the minimum number of bytes required to accommodate the specified sample length.

Sample size Data format Minimum value Maximum value
8-bit PCM unsigned int 0 225
16-bit PCM int -32767 32767

For mono sound files, the sampling data is an 8-bit short integer (short int 00H-FFH) and the sampling data is stored in chronological order.

Two-channel stereo sound file, each sampling data is a 16-bit integer (int), the upper eight bits (left channel) and the lower eight bits (right channel) respectively represent two channels, and the sampling data is in chronological order Deposit in alternate order.
The same is true when the number of sampling bits is 16 bits, and the storage is related to the byte order.

PCM data format
All network protocols use the big endian way to transmit data. Therefore, the big endian method is also called network byte order. When two hosts with different byte order communicate, they must be converted into network byte order before sending data before transmitting.

4 G.711
In general PCM, the analog signal undergoes some processing (such as amplitude compression) before being digitized. Once digitized, the PCM signal is usually processed further (such as digital data compression).

G.711 is a standard multimedia digital signal (compression/decompression) algorithm that modulates the pulse code from ITU-T. It is a sampling technique for digitizing analog signals, especially for audio signals. PCM samples the signal 8000 times per second, 8KHz; each sample is 8 bits, a total of 64Kbps (DS0). There are two standards for the coding of sampling levels. North America and Japan use the Mu-Law standard, while most other countries use the A-Law standard.

A-law and u-law are two encoding methods of PCM. A-law PCM is used in Europe and my country, and Mu-law is used in North America and Japan. The difference between the two is the quantization method. The A law uses 12bits quantization and the u law uses 13bits quantization. The sampling frequency is 8KHz, and both are 8bits encoding methods.

Simple understanding: PCM is the original audio data collected by audio equipment. G.711 and AAC are two different algorithms, which can compress PCM data to a certain ratio, thereby saving bandwidth in network transmission.