Speech Coding Techniques 17116

Speech Coding Techniques 潘奕誠 4/7/2003

Introduction 

Efficient speech-coding techniques   

 

Advantages for VoIP Digital streams of ones and zeros The lower the bandwidth, the lower the quality

RTP payload types Processing power 



The better quality (for a given bandwidth) uses a more complex algorithm A balance between quality and cost

Voice Quality 

Bandwidth is easily quantified 



Voice quality is subjective

MOS, Mean Opinion Score 

ITU-T Recommendation P.800     

 

Excellent – 5 Good – 4 Fair – 3 Poor – 2 Bad – 1

A minimum of 30 people Listen to voice samples or in conversations



P.800 recommendations    



The selection of participants The test environment Explanations to listeners Analysis of results

Toll quality 

A MOS of 4.0 or higher

About Speech 

Speech 

 



Model the vocal tract as a filter 



Air pushed from the lungs past the vocal cords and along the vocal tract The basic vibrations – vocal cords The sound is altered by the disposition of the vocal tract ( tongue and mouth) The shape changes relatively slowly

The vibrations at the vocal cords 

The excitation signal

Speech sounds 

Voiced sound   



Unvoiced sounds    



The vocal cords vibrate open and close Quasi-periodic pulses of air The rate of the opening and closing – the pitch Forcing air at high velocities through a constriction Noise-like turbulence Show little long-term periodicity Short-term correlations still present

Plosive sounds  

A complete closure in the vocal tract Air pressure is built up and released suddenly

Voice Sampling 

Discrete Time LTI Systems: The Convolution Sum x[n] 



 x[k ] [n  k ]



 x[k ]h[n  k ]

y[n] 

k   1

k  

h[n] n

0 1 2 2

0.5

0 1

x[n]

2.5

2

0.5

n

0 1 2 3

y[n] n

 Nyquist sampling theorem X c ( j )

s (t ) 



  (t  nT )

n    N



N

xs (t )  xc (t ) s (t ) 

 S

0 X c ( j )

S

 xc (t )   (t  nT ) n  



2 S ( j )  T S

 N

N

S ( S   N )





  (  k )

k  

s

Quantization (Scalar Quantization) v1

m0= -A

vk+1

v2

m1

m2 ……

mk

mk+1

vL

mL1

J

mL=A

k+1  Assume | x[n] |  A divide the range [ A , A ] into L quantization levels { J1 , J2 , …… Jk ,….. JL } Jk : [mk-1,mk ]

R L=2

each quantization level Jk is represented by a value vk

S = U Jk , V = { v1 , v2 , …… vk ,….. vL }

Non-Uniform Quantization m0 = -A

m1

m2 ……

0

mL=A

Concept : small quantization levels for small x large quantization levels for large x Goal: constant SNRQ for all x

Companding x[n]

F(x)

Uniform Quantization

Uniform Decoder

F1(x)

Compressor …1101…1101… Expandor Compressor + Expandor  Compandor F(x) is to specify the non-uniform quantization characteristics

^ x[n]

Non-Uniform Quantization 

-law F ( x) 

log 1  μ x 



A-law

log( 1  μ)





F ( x)    

,0  x  1

Ax 1 ,0  x  1  lnA A 1  ln[ A x ] 1 ,  x 1 1  lnA A

 Typical values in practice  = 255 , A = 87.6

Types of Speech Codecs 

Waveform codecs,source codecs (also known as vocoders),and hybrid codecs.

Speech Source Model and Source Coding G(z), G(), g[n] unvoiced

random sequence generator periodic pulse train generator

 G v/u voiced

u[n]

Excitation parameters

1

G(z) = P

1  akz-k k=1

Vocal Tract Model

N

x[n]v/u : voiced/ unvoiced N : pitch for voiced G : signal gain  excitation signal u[n] Vocal Tract parameters

Excitation

A good approximation, though not precise enough

{ak} : LPC coefficients formant structure of speech signals

LPC Vocoder(Voice Coder) x[n]

LPC Analysis

{ ak } N,G v/u

Encoder …11011…

N by pitch detection v/u by voicing detection receiver Decoder …11011…

{ ak } N,G v/u

Ex

g[n] G(z)

x[n]

{ak} can be non-uniform or vector quantized to reduce bit rate

G.711 





The most commonplace codec  Used in circuit-switched telephone network  PCM, Pulse-Code Modulation If uniform quantization  12 bits * 8 k/sec = 96 kbps Non-uniform quantization  65 kbps DS0 rate   law 

North America A-law  Other countries, a little friendlier to lower signal levels An MOS of about 4.3 





ADPCM(adaptive differential PCM) 

DPCM and ADPCM. 

ADPCM : Adaptive Prediction in DPCM Adaptive Quantization Adaptive Quantization   

 

Quantization level  varies with local signal level [n] = ax[n] x[n] : locally estimated standard deviation of x[n]

G.721:ADPCM-coded speech at 32Kbps. G.726(A-law or  law )  

16,24,32,40Kbps MOS 4.0 , at 32Kbps

Analysis-by-Synthesis (AbS) Codecs Hybrid codec 





Fill the gap between waveform and source codecs The most successful and commonly used  Time-domain AbS codecs  Not a simple two-state, voiced/unvoiced  Different excitation signals are attempted  Closest to the original waveform is selected  MPE, Multi-Pulse Excited  RPE, Regular-Pulse Excited  CELP, Code-Excited Linear Predictive

G.728 LD-CELP 

CELP codecs  





A filter; its characteristics change over time A codebook of acoustic vectors  A vector = a set of elements representing various char. of the excitation Transmit  Filter coefficients, gain, a pointer to the vector chosen

Low Delay CELP 

Backward-adaptive coder  Use previous samples to determine filter coefficients  Operates on five samples at a time  Delay < 1 ms  Only the pointer is transmitted

  



1024 vectors in the code book 10-bit pointer (index) 16 kbps

LD-CELP encoder 

Minimize a frequency-weighted mean-square error



LD-CELP decoder

 

An MOS score of about 3.9 One-quarter of G.711 bandwidth

G.723.1 ACELP 

6.3 or 5.3 kbps  



Both mandatory Can change from one to another during a conversation

The coder      

A band-limited input speech signal Sampled at 8 KHz, 16-bit uniform PCM quantization Operate on blocks of 240 samples at a time A look-ahead of 7.5 ms A total algorithmic delay of 37.5 ms + other delays A high- filter to remove any DC component



G.723.1 Annex A 



The two lsbs of the first octet   



Silence Insertion Description (SID) frames of size four octets 00 01 10

6.3kbps 24 octets/frame 5.3kbps 20 SID frame 4

An MOS of about 3.8 

At least 37.5 ms delay

G.729  



8 kbps Input frames of 10 ms, 80 samples for 8 KHz sampling rate 5 ms look-ahead 

 

Algorithmic delay of 15 ms

An 80-bit frame for 10 ms of speech A complex codec 

  

G.729.A (Annex A), a number of simplifications Same frame structure Encoder/decoder, G.729/G.729.A Slightly lower quality



G.729.B 

VAD, Voice Activity Detection 





DTX, Discontinuous Transmission  



 

Based on analysis of several parameters of the input The current frames plus two preceding frames Send nothing or send an SID frame SID frame contains information to generate comfort noise

CNG, Comfort Noise Generation

G.729, an MOS of about 4.0 G.729A an MOS of about 3.7

Other Codecs 

CDMA QCELP defined in IS-733  

Variable-rate coder Two most common rates  

 

The high rate, 13.3 kbps A lower rate, 6.2 kbps

Silence suppression For use with RTP, RFC 2658



GSM Enhanced Full-Rate (EFR)    

GSM 06.60 An enhanced version of GSM Full-Rate ACELP-based codec The same bit rate and the same overall packing structure 

 

12.2 kbps

discontinuous transmission For use with RTP, RFC 1890



GSM Adaptive Multi-Rate (AMR) codec     

 

GSM 06.90 Eight different modes 4.75 kbps to 12.2 kbps 12.2 kbps, GSM EFR 7.4 kbps, IS-641 (TDMA cellular systems) Change the mode at any time Offer discontinuous transmission



The MOS values are for laboratory conditions  

G.711 does not deal with lost packets G.729 can accommodate a lost frame by interpolating from previous frames 



But cause errors in subsequent speech frames

Processing Power  

G.728 or G.729, 40 MIPS G.726 10 MIPS



Cascaded Codecs 





E.g., G.711 stream -> G.729 encoder/decoder Might not even come close to G.729

Each coder only generate an approximate of the incoming signal

Tones, Signal, and DTMF The hybrid codecs are optimized for Digits 

human speech   

  

Other data may need to be transmitted Tones: fax tones, dialing tone, busy tone DTMF digits for two-stage dialing or voicemail

G.711 is OK G.723.1 and G.729 can be unintelligible The ingress gateway needs to intercept  

The tones and DTMT digits

 



Easy at the start of a call Difficult in the middle of a call

Encode the tones differently form the speech  





Send them along the same media path An RTP packet provides the name of the tone and the duration Or, a dynamic RTP profile; an RTP packet containing the frequency, volume and the duration RFC 2198  An RTP payload format for redundant audio data  Sending both types of RTP payload



RTP Payload Format for DTMF Digits   

An Internet Draft Both methods described before A large number of tones and events 



DTMF digits, a busy tone, a congestion tone, a ringing tone, etc.

The named events 

E: the end of the tone, R: reserved



Payload format

Finis

Discrete Time LTI Systems: The Convolution Sum x[n] 



 x[k ] [n  k ]

k  

y[n] 



 x[k ]h[n  k ]

k   1

h[n] n

0 1 2 2

0.5

0 1

x[n]

2.5

2

0.5

n

0 1 2 3

y[n] n

Frequency-Domain Representation of Sampling X c ( j)

s (t ) 



  (t  nT )

n    N



N

xs (t )  xc (t ) s (t ) 

 S

0 X c ( j)

S

 xc (t )   (t  nT ) n  



2 S ( j )  T S

 N

N

S ( S   N )





  (  k )

k  

s

Speech Source Model and Source Coding 

Vocal Tract Model u (n)   a x[n  k ]  x[n] p

k 1

G( z) 

k

1 p

1   ak z  k k 1

X ( z)  U ( z)

Speech Coding Techniques 17116

Overview 26281t

More details 6y5l6z

Related Documents 3h463d

Speech Coding Techniques 17116

Channel Coding Techniques p64t

Waveform Coding Techniques 6w6w6w

Speech And Language Stimulation Techniques For Children 496v3n

More Documents from "jenny butil" 4z3e35

Speech Coding Techniques 17116

La Compania Maritima V Francsico Munoz 24j1e

Nuclear Test Cases - Australia V 362h1v

Carrera Lopez Jenny M05s1ai1 2a4n6j

Human Resources Certificate 3d1650

Predicacion Fiesta De Las Primicias.docx 2o4z6g