当前位置:首页 >> 信息与通信 >>

修正线性单元用于语音识别


ON RECTIFIED LINEAR UNITS FOR SPEECH PROCESSING M.D. Zeiler1 ? , M. Ranzato2 , R. Monga2 , M. Mao2 , K. Yang2 , Q.V. Le2 , P. Nguyen2 , A. Senior2 , V. Vanhoucke2 , J. Dean2 , G.E. Hinton3
1

New York University, USA

2

Google Inc., USA

3

University of Toronto, Canada

ABSTRACT Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with recti?ed linear units. These units are linear when their input is positive and zero otherwise. In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology. Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks. All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data. Index Terms— Recti?ed Linear units, Deep Learning, Neural Networks, Unsupervised Learning, Hybrid System 1. INTRODUCTION Recent years have seen a surge of interest in neural networks for acoustic modeling in speech recognition systems. Compared to traditional Gaussian Mixture Models (GMMs), neural networks have two main advantages. They scale better with the input dimensionality allowing the use of larger context windows and they automatically learn discriminative features from data alleviating the problem of manually engineering and selecting features. Together these two factors have yielded dramatic improvements in terms of word error rate. In their seminal work, Mohamed et al. [1] proposed to use a system composed of many layers of logistic units. In order to overcome the notoriously dif?cult problem of optimizing very deep networks, they proposed to use a layer-wise unsupervised learning algorithm, called Restricted Boltzmann Machine (RBM) [2], as a way to provide a sensible initialization and they demonstrated signi?cant improvements over the baseline GMM. One issue with this training procedure is that tracking convergence of RBMs is dif?cult and the overall layer-wise train
? This

work was done while M.D. Zeiler was an intern at Google.

ing procedure is laborious and time-consuming, even when using specialized hardware like GPUs. This challenge continued to motivate researchers to design better unsupervised algorithms [3, 4] and better optimization methods for training deep neural nets [5, 6]. Inspired by recent work on deep learning for vision applications [7, 8, 9], we propose to replace the logistic nonlinearity with a half-recti?cation non-linearity which is linear for positive values and zero otherwise. Because of the shape of this non-linearity, we call the resulting deep network a “hinge deep neural network” (HDNN), and the units that compose the HDNN “recti?ed linear units” (ReLUs) [7]. This small change brings several advantages. First, in our experience it eliminates the necessity to have a “pretraining” phase using unsupervised learning [8]. We demonstrate empirically that we can easily and successfully train extremely deep networks even from random initialization. Second, the convergence of HDNN is faster than in a regular logistic neural net with the same topology. Third, HDNN is very simple to optimize. Even vanilla stochastic gradient descent with constant learning rate yields very good accuracy. Fourth, HDNN generalizes better than its logistic counterpart. And ?nally, recti?ed linear units are faster to compute because they do not require exponentiation and division, with an overall speed up of 25% on the 4 hidden layer neural network we tested on. We conjecture that the reason why recti?ed linear units are so bene?cial for ef?cient learning of deep neural nets is twofold. First, from the optimization perspective, HDNN is piece-wise linear. If we restrict our attention to the units that are non-zero, the whole system reduces to a linear convex system whose optimization is straightforward even using ?rst order optimizers. Second, HDNN seems to generalize better because the internal representation produced by the network is much more regularized. Unlike logistic units that produce small positive values when the input is not aligned with the internal weights, recti?ed linear units often output exact zeros. For instance, we found that on average about 80% of the units in a HDNN are zero after training. Improved generalization can be seen as the effect of the increased sparsity of the internal representation or also by interpreting HDNN as a system with stacked binary linear SVMs (as opposed to logistic regression classi?ers), and it is well known that such

Fig. 1. The proposed non-linearity, ReLU, and the standard neural network non-linearity, logistic. classi?ers enjoy better generalization properties [9]. Although our work does not reveal any bene?t to unsupervised learning for HDNNs, it is still interesting to learn features without any supervision for the purpose of automatic discovery of phonetic elements and potentially, for training on languages with small amounts of labeled data. We propose a very simple sparse autoencoder method that can learn very interpretable and discriminative features when using ReLUs. Overall, we demonstrate a clear advantage of ReLUs over logistic units both in the supervised and unsupervised setting. Our empirical validation uses a recently introduced distributed framework [10]. This allows us to train a network with 43 million parameters on four hundred machines using 1500 cores to process 1.2 billion frames per day, more than 6 times faster than using a single NVIDIA GeForce GTX 580 GPU. 2. SUPERVISED LEARNING Our supervised learning set up is conventional, except for the use of the proposed activation function. The network is given both an input x (typically a few consecutive frames of a spectrogram representation) and a label representing the state of the HMM for that input. The network processes the input through a sequence of non-linear transformations. In particular, at the i-th layer the network computes: hi = f (W i hi?1 + bi ), where W i ∈ RM ×N is a matrix of trainable weights, bi ∈ RM is a vector of trainable biases, and hi ∈ RN is the i-th hidden layer (or the input x if i is equal to 0) and hi+1 ∈ RM is the (i + 1)-th hidden layer. In our work, we propose to use as f the following pointwise non-linear function: f (u) = max(0, u). The resulting unit in the network is dubbed ReLU [7]. We have also experimented and compared to other functions as well. We tested the widely used logistic, f (u) = 1/(1 + exp(?u)), and hyperbolic tangent, f (u) = tanh(u). Since hyperbolic tangent performed slightly worse than logistic, we only report the latter for our baseline comparisons. In order to predict the label, the topmost layer of the network uses a softmax non-linearity which outputs probability values. If the network has L layers, then the prediction for the probability of the k -th class is: p(k |x) = exp((W L )k hL?1 + C L L L L bL k )/ j =1 exp((W )j h + bj ), where (W )j is j -th row of the last layer weight matrix, bL j is the j -th entry in the last

layer vector of biases and C is the number of classes. Training the parameters of the network (the weight matrices and biases at all layers) is performed by minimizing the cross entropy loss over the training set. The contribution of each sample x to the loss is: Lsup (θ) = C ? j =1 tj log p(j |x; θ), where t is a 1-of-C encoding of the target class label and θ collectively denotes all parameters of the neural network, namely {(W i , bi ), i = 0, . . . , L}. We will discuss in sec. 4 how we minimize this loss function. 3. UNSUPERVISED LEARNING In our unsupervised experiments, we use a method described in appendix B of [11]. This is a very ef?cient and effective sparse feature learning method which can be understood as a sparse auto-encoder neural network using ReLU units as features. Training proceeds layer by layer, from the bottom to the top in sequence. For each layer, the features (that will be subsequently used as data to train the layer above) are: hi = max(0, W i hi?1 + bi ). During training we couple this layer with an auxiliary layer that reconstructs the input ? i?1 = max(0, W i hi + bi ), where from the features using h r r N N ×M i is is a matrix of trainable weights, bi Wr ∈ R r ∈ R i ?1 N ? a vector of trainable biases, and h ∈ R is the reconstruction of hi?1 . When i is 0 and we reconstruct the input x, the ReLUs units of the reconstruction layer are replaced by linear units because x takes also negative values. The parameters are learned by minimizing a loss function. The contribution of each sample to the loss at the i-th layer is: ? i?1 ? hi?1 ||2 + λ||hi ||1 , with λ ≥ 0. The Lunsup (θ) = ||h 2 ?rst term measures the squared reconstruction error and it is useful to guarantee that features preserve information of the input. The latter term makes the learning algorithm discover sparse features, that is, features with few non-zero values. This is important to restrict the capacity of the model (overall when there are more features than input dimensions) and force it to capture the regularities of the input data. Since the loss can be trivially decreased by scaling down W i while i i scaling up Wr , we re-parameterize Wr as follows: (Wr )j = i i i i ? ? (Wr )j /||(Wr )j ||2 , where (Wr )j is the j -th column of Wr , i ? r. and we learn the parameters of matrix W This method can be interpreted as a special case of PSD [12, 13] and sparse coding [14] when inference of features is computed in just one step. Unlike RBM’s objective function which is intractable, this method can be optimized very ef?ciently and it enjoys the use of ReLU units because they naturally produce sparse features. 4. LEARNING IN A DISTRIBUTED FRAMEWORK In order to support training on vast amount of data in very short time, we use our recently proposed distributed framework [10]. The hidden units of the network are partitioned across several machines and each machine further parallelizes

computation across several cores. Parallel distributed computation is used across the samples in a mini-batch as well as across the nodes of the neural network. In the experiments of sec. 5 we use this framework and learn the parameters of the system by asynchronous stochastic gradient descent (SGD) [10]. Training proceeds as follows. The network is replicated P times. Each replica is an exact copy of the model, with possibly slightly stale parameters and operating on a random subset of the training data. Besides the P replicas, there is also a sharded parameter server hosting the most updated version of the parameters. Once a model replica has ?nished computing the gradients on its mini-batch, it sends them to the parameters server which uses them to update the parameters. Finally, the parameter server sends back an updated copy of the parameters to that model replica. This mechanism allows many model replicas to work concurrently but asynchronously on the same training problem and to quickly update the parameters, while being tolerant to machine failure and high latency. In this work, we investigated three different ways to upt date the parameters in the parameter server. Let θi be the i-th parameter after t ? 1 weight updates. In vanilla SGD the pat+1 t t rameters are updated using: θi = θi ? η?L/?θi , where η is the learning rate. In SGD with Adagrad [10, 15], each parameter has its own adaptive learning rate; the parameter update
t+1 t t t t s 2 is θi = θi ? ηi ?L/?θi with ηi = η/ s=1 (?L/?θi ) . Finally, in SGD with momentum the parameters are updated t+1 +1 +1 t t by: θi = θi ? η ?t , with ?t = 0.9?t i + ?L/?θi . i i While Adagrad aims at gently scaling and annealing learning rates, momentum speeds up learning along those gradient directions that are persistent during training. t

Fig. 2. Frame accuracy as a function of time of a 4 hidden layer HNN trained with different optimizers.

Fig. 3. Frame accuracy as a function of time for a 4 hidden layer neural net trained with either logistic or ReLUs and using as optimizer either SGD or SGD with Adagrad (ADG). across 4 machines using up to 4 CPUs each. The number of model replicas P has been set to 100. All parameters in the weight matrices are intialized at random while the biases are initialized at zero. Learning rates have been cross-validated. Typically, HDNN uses a learning which is 10 times smaller than a logistic DNN. 5.1. Supervised Learning Experiments The results we report are obtained by training for one week. In the ?rst experiment shown in ?g. 2, we compare the three different optimization strategies we described in sec. 4, on a 4 hidden layer HNN initialized at random. In terms of wall clock time to reach a given frame accuracy on the validation set, Adagrad exhibits the fastest convergence time, although plain SGD eventually reaches the same overall frame accuracy. Momentum instead performs slightly worse. Similar ?ndings were observed using a network with logistic units. However, plain SGD does not perform as well as Adagrad in this case, see ?g. 3. Unlike HNN, a logistic network does need accelerated ?rst order methods to yield good frame accuracy. It seems that optimization is much harder in logistic networks than HNNs. Fig. 3 shows that a logistic network trained with Adagrad can achieve the same accuracy than a HNN trained with either Adagrad or even plain SGD. However, we found that the performance in terms of

5. EXPERIMENTS All experiments are performed using several hundred hours of US English data collected using Voice Search, Voice Typing and read data. The test set follows the same distribution of the training set but uses independent sources. The setup for the hybrid decoding is exactly the same as the one described in earlier work [16]. In the supervised setting, a baseline GMM-HMM system is trained and used to generate 7969 context-dependent tied acoustic states. This system is also used to produce state labels for every input frame using forced alignment. These labels are the target for the supervised network. In both the supervised and unsupervised settings, the input to the network consists of 26 consecutive frames, each comprising 40 log-energy ?lter bank outputs representing 25ms of speech. Consecutive frames are 10ms apart. The overall input dimensionality is 1040, although spectral analysis reveals that 95% of the variance is concentrated in the ?rst leading 100-dimensional principal components. All layers of our networks have 2560 hidden units and training has been performed by partitioning each network

Fig. 4. Validation frame accuracy over time using HNN with different number of hidden layers and SGD.
Nr. hid. layers
WER %

Fig. 5. Left: Random subset of the 2560 ?lters learned in an unsupervised way. The vertical axis is over frequencies and horizontal axis is over time (26 frames, each 10ms long). Right: Example of how some ?lters (left part) match samples on a validation set (right part with corresponding phone label).
Phone label
er iy r

Precision %
57.0 49.6 52.5

Recall %
2.0 11.0 6.0

Accuracy %
98.5 96.7 97.2

1
16.0

2
12.8

4
11.4

8
10.9

10
11.0

12
11.1

Table 1. Word error rate of HNN with varying number of hidden layers. word error rate is superior when using HNN and SGD. The word error rate of a logistic network trained with Adagrad is 11.8% (slightly better than when training using SGD), while the word error rate of HNN is 11.7 and 11.4% when using Adagrad and SGD, respectively. Since a difference of 0.1% is statistically signi?cant in our data set, we conclude that HNNs are not only easier to train but also they generalize better. Finally, ?g. 4 shows that extremely deep HNNs (we tested up to 12 hidden layers) can be successfully trained from random initialization. Since we did not allocate more resources for the deeper networks, their compute and convergence time is slower. However, they do not get stuck in the optimization and produce among the best results. Table 1 reports the corresponding word error rates on the test set. The 8 hidden layer neural network produces the best rate of 10.9%, but this is closely followed by the 10 and 12 hidden layer HNN. We also tested very deep logistic networks from random initialization but did observe that the optimization gets stuck when using 8 hidden layers and more. After one week, 8 hidden layers logistic network achieves a mere word error rate of 12.0%. 5.2. Unsupervised Learning Experiments To validate the use of ReLUs for unsupervised learning, we trained using the same input data as in the previous section (but without making use of the labels). After learning, we ?rst inspected the learned features. Fig. 5 shows a subset of the 2560 features where each tile corresponds to the weights connected to a hidden unit. Some features resemble Gabor functions localized in the time-frequency domain, but others are more complex and seem to capture the structure and the temporal dynamics of formants. We also tested the use of logistic units and linear units, but could not learn any struc-

Table 2. Unsupervised discovery of phones: a threshold on a single feature can be a high accuracy phone detector.
Nr. hid. layers
Frame accuracy %

0
28.5

1
37.8

2
38.3

3
39.2

Table 3. Test frame accuracy using a linear classi?er on the features learned in an unsupervised way. ture set of features. To better interpret the ReLU features, we looked for the best matching input sample in the validation set. Leftmost part of ?g. 5 shows how some ?lters resemble closely actual inputs, suggesting that some of these features may have discovered phonetic elements in an unsupervised way. To validate this hypothesis, we used each single feature as a threshold classi?er and checked whether its output correlates with the phone label of the input. Table 2 shows that this is indeed the case for some phones. Since the large fraction of frames has label “silence”, there is a very large number of negative inputs and achieving a precision of 50% at a recall greater than 1% is considered remarkable. The discrimination ability of ReLUs is expected to improve when we consider the whole feature set. We therefore trained a linear logistic regression classi?er on the whole feature vector. Table 3 reports the frame accuracy as we learn more layers of features and demonstrates that features do get more discriminative as we stack them, although with diminishing returns. Using these features to initialize a deep HNN did not improve performance, however. We observed faster initial convergence but not better accuracy after a few hours of training. 6. CONCLUSION In this empirical study we advocate the use of ReLU in deep networks since a) they are easier to optimize, b) they converge faster, c) they generalize better and d) they are faster to compute. Future work will leverage unsupervised learning and ReLUs for tasks where labeled data is very scarce.

7. REFERENCES [1] A.R. Mohamed, G.E. Dahl, and G.E. Hinton, “Deep belief networks for phone recognition,” NIPS 22 workshop on deep learning for speech recognition, 2009. [2] G.E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, vol. 14, pp. 1771–1800, 2002. [3] C. Plahl, T.N. Sainath, B. Ramabhadran, and D. Nahamoo, “Improved pre-training of deep belief networks using sparse encoding symmetric machines,” in ICASSP, 2012. [4] T.N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” in ICASSP, 2012. [5] B. Kingsbury, T.N. Sainath, and H. Soltau, “Scalable minimum bayes risk training of deep neural network acoustic models using distributed hessian-free optimization,” in Interspeech, 2012. [6] L. Deng, B. Hutchinson, and D. Yu, “Parallel training of deep stacking networks,” in Interspeech, 2012. [7] V. Nair and G.E. Hinton, “Recti?ed linear units improve restricted boltzmann machines,” in ICML, 2010. [8] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse recti?er neural networks,” Journal of Machine Learning Research - Proceedings Track, vol. 15, pp. 315–323, 2011. [9] A. Krizhevsky, I. Sutskever, and G.E. Hinton, “Imagenet classi?cation with deep convolutional neural networks,” in NIPS, 2012. [10] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale distributed deep networks,” in NIPS, 2012. [11] M. Ranzato, “Unsupervised learning of feature hierarchies,” Ph.D. thesis, ch. 1, 2009. [12] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast inference in sparse coding algorithms with applications to object recognition,” Tech. Rep., Computational and Biological Learning Lab, Courant Institute, NYU, 2008, Tech Report CBLL-TR-2008-12-01. [13] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in ICML, 2010. [14] C.J. Rozell, D.H. Johnson, Baraniuk R.G., and B.A. Olshausen, “Sparse coding via thresholding and local competition in neural circuits,” Neural Computation, 2008.

[15] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online leaning and stochastic optimization,” in COLT, 2010. [16] N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “Application of pretrained deep neural networks to large vocabulary speech recognition,” in Interspeech, 2012.


相关文章:
语音识别概述
语音识别技术的应用主要有以下两个方面。 一是用于...(1)语音识别单元的选取 选择识别单元语音识别研究...线性预测(LP)分析技术是目前应用广泛的特征参数提取...
语音识别论文
与模式匹配; 1.3.1 语音信号预处理与特征提取 选择识别单元语音识别研究的...全极点数字滤波器的形式,从而 n 时刻的信号 可以前若干时刻的信号的线性组合...
语音识别与语义识别
也就是提取出反映 语音信号特征的关键特征参数形成特征矢量序列,以便用于后续处理...线性预测倒谱系数 (LPCC)和美尔频率倒谱 系数(MFCC)" 语音识别过程就是根据模式...
语音识别综述
理论的提出,并且实现了基于线性预测倒谱和 DTW 技术 的特定人孤立语音识别系统。...虽然出现了很多新的修正方法, 但在识别速 度,关键词检测等仍有许多问题亟待解决...
语音识别方法及发展趋势分析
张文林等人提出从语音信号声学特征空间的非线性流形...(指的是词组、词、建模单元或 HMM 的状 态)空间...虽然各种新的修正方法不断出现,但其普遍性和实用性...
语音识别小程序
了特征提取、模式匹配、参考模型库 这三个基本单元...通 过线性预测分析的线性预测倒谱系数 LPCC 和基于...能够用前面介绍的原理进行 汉语数字语音识别,而且...
语音识别
具有能够逼近任 意的分线性函数、并行化处理信息、...语音识别系统长期以来, 在描述每个建模单元的统计概率...因为现在用于特征提取的参数很多,例如 MFCC, LPCC ...
多重共线性的检验与修正
多重共线性的检验与修正_经济学_高等教育_教育专区。多重共线性的检验与修正附件二:实验报告格式(首页) 山东轻工业学院实验报告课程名称 计量经济学 院(系) 商学...
DSP语音论文
ANN 在语音识别中的应用是目前研究的热点。该网络本质上是一个自适应非 线性...音素单元用于英语语 音识别研究,因为英语是多音节语言。但大、中词汇汉语语音...
</