Pointwise v18.0 R3 x86/x64 + Activation Crack - jyvsoft Skip to content

# Pointwise Activation key

(2) The pointwise convolution then applies a number of 1-dimensional Table 4 shows the key specifications for the FRDM K66F development. Mish is a smooth, non-monotonic activation function. The reason why Mish function is used in YOLOv4 is its low cost and its various. Pointwise Crack may be a powerful program for computational fluid dynamics and 3D modeling that gives a knowledgeable set of tools to create.

### Pointwise Activation key -

##### Abstract

The aim of this thesis is to study the effect that linguistic context exerts on the activation and processing of word meaning over time. Previous studies have demonstrated that a biasing context makes it possible to predict upcoming words. The context causes the pre-activation of expected words and facilitates their processing when they are encountered. The interaction of context and word meaning can be described in terms of feature overlap: as the context unfolds, the semantic features of the processed words are activated and words that match those features are pre-activated and thus processed more quickly when encountered. The aim of the experiments in this thesis is to test a key prediction of this account, viz., that the facilitation effect is additive and occurs together with the unfolding context. Our first contribution is to analyse the effect of an increasing amount of biasing context on the pre-activation of the meaning of a critical word. In a self-paced reading study, we investigate the amount of biasing information required to boost word processing: at least two biasing words are required to significantly reduce the time to read the critical word. In a complementary visual world experiment we study the effect of context as it unfolds over time. We identify a ceiling effect after the first biasing word: when the expected word has been pre-activated, an increasing amount of context does not produce any additional significant facilitation effect. Our second contribution is to model the activation effect observed in the previous experiments using a bag-of-words distributional semantic model. The similarity scores generated by the model significantly correlate with the association scores produced by humans. When we use point-wise multiplication to combine contextual word vectors, the model provides a computational implementation of feature overlap theory, successfully predicting reading times. Our third contribution is to analyse the effect of context on semantically similar words. In another visual world experiment, we show that words that are semantically similar generate similar eye-movements towards a related object depicted on the screen. A coherent context pre-activates the critical word and therefore increases the expectations towards it. This experiment also tested the cognitive validity of a distributional model of semantics by using this model to generate the critical words for the experimental materials used.

Источник: https://era.ed.ac.uk/handle/1842/10508

## Prelims

The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks. The paper itself is very clearly written, but the conventional wisdom has been that it is quite difficult to implement correctly.

In this post I present an “annotated” version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation. In total there are 400 lines of library code which can process 27,000 tokens per second on 4 GPUs.

To follow along you will first need to install PyTorch. The complete notebook is also available on github or on Google Colab with free GPUs.

Note this is merely a starting point for researchers and interested developers. The code here is based heavily on our OpenNMT packages. (If helpful feel free to cite.) For other full-sevice implementations of the model check-out Tensor2Tensor (tensorflow) and Sockeye (mxnet).

• Alexander Rush (@harvardnlp or srush@seas.harvard.edu), with help from Vincent Nguyen and Guillaume Klein

My comments are blockquoted. The main text is all from the paper itself.

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End- to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple- language question answering and language modeling tasks.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

Most competitive neural sequence transduction models have an encoder-decoder structure (cite). Here, the encoder maps an input sequence of symbol representations $(x_1, …, x_n)$ to a sequence of continuous representations $\mathbf{z} = (z_1, …, z_n)$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_1,…,y_m)$ of symbols one element at a time. At each step the model is auto-regressive (cite), consuming the previously generated symbols as additional input when generating the next.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

### Encoder

The encoder is composed of a stack of $N=6$ identical layers.

We employ a residual connection (cite) around each of the two sub-layers, followed by layer normalization (cite).

That is, the output of each sub-layer is $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$, where $\mathrm{Sublayer}(x)$ is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{\text{model}}=512$.

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed- forward network.

### Decoder

The decoder is also composed of a stack of $N=6$ identical layers.

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

### Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention “Scaled Dot-Product Attention”. The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:

The two most commonly used attention functions are additive attention (cite), and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ (cite). We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients (To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean $0$ and variance $1$. Then their dot product, $q \cdot k = \sum_{i=1}^{d_k} q_ik_i$, has mean $0$ and variance $d_k$.). To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

Where the projections are parameter matrices $W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$. In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

### Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways: 1) In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as (cite).

2) The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

3) Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot- product attention by masking out (setting to $-\infty$) all values in the input of the softmax which correspond to illegal connections.

### Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{\text{model}}=512$, and the inner-layer has dimensionality $d_{ff}=2048$.

### Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{\text{model}}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by $\sqrt{d_{\text{model}}}$.

### Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{\text{model}}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite).

In this work, we use sine and cosine functions of different frequencies:

where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop}=0.1$.

Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.

We also experimented with using learned positional embeddings (cite) instead, and found that the two versions produced nearly identical results. We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

### Full Model

Here we define a function that takes in hyperparameters and produces a full model.

This section describes the training regime for our models.

We stop for a quick interlude to introduce some of the tools needed to train a standard encoder decoder model. First we define a batch object that holds the src and target sentences for training, as well as constructing the masks.

Next we create a generic training and scoring function to keep track of loss. We pass in a generic loss compute function that also handles parameter updates.

### Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens. For English- French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.

Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

We will use torch text for batching. This is discussed in more detail below. Here we create batches in a torchtext function that ensures our batch size padded to the maximum batchsize does not surpass a threshold (25000 if we have 8 gpus).

### Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).

### Optimizer

We used the Adam optimizer (cite) with $\beta_1=0.9$, $\beta_2=0.98$ and $\epsilon=10^{-9}$. We varied the learning rate over the course of training, according to the formula: This corresponds to increasing the learning rate linearly for the first $warmup_steps$ training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used $warmup_steps=4000$.

Note: This part is very important. Need to train with this setup of the model.

Example of the curves of this model for different model sizes and for optimization hyperparameters.

### Label Smoothing

During training, we employed label smoothing of value $\epsilon_{ls}=0.1$ (cite). This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

We implement label smoothing using the KL div loss. Instead of using a one-hot target distribution, we create a distribution that has of the correct word and the rest of the mass distributed throughout the vocabulary.

Here we can see an example of how the mass is distributed to the words based on confidence.

Label smoothing actually starts to penalize the model if it gets very confident about a given choice.

We can begin by trying out a simple copy-task. Given a random set of input symbols from a small vocabulary, the goal is to generate back those same symbols.

### Greedy Decoding

This code predicts a translation using greedy decoding for simplicity.

Now we consider a real-world example using the IWSLT German-English Translation task. This task is much smaller than the WMT task considered in the paper, but it illustrates the whole system. We also show how to use multi-gpu processing to make it really fast.

We will load the dataset using torchtext and spacy for tokenization.

Batching matters a ton for speed. We want to have very evenly divided batches, with absolutely minimal padding. To do this we have to hack a bit around the default torchtext batching. This code patches their default batching to make sure we search over enough sentences to find tight batches.

### Multi-GPU Training

Finally to really target fast training, we will use multi-gpu. This code implements multi-gpu word generation. It is not specific to transformer so I won’t go into too much detail. The idea is to split up word generation at training time into chunks to be processed in parallel across many different gpus. We do this using pytorch parallel primitives:

• replicate - split modules onto different gpus.
• scatter - split batches onto different gpus
• parallel_apply - apply module to batches on different gpus
• gather - pull scattered data back onto one gpu.
• nn.DataParallel - a special module wrapper that calls these all before evaluating.

Now we create our model, criterion, optimizer, data iterators, and paralelization

Now we train the model. I will play with the warmup steps a bit, but everything else uses the default parameters. On an AWS p3.8xlarge with 4 Tesla V100s, this runs at ~27,000 tokens per second with a batch size of 12,000

### Training the System

Once trained we can decode the model to produce a set of translations. Here we simply translate the first sentence in the validation set. This dataset is pretty small so the translations with greedy search are reasonably accurate.

So this mostly covers the transformer model itself. There are four aspects that we didn’t cover explicitly. We also have all these additional features implemented in OpenNMT-py.

1) BPE/ Word-piece: We can use a library to first preprocess the data into subword units. See Rico Sennrich’s subword- nmt implementation. These models will transform the training data to look like this:

▁Die ▁Protokoll datei ▁kann ▁ heimlich ▁per ▁E - Mail ▁oder ▁FTP ▁an ▁einen ▁bestimmte n ▁Empfänger ▁gesendet ▁werden .

2) Shared Embeddings: When using BPE with shared vocabulary we can share the same weight vectors between the source / target / generator. See the (cite) for details. To add this to the model simply do this:

3) Beam Search: This is a bit too complicated to cover here. See the OpenNMT- py for a pytorch implementation.

4) Model Averaging: The paper averages the last k checkpoints to create an ensembling effect. We can do this after the fact if we have a bunch of models:

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0.1, instead of 0.3.

The code we have written here is a version of the base model. There are fully trained version of this system available here (Example Models).

With the addtional extensions in the last section, the OpenNMT-py replication gets to 26.9 on EN-DE WMT. Here I have loaded in those parameters to our reimplemenation.

### Attention Visualization

Even with a greedy decoder the translation looks pretty good. We can further visualize it to see what is happening at each layer of the attention

Hopefully this code is useful for future research. Please reach out if you have any issues. If you find this code helpful, also check out our other OpenNMT tools.

Cheers, srush

Источник: https://nlp.seas.harvard.edu/2018/04/03/attention.html

ANTI-DISTILLATION: IMPROVING REPRODUCIBILITY

Deep networks have been revolutionary in improving performance of machine learning and artificial intelligence systems. Their high prediction accuracy, however, comes at a price of model… Expand

Synthesizing Irreproducibility in Deep Networks

TLDR

This study demonstrates the effects of randomness in initialization, training data shuffling window size, and activation functions on prediction irreproducibility, even under very controlled synthetic data.Expand

Randomness In Neural Network Training: Characterizing The Impact of Tooling

TLDR

The results suggest that deterministic tooling is critical for AI safety, but also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to 746%, 241%, and 196% on a spectrum of widely used GPU accelerator architectures, relative to non-deterministic training.Expand

Dropout Prediction Variation Estimation Using Neuron Activation Strength

• Haichao Yu, Zhe Chen, Dong Lin, G. Shamir, Jie Han
• Computer Science
• ArXiv
• 2021

TLDR

This approach provides an inference-once alternative to estimate dropout prediction variation as an auxiliary task and demonstrates that using activation features from a subset of the neural network layers can be sufficient to achieve variation estimation performance almost comparable to that of usingactivation features from all layers, thus reducing resources even further for variation estimation.Expand

Smooth activations and reproducibility in deep networks

TLDR

A new family of activations; Smooth ReLU (SmeLU) is proposed, designed to give better tradeoffs, while also keeping the mathematical expression simple, and thus training speed fast and implementation cheap, and demonstrating the superior accuracy-reproducibility tradeoffs with smooth activation, SmeLU in particular.Expand
Источник: https://www.semanticscholar.org/paper/Beyond-Point-Estimate%3A-Inferring-Ensemble-Variation-Chen-Wang/a5032b460626aff980a7c98c99900a5c093e3c75

Obtain an ISV settings file extendsim. Place the extendsim. Activation will either be automatic or manual. Make a copy of the extendsim. Since RLM was already running, the extendsim.

If you subsequently edit the license file, such as changing the ISV port, distribute the modified file to the Clients. Stop that Service these instructions are for Windows 10; others differ :. In the list that appears, double-click Services and Applications. Double-click Services. In the list that appears, scroll to the name of the identified RLM Service.

Stop that service right-click select Stop.

### Set up the RLM as a Windows Service

Close the Computer Management window. Remove the currently running RLM Service:. If successful, the Service will be removed and you can close the window. The folder contains the RLM Svc folder and three files: extendsim. Copy the all the settings and license files but not RLM. The extendsim. If you installed the License Manager without following these instructions, the log file will report something similar to:. What to do if RLM is already running on the Server.

Both cases accommodate the products that are already using RLM as their license manager. Details Category: License Manager on Server. ExtendSim Store. Conferences Training Webinars News.Customers who do not have RLM will need to obtain a license and perform a complete download and install. For more information, or to see answers to the most common installation questions, please see our Get Help page.

Note that a valid license is required to run the software. Current customers will need to update to a V An activation key is a digit numeric string used during the Reprise License Manager RLM installation process to contact Pointwise for information on the number of processes licensed and their expiration date. For best results, we recommend that you have your activation key prior to beginning installation. If you do not already have an activation key, you can request one at www.

The terms of the agreement under which you may acquire the right to use the software include the "Right To Use License" attached hereto or accompanying the software. Your purchase, license, receipt, or use of the software constitutes your acceptance of all the "Right To Use License".

Please read the Right To Use License. Email Support. Request an Activation Key. The current production release is Pointwise V Use the following space to send us any questions you may have about Pointwise.GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account. As such, products that are developed or being developed against the RLM library cannot run with a floating license configuration. Some products are only available in a floating capacity and therefore cannot be used due to this issue.

### Pointwise and RLM Help Guide

This problem was shared with the Reprise Software team. The ticket has since been closed. Please contact me for more info or for me to run any other diagnostic tools. A full strace of the rlm test client "rlmclient" is attached. Please let us know if you need further information to troubleshoot. No indication here what was broken or what was fixed, but whatever it was, it is fixedinsiders now since 14 Sep was after Fall Creators.

Possibly something to do with select on ipv6, shrug. We use optional third-party analytics cookies to understand how you use GitHub.

For more information, see our Privacy Statement. We use essential cookies to perform essential website functions, e. We use analytics cookies to understand how you use our websites so we can make them better, e. Skip to content. Dismiss Join GitHub today GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Sign up. New issue. Jump to bottom. Labels bug fixedininsiderbuilds.

Rasamani in coimbatore

Black african skokka girl sex videos

FAQs for license administrators Common installation and configuration questions. Admin Corner section of our blog Common installation and configuration questions. Known issues for license administrators Known issues and how to resolve unusual behavior.

What do all these have in common?

They are all annoying, everyone hates them with the possible exception of their spouses and mothersall are to some extent illegal, and for them to be successful, they must be bad for your business.

What does this have to do with license management? Nothing at all. But …. Exceptional Tech Support License administrators and users of RLM-licensed applications depend primarily on their software publishers for support, but the links to the software licensing technical resources below may answer your RLM questions and help you with troubleshooting.

Current shipping RLM version: v Tweets by RepriseSoftware. Support home Resources for software publishers and license administrators. Reprise on Twitter Follow Reprise on Twitter for news of bug fixes, product releases, and more. Reprise Blog Stay current on all topics related to license management.

Try RLM Now. Want to get your hands on a copy of RLM?

### Volume Tech Files

We provide an exceptional customer experience. Latest Blog Post 10 Jul. What Customers Say.On Microsoft Windows servers, you may want to install and run the rlm server as a Windows service process.

A service process can start automatically at boot time and remain running as long as the system is up, regardless of user logins and logouts. Once installed as a service, it remains installed until it is explicitly deleted as a service. To install using the web interface, select Manage Windows Service from the main menu on the left.

O amarelinho

You will get a form with 3 data fields:. All 3 fields will be filled in with default values. By default, the logfile is put in the directory with the rlm. Also, by default, rlm will search for all license files in this directory. If you select Remove Servicethe service name specified in the form will be removed. If the instance of rlm which you are running is actually running as a service, you will not be able to Remove the Service since it is running.

To remove the service, you will have to stop the service, and then either use the service control panel in Windows, or run rlm in a command window and use the Remove Service option in the web interface. Optionally, you can install RLM as a service in a command window. To do this, use the rlm program itself in a command windowwith special arguments:.

This parameter is required. If sname contains embedded whitespace, it must be enclosed in double quotes. Installed RLM services are also deleted with the rlm program. Services must be stopped via the service control panel before they can be deleted. Note that deleting a service deletes it from the Windows service database; it does not delete the rlm executable or associated license file s :.

Note that you must install this startup script as root. The startup script should su to a different user so that the rlm servers are not running as root. The following is an example of a script which would start rlm at boot time on Mac systems. You can add additional ProgramArguments as needed:. Licensing Documentation 1.If you have multiple products that use the RLM License Manager, please click here for additional instructions. This utility can be used to create, delete and start a Windows Service and set certain User-defined parameters.

It can be something like rlm or gsrlm. User: Optional. Password: Optional. It allows you to view all available Services on your machine. Starting the Windows Service Stop all rlm processes that are running. This can be done in the Windows Task Manager by selecting Show processes from all users. If you expand the Name heading on the Services Windowyou can see the Short name of the Windows Service at the end of the Friendly name.

Press the Delete Service Button. Repeat for any other rlm Windows Services remaining. When all previous rlm Windows Services have been deleted, complete the fields listed on the Manage Services Tab for the new permanent Windows Service. Press the Install Service Button. Select the rlm Service that you wish to start using the Right-click button on your mouse and select Start on the menu options Go to the General Tab.

Command Line Instructions Open a command-prompt as administrator cmd. Have a Specific Question? Get a real answer from a real person. Need Support? Get help from our friendly experts. Contact Support.

Start a FREE trial. All rights reserved. Privacy Policy. Thank you for signing up!Thank you for coming to us with your questions and concerns. Many of our most commonly asked questions and solutions can be found here to help troubleshoot any issues you may encounter. If you don't see the solution or answer you need, feel free to send us a description of the problem using the Ask for Help form or by calling us at PTWISE.

If you have administrator privileges, you can right-click the Pointwise installer file and select Run as Administrator.

If there is, Pointwise automatically connects to that server. In certain situations, you may wish to connect to a specific RLM server running Pointwise. To do so, use the following steps:. If Pointwise cannot find a license server in Step 1, the Pointwise License Wizard will launch automatically. Simply follow the remaining steps outlined above.

Yes, you can borrow processes from your main license server. This is most often necessary in the situation where you need to take a laptop with you while you are on a short-term trip. To borrow a process of Pointwise, use the following steps:. Note the maximum number of days that you can borrow a process is 30 days. In the Pointwise interface, this will be displayed as 31 days because each day is counted until midnight.

For example, if you chose to borrow a process for 15 days, the first day would end at midnight on the day you borrowed the process.

Returning your borrowed Pointwise process before it expires takes just a few simple steps:. Note: When a license is borrowed or roamingit depends on the version of the license file on the license server remaining fixed.

Using the RLM Multi License Server for PC

If that file is changed, it can prevent the early return of a borrowed process. For more information, please refer to the RLM license administration documentation available at: www.

Barnes sewage ejector pumps

## Understanding LSTM Networks

Posted on August 27, 2015

### Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

Recurrent Neural Networks have loops.

In the above diagram, a chunk of neural network, $$A$$, looks at some input $$x_t$$ and outputs a value $$h_t$$. A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

An unrolled recurrent neural network.

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of neural network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.

### The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!

### LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

The repeating module in an LSTM contains four interacting layers.

Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

### The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

### Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at $$h_{t-1}$$ and $$x_t$$, and outputs a number between $$0$$ and $$1$$ for each number in the cell state $$C_{t-1}$$. A $$1$$ represents “completely keep this” while a $$0$$ represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, $$\tilde{C}_t$$, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

It’s now time to update the old cell state, $$C_{t-1}$$, into the new cell state $$C_t$$. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by $$f_t$$, forgetting the things we decided to forget earlier. Then we add $$i_t*\tilde{C}_t$$. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through $$\tanh$$ (to push the values to be between $$-1$$ and $$1$$) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

### Variants on Long Short Term Memory

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

### Conclusion

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

### Acknowledgments

I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

### Deep Learning, NLP, and Representations

Источник: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

## A depthwise separable convolutional neural network for keyword spotting on an embedded system

EURASIP Journal on Audio, Speech, and Music Processingvolume 2020, Article number: 10 (2020) Cite this article

• 3536 Accesses

• 1 Citations

• Metrics details

### Abstract

A keyword spotting algorithm implemented on an embedded system using a depthwise separable convolutional neural network classifier is reported. The proposed system was derived from a high-complexity system with the goal to reduce complexity and to increase efficiency. In order to meet the requirements set by hardware resource constraints, a limited hyper-parameter grid search was performed, which showed that network complexity could be drastically reduced with little effect on classification accuracy. It was furthermore found that quantization of pre-trained networks using mixed and dynamic fixed point principles could reduce the memory footprint and computational requirements without lowering classification accuracy. Data augmentation techniques were used to increase network robustness in unseen acoustic conditions by mixing training data with realistic noise recordings. Finally, the system’s ability to detect keywords in a continuous audio stream was successfully demonstrated.

### Introduction

During the last decade, deep learning algorithms have continuously improved performances in a wide range of applications, among others automatic speech recognition (ASR) [1]. Enabled by this, voice-controlled devices constitute a growing part of the market for consumer electronics. Artificial intelligence (AI) digital assistants utilize natural speech as the primary user interface and often require access to cloud computation for the demanding processing tasks. However, such cloud-based solutions are impractical for many devices and cause user concerns due to the requirement of continuous internet access and due to concerns regarding privacy when transmitting audio continuously to the cloud [2]. In contrast to these large-vocabulary ASR systems, devices with more limited functionality could be more efficiently controlled using only a few speech commands, without the need of cloud processing.

Keyword spotting (KWS) is the task of detecting keywords or phrases in an audio stream. The detection of a keyword can then trigger a specific action of the device. Wake-word detection is a specific implementation of a KWS system where only a single word or phrase is detected which can then be used to, for example, trigger a second, more complex recognition system. Early popular KWS systems have typically been based on hidden Markov models (HMMs) [3–5]. In recent years, however, neural network-based systems have dominated the area and improved the accuracies of these systems. Popular architectures include standard feedforward deep neural networks (DNNs) [6–8] and recurrent neural networks (RNNs) [9–12]. Strongly inspired by advancements in techniques used in computer vision (e.g., image classification and facial recognition), the convolutional neural network (CNN) [13] has recently gained popularity for KWS in small memory footprint applications [14]. The depthwise separable convolutional neural network (DS-CNN) [15, 16] was proposed as an efficient alternative to the standard CNN. The DS-CNN decomposes the standard 3-D convolution into 2-D convolutions followed by 1-D convolutions, which drastically reduces the number of required weights and computations. In a comparison of multiple neural network architectures for KWS on embedded platforms, the DS-CNN was found to be the best performing architecture [17].

For speech recognition and KWS, the most commonly used speech features are the mel-frequency cepstral coefficients (MFCCs) [17–20]. In recent years, there has, however, been a tendency to use mel-frequency spectral coefficients (MFSCs) directly with neural network-based speech recognition systems [6, 14, 21] instead of applying the discrete cosine transform (DCT) to obtain MFCCs. This is mainly because the strong correlations between adjacent time-frequency components of speech signals can be exploited efficiently by neural network architectures such as the CNN [22, 23]. An important property of MFSC features is that they attenuate the characteristics of the acoustic signal irrelevant to the spoken content, such as the intonation or accent [24].

One of the major challenges of supervised learning algorithms is the ability to generalize from training data to unseen observations [25]. Reducing the impact of speaker variability on the input features can make it easier for the network to generalize. Another way to improve the generalization is to ensure a high diversity of the training data, which can be realized by augmenting the training data. For audio data, augmentation techniques include filtering [26], time shifting and time warping [27], and adding background noise. However, the majority of KWS systems either have used artificial noises, such as white or pink noise, which are not relevant for real-life applications or have considered only a limited number of background noises [14, 17, 28].

Because of the limited complexity of KWS compared to large-vocabulary ASR, low-power embedded microprocessor systems are suitable targets for running real-time KWS without access to cloud computing [17]. Implementing neural networks on microprocessors presents two major challenges in terms of the limited resources of the platform: (1) memory capacity to store weights, activations, input/output, and the network structure itself is very limited for microprocessors; (2) computational power on microprocessors is limited. The number of computations per network inference is therefore limited by the real-time requirements of the KWS system. To meet these strict resource constraints, the size of the networks must be restricted in order to reduce the number of network parameters. Techniques like quantization can further be used to reduce the computational load and memory footprint. The training and inference of neural networks is typically done using floating-point precision for weights and layer outputs, but for implementation on mobile devices or embedded platforms, fixed point formats at low bit widths are often more efficient. Many microprocessors support single instruction, multiple data (SIMD) instructions, which perform arithmetic on multiple data points simultaneously, but typically only for 8/16 bit integers. Using low bit width representations will therefore increase the throughput and thus lower the execution time of network inference. Previous research has shown that, for image classification tasks, it is possible to quantize CNN weights and activations to 8-bit fixed point format with a minimum loss of accuracy [29, 30]. However, the impact of quantization on the performance of a DS-CNN-based KWS system has not yet been investigated.

This paper extends previous efforts [17] to implement a KWS system based on a DS-CNN by (a) identifying performance-critical elements in the system when scaling the network complexity, (b) augmenting training data with a wider variety of realistic noise recordings and by using a controlled range of signal-to-noise ratios (SNRs) that are realistic for practical KWS applications during both training and testing. Moreover, the ability of the KWS system to generalize to unseen acoustic conditions was tested by evaluating the system performance in both matched and mismatched background noise conditions, (c) evaluating the effect of quantizing individual network elements and (d) evaluating the small-footprint KWS system on a continuous audio stream rather than single inferences. Specifically, the paper reports the implementation of a 10-word KWS system based on a DS-CNN classifier on a low-power embedded microprocessor (ARM Cortex M4), motivated by the system in [17]. The KWS system described in the present study is targeted at real-time applications, which can be either always on or only active when triggered by an external system, e.g., a wake-word system. To quantify the network complexity where the performance decreases relative to the system in [17], we tested a wide range of system parameters between 2 layers and 10 filters per layer up to 9 layers and 300 filters per layer. The network was trained with keywords augmented with realistic background noises at a wide range of SNRs and the network’s ability to generalize to unseen acoustic conditions was evaluated. With the goal to reduce the memory footprint of the system, it was investigated how quantization of weights and activations affected performance by gradually lowering the bit widths using principles of mixed and dynamic fixed point representations. In this process, single-inference performance was evaluated, motivated by the smaller parameter space and the close connection between the performance in single-inference testing and continuous audio presentation. In the final step, the performance of the suggested KWS system was tested when detecting keywords in a continuous audio stream and compared to the reference system of high complexity.

### KWS system

The proposed DS-CNN-based KWS system consisted of three major building blocks as shown in Fig. 1. First, MFSC features were extracted based on short time blocks of the raw input signal stream (pre-processing stage). These MFSC features were then fed to the DS-CNN-based classifier, which generated probabilities for each of the output classes in individual time blocks. Finally, a posterior handling stage combined probabilities across time blocks to improve the confidence of the detection. Each of the three building blocks is described in detail in the following subsections.

### Feature extraction

The MFSC extraction consisted of three major steps, as shown in Fig. 2. The input signal was sampled at a rate of 16 kHz and processed by the feature extraction stage in blocks of 1000 ms. For each block, the short-time discrete Fourier transform (STFT) was computed by using a Hann window of 40-ms duration with 50 % overlap, giving a total of 49 frames. Each frame was zero-padded to a length of 1024 samples before computing a 1024-point discrete Fourier transform (DFT). Afterwards, a filterbank with 20 triangular bandpass filters with a constant Q-factor spaced equidistantly on the mel-scale between 20 and 4000 Hz [31] was applied. The mel-frequency band energies were then logarithmically compressed, producing the MFSC features, resulting in a 2-D feature matrix of size 20×49 for each inference. The number of log-mel features was derived from initial investigations on a few selected network configurations where it was found that 20 log-mel features proved most efficient in terms of performance vs the resources used.

### DS-CNN classifier

The classifier had one output class for each of the keywords it should detect. It furthermore had an output class for unknown speech signals and one for signals containing silence. The input to the network was a 2-dimensional feature map consisting of the extracted MFSC features. Each convolutional layer of the network then applied a number of filters, Nfilters, to detect local time-frequency patterns across input channels. The output of each network inference was a probability vector, containing the probability for each output class. The general architecture of the DS-CNN classifier is shown in Fig. 3. The first layer of the network was in all cases a standard convolutional layer. Following the convolutional layer was a batch-normalization layer with a rectified linear unit (ReLU) [32] activation function.

Batch normalization [33] was employed to accelerate training and to reduce the risk of overfitting through regularization. By equalizing the distributions of activations, higher learning rates can be used because the magnitude of the gradients of each layer is more similar, which results in faster model convergence. Because the activations of a single audio file are not normalized by the mean and variance of each audio file, but instead by the mean and variance of the mini-batch [32] in which it appears, a regularization effect is created by the random selection of audio files in the mini-batch. The batch-normalization layer was followed by a number of depthwise separable convolutions (DS-convs) [16], which each consisted of a depthwise convolution (DW-conv) and pointwise convolution (PW-conv) as illustrated in Fig. 4, both followed by a batch-normalization layer with ReLU activation. An average pooling layer then reduced the number of activations by applying an averaging window to the entire time-frequency feature map of each input channel. Finally, a fully connected (FC) layer with softmax [32] activations generated the probabilities for each output class.

### Posterior handling

The classifier was run 4 times per second, resulting in a 250-ms shift and an overlap of 75 %. As the selected keywords were quite short in duration, they typically appeared in full length in multiple input blocks. In order to increase the confidence of the classification an integration period, Tintegrate, was introduced, in which the predicted output probabilities of each output class were averaged. The system then detected a keyword if any of these averaged probabilities exceeded a predetermined detection threshold. To avoid that the same word would trigger multiple detections by the system, a refractory period, Trefractory, was introduced. When the system detected a keyword, it would be suppressed from detecting the same keyword during the refractory period. For this paper, an integration period of Tintegrate=750 ms and a refractory period of Trefractory=1000 ms were used.

### Dataset

The Speech Commands dataset [34] was used for training and evaluation of the networks. The dataset consisted of 65000 single-speaker, single-word recordings of 30 different words. A total of 1881 speakers contributed to the dataset, ensuring a high speaker diversity. The following 10 words were used as keywords: {“Yes,” “No,” “Up,” “Down,” “Left,” “Right,” “On,” “Off,” “Go,” “Stop”}. The remaining 20 words of the dataset were used to train the category “unknown.” The dataset was split into “training,” “validation,” and “test” sets with the ratio 80:10:10, while restricting recordings of the same speaker to only appear in one of the three sets. For training, 10 % of the presented audio files were labeled silence, i.e., containing no spoken word; 10 % were unknown words; and the remaining contained keywords.

### Data augmentation

For training, validation, and testing, the speech files were mixed with 13 diverse background noises at a wide range of SNRs. The background noise signals were real-world recordings, some containing speech, obtained from two publicly available databases, the TUT database [35] and the DEMAND database [36]. The noise signals were split into two sets, reflecting matched and mismatched conditions (see Table 1). The networks were either trained on the clean speech files or trained on speech files mixed with noise signals from noise set 1 with uniformly distributed A-weighted SNRs in the range between 0 and 15 dB. To add background noise to the speech files, the filtering and noise adding tool (FaNT) [37] was used. Noise set 2 was then used to evaluate the network performance in acoustic conditions that were not included in the training. Separate recordings of each noise type were used for training and evaluation.

### Resource estimation

To compare the resources used by different network configurations, the following definitions were used to estimate number of operations, memory, and execution time.

#### Operations

The number of operations are per inference of the network, defined as the total number of multiplications and additions in the convolutional layers of the DS-CNN.

#### Memory

The memory reported is the total memory required to store the network weights/biases and layer activations, assuming 8-bit variables. As the activations of one layer are only used as input for the next layer, the memory for the activations can be reused. The total memory allocated for activations is then equal to the maximum of the required memory for inputs and outputs of a single layer.

#### Execution time

The execution times reported in this paper are estimations based on measured execution times of multiple different-sized networks. The actual network inference execution time of implemented DS-CNNs on the Cortex M4 was measured using the Cortex M4’s on-chip timers, with the processor running at a clock frequency of 180 MHz. In this study, only two hyper-parameters were altered: the number of DS-conv layers, Nlayers, and the number of filters applied per layer, Nfilters. The number of layers was varied between 2 and 9, and the number of filters per layer was varied between 10 and 300. Convolutional layers after layer 7 had the same parameters as seen in the last layers in Table 2 in terms of filter size and strides.

### Quantization methods

The fixed point format represents floating-point numbers as N-bit 2’s complement signed integers, where the BI leftmost bits (including the sign-bit) represent the integer part, and the remaining BF rightmost bits represent the fractional part. The following two main concepts were applied when quantizing a pre-trained neural network effectively [29].

#### Mixed fixed point precision

The fully connected and convolutional layers of a DS-CNN consist of a long series of multiply-and-accumulate (MAC) operations, where network weights multiplied with layer activations are accumulated to give the output. Using different bit widths for different parts of the network, i.e., mixed precision, has been shown to be an effective approach when quantizing CNNs [38], as the precision required to avoid performance degradation may vary in different parts of the network.

#### Dynamic fixed point

The weights and activations of different CNN layers will have different dynamic ranges. The fixed point format requires that the range of the values to represent is known beforehand, as this determines BI and BF. To ensure a high utilization of the fixed point range, dynamic fixed point [39] can be used, which assigns the weights/activations into groups of constant BI.

For faster inference, the batch-norm operations were fused into the weights of the preceding convolutional layer and quantized after this fusion. BI and BF were determined by splitting the network variables for each layer into groups of weights, biases, and activations, and estimating the dynamic range of each group. The dynamic ranges of groups with weights and biases were fixed after training, while the ranges of activations were estimated by running inference on a large number of representative audio files from the dataset and generating statistical parameters for the activations of each layer. BI and BF were then chosen such that saturation is avoided. The optimal bit widths were determined by dividing the variables in the network into separate categories based on the operation, while the layer activations were kept as one category. The effects on performance were then examined when reducing the bit width of a single category while keeping the rest of the network at floating-point precision. The precision of the weights and activations in the network was varied in experiment 3 between 32-bit floating-point precision and low bit width fixed point formats ranging from 8 to 2 bit.

### Training

All networks were trained with Google’s TensorFlow machine learning framework [40] using an Adam optimizer to minimize the cross-entropy loss. The networks were trained in 30,000 iterations with a batch size of 100. Similar to [17], an initial learning rate of 0.0005 was used; after 10,000 iterations, it was reduced to 0.0001; and for the remaining 10,000 iterations, it was reduced to 0.00002. During training, audio files were randomly shifted in time up to 100 ms to reduce the risk of overfitting.

### Evaluation

To evaluate the DS-CNN classifier performance in the presence of background noise, test sets with different SNRs between −5 and 30 dB were used. Separate test sets were created for noise signals from noise set 1 and noise set 2. The system was tested by presenting single inferences (single-inference testing) to evaluate the performance of the network in isolation. In addition, the system was tested by presenting a continuous audio stream (continuous-stream testing) to approximate a more realistic application environment.

#### Single-inference testing

For single-inference testing, the system was tested without the posterior handling stage. For each inference, the maximum output probability was selected as the detected output class and compared to the label of the input signal. When testing, 10 % of the samples were silence, 10 % were unknown words, and the remaining contained keywords. Each test set consisted of 3081 audio files, and the reported test accuracy specified the ratio of correctly labeled audio files to the total amount of audio files in the test. To compare different network configurations, the accuracy was averaged across the range 0−20 dB SNR, as this reflects SNRs in realistic conditions [41].

#### Continuous audio stream testing

Test signals with a duration of 1000 s were created for each SNR and noise set, with words from the dataset appearing approximately every 3 s. Seventy percent of the words in the test signal were keywords. The test signals were constructed with a high ratio of keywords to reflect the use case in which the KWS system is not run in an always-on state but instead triggered externally by, e.g., a wake-word detector. A hit was counted if the system detected the keyword within 750 ms after occurrence, and the hit rate (also called true positive rate (TPR)) then corresponds to the number of hits relative to the total number of keywords in the test signal. The false alarm rate (also called false positive rate (FPR)) reported is the total number of incorrect keyword detections relative to the duration of the test signal, here reported as false alarms per hour.

### Network test configuration

Unless stated otherwise, the parameters summarized in Table 2 are used. The network had 7 convolutional layers with 76 filters for each layer.

As a baseline for comparison, a high-complexity network was introduced. The baseline network had 8 convolutional layers with 300 filters for each layer with hyper-parameters as summarized in Table 3. The baseline network was trained using the noise-augmented dataset and evaluated using floating-point precision weights and activations.

### Platform description

Table 4 shows the key specifications for the FRDM K66F development platform used for verification of the designed systems. The deployed network used 8-bit weights and activations, but performed feature extraction using 32-bit floating-point precision. The network was implemented using the CMSIS-NN library [42] which features neural network operations optimized for Cortex-M processors.

### Results

In experiment 1, the effect of training the network on noise-augmented speech on single-inference accuracy was investigated. The influence of network complexity was assessed in experiment 2 by systematically varying the number of convolutional layers and the number of filters per layer. Experiment 3 investigated the effects on performance when quantizing network weights and activations for fixed point implementation. Finally, the best performing network was tested on a continuous audio stream and the impact of the detection threshold on hit rate and false positive rate was evaluated.

### Experiment 1: Data augmentation

Figure 5 shows the single-inference accuracies when using noise-augmented speech files for training. For SNRs below 20 dB, the network trained on noisy data had a higher test accuracy than the network trained on clean data, while the accuracy was slightly lower for SNRs higher than 20 dB. In the range between −5 and 5 dB SNR, the average accuracy for the network trained on noisy data was increased by 11.1 % and 8.6% for the matched and mismatched noise sets respectively relative to the training on clean data. Under clean test conditions, it was found that the classification accuracy of the network trained on noisy data was 4 % lower than the network trained on clean data. For both networks, there was a difference between the accuracy on the matched and mismatched test. The average difference in accuracy in the range from −5 to 20 dB SNR was 3.3 % and 4.4 % for the network trained on clean and noisy data, respectively. The high-complexity baseline network performed on average 3 % better than the test network trained on noisy data.

### Experiment 2: Network complexity

Figure 6 shows a selection of the most feasible networks for different numbers of layers and numbers of filters per layer. For each trained network, the table specifies single-inference average accuracy in the range 0−20 dB SNR for both test sets (accuracy in parentheses for the mismatched test). Moreover, the number of operations per inference, the memory required by the model for weights/activations, and the estimated execution time per inference on the Cortex M4 are specified. For networks with more than 5 layers, no significant improvement (< 1 %) was obtained when increasing the number of filters beyond 125. Networks with less than 5 layers gained larger improvements from using more than 125 filters, though none of those networks reached the accuracies obtained with networks with more layers.

Figure 7 shows the accuracies of all the layer/filter combinations of the hyper-parameter search as a function of the operations. For the complex network structures, the deviation of the accuracies was very small, while for networks using few operations, there was a large difference in accuracy depending on the specific combination of layers and filters. For networks ranging between 5 and 200 million operations, the difference in classification accuracy between the best performing models was less than 2.5 %. Depending on the configuration of the network, it is therefore possible to drastically reduce the number of operations while maintaining a high classification accuracy.

In Fig. 8, a selection of the best performing networks is shown as a function of required memory and operations per inference. The label for each network specifies the parameter configuration [Nlayers,Nfilters] and the average accuracy in 0−20 dB SNR for noise set 1 and 2. The figure illustrates the achievable performance given the platform resources and shows that high accuracy was reached with relatively simple networks. From this investigation, it was found that the best performing network fitting the resource requirements of the platform consisted of 7 DS-CNN layers with 76 filters per layer, as described in Section 3.7.

### Experiment 3: Quantization

Table 5 shows the single-inference test results of the quantized networks, where each part of the network specified in Section 3.7 was quantized separately, while the remainder of the network was kept at floating-point precision.

All of the weights and activations could be quantized to 8-bit using dynamic fixed point representation with no loss of classification accuracy, and the bit widths of the weights could be further reduced to 4 bits with only small reductions in accuracy. In contrast, reducing the bit width of activations to less than 8 bits significantly reduced classification accuracy. While the classification accuracy was substantially reduced when using only 2 bits for regular convolution parameters and FC parameters, the performance completely broke down when quantizing pointwise and depthwise convolution parameters and layer activations with 2 bits. The average test accuracy in the range of 0−20 dB SNR of the network with all weights and activations quantized to 8 bits was 83.2 % for test set 1 (matched) and 79.2 % for test set 2 (mismatched), which was the same performance as using floating-point precision.

Using 8-bit fixed point numbers instead of 32-bit floating point reduced the required memory by a factor of 4, from 366 to 92 KB, with 48 KB reserved for activations and 44 KB for storing weights. Utilizing mixed fixed point precision and quantizing activations to 8 bits and weights to 4 bits would reduce the required memory to 70 KB.

### Experiment 4: Continuous audio stream

Figure 9 shows the hit rate and false positive rate obtained by the KWS system on the continuous audio signals. The system was tested using different detection thresholds, which affected the system’s inclination towards detecting a keyword. It was found that the difference in hit rates was constant as a function of SNR when the detection threshold was altered, while the difference in false positive rates increased towards low SNRs. For both test sets, the hit rate and false positive rate saturated at SNRs higher than 15 dB. Figure 10 shows the corresponding DET curve obtained for the test network and baseline network.

### FRDM K66F implementation

Table 6 shows the distribution of execution time over network layers for a single inference for the implementation on the FRDM K66F development board. The total execution time of the network inference was 227.4 ms, which leaves sufficient time for feature extraction and audio input handling, assuming 4 inferences per second.

### Discussion

Experiment 1 showed that adding noise to the training material increased the classifier robustness in low SNR conditions. The increase in accuracy, compared to the same network trained on clean speech files, was most significant for the matched noise test, where the test data featured the same noise types as the training material. For the mismatched test, the increase in accuracy was slightly smaller. A larger difference in performance between clean and noisy training was expected, but as explained in [43], the dataset used was inherently noisy and featured invalid audio files, which could diminish the effect of adding more noise. For both test sets, the network trained on clean data performed better under high SNRs, i.e., SNR>20 dB. From the perspective of this paper however, the performance in high SNRs was of less interest as the focus was on real-world application. If the performance in clean conditions is also of concern, [28] demonstrated that the performance decrease in clean conditions could be reduced by including clean audio files in the noisy training. As was also found in [28], it was observed that the noisy training enabled the network to adapt to the noise signals and improve the generalization ability, by forcing it to detect patterns more unique to the keywords. Even though the two noise sets consisted of different noise environment recordings, many of the basic noise types, such as speech, motor noise, or running water, were present in both noise sets. This would explain why, that even though the network was only trained on data mixed with noise set 1 (matched), it also performed better on test set 2 (mismatched) than the network trained on clean data.

The main result of experiment 2 was that the classification accuracy as a function of network complexity reached a saturation point. Increasing the number of layers or the number of filters per layer beyond this point only resulted in negligible accuracy gains, <2 %. This was explicitly shown in Fig. 5 for the single-inference classification accuracy and Fig. 10 for continuous audio streaming, where the high-complexity baseline network was directly compared with the smaller network chosen for the implementation. It was also found that, given a fixed computational and memory constraint, higher accuracies were achieved by networks with many layers and few filters than by networks with few layers and many filters. In a convolutional layer, the number of filters determines how many different patterns can be detected in the input features. The first layer detects characteristic patterns of the input speech features, and each subsequent convolutional layer will detect patterns in the patterns detected by the previous layer, adding another level of abstraction. One interpretation of the grid-search results could therefore be, that if the network has sufficient levels of abstraction (layers), then the number of distinct patterns needed at each abstraction level to characterize the spoken content (number of filters) can be quite low. As the classifier should run 4 times per second, feasible network configurations were limited to inference execution times below 250 ms, which ruled out the majority of the configurations tested. In terms of the resource constraints set by the platform, execution time was the limiting factor for these networks and not the memory required for weights and activations. This was not unexpected as the DS-CNN, contrary to network architectures such as the DNN, reuses the same weights (filters) for computing multiple neurons. The DS-CNN therefore needs fewer weights relative to the number of computations it must perform, making this approach especially suitable for platforms with very limited memory capacity.

The results from experiment 3 showed that weights and activations of a network trained using floating-point precision could be quantized to low bit widths without affecting the classification accuracy. Quantizing all numbers in the network to 8 bit resulted in the same classification accuracy as using floating-point precision. It was also found that the weights of the network could all be quantized to 4 bit with no substantial loss of accuracy, which can significantly reduce the memory footprint and possibly reduce the processing time spent on fetching data from memory. These results showed that mixed fixed point precision leads to the most memory-efficient network, because the different network components (weights and activations) are robust to different reductions in bit width. For many deep CNN classifiers [29, 30, 38], it was reported that networks are very robust to the reduced resolution caused by quantization. The reason for this robustness could be that the networks are designed and trained to ignore the background noise and the deviations of the speech samples. The quantization errors are then simply another noise source for the network, which it can handle up to a certain magnitude. Gysel et al. [29] found that small accuracy decreases of CNN classifiers, caused by fixed point quantization of weights and activations, could be compensated for by partially retraining the networks using these fixed point weights and activations. A natural next step for the KWS system proposed in this paper would therefore also be to fine tune the quantized networks. Because the network variables were categorized, the quantization effects on the overall performance could be evaluated individually for each category. Results showed that different bit widths were required for the different categories, in order to maintain the classification accuracy achieved using floating-point numbers. It is however suspected that, because some of the categories span multiple network layers, a bottleneck effect could occur. For example, if the activations of a single layer require high precision, i.e., large bit width, but the other layers’ activations required fewer bits, this would be masked in the experiment because they were all in the same category. It is therefore expected that using different bit widths for each of the layers in each of the categories would potentially result in a lower memory footprint. In this paper, the fixed point representations had symmetric, zero-centered ranges. However, all of the convolutional layers use ReLU activations functions, so the activations effectively only utilize half of the available range as values below zero are cutoff. By shifting the range, such that zero becomes the minimum value, the total range can be halved, i.e., BI is decreased by one. This in turn frees a bit, which could be used to increase BF by one, thereby increasing the resolution, or it could be used to reduce the total bit width by one.

Experiment 4 tested the KWS system performance on a continuous audio stream. As found in most signal detection tasks, lowering the decision criterion, i.e., the detection threshold, increases the hit rate but also the FPR, which means there is a trade-off. The detection threshold should match the intended application of the system. For always-on systems, it is crucial to keep the number of false alarms as low as possible, while for externally activated systems where the KWS is only active for a short time window in which a keyword is expected, a higher hit rate is more desirable. One method for lowering the FPR and increasing the true negative rate could be to increase the ratio of negative to positive samples in the training, i.e., use more “unknown” and “silence” samples. This has been shown as an effective method in other machine learning detection tasks [44, 45]. Another approach for lowering the FPR could be to create a loss function for the optimizer during training, which penalizes errors that cause false alarms more than errors that cause misses. There were significant discrepancies between the estimated number of operations and actual execution time of the different layers of the implemented network (see Table 6). The convolutional functions in the software library used for the implementation [42] all use highly optimized matrix multiplications (i.e. general matrix-matrix multiplication, GEMM) to compute the convolution. However, in order to compute 2D convolutions using matrix multiplications, it is necessary to first rearrange the input data and weights during run time. It was argued that, despite this time consuming and memory expanding data reordering, using matrix multiplications is still the most efficient implementation of convolutional layers [46, 47]. The discrepancies between operations and execution time could be explained by the fact that the reordering of data was not accounted for in the operation parameter and that the different layers required different degrees of reordering. For the pointwise-convolutions and fully connected layer, the activations were stored in memory in an order such that no reordering was required to do the matrix multiplication, whereas this was not possible for the standard convolution or depthwise convolution. The number of arithmetic operations for running network inference should therefore not solely be used to asses the feasibility of implementing neural networks on embedded processors, as done in [17], as this parameter does not directly reflect the execution time. Instead, this estimate should also include the additional work for data reordering required by some network layers. Based on the results presented in this paper, there are several possible actions to take to improve performance or optimize implementation of the proposed KWS system. Increasing the size of the dataset and removing corrupt recordings, or augmenting training data with more varied background noises, such as music, could increase network accuracy and generalization. Reducing the number of weights of a trained network using techniques such as pruning [48] could be used to further reduce memory footprint and execution time.

Python training scripts and FRDM K66F deployment source code as well as a quantitative comparison of performance for using MFCC vs MFSC features for a subset of networks are available on Github [49].

### Conclusion

In this paper, methods for training and implementing a DS-CNN-based KWS system for low-resource embedded platforms were presented and evaluated. Experimental results showed that augmenting training data with realistic noise recordings increased the classification accuracy in both matched and mismatched noise conditions. By performing a limited hyper-parameter grid search, it was found that network accuracy saturated when increasing the number of layers and filters in the DS-CNN and that feasible networks for implementation on the ARM Cortex M4 processor were in this saturated region. It was also shown that using dynamic fixed point representations allowed network weights and activations to be quantized to 8-bit precision with no loss in accuracy. By quantizing different network components individually, it was found that layer activations were most sensitive to further quantization, while weights could be quantized to 4 bits with only small decreases in accuracy. The ability of the KWS system to detect keywords in a continuous audio stream was tested, and it was seen how altering the detection threshold affected the hit rate and false alarm rate. Finally, the system was verified by the implementation on the Cortex M4, where it was found that the number of arithmetic operations per inference are not directly related to execution time. Ultimately, this paper shows that the number of layers and the number of filters per layers provide a useful parameter when scaling system complexity. In addition, it was shown that a 8-bit quantization provides a significant reduction in memory footprint and processing time and does not result in a loss of accuracy.

### Availability of data and materials

The data that support the findings of this study are available from [34].

### Abbreviations

Artificial intelligence

Automatic speech recognition

Convolutional neural network

Discrete cosine transform

Discrete Fourier transform

Deep neural network

Depthwise separable convolutional neural network

Depthwise separable convolution

Depthwise convolution

Fully connected

False positive rate

Hidden Markov model

Keyword spotting

Multiply and accumulate

Mel-frequency cepstral coefficients

Mel-frequency spectral coefficients

Pointwise convolution

Rectified linear unit

Recurrent neural network

Singe instruction, multiple data

Signal-to-noise ratio

Short-time discrete Fourier transform

True positive rate

### References

1. 1

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag.29(6), 82–97 (2012). https://doi.org/10.1109/MSP.2012.2205597.

2. 2

New Electronic Friends. https://pages.arm.com/machine-learning-voice-recognition-report.html. Accessed 30 May 2018.

3. 3

R. C. Rose, D. B. Paul, in International Conference on Acoustics, Speech, and Signal Processing. A hidden Markov model based keyword recognition system, (1990), pp. 129–1321. https://doi.org/10.1109/ICASSP.1990.115555.

4. 4

J. R. Rohlicek, W. Russell, S. Roukos, H. Gish, in International Conference on Acoustics, Speech, and Signal Processing,. Continuous hidden Markov modeling for speaker-independent word spotting, (1989), pp. 627–6301. https://doi.org/10.1109/ICASSP.1989.266505.

5. 5

J. G. Wilpon, L. G. Miller, P. Modi, in [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing. Improvements and applications for key word recognition using hidden Markov modeling techniques, (1991), pp. 309–312. https://doi.org/10.1109/ICASSP.1991.150338. http://ieeexplore.ieee.org/document/150338/.

6. 6

G. Chen, C. Parada, G. Heigold. Small-footprint keyword spotting using deep neural networks, (2014). https://doi.org/10.1109/icassp.2014.6854370.

7. 7

K. Shen, M. Cai, W. -Q. Zhang, Y. Tian, J. Liu, Investigation of DNN-based keyword spotting in low resource environments. Int. J. Future Comput. Commun.5(2), 125–129 (2016). https://doi.org/10.18178/ijfcc.2016.5.2.458.

8. 8

G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, S. Vitaladevuni. Model compression applied to small-footprint keyword spotting, (2016), pp. 1878–1882. https://doi.org/10.21437/Interspeech.2016-1393.

9. 9

S. Fernández, A. Graves, J. Schmidhuber, in Artificial Neural Networks – ICANN 2007, ed. by J. M. de Sá, L. A. Alexandre, W. Duch, and D. Mandic. An application of recurrent neural networks to discriminative keyword spotting (SpringerBerlin, Heidelberg, 2007), pp. 220–229.

10. 10

K. P. Li, J. A. Naylor, M. L. Rossen, in [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. A whole word recurrent neural network for keyword spotting, (1992), pp. 81–842. https://doi.org/10.1109/ICASSP.1992.226115.

11. 11

M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, S. Vitaladevuni, Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. CoRR. abs/1705.02411: (2017). http://arxiv.org/abs/1705.02411.

12. 12

S. Ö,. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, A. Coates, Convolutional recurrent neural networks for small-footprint keyword spotting. CoRR. abs/1703.05390: (2017). http://arxiv.org/abs/1703.05390.

13. 13

Y. LeCun, Y. Bengio, in Chap. Convolutional Networks for Images, Speech, and Time Series. The Handbook of Brain Theory and Neural Networks (Press, MITCambridge, MA, USA, 1998), pp. 255–258. http://dl.acm.org/citation.cfm?id=303568.303704.

14. 14

T. N. Sainath, C. Parada, in INTERSPEECH. Convolutional neural networks for small-footprint keyword spotting, (2015).

15. 15

F. Chollet, Xception: deep learning with depthwise separable convolutions. CoRR. abs/1610.02357: (2016). http://arxiv.org/abs/1610.02357.

16. 16

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR. abs/1704.04861: (2017). http://arxiv.org/abs/1704.04861.

17. 17

Y. Zhang, N. Suda, L. Lai, V. Chandra, Hello edge: keyword spotting on microcontrollers. CoRR. abs/1711.07128: (2017). http://arxiv.org/abs/1711.07128.

18. 18

S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process.28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420.

19. 19

I. Chadawan, S. Siwat, Y. Thaweesak, in International Conference on Computer Graphics, Simulation and Modeling (ICGSM’2012). Speech recognition using MFCC (Pattaya (Thailand), 2012).

20. 20

Bhadragiri Jagan Mohan, Ramesh Babu N., in 2014 International Conference on Advances in Electrical Engineering (ICAEE). Speech recognition using MFCC and DTW, (2014), pp. 1–4. https://doi.org/10.1109/ICAEE.2014.6838564.

21. 21

O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.22(10), 1533–1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736.

22. 22

A. -R. Mohamed, Deep Neural Network acoustic models for ASR. PhD thesis (University of Toronto, 2014). https://tspace.library.utoronto.ca/bitstream/1807/44123/1/Mohamed_Abdel-rahman_201406_PhD_thesis.pdf.

23. 23

S. Watanabe, M. Delcroix, F. Metze, J. R. Hershey, in Springer International Publishing. New era for robust speech recognition, (2017), p. 205. https://doi.org/10.1007/978-3-319-64680-0.

24. 24

J. W. Picone, Signal modeling techniques in speech recognition. Proc. IEEE. 81:, 1215–1247 (1993). https://doi.org/10.1109/5.237532.

25. 25

X. Xiao, J. Li, Chng. E.S., H. Li, C. -H. Lee, A study on the generalization capability of acoustic models for robust speech recognition. IEEE Trans. Audio Speech Lang. Process.18(6), 1158–1169 (2010). https://doi.org/10.1109/TASL.2009.2031236.

26. 26

I. Rebai, Y. BenAyed, W. Mahdi, J. -P. Lorré, Improving speech recognition using data augmentation and acoustic model fusion. Procedia Comput. Sci.112:, 316–322 (2017). https://doi.org/10.1016/J.PROCS.2017.08.003.

27. 27

T. Ko, V. Peddinti, D. Povey, S. Khudanpur, in INTERSPEECH. Audio augmentation for speech recognition, (2015).

28. 28

S. Yin, C. Liu, Z. Zhang, Y. Lin, D. Wang, J. Tejedor, T. F. Zheng, Y. Li, Noisy training for deep neural networks in speech recognition. EURASIP J. Audio Speech Music Process.2015(1), 2 (2015). https://doi.org/10.1186/s13636-014-0047-0.

29. 29

P. Gysel, M. Motamedi, S. Ghiasi, Hardware-oriented approximation of convolutional neural networks. CoRR. abs/1604.03168: (2016). http://arxiv.org/abs/1604.03168.

30. 30

D. D. Lin, S. S. Talathi, V. S. Annapureddy, Fixed point quantization of deep convolutional networks. CoRR. abs/1511.06393: (2015). http://arxiv.org/abs/1511.06393.

31. 31

D. O’Shaughnessy, Speech Communication: Human and Machine, (1987).

32. 32

M. A. Nielsen, Neural Networks and Deep Learning, (2015). http://neuralnetworksanddeeplearning.com/. Accessed 26 May 2020.

33. 33

S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR. abs/1502.03167: (2015). http://arxiv.org/abs/1502.03167.

34. 34

P. Warden, Speech commands: a public dataset for single-word speech recognition (2017). Dataset available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz.

35. 35
Источник: https://asmp-eurasipjournals.springeropen.com/articles/10.1186/s13636-020-00176-2

Pointwise Crack free Download is a powerful program for computational fluid dynamics and 3D modeling, Pointwise 18.3 comes with a professional set of tools that can accurately create different textures and draw high-speed 3D models. It is a straightforward application that provides best mesh production and fluid dynamics features in the 3D models. Now you you can pointwise license crack free download from Doload website.

Pointwise License Crack provides reliable timeline features with support for solving high viscosity flows in complex geometries. Achieve the results in higher quality and provides complete support for air currents in the complex areas. Additionally, users can also work with geometric and analytical areas.

### Pointwise Features and Highlights

• Powerful 3D modeling and computation fluid dynamics software
• Accurately create networked textures and generate high-speed currents
• Solving high viscosity flows in the complex geometries and work with timelines
• A higher level of automation to achieve accurate results
• Structured network texture technology along with T-Rex technology
• Produce air currents in different complex shapes
• Work with different geometric and analytical areas
• Extract the project with CFD standards
• Work in collaboration with SolidWorks and CATIA
• Delivers high tolerance features with geometric modeling tools
• Producing waves and echoes of the sound in the 3D models
• Many other powerful options and features

### Pointwise Full Specification

• Software Name: Pointwise
• File Size: 772 MB
• Setup Format: Exe
• Setup Type: Offline Installer/Standalone Setup.
• Supported OS: Windows
• Minimum RAM: 1 GB
• Space: 1 GB
• Developers: Pointwise

### How to Crack, Register or Free Activation Pointwise

#2: Install the Pointwise setup file.

#3: Open “Readme.txt” for activate the software

#4: That’s it. Done…!

## A depthwise separable convolutional neural network for keyword spotting on an embedded system

EURASIP Journal on Audio, Speech, and Music Processingvolume 2020, Article number: 10 (2020) Cite this article

• 3536 Accesses

• 1 Citations

• Metrics details

### Abstract

A keyword spotting algorithm implemented on an embedded system using a depthwise separable convolutional neural network classifier is reported. The proposed system was derived from a high-complexity system with the goal to reduce complexity and to increase efficiency. In order to meet the requirements set by hardware resource constraints, a limited hyper-parameter grid search was performed, which showed that network complexity could be drastically reduced with little effect on classification accuracy. It was furthermore found that quantization of pre-trained networks using mixed and dynamic fixed point principles could reduce the memory footprint and computational requirements without lowering classification accuracy. Data augmentation techniques were used to increase network robustness in unseen acoustic conditions by mixing training data with realistic noise recordings. Finally, the system’s ability to detect keywords in a continuous audio stream was successfully demonstrated.

### Introduction

During the last decade, deep learning algorithms have continuously improved performances in a wide range of applications, among others automatic speech recognition (ASR) [1]. Enabled by this, voice-controlled devices constitute a growing part of the market for consumer electronics. Artificial intelligence (AI) digital assistants utilize natural speech as the primary user interface and often require access to cloud computation for the demanding processing tasks. However, such cloud-based solutions are impractical for many devices and cause user concerns due to the requirement of continuous internet access and due to concerns regarding privacy when transmitting audio continuously to the cloud [2]. In contrast to these large-vocabulary ASR systems, devices with more limited functionality could be more efficiently controlled using only a few speech commands, without the need of cloud processing.

Keyword spotting (KWS) is the task of detecting keywords or phrases in an audio stream. The detection of a keyword can then trigger a specific action of the device. Wake-word detection is a specific implementation of a KWS system where only a single word or phrase Pointwise Activation key detected which can then be used to, for example, trigger a second, more complex recognition system. Early popular KWS systems have typically been based on hidden Markov models (HMMs) [3–5]. In recent years, however, neural network-based systems have dominated the area and improved the accuracies of these systems. Popular architectures include standard feedforward deep neural networks (DNNs) [6–8] and recurrent neural networks (RNNs) [9–12]. Strongly inspired by advancements in techniques used in computer vision (e.g., image classification and facial recognition), the convolutional neural network (CNN) [13] has recently gained popularity for KWS in small memory footprint applications [14]. The depthwise separable convolutional neural network (DS-CNN) [15, 16] was proposed as an efficient alternative to the standard CNN. The DS-CNN decomposes the standard 3-D convolution into 2-D convolutions followed by 1-D convolutions, which drastically reduces the number of required weights and computations. In a comparison of multiple neural network architectures for KWS on embedded platforms, the DS-CNN was found to be the best performing architecture [17].

For speech recognition and KWS, the most commonly used speech features are the mel-frequency cepstral coefficients (MFCCs) [17–20]. In recent years, there has, however, been a tendency to use mel-frequency spectral coefficients (MFSCs) directly with neural network-based speech recognition systems [6, 14, 21] instead of applying the discrete cosine transform (DCT) to obtain MFCCs. This is mainly because the strong correlations between adjacent time-frequency components of speech signals can be exploited efficiently by neural network architectures such as the CNN [22, 23]. An important property of MFSC features is that they attenuate the characteristics of the acoustic signal irrelevant to the spoken content, such as the intonation or accent [24].

One of the major challenges of supervised learning algorithms is the ability to generalize from training data to unseen observations [25]. Reducing the impact of speaker variability on the input features can make it easier for the network to generalize. Another way to improve the generalization is to ensure a high diversity of the training data, which can be realized by augmenting the training data. For audio data, augmentation techniques include filtering [26], time shifting and time warping [27], and adding background noise. However, the majority of KWS systems either have used artificial noises, such as white or pink noise, which are not relevant for real-life applications or have considered only a limited number of background noises [14, 17, 28].

Because of the limited complexity of KWS compared to large-vocabulary ASR, low-power embedded microprocessor systems are suitable targets for running real-time KWS without access to cloud computing [17]. Implementing neural networks on microprocessors presents two major challenges in terms of the limited resources of the platform: (1) memory capacity to store weights, activations, input/output, and the network structure itself is very limited for microprocessors; (2) computational power on microprocessors is limited. The number of computations per network inference is therefore limited by the real-time requirements of the KWS system. To meet these strict resource constraints, the size of the networks must be restricted in order to reduce the number of network parameters. Techniques like quantization can further be used to reduce the computational load and memory footprint. The training and inference of neural networks is typically done using floating-point precision for weights and layer outputs, but for implementation on mobile devices or embedded platforms, fixed point formats at low bit widths are often more efficient. Many microprocessors support single instruction, multiple data (SIMD) instructions, which perform arithmetic on multiple data points simultaneously, but typically only for 8/16 bit integers. Using low bit width representations will therefore increase the throughput and thus lower the execution time of network inference. Previous research has shown that, for image classification tasks, it is possible to quantize CNN weights and activations to 8-bit fixed point format with a minimum loss of accuracy [29, 30]. However, the impact of quantization on the performance of a DS-CNN-based KWS system has not yet been investigated.

This paper extends previous efforts [17] to implement a KWS system based on a DS-CNN by (a) identifying performance-critical elements in the system when scaling the network complexity, (b) augmenting training data with a wider variety of realistic noise recordings and by using a controlled range of signal-to-noise ratios (SNRs) that are realistic for practical KWS applications during both training and testing. Moreover, the ability of the KWS system to generalize to unseen acoustic conditions was tested by evaluating the system performance in both matched and mismatched background noise conditions, (c) evaluating the effect of quantizing individual network elements and (d) evaluating the small-footprint KWS system on a continuous audio stream rather than single inferences. Specifically, the paper reports the implementation of a 10-word KWS system based on a DS-CNN classifier on a low-power embedded microprocessor (ARM Cortex M4), motivated by the system in [17]. The KWS system described in the present study is targeted at real-time applications, which can be either always on or only active when triggered by an external system, e.g., a wake-word system. To quantify the network complexity where the performance decreases relative to the system in [17], we tested a wide range of system parameters between 2 layers and 10 filters per layer up to 9 layers and 300 filters per layer. The network was trained with keywords augmented with realistic background noises at a wide range of SNRs and the network’s ability to generalize to unseen acoustic conditions was evaluated. With the goal to reduce the memory footprint of the system, it was investigated how quantization of weights and activations affected performance by gradually lowering the bit widths using principles of mixed and dynamic fixed point representations. In this process, single-inference performance was evaluated, motivated by the smaller parameter space and the close connection between the performance in single-inference testing and continuous audio presentation. In the final step, the performance of the suggested KWS system was tested when detecting keywords in a continuous audio stream and compared to the reference system of high complexity.

### KWS system

The proposed DS-CNN-based KWS system consisted of three major building blocks as shown in Fig. 1. First, MFSC features were extracted based on short time blocks of the raw input signal stream (pre-processing stage). These MFSC features were then fed to the DS-CNN-based classifier, which generated probabilities for each of the output classes in individual time blocks. Finally, a posterior handling stage combined probabilities across time blocks to improve the confidence of the detection. Each of the three building blocks is described in detail in the following subsections.

### Feature extraction

The MFSC extraction consisted of three major steps, as shown in Fig. 2. The input signal was sampled at a rate of 16 kHz and processed by the feature extraction stage in blocks of 1000 ms. For each block, the short-time discrete Fourier transform (STFT) was computed by using a Hann window of 40-ms duration with 50 % overlap, giving a total of 49 frames. Each frame was zero-padded to a length of 1024 samples before computing a 1024-point discrete Fourier transform (DFT). Afterwards, a filterbank with 20 triangular bandpass filters with a constant Q-factor spaced equidistantly on the mel-scale between 20 and 4000 Hz [31] was applied. The mel-frequency band energies were then logarithmically compressed, producing the MFSC features, resulting in a 2-D feature matrix of size 20×49 for each inference. The number of log-mel features was derived from initial investigations on a few selected network configurations where it was found that 20 log-mel features proved most efficient in terms of performance vs the resources used.

### DS-CNN classifier

The classifier had one output class for each of the keywords it should detect. It furthermore had an output class for unknown speech signals and one for signals containing silence. The input to the network was a 2-dimensional feature map consisting of the extracted MFSC features. Each convolutional layer of the network then applied a number of filters, Nfilters, to detect local time-frequency patterns across input channels. The output of each network inference was a probability vector, containing the probability for each output class. The general architecture of the DS-CNN classifier is shown in Fig. 3. The first layer of the network was in all cases a standard convolutional layer. Following the convolutional layer was a batch-normalization layer with a rectified linear unit (ReLU) [32] activation function.

Batch normalization [33] was employed to accelerate training and to reduce the risk of overfitting through regularization. By equalizing the distributions of activations, higher learning rates can be used because the magnitude of the gradients of each layer is more similar, which results in faster model convergence. Because the activations of a single audio file are not normalized by the mean and variance of each audio file, but instead by the mean and variance of the mini-batch [32] in which it appears, a regularization effect is created by the random selection of audio files in the mini-batch. The batch-normalization layer was followed by a number of depthwise separable convolutions (DS-convs) [16], which each consisted of a depthwise convolution (DW-conv) and pointwise convolution (PW-conv) as illustrated in Fig. 4, both followed by a batch-normalization layer with ReLU activation. An average pooling layer then reduced the number of activations by applying an averaging window to the entire time-frequency feature map of each input channel. Finally, a fully connected (FC) layer with softmax [32] activations generated the probabilities for each output class.

### Posterior handling

The classifier was run 4 times per second, resulting in a 250-ms shift and an overlap of 75 %. As the selected keywords were quite short in duration, they typically appeared in full length in multiple input blocks. In order to increase the confidence of the classification an integration period, Tintegrate, was introduced, in which the predicted output probabilities of each output class were averaged. The system then detected a keyword if any of these averaged probabilities exceeded a predetermined detection threshold. To avoid that the same word would trigger multiple detections by the system, a refractory period, Trefractory, was introduced. When the system detected a keyword, it would be suppressed from detecting the same keyword during the refractory period. For this paper, an integration period of Tintegrate=750 ms and a refractory period of Trefractory=1000 ms were used.

### Dataset

The Speech Commands dataset [34] was used for training and evaluation of the networks. The dataset consisted of 65000 single-speaker, single-word recordings of 30 different words. A total of 1881 speakers contributed to the dataset, ensuring a high speaker diversity. The following 10 words were used as keywords: {“Yes,” “No,” “Up,” “Down,” “Left,” “Right,” “On,” “Off,” “Go,” “Stop”}. The remaining 20 words of the dataset were used to train the category “unknown.” The dataset was split into “training,” “validation,” and “test” sets with the ratio 80:10:10, while restricting recordings of the same speaker to only appear in one of the three sets. For training, 10 % of the presented audio files were labeled silence, i.e., containing no spoken word; 10 % were unknown words; and the remaining contained keywords.

### Data augmentation

For training, validation, and testing, the speech files were mixed with 13 diverse background noises at a wide range of SNRs. The background noise signals were real-world recordings, some containing speech, obtained from two publicly available databases, the TUT database [35] and the DEMAND database [36]. The noise signals were split into two sets, reflecting matched and mismatched conditions (see Table 1). The networks were either trained on the clean speech files or trained on speech files mixed with noise signals from noise set 1 with uniformly distributed A-weighted SNRs in the range between 0 and 15 dB. To add background noise to the speech files, the filtering and noise adding tool (FaNT) [37] was used. Noise set 2 was then used to evaluate the network performance in acoustic conditions that were not included in the training. Separate recordings of each noise type were used for training and evaluation.

### Resource estimation

To compare the resources used by different network configurations, the following definitions were used to estimate number of operations, memory, and execution time.

#### Operations

The number of operations are per inference of the network, defined as the total number of multiplications and additions in the convolutional layers of the DS-CNN.

#### Memory

The memory reported is the total memory required to store the network weights/biases and layer activations, assuming 8-bit variables. As the activations of one layer are only used as input for the next layer, the memory for the activations can be reused. The total memory allocated for activations is then equal to the maximum of the required memory for inputs and outputs of a single layer.

#### Execution time

The execution times reported in this paper are estimations based on measured execution times of multiple different-sized networks. The actual network inference execution time of implemented DS-CNNs on the Cortex M4 was measured using the Cortex M4’s on-chip timers, with the processor running at a clock frequency of 180 MHz. In this study, only two hyper-parameters were altered: the number of DS-conv layers, Nlayers, and the number of filters applied per layer, Nfilters. The number of layers was varied between 2 and 9, and the number of filters per layer was varied between 10 and 300. Convolutional layers after layer 7 had the same parameters as seen in the last layers in Table 2 in terms of filter size and strides.

### Quantization methods

The fixed point format represents floating-point numbers as N-bit 2’s complement signed integers, where the BI leftmost bits (including the sign-bit) represent the integer part, and the remaining BF rightmost bits represent the fractional part. The following two main concepts were applied when quantizing a pre-trained neural network effectively [29].

#### Mixed fixed point precision

The fully connected and convolutional layers of a DS-CNN consist of a long series of multiply-and-accumulate (MAC) operations, where network weights multiplied with layer activations are accumulated to give the output. Using different bit widths for different parts of the network, i.e., mixed precision, has been shown to be an effective approach when quantizing CNNs [38], as the precision required to avoid performance degradation may vary in different parts of the network.

#### Dynamic fixed point

The weights and activations of different CNN layers will have different dynamic ranges. The fixed point format requires that the range of the values to represent is known beforehand, as this determines BI and BF. To ensure a high utilization of the fixed point range, dynamic fixed point [39] can be used, which assigns the weights/activations into groups of constant BI.

For faster inference, the batch-norm operations were fused into the weights of the preceding convolutional layer and quantized after this fusion. BI and BF were determined by splitting the network variables for each layer into groups of weights, biases, and activations, and estimating the dynamic range of each group. The dynamic ranges of groups with weights and biases were fixed after training, while the ranges of activations were estimated by running inference on a large number of representative audio files from the dataset and generating statistical parameters for the activations of each layer. BI and BF were then chosen such that saturation is avoided. The optimal bit widths were determined by dividing the variables in the network into separate categories based on the operation, while the layer activations were kept as one category. The effects on performance were then examined when reducing the bit width of a single category while keeping the rest of the network at floating-point precision. The precision of the weights and activations in the network was varied in experiment 3 between 32-bit floating-point precision and low bit width fixed point formats ranging from 8 to 2 bit.

### Training

All networks were trained with Google’s TensorFlow machine learning framework [40] using an Adam optimizer to minimize the cross-entropy loss. The networks were trained in 30,000 iterations with a batch size of 100. Similar to [17], an initial learning rate of 0.0005 was used; after 10,000 iterations, it was reduced to 0.0001; and for the remaining 10,000 iterations, it was reduced to 0.00002. During training, audio files were randomly shifted in time up to 100 ms to reduce the risk of overfitting.

### Evaluation

To evaluate the DS-CNN classifier performance in the presence of background noise, test sets with different SNRs between Pointwise Activation key and 30 dB were used. Separate test sets were created for noise signals from noise set 1 and noise set 2. The system was tested by presenting single inferences (single-inference testing) to evaluate the performance of the network in isolation. In addition, the system was tested by presenting a continuous audio stream (continuous-stream testing) to approximate a more realistic application environment.

#### Single-inference testing

For single-inference testing, the system was tested without the posterior handling stage. For each inference, the maximum output probability was selected as the detected output class and compared to the label of the input signal. When testing, 10 % of the samples were silence, 10 % were unknown words, and the remaining contained keywords. Each test set consisted of 3081 audio files, and the reported test accuracy specified the ratio of correctly labeled audio files to the total amount of audio files in the test. To compare different network configurations, the accuracy was averaged across the range 0−20 dB SNR, as this reflects SNRs in realistic conditions [41].

#### Continuous audio stream testing

Test signals with a duration of 1000 s were created for each SNR and noise set, with words from the dataset appearing approximately every 3 s. Seventy percent of the words in the test signal were keywords. The test signals were constructed with a high ratio of keywords to reflect the use case in which the KWS system is not run in an always-on state but instead triggered externally by, e.g., a wake-word detector. A hit was counted if the system detected the keyword within 750 ms after occurrence, and the hit rate (also called true positive rate (TPR)) then corresponds to the number of hits relative to the total number of keywords in the test signal. The false alarm rate (also called false positive rate (FPR)) reported is the total number of incorrect keyword detections relative to the duration of the test signal, here reported as false alarms per office 365 student test configuration

Unless stated otherwise, the parameters summarized in Table 2 are used. The network had 7 convolutional layers with 76 filters for each layer.

As a baseline for comparison, a high-complexity network was introduced. The baseline network had 8 convolutional layers with 300 filters for each layer with hyper-parameters as summarized in Table 3. The baseline network was trained using the noise-augmented dataset and evaluated using floating-point precision weights and activations.

### Platform description

Table 4 shows the key specifications for the FRDM K66F development platform used for verification of the designed systems. The deployed network used 8-bit weights and activations, but performed feature extraction using 32-bit floating-point precision. The network was implemented using the CMSIS-NN library [42] which features neural network operations optimized for Cortex-M processors.

### Results

In experiment 1, the effect of training the network on noise-augmented speech on single-inference accuracy was investigated. The influence of network complexity was assessed in experiment 2 by systematically varying the number of convolutional layers and the number of filters per layer. Experiment 3 investigated the effects on performance when quantizing network weights and activations for fixed point implementation. Finally, the best performing network was tested on a continuous audio stream and the impact of the detection threshold on hit rate and false positive rate was evaluated.

### Experiment 1: Data augmentation

Figure 5 shows the single-inference accuracies when using noise-augmented speech files for training. For SNRs below 20 dB, the network trained on noisy data had a higher test accuracy than the network trained on clean data, while the accuracy was slightly lower for SNRs higher than 20 dB. In the range between −5 and 5 dB SNR, the average accuracy for the network trained on noisy data was increased by 11.1 % and 8.6% for the matched and mismatched noise sets respectively relative to the training on clean data. Under clean test conditions, it was found that the classification accuracy of the network trained on noisy data was 4 % lower than the network trained on clean data. For both networks, there was a difference between the accuracy on the matched and mismatched test. The average difference in accuracy in the range from −5 to 20 dB SNR was 3.3 % and 4.4 % for the network trained on clean and noisy data, respectively. The high-complexity baseline network performed on average 3 % better than the test network trained on noisy data.

### Experiment 2: Network complexity

Figure 6 shows a selection of the most feasible networks for different numbers of layers and numbers of filters per layer. For each trained network, the table specifies single-inference average accuracy in the range 0−20 dB SNR for both test sets (accuracy in parentheses for the mismatched test). Moreover, the number of operations per inference, the memory required by the model for weights/activations, and the estimated execution time hitman pro free license key inference on the Cortex M4 are specified. For networks with more than 5 layers, no significant improvement (< 1 %) was obtained when increasing the number of filters beyond 125. Networks with less than 5 layers gained larger improvements from using more than 125 filters, though none of those networks reached the accuracies obtained with networks with more layers.

Figure 7 shows the accuracies of all the layer/filter combinations of the hyper-parameter search as a function of the operations. For the complex network structures, the deviation of the accuracies was very small, while for networks using few operations, there was a large difference in accuracy depending on the specific combination of layers and filters. For networks ranging between 5 and 200 million operations, the difference in classification accuracy between the best performing models was less than 2.5 %. Depending on the configuration of the network, it is therefore possible to drastically reduce the number of operations while maintaining a high classification accuracy.

In Fig. 8, a selection of the best performing networks is shown as a function of required memory and operations per inference. The label for each network specifies the parameter configuration [Nlayers,Nfilters] and the average accuracy in 0−20 dB SNR for noise set 1 and 2. The figure illustrates the achievable performance given the platform resources and shows that high accuracy was reached with relatively simple networks. From this investigation, it was found that the best performing network fitting the resource requirements of the platform consisted of 7 DS-CNN layers with 76 filters per layer, as described in Section 3.7.

### Experiment 3: Quantization

Table 5 shows the single-inference test results of the quantized networks, where each part of the network specified in Section 3.7 was quantized separately, while the remainder of the network was kept at floating-point precision.

All of the weights and activations could be quantized to 8-bit using dynamic fixed point representation with no loss of classification accuracy, and the bit widths of the weights could be further reduced to 4 bits with only small reductions in accuracy. In contrast, reducing the bit width of activations to less than 8 bits significantly reduced classification accuracy. While the classification accuracy was substantially reduced when using only 2 bits for regular convolution parameters and FC parameters, the performance completely broke down when quantizing pointwise and depthwise convolution parameters and layer activations with 2 bits. The average test accuracy in the range of 0−20 dB SNR of the network with all weights and activations quantized to 8 bits was 83.2 % for test set 1 (matched) and 79.2 % for test set 2 (mismatched), which was the same performance as using floating-point precision.

Using 8-bit fixed point numbers instead of 32-bit floating point reduced the required memory by a factor of 4, from 366 to 92 KB, with 48 KB reserved for activations and 44 KB for storing weights. Utilizing mixed fixed point precision and quantizing activations to 8 bits and weights to 4 bits would reduce the required memory to 70 KB.

### Experiment 4: Continuous audio stream

Figure 9 shows the hit rate and false positive rate obtained by the KWS system on the continuous audio signals. The system was tested using different detection thresholds, which affected the system’s inclination towards detecting a keyword. It was found that the difference in hit rates was constant as a function of SNR when the detection threshold was altered, while the difference in false positive rates increased towards low SNRs. For both test sets, the hit rate and false positive rate saturated at SNRs higher than 15 dB. Figure 10 shows the corresponding DET curve obtained for the test network and baseline network.

### FRDM K66F implementation

Table 6 shows the distribution of execution time over network layers for a single inference for the implementation on the FRDM K66F development board. The total execution time of the network inference was 227.4 ms, which leaves sufficient time for feature extraction and audio input handling, assuming 4 inferences per second.

### Discussion

Experiment 1 showed that adding noise to the training material increased the classifier robustness in low SNR conditions. The increase in accuracy, compared to the same network trained on clean speech files, was most significant for the matched noise test, where the test data featured the same noise types as the training material. For the mismatched test, the increase in accuracy was slightly smaller. A larger difference in performance between clean and noisy training was expected, but as explained in [43], the dataset used was inherently noisy and featured invalid audio files, which could diminish the effect of adding more noise. For both test sets, the network trained on clean data performed better under high SNRs, i.e., SNR>20 dB. From the perspective of this paper however, the performance in high SNRs was of less interest as the focus was on real-world application. If the performance in clean conditions is also of concern, [28] demonstrated that the performance decrease in clean conditions could be reduced by including clean audio files in the noisy training. As was also found in [28], it was observed that the noisy training enabled the network to adapt to the noise signals and improve the generalization ability, by forcing it to detect patterns more unique to the keywords. Even though the two noise sets consisted of different noise environment recordings, many of the basic noise types, such as speech, motor noise, or running water, were present in both noise sets. This would explain why, that even though the network was only trained on data mixed with noise set 1 (matched), it also performed better on test set 2 (mismatched) than the network trained on clean data.

The main result of experiment 2 was that the classification accuracy as a function of network complexity reached a saturation point. Increasing the number of layers or the number of filters per layer beyond this point only resulted in negligible accuracy gains, <2 %. This was explicitly shown in Fig. 5 for the single-inference classification accuracy and Fig. 10 for continuous audio streaming, where the high-complexity baseline network was directly compared with the smaller network chosen for the implementation. It was also found that, given a fixed computational and memory constraint, higher accuracies were achieved by networks with many layers and few filters than by networks with few layers and many filters. In a convolutional layer, the number of filters determines how many different patterns can be detected in the input features. The first layer detects characteristic patterns of the input speech features, and each subsequent convolutional layer will detect patterns in the patterns detected by the previous layer, adding another level of abstraction. One interpretation of the grid-search results could therefore be, that if the network has sufficient levels of abstraction (layers), then the number of distinct patterns needed at each abstraction level to characterize the spoken content (number of filters) can be quite low. As the classifier should run 4 times per second, feasible network configurations Pointwise Activation key limited to inference execution times below 250 ms, which ruled out the majority of the configurations tested. In terms of the resource constraints set by the platform, execution time was the limiting factor for these networks and not the memory required for weights and activations. This was not unexpected as the DS-CNN, contrary to network architectures such as the DNN, reuses the same weights (filters) for computing multiple neurons. The DS-CNN therefore needs fewer weights relative to the number of computations it must perform, making this approach especially suitable for platforms with very limited memory capacity.

The results from experiment 3 showed that weights and activations of a network trained using floating-point precision could be quantized to low bit widths without affecting the classification accuracy. Quantizing all numbers in the network to 8 bit resulted in the same classification accuracy as using floating-point precision. It was also found that the weights of the network could all be quantized to 4 bit with no substantial loss of accuracy, which can significantly reduce the memory footprint and possibly reduce the processing time spent on fetching data from memory. These results showed that mixed fixed point precision leads to the most memory-efficient network, because the different network components (weights and activations) are robust to different reductions in bit width. For many deep CNN classifiers [29, 30, 38], it was reported that networks are very robust to the reduced resolution caused by quantization. The reason for this robustness could be that the networks are designed and trained to ignore the background noise and the deviations of the speech samples. The quantization errors are then simply another noise source for the network, which it can handle up to a certain magnitude. Gysel et al. [29] found that small accuracy decreases of CNN classifiers, caused by fixed point quantization of weights and activations, could be compensated for by partially retraining the networks using these fixed point weights and activations. A natural next step for the KWS system proposed in this paper would therefore also be to fine tune the quantized networks. Because the network variables were categorized, the quantization effects on the overall performance could be evaluated individually for each category. Results showed that different bit widths were required for the different categories, in order to maintain the classification accuracy achieved using floating-point numbers. It is however suspected that, because some of the categories span multiple network layers, a bottleneck effect could occur. For example, if the activations of a single layer require high precision, i.e., large bit width, but the other layers’ activations required fewer bits, this would be masked in the experiment because they were all in the same category. It is therefore expected that using different bit widths for each of the layers in each of the categories would potentially result in a lower memory footprint. In this paper, the fixed point representations had symmetric, zero-centered ranges. However, all of the convolutional layers use ReLU activations functions, so the activations effectively only utilize half of the available range as values below zero are cutoff. By shifting the range, such that zero becomes the minimum value, the total range can be halved, i.e., BI is decreased by one. This in turn frees a bit, which could be used to increase BF by one, thereby increasing the resolution, or it could be used to reduce the total bit width by one.

Experiment 4 tested the KWS system performance on a continuous audio stream. As found in most signal detection tasks, lowering the decision criterion, i.e., the detection threshold, increases the hit rate but also the FPR, which means there is a trade-off. The detection threshold should match the intended application of the system. For always-on systems, it is crucial to keep the number of false alarms as low as possible, while for externally activated systems where the KWS is only active for a short time window in which a keyword is expected, a higher hit rate is more desirable. One method for lowering the FPR and increasing the true negative rate could be to increase the ratio of negative to positive samples in the training, i.e., use more “unknown” and “silence” samples. This has been shown as an effective method in other machine learning detection tasks [44, 45]. Another approach for lowering the FPR could be to create a loss function for the optimizer during training, which penalizes errors that cause false alarms more than errors that cause misses. There were significant discrepancies between the estimated number of operations and actual execution time of the different layers of the implemented network (see Table 6). The convolutional functions in the software library used for the implementation [42] all use highly optimized matrix multiplications (i.e. general matrix-matrix multiplication, GEMM) to compute the convolution. However, in order to compute 2D convolutions using matrix multiplications, it is necessary to first rearrange the input data and weights during run time. It was argued that, despite this time consuming and memory expanding data reordering, using matrix multiplications is still the most efficient implementation of convolutional layers [46, 47]. The discrepancies between operations and execution time could be explained by the fact that the reordering of data was not accounted for in the operation parameter and that the different layers required different degrees of reordering. For the pointwise-convolutions and fully connected layer, the activations were stored in memory in an order such that no reordering was required to do the matrix multiplication, whereas this was not possible for the standard convolution or depthwise convolution. The number of arithmetic operations for running network inference should therefore not solely be used to asses the feasibility of implementing neural networks on embedded processors, as done in [17], as this parameter does not directly reflect the execution time. Instead, this estimate should also include the additional work for data reordering required by some network layers. Based on the results presented in this paper, there are several possible actions to take to improve performance or optimize implementation of the proposed KWS system. Increasing the size of the dataset and removing corrupt recordings, or augmenting training data with more varied background noises, such as music, could increase network accuracy and generalization. Reducing the number of weights of a trained network using techniques such as pruning [48] could be used to further reduce memory footprint and execution time.

Python training scripts and FRDM K66F deployment source code as well as a quantitative comparison of performance for using MFCC vs MFSC features for a subset of networks are available on Github [49].

### Conclusion

In this paper, methods for training and implementing a DS-CNN-based KWS system for low-resource embedded platforms were presented and evaluated. Experimental results showed that augmenting training data with realistic noise recordings increased the classification accuracy in both matched and mismatched noise conditions. By performing a limited hyper-parameter grid search, it was found that network accuracy saturated when increasing the number of layers and filters in the DS-CNN and that feasible networks for implementation on the ARM Cortex M4 processor were in this saturated region. It was also shown that using dynamic fixed point representations allowed network weights and activations to be quantized to 8-bit precision with no loss in accuracy. By quantizing different network components individually, it was found that layer activations were most sensitive to further quantization, while weights could be quantized to 4 bits with only small decreases in accuracy. The ability of the KWS system to detect keywords in a continuous audio stream was tested, and it was seen how altering the detection threshold affected the hit rate and false alarm Monster Hunter World Crack Download Full Version With Serial Code 2021. Finally, the system was verified by the implementation on the Cortex M4, where it was found that the number of arithmetic operations per inference are not directly related to execution time. Ultimately, this paper shows that the number of layers and the number of filters per layers provide a useful parameter when scaling system complexity. In addition, it was shown that a 8-bit quantization provides a significant reduction in memory footprint and processing time and does not result in a loss of accuracy.

### Availability of data and materials

The data that support the findings of this study are available from [34].

### Abbreviations

Artificial intelligence

Automatic speech recognition

Convolutional neural network

Discrete cosine transform

Discrete Fourier transform

Deep neural network

Depthwise separable convolutional neural network

Depthwise separable convolution

Depthwise convolution

Fully connected

False positive rate

Hidden Markov model

Keyword spotting

Multiply and accumulate

Mel-frequency cepstral coefficients

Mel-frequency spectral coefficients

Pointwise convolution

Rectified linear unit

Recurrent neural network

Singe instruction, multiple data

Signal-to-noise ratio

Short-time discrete Fourier transform

True positive rate

### References

1. 1

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag.29(6), 82–97 (2012). https://doi.org/10.1109/MSP.2012.2205597.

2. 2

New Electronic Friends. https://pages.arm.com/machine-learning-voice-recognition-report.html. Accessed 30 May 2018.

3. 3

R. C. Rose, D. B. Paul, in International Conference on Acoustics, Speech, and Signal Processing. A hidden Markov model based keyword recognition system, (1990), pp. 129–1321. https://doi.org/10.1109/ICASSP.1990.115555.

4. 4

J. R. Rohlicek, W. Russell, S. Roukos, H. Gish, in International Conference on Acoustics, Speech, and Signal Processing,. Continuous hidden Markov modeling for speaker-independent word spotting, (1989), pp. 627–6301. https://doi.org/10.1109/ICASSP.1989.266505.

5. 5

J. G. Wilpon, L. G. Miller, P. Modi, in [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing. Improvements and applications for key word recognition using hidden Markov modeling techniques, (1991), pp. 309–312. https://doi.org/10.1109/ICASSP.1991.150338. http://ieeexplore.ieee.org/document/150338/.

6. 6

G. Chen, C. Parada, G. Heigold. Small-footprint keyword spotting using deep neural networks, (2014). https://doi.org/10.1109/icassp.2014.6854370.

7. 7

K. Shen, M. Cai, W. -Q. Zhang, Y. Tian, J. Liu, Investigation of DNN-based keyword spotting in low resource environments. Int. J. Future Comput. Commun.5(2), 125–129 (2016). https://doi.org/10.18178/ijfcc.2016.5.2.458.

8. 8

G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, S. Vitaladevuni. Model compression applied to small-footprint keyword spotting, (2016), pp. 1878–1882. https://doi.org/10.21437/Interspeech.2016-1393.

9. 9

S. Fernández, A. Graves, J. Schmidhuber, in Artificial Neural Networks – ICANN 2007, ed. by J. M. de Sá, L. A. Alexandre, W. Duch, and D. Mandic. An application of recurrent neural networks to discriminative keyword spotting (SpringerBerlin, Heidelberg, 2007), pp. 220–229.

10. 10

K. P. Li, J. A. Naylor, M. L. Rossen, in [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2. A whole word recurrent neural network for keyword spotting, (1992), pp. 81–842. https://doi.org/10.1109/ICASSP.1992.226115.

11. 11

M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, S. Vitaladevuni, Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. CoRR. abs/1705.02411: (2017). http://arxiv.org/abs/1705.02411.

12. 12

S. Ö. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, A. Coates, Convolutional recurrent neural networks for small-footprint keyword spotting. CoRR. abs/1703.05390: (2017). http://arxiv.org/abs/1703.05390.

13. 13

Y. LeCun, Y. Bengio, in Chap. Convolutional Networks for Images, Speech, and Time Series. The Handbook of Brain Theory and Neural Networks (Press, MITCambridge, MA, USA, 1998), pp. 255–258. http://dl.acm.org/citation.cfm?id=303568.303704.

14. 14

T. N. Sainath, C. Parada, in INTERSPEECH. Convolutional neural networks for small-footprint keyword spotting, (2015).

15. 15

F. Chollet, Xception: deep learning with depthwise separable convolutions. CoRR. abs/1610.02357: (2016). http://arxiv.org/abs/1610.02357.

16. 16

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR. abs/1704.04861: (2017). http://arxiv.org/abs/1704.04861.

17. 17

Y. Zhang, N. Suda, L. Lai, V. Chandra, Hello edge: keyword spotting on microcontrollers. CoRR. abs/1711.07128: (2017). Pointwise Activation key. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process.28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420.

18. 19

I. Chadawan, S. Siwat, Y. Thaweesak, in International Conference on Computer Graphics, Simulation and Modeling (ICGSM’2012). Speech recognition using MFCC (Pattaya (Thailand), 2012).

19. 20

Bhadragiri Jagan Mohan, Ramesh Babu N., in 2014 International Conference on Advances in Electrical Engineering (ICAEE). Speech recognition using MFCC and DTW, (2014), pp. 1–4. https://doi.org/10.1109/ICAEE.2014.6838564.

20. 21

O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.22(10), 1533–1545 (2014). https://doi.org/10.1109/TASLP.2014.2339736.

21. 22

A. -R. Mohamed, Deep Neural Network acoustic models for ASR. PhD thesis (University of Toronto, 2014). https://tspace.library.utoronto.ca/bitstream/1807/44123/1/Mohamed_Abdel-rahman_201406_PhD_thesis.pdf.

22. 23

S. Watanabe, M. Delcroix, F. Metze, J. R. Hershey, in Springer International Publishing. New era for robust speech recognition, (2017), p. 205. https://doi.org/10.1007/978-3-319-64680-0.

23. 24

J. W. Picone, Signal modeling techniques in speech recognition. Proc. IEEE. 81:, 1215–1247 (1993). https://doi.org/10.1109/5.237532.

Article Pointwise Activation key Google Scholar

24. 25

X. Xiao, J. Li, Chng. E.S., H. Li, C. -H. Lee, A study on the generalization capability of acoustic models for robust speech recognition. IEEE Trans. Audio Speech Lang. Process.18(6), 1158–1169 (2010). https://doi.org/10.1109/TASL.2009.2031236.

25. 26

I. Rebai, Y. BenAyed, W. Mahdi, J. -P. Lorré, Improving speech recognition using data augmentation and acoustic model fusion. Procedia Comput. Sci.112:, 316–322 (2017). https://doi.org/10.1016/J.PROCS.2017.08.003.

26. 27

T. Ko, V. Peddinti, D. Povey, S. Khudanpur, in INTERSPEECH. Audio augmentation for speech recognition, (2015).

27. 28

S. Yin, C. Liu, Z. Zhang, Y. Lin, D. Wang, J. Tejedor, T. F. Zheng, Y. Li, Noisy training for deep neural networks in speech recognition. EURASIP J. Audio Speech Music Process.2015(1), 2 (2015). https://doi.org/10.1186/s13636-014-0047-0.

28. 29

P. Gysel, M. Motamedi, S. Ghiasi, Hardware-oriented approximation of convolutional neural networks. CoRR. abs/1604.03168: (2016). http://arxiv.org/abs/1604.03168.

29. 30

D. D. Lin, S. S. Talathi, V. S. Annapureddy, Fixed point quantization of deep convolutional networks. CoRR. abs/1511.06393: (2015). http://arxiv.org/abs/1511.06393.

30. 31

D. O’Shaughnessy, Speech Communication: Human and Machine, (1987).

31. 32

M. A. Nielsen, Neural Networks and Deep Learning, (2015). http://neuralnetworksanddeeplearning.com/. Accessed 26 May 2020.

32. 33

S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR. abs/1502.03167: (2015). http://arxiv.org/abs/1502.03167.

33. 34

P. Warden, Speech commands: a public dataset for single-word speech recognition (2017). Dataset available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz.

34. 35
Источник: https://asmp-eurasipjournals.springeropen.com/articles/10.1186/s13636-020-00176-2

## Prelims

The Transformer from “Attention is All You Need” has been on a lot of people’s minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks. The paper itself is very clearly written, but the conventional wisdom has been that it is quite difficult to implement correctly.

In this post I present an “annotated” version of the paper in the form of a line-by-line implementation. I have reordered and deleted some sections from the original paper and added comments throughout. This document itself is a working notebook, and should be a completely usable implementation. In total there are 400 lines of library code which can process 27,000 tokens per second on 4 GPUs.

To follow along you will first need to install PyTorch. The complete notebook is also available on github or on Google Colab with free GPUs.

Note this is merely a starting point for researchers and interested developers. The code here is based heavily on our OpenNMT packages. (If helpful feel free to cite.) For other full-sevice implementations of the model check-out Tensor2Tensor (tensorflow) and Sockeye (mxnet).

• Alexander Rush (@harvardnlp or srush@seas.harvard.edu), with help from Vincent Nguyen and Guillaume Klein

My comments are blockquoted. The main text is all from the paper itself.

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention.

Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End- to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple- language question answering and language modeling tasks.

To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

Most competitive neural sequence transduction models have an encoder-decoder structure (cite). Here, the encoder maps an input sequence of symbol representations $(x_1, …, x_n)$ to a sequence of continuous representations $\mathbf{z} = (z_1, …, z_n)$. Given $\mathbf{z}$, the decoder then generates an output sequence $(y_1,…,y_m)$ of symbols one element at a time. At each step the model is auto-regressive (cite), consuming the previously generated symbols as additional input when generating the next.

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

### Encoder

The encoder is composed of a stack of $N=6$ identical layers.

We employ a residual connection (cite) around each of the two sub-layers, followed by layer normalization (cite).

That is, the output of each sub-layer is $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$, where $\mathrm{Sublayer}(x)$ is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{\text{model}}=512$.

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed- forward network.

### Decoder

The decoder is also composed of a stack of $N=6$ identical layers.

In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

### Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

We call our particular attention “Scaled Dot-Product Attention”. The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:

The two most commonly used attention functions are additive attention (cite), and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ (cite). We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients (To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean $0$ and variance $1$. Then their dot product, $q \cdot k = \sum_{i=1}^{d_k} q_ik_i$, has mean $0$ and variance $d_k$.). To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.

Where the projections are parameter matrices $W^Q_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^K_i \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V_i \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$. In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

### Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways: 1) In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as (cite).

2) The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, Pointwise Activation key output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

3) Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot- product attention by masking out (setting to $-\infty$) all values in the input of the softmax which correspond to illegal connections.

### Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{\text{model}}=512$, and the inner-layer has dimensionality $d_{ff}=2048$.

### Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{\text{model}}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by $\sqrt{d_{\text{model}}}$.

### Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{\text{model}}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed (cite).

In this work, we use sine and cosine functions of different frequencies:

where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$.

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop}=0.1$.

Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.

We also experimented with using learned positional embeddings (cite) instead, and found that the two versions produced nearly identical results. We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

### Full Model

Here we define a function that takes in hyperparameters and produces a full model.

This section describes the training regime for our models.

We stop for a quick interlude to introduce some of the tools needed to train a standard encoder decoder model. First we define a batch object that holds the src and target sentences for training, as well as constructing the masks.

Next we create a generic training and scoring function to keep track of loss. We pass in a generic loss compute function that also handles parameter updates.

### Training Data and Batching

We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens. For English- French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.

Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

We will use torch text for batching. This is discussed in more detail below. Here we create batches in a torchtext function that ensures our batch size padded to the maximum batchsize does not surpass a threshold (25000 if we have 8 gpus).

### Hardware and Schedule

We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).

### Optimizer

We used the Adam optimizer (cite) with $\beta_1=0.9$, $\beta_2=0.98$ and $\epsilon=10^{-9}$. We varied the learning rate over the course of training, according to the formula: This corresponds to increasing the learning rate linearly for the first $warmup_steps$ training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used $warmup_steps=4000$.

Note: This part is very important. Need to train with this setup of the model.

Example of the curves of this model for different model sizes and for optimization hyperparameters.

### Label Smoothing

During training, we employed label smoothing of value $\epsilon_{ls}=0.1$ (cite). This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

We implement label smoothing using the KL div loss. Instead of using a one-hot target distribution, we create a distribution that has of the correct word and the rest of the mass distributed throughout the vocabulary.

Here we can see an example of how the mass is distributed to the words based on confidence.

Label smoothing actually starts to penalize the model if it gets very confident about a given choice.

We can begin by trying out a simple copy-task. Given a random set of input symbols from a small vocabulary, the goal is to generate back those same symbols.

### Greedy Decoding

This code predicts a translation using greedy decoding for simplicity.

Now we consider a real-world example using the IWSLT German-English Translation task. This task is much smaller than the WMT task considered in the paper, but it illustrates the whole system. We also show how to use multi-gpu processing to make it really fast.

We will load the dataset using torchtext and spacy for tokenization.

Batching matters a ton for speed. We want to have very evenly divided batches, with absolutely minimal padding. To do this we have to hack a bit around the default torchtext batching. This code patches their default batching to make sure we search over enough sentences to find tight batches.

### Multi-GPU Training

Finally to really target fast training, we will use multi-gpu. This code implements multi-gpu word generation. It is not specific to transformer so I won’t go into too much detail. The idea is to split up word generation at training time into chunks to be processed in parallel across many different gpus. We do this using pytorch parallel primitives:

• replicate - split modules onto different gpus.
• scatter - split batches onto different gpus
• parallel_apply - apply module to batches on different gpus
• gather - pull scattered data back onto one gpu.
• nn.DataParallel - a special module wrapper that calls these all before evaluating.

Now we create our model, criterion, optimizer, data iterators, and paralelization

Now we train the model. I will play with the warmup steps a bit, but everything else uses the default parameters. On an AWS p3.8xlarge with 4 Tesla V100s, this runs at ~27,000 tokens per second with a batch size of 12,000

### Training the System

Once trained we can decode the model to produce a set of translations. Here we simply translate the first sentence in the validation set. This dataset is pretty small so the translations with greedy search are reasonably accurate.

So this mostly covers the transformer model itself. There are four aspects that we didn’t cover explicitly. We also have all these additional features implemented in OpenNMT-py.

1) BPE/ Word-piece: We can use a library to first preprocess the data into subword units. See Rico Sennrich’s subword- nmt implementation. These models will transform the training data to look like this:

▁Die ▁Protokoll datei ▁kann ▁ heimlich ▁per ▁E - Mail ▁oder ▁FTP ▁an ▁einen ▁bestimmte n ▁Empfänger ▁gesendet ▁werden Pointwise Activation key Shared Embeddings: When using BPE with shared vocabulary we can share the same weight vectors between the source / target / generator. See the (cite) for details. To add this to the model simply do this:

3) Beam Search: This is a bit too complicated to cover here. See the OpenNMT- py for a pytorch implementation.

4) Model Averaging: The paper averages the last k checkpoints to create an ensembling effect. We can do this after the fact if we have a bunch of models:

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.

On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate Pdrop = 0.1, instead of 0.3.

The code we have written here is a version of the base model. There are fully trained version of this system available here (Example Models).

With the addtional extensions in the last section, the OpenNMT-py replication gets to 26.9 on EN-DE WMT. Here I have loaded in those parameters to Jungle Scout Pro 4.3.1 Full Crack With Product [Version] Free Download 2021 reimplemenation.

### Attention Visualization

Even with a greedy decoder the translation looks pretty good. We can further visualize it to see what is happening at each layer of the attention

Hopefully this code is useful for future research. Please reach out if you have any issues. If you find this code helpful, also check out our other OpenNMT tools.

Cheers, srush

Источник: https://nlp.seas.harvard.edu/2018/04/03/attention.html

## Understanding LSTM Networks

Posted on August 27, 2015

### Recurrent Neural Networks

Humans don’t start their thinking from scratch every second. As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again. Your thoughts have persistence.

Traditional neural networks can’t do this, and it seems like a major shortcoming. For example, imagine you want to classify what kind of event is happening at every point in a movie. It’s unclear how a traditional neural network could use its reasoning about previous events in the film to inform later ones.

Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist.

Recurrent Neural Networks have loops.

In the above diagram, a chunk of neural network, $$A$$, looks at some input $$x_t$$ and outputs a value $$h_t$$. A loop allows information to be passed from one step of the network to the next.

These loops make recurrent neural networks seem kind of mysterious. However, if you think a bit more, it turns out that they aren’t all that different than a normal neural network. A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

An unrolled recurrent neural network.

This chain-like nature reveals that recurrent neural networks are intimately related to sequences and lists. They’re the natural architecture of Pointwise Activation key network to use for such data.

And they certainly are used! In the last few years, there have been incredible success applying RNNs to a variety of problems: speech recognition, language modeling, translation, image captioning… The list goes on. I’ll leave discussion of the amazing feats one can achieve with RNNs to Andrej Karpathy’s excellent blog post, The Unreasonable Effectiveness of Recurrent Neural Networks. But they really are pretty amazing.

Essential to these successes is the use of “LSTMs,” a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version. Almost all exciting results based on recurrent neural networks are achieved with them. It’s these LSTMs that this essay will explore.

### The Problem of Long-Term Dependencies

One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.

Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem!

### LSTM Networks

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber (1997), and were refined and popularized by many people in following work.1 They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The repeating module in a standard RNN contains a single layer.

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

### The Core Idea Behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

An LSTM has three of these gates, to protect and control the cell state.

### Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at $$h_{t-1}$$ and $$x_t$$, and outputs a number between $$0$$ and $$1$$ for each number in the cell state $$C_{t-1}$$. A $$1$$ represents “completely keep this” while a $$0$$ represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, $$\tilde{C}_t$$, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

It’s now time to update the old cell state, $$C_{t-1}$$, into the new cell state $$C_t$$. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by $$f_t$$, forgetting the things we decided to forget earlier. Then we add $$i_t*\tilde{C}_t$$. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through $$\tanh$$ (to push the values to be between $$-1$$ and $$1$$) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

### Variants on Long Short Term Memory

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

### Conclusion

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

### Acknowledgments

I’m grateful to a number of people for helping me better understand LSTMs, commenting on the visualizations, and providing feedback on this post.

I’m very grateful to my colleagues at Google for their helpful feedback, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis, and Ilya Sutskever. I’m also thankful to many other friends and colleagues for taking the time to help me, including Dario Amodei, and Jacob Steinhardt. I’m especially thankful to Kyunghyun Cho for extremely thoughtful correspondence about my diagrams.

Before this post, I practiced explaining LSTMs during two seminar series I taught on neural networks. Thanks to everyone who participated in those for their patience with me, and for their feedback.

### Deep Learning, NLP, and Representations

Источник: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

## Understanding of LSTM Networks

This article talks about the problems of conventional RNNs, namely, the vanishing and exploding gradients and provides a convenient solution to these problems in the form of Long Short Term Memory (LSTM). Long Short-Term Memory is an advanced version of recurrent neural network (RNN) architecture that was designed to model chronological sequences and their long-range dependencies more precisely than conventional RNNs. The major highlights include the interior design of a basic LSTM cell, the variations brought into the LSTM architecture, and few applications of LSTMs that are highly in demand. It also makes a comparison between LSTMs and GRUs. The article concludes with a list of disadvantages of the LSTM network and a brief introduction of the upcoming attention-based models that are swiftly replacing LSTMs in the real world.

Introduction:

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.

LSTM networks are an extension of recurrent neural networks (RNNs) mainly introduced to handle situations where RNNs fail. Talking about RNN, it is a network that works on the present input by taking into consideration the previous output (feedback) and storing in its memory for a short period of time (short-term memory). Out of its various applications, the most popular ones are in the fields of speech processing, non-Markovian control, and music composition. Nevertheless, there are drawbacks to RNNs. First, it fails to store information for a longer period of time. At times, a reference to certain information stored quite a long time ago is required to predict the current output. But RNNs are absolutely incapable of handling such “long-term dependencies”. Second, there is no finer control over which part of the context needs to be carried forward and how much of the past needs to be ‘forgotten’. Other issues with RNNs are exploding and vanishing gradients (explained later) which occur during the training process of a network through backtracking. Thus, Long Short-Term Memory (LSTM) was brought into the picture. It has been so designed that the vanishing gradient problem is almost completely removed, while the training model is left unaltered. Long time lags in certain problems are bridged using LSTMs where they also handle noise, distributed representations, and continuous values. With LSTMs, there is no need to keep a finite number of states from beforehand as required in the hidden Markov model (HMM). LSTMs provide us with a large range of parameters such as learning rates, and input and output biases. Hence, no need for fine adjustments. The complexity to update each weight is reduced to O(1) with LSTMs, similar to that of Back Propagation Through Time (BPTT), which is an advantage.

During the training process of a network, the main goal is to minimize loss (in terms of error or cost) observed in the output when training data is sent through it. We calculate the gradient, that is, loss with respect to a particular set of weights, adjust the weights accordingly and repeat this process until we get an optimal set of weights for which loss is minimum. This is the concept of backtracking. Sometimes, it so happens that the gradient is almost negligible. It must be noted that the gradient of a layer depends on certain components in the successive layers. If some of these components are small (less than 1), the result obtained, which is the gradient, will be even smaller. This is known as the scaling effect. When this gradient is multiplied with the learning rate which is in itself a small value ranging between 0.1-0.001, it results in a smaller value. As a consequence, the alteration in weights is quite small, producing almost the same output as before. Similarly, if the gradients are quite large in value due to the large values of components, the weights get updated to a value beyond the optimal value. This is known as the problem of exploding gradients. To avoid this scaling effect, the neural network unit was re-built in such a way that the scaling factor was fixed to one. The cell was then enriched by several gating units and was called LSTM.

Architecture:

The basic difference between the architectures of RNNs and LSTMs is that the hidden layer of LSTM is a gated unit or gated cell. It consists of four layers that interact with one another in a way to produce the output of that cell along with the cell state. These two things are then passed onto the next hidden layer. Unlike RNNs which have got the only single neural net layer of tanh, LSTMs comprises of three logistic sigmoid gates and one tanh layer. Gates have been introduced in order to limit the information that is passed through the cell. They determine which part of the information will be needed by the next cell and which part is to be discarded. The output is usually in the range of 0-1 where ‘0’ means ‘reject all’ and ‘1’ means ‘include all’.

Hidden layers of LSTM :

Each LSTM cell has three inputs ,and and two outputs and . For a given time t, is the hidden state, is the cell state or memory, is the current data point or input. The first sigmoid layer has two inputs–and where is the hidden state of the previous cell. It is known as the forget gate as its output selects the amount of information of the previous cell to be included. The output is a number in [0,1] which is multiplied (point-wise) with the previous cell state

Conventional LSTM:

The second sigmoid layer is the input gate that decides what new information is to be added to the cell. It takes two inputs and . The tanh layer creates a vector of the new candidate values. Together, these two layers determine the information to be stored in the cell state. Their point-wise multiplication tells us the amount of information to be added to the cell state. The result is then added with the result of the forget gate multiplied with previous cell state to produce the current cell state . Next, the output of the cell is calculated using a sigmoid and a tanh layer. The sigmoid layer decides which part of the cell state will be present in the output whereas tanh layer shifts the output in the range of [-1,1]. The results of the two layers undergo point-wise multiplication to produce the output ht of the cell.

Variations:

With the increasing popularity of LSTMs, various alterations have been tried on the conventional LSTM architecture to simplify the internal design of cells to make them work in a more efficient way and to reduce the computational complexity. Gers and Schmidhuber introduced peephole connections which allowed gate layers to have knowledge about the cell state at every instant. Some LSTMs also made use of a coupled input and forget gate instead of two separate gates that helped in making both the decisions simultaneously. Another variation was the use of the Gated Recurrent Unit(GRU) which improved the design complexity by reducing the number of gates. It uses a combination of the cell state and hidden state and also an update gate which has forgotten and input gates merged into it.

LSTM(Figure-A), DLSTM(Figure-B), LSTMP(Figure-C) and DLSTMP(Figure-D)

1. Figure-A represents what a basic LSTM network looks like. Only one layer of LSTM between an input and output layer has been shown here.
2. Figure-B represents Deep LSTM which includes a number of LSTM layers in between the input and output. The advantage is that the input values fed to the network not only go through several LSTM layers but also propagate through time within one LSTM cell. Hence, parameters are well distributed within multiple layers. This results in a thorough process of inputs in each time step.
3. Figure-C represents LSTM with the Recurrent Projection layer where the recurrent connections are taken from the projection layer to the LSTM layer input. This architecture was designed to reduce the high learning computational complexity (O(N)) for each time step) of the standard LSTM RNN.
4. Figure-D represents Deep LSTM with a Recurrent Projection Layer consisting of multiple LSTM layers where each layer has its own projection layer. The increased depth is quite useful in the case where the memory size is too large. Having increased depth prevents overfitting in models as the inputs to the network need to go through many nonlinear functions.

GRUs Vs LSTMs

In spite of being quite similar to LSTMs, GRUs have never been so popular. But what are GRUs? GRU stands for Gated Recurrent Units. As the name suggests, these recurrent units, proposed by Cho, are also provided with a gated mechanism to effectively and adaptively capture dependencies of different time scales. They have an update gate and a reset gate. The former is responsible for selecting what piece of knowledge is to be carried forward, whereas the latter lies in between two successive recurrent units and decides how much information needs to be forgotten.

Activation at time t:

Update gate:

Candidate activation:

Reset gate:

Another striking aspect of GRUs is that they do not store cell state in any way, hence, they are unable to regulate the amount of memory content to which the next unit is exposed. Instead, LSTMs regulate the amount of new information being included in the cell. On the other hand, the GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate).

Applications:

LSTM models need to be trained with a training dataset prior to its employment in real-world applications. Some of the most demanding applications are discussed below:

1. Language modelling or text generation, that involves the computation of words when a sequence of words is fed as input. Language models can be operated at the character level, n-gram level, sentence level or even paragraph level.
2. Image processing, that involves performing analysis of a picture and concluding its result into a sentence. For this, it’s required to have a dataset comprising of a good amount of pictures with their corresponding descriptive captions. A model that has already been trained is used to predict features of images present in the dataset. This is photo data. The dataset is then processed in such a way that only the words that are most suggestive are present in it. This is text data. Using these two types of data, we try to fit the model. The work of the model is to generate a descriptive sentence for the picture one word at a time by taking input words that were predicted previously by the model and also the image.
3. Speech and Handwriting Recognition
4. Music generation which is quite similar to that of text generation where LSTMs predict musical notes instead of text by analyzing a combination of given notes fed as input.
5. Language Translation involves mapping a sequence in one language to a sequence in another language. Similar to image processing, a dataset, containing phrases and their translations, is first cleaned and only a part of it is used to train the model. An encoder-decoder LSTM model is used which first converts input sequence to its vector representation (encoding) and then outputs it to its translated version.

Drawbacks:

As it is said, everything in this world comes with its own advantages and disadvantages, LSTMs too, have a few drawbacks which are discussed as below:

1. LSTMs became popular because they could solve the problem of vanishing gradients. But it turns out, they fail to remove it completely. The problem lies in the fact that the data still has to move from cell to cell for its evaluation. Moreover, the cell has become quite complex now with the additional features (such as forget gates) being brought into the picture.
2. They require a lot of resources and time to get trained and become ready for real-world applications. In technical terms, they need high memory-bandwidth because of linear layers present in each cell which the system usually fails to provide for. Thus, hardware-wise, LSTMs become quite inefficient.
3. With the rise of data mining, developers are looking for a model that can remember past information for a longer time than LSTMs. The source of inspiration for such kind of model is the human habit of dividing a given piece of information into small parts for easy remembrance.
4. LSTMs get affected by different random weight initialization and hence behave quite similar to that of a feed-forward neural net. They prefer small weight initialization instead.
5. LSTMs are prone to overfitting and it is difficult to apply the dropout algorithm to curb this issue. Dropout is a regularization method where input and recurrent connections to LSTM units are probabilistically excluded from activation and weight updates while training a network.

Источник: https://www.geeksforgeeks.org/understanding-of-lstm-networks/
##### Abstract

The aim of this thesis is to study the effect that linguistic context exerts on the activation and processing of word meaning over time. Previous studies have demonstrated that a biasing context makes it possible to predict upcoming words. The context causes the pre-activation of expected words and facilitates their processing when they are encountered. The interaction of context and word meaning can be described in terms of feature overlap: as the context unfolds, the semantic features of the processed words are activated and words that match those features are pre-activated and thus processed more quickly when encountered. The aim of the experiments in this thesis is to test a key prediction of this account, viz., that the facilitation effect is additive and occurs together with the unfolding context. Our first contribution is to analyse the effect of an increasing amount of biasing context on the pre-activation of the meaning of a critical word. In a self-paced reading study, we investigate the amount of biasing information required to boost word processing: at least two biasing words are required to significantly reduce the time to read the critical word. In a complementary visual world experiment we study the effect of context as it unfolds over time. We identify a ceiling effect after the first biasing word: when the expected word has been pre-activated, an increasing amount of context does not produce any additional significant facilitation effect. Our second contribution is to model the activation effect observed in the previous experiments using a bag-of-words distributional semantic model. The similarity scores generated by the model significantly correlate with the association scores produced by humans. When we use point-wise multiplication to combine contextual word vectors, the model provides a computational implementation of feature overlap theory, successfully predicting reading times. Our third contribution is to analyse the effect of context on semantically similar words. In another visual world experiment, we show that words that are semantically similar generate similar eye-movements towards a related object depicted on the screen. A coherent context pre-activates the critical word and therefore increases the expectations towards it. This experiment also tested the cognitive validity of a distributional model of semantics by using this model to generate the critical words for the experimental materials used.

Источник: https://era.ed.ac.uk/handle/1842/10508

Notice: Undefined variable: z_bot in /sites/mauitopia.us/key/pointwise-activation-key.php on line 150

Notice: Undefined variable: z_empty in /sites/mauitopia.us/key/pointwise-activation-key.php on line 150

### 1 Replies to “Pointwise Activation key”

1. holy trixx says:

HR: вЂњembrace how you learn; are you a kinesthetic learner? Do you learn by sight or do you learn by reading a manual?вЂќ