• # Bayesian neural networks: Bayesian neural networks

Bayesian neural network, in short, can be understood as regularization by introducing uncertainty into the weight of neural network, which is equivalent to integrating infinite groups of neural networks on a certain weight distribution for prediction.

This paper is based on Charles et al. 2015 

Published separately in Know

## Probability model of neural network

As we all know, a neural network model can be regarded as a conditional distribution model$P(\mathbf{y}|\mathbf{x},\mathbf{w})$ : input$\mathbf{x}$ , output the predicted value$\mathbf{y}$ The distribution of,$\mathbf{w}$ Is the weight in the neural network. In the classification problem, this distribution corresponds to all kinds of probability. In regression problem, it is generally considered as Gaussian distribution with fixed standard deviation, and the mean value is taken as the prediction result. Accordingly, the learning of neural network can be regarded as a maximum likelihood estimation (MLE)

\begin{aligned} \mathbf{w}^\mathrm{MLE}&=\arg\max_ \mathbf{w}\log P(\mathcal{D}|\mathbf{w})\\ &=\arg\max_ \mathbf{w}\sum_ i\log P(\mathbf{y}_i|\mathbf{x}_i,\mathbf{w}) \end{aligned}

among$\mathcal{D}$ It corresponds to the dataset we use for training. In regression problem, we can get mean squared error (MSE) by substituting Gaussian distribution, and cross entropy can be deduced by substituting logistic function for classification problem. In general, gradient descent is used to find the minimum of neural network, which is based on back-propagation (BP).
Wrong in MLE$\mathbf{w}$ A priori probability is assumed$\mathbf{w}$ There is an equal opportunity to take whatever value. If$\mathbf{w}$ When a priori is introduced, it becomes the maximum posteriori (map)

\begin{aligned} \mathbf{w}^\mathrm{MAP}&=\arg\max_ \mathbf{w}\log P(\mathbf{w}|\mathcal{D})\\ &=\arg\max_ \mathbf{w}\log P(\mathcal{D}|\mathbf{w}) + \log P(\mathbf{w}) \end{aligned}

Substituting Gaussian distribution can lead to L2 regularization (tend to take small value), while substituting Laplace distribution (Laplace) can deduce L1 regularization (tend to take 0 to make weight sparse).

## Bayes is up!

Bayesian estimation also introduces a priori hypothesis. The difference between Bayesian estimation and map is that Bayesian estimation is obtained$\mathbf{w}$ Posterior distribution of$P(\mathbf{w}|\mathcal{D})$ , but not limited to$\arg\max$ In this way, we can introduce uncertainty into the prediction of neural network. Because we get the distribution based on$\mathbf{w}$ Input by$\hat{\mathbf{x}}$ forecast$\hat{\mathbf{y}}$ The probability model of the model becomes:

\begin{aligned} P(\hat{\mathbf{y}}|\hat{\mathbf{x}})=\mathbb{E}_ {P(\mathbf{w}|\mathcal{D})}[P(\hat{\mathbf{y}}|\hat{\mathbf{x}},\mathbf{w})] \end{aligned}

So every time we predict$\hat{\mathbf{y}}$ We have to find an expectation. The problem is that we can't really work out this expectation, because it means that we have to calculate the expected value$P(\mathbf{w}|\mathcal{D})$ Prediction values of all possible neural networks on.
On the other hand, the posterior distribution is obtained$P(\mathbf{w}|\mathcal{D})$ It's also a troublesome thing. As we all know, according to Bayesian theory, the$P(\mathbf{w}|\mathcal{D})$ Need to pass:

\begin{aligned} P(\mathbf{w}|\mathcal{D})= \frac{P(\mathbf{w},\mathcal{D})}{P(\mathcal{D})} =\frac{P(\mathcal{D}|\mathbf{w})P(\mathbf{w})}{P(\mathcal{D})} \end{aligned}

It's also intractable.

Therefore, in order to introduce Bayesian estimation into neural network, we need to find a method to approximate these things, and it is better to transform it into the form of solving optimization problems, which is more in line with the pursuit of our alchemists.

## Variational estimation

Using the variational approach, we can use a set of parameters$\theta$ Distribution of control$q(\mathbf{w}|\theta)$ To approach the real posterior$P(\mathbf{w}|\mathcal{D})$ For example, if we use Gaussian approximation$\theta$ namely$(\mu,\sigma)$ In this way, the problem of finding the posterior distribution is transformed into finding the best one$\theta$ Such an optimization problem. This process can be realized by minimizing the distribution of legendiller

\begin{aligned} \theta^*&=\arg\min_ \theta D_ \mathrm{KL}[q(\mathbf{w}|\theta)||P(\mathbf{w}|\mathcal{D})]\\ &=\arg\min_ \theta \int q(\mathbf{w}|\theta)\log \frac{q(\mathbf{w}|\theta)}{P(\mathbf{w})P(\mathcal{D}|\mathbf{w})} d\mathbf{w}\\ &=\arg\min_ \theta D_ \mathrm{KL}[q(\mathbf{w}|\theta)||P(\mathbf{w})] - \mathbb{E}_ {q(\mathbf{w}|\theta)}[\log P(\mathcal{D}|\mathbf{w})] \end{aligned}

It looks much better than the front. The form of objective function is as follows:

\begin{aligned} \mathcal{F}(\mathcal{D},\theta)=D_ \mathrm{KL}[q(\mathbf{w}|\theta)||P(\mathbf{w})]-\mathbb{E}_ {q(\mathbf{w}|\theta)}[\log P(\mathcal{D}|\mathbf{w})] \end{aligned}\ \ \ \ \ (1)

It still can't be worked out, but at least it looks more like something that can be calculated. Our prior and posteriori variation of KL is our prior; The value of the second term depends on the training data. The first term is called complexity cost, which describes the degree of agreement between weight and prior; The second term is called likelihood cost, which describes the degree of sample fitting. The objective function of optimization can be regarded as the most familiar regularization for alchemists, balancing the two costs.

about$P(\mathbf{w})$ A scale mixture Gaussian prior is given

\begin{aligned} P(\mathbf{w})=\prod_ {j} \pi \mathcal{N}\left(\mathbf{w}_{j} | 0, \sigma_{1}^{2}\right)+(1-\pi) \mathcal{N}\left(\mathbf{w}_{j} | 0, \sigma_{2}^{2}\right) \end{aligned}

In other words, the prior distribution of each weight is the superposition of two Gaussian distributions with the same mean and different standard deviations.

The next step is to continue to approximate the objective function until it can be solved.

## Don't decide, Monte Carlo

The Monte Carlo method is a method engraved in the alchemist's DNA. (1) This method can not be used well.

It is well known that the variational auto encoder (VAE) is also derived from Bayesian estimation  A wonderful reparameterize operation is introduced: for$z\sim \mathcal{N}(\mu,\sigma^2)$ , directly from$\mathcal{N}(\mu,\sigma^2)$ Sample will make$\mu$ and$\sigma$ Become indistinguishable; In order to get their gradients, the$z$ Rewrite as$z=\sigma \epsilon+\mu$ In which$\epsilon\sim \mathcal{N}(0,1)$ In this way, random variables can be sampled from the standard Gaussian distribution, and then can be induced$\mu$ and$\sigma$

In this paper, we generalize it and prove that for a random variable$\epsilon$ And probability density$q(\epsilon)$ As long as it can be satisfied$q(\epsilon)d\epsilon=q(\mathbf{w}|\theta)d\mathbf{w}$ For expectation, the derivable unbiased estimation of the expected partial derivative can also be obtained by using similar operations

\begin{aligned} \frac{\partial}{\partial \theta} \mathbb{E}_ {q(\mathbf{w} | \theta)}[f(\mathbf{w}, \theta)]=\mathbb{E}_ {q(\epsilon)}\left[\frac{\partial f(\mathbf{w}, \theta)}{\partial \mathbf{w}} \frac{\partial \mathbf{w}}{\partial \theta}+\frac{\partial f(\mathbf{w}, \theta)}{\partial \theta}\right] \end{aligned}

The Monte Carlo approximation of (1) can be obtained by using this point

\begin{aligned} \mathcal{F}(\mathcal{D}, \theta) \approx \sum_ {i=1}^{n} \log q\left(\mathbf{w}^{(i)} | \theta\right)-\log P\left(\mathbf{w}^{(i)}\right) -\log P\left(\mathcal{D} | \mathbf{w}^{(i)}\right) \end{aligned}\ \ \ \ \ (2)

among$\mathbf{w}^{(i)}$ Is the weight sampling when processing the ith data point.

 This approximation is proposed in this paper$\mathcal{F}(\mathcal{D}|\theta)$ In fact, for many prior forms, the KL term can have analytical solutions.  The reason for this is to fit the more complex a / P forms. In the other paper, we only consider Gaussian priori, so we take the analytic solution of KL term in the same lower bound of evidence. In practice, different approximations can be obtained according to different priors used.

## Bayesian small batch gradient descent

(1) The objective function in (2) and the approximation in (2) are the lower bounds of the model on the whole dataset. In practice, modern alchemy adopts small batch gradient descent, so the complexity cost needs to be scaled accordingly. Suppose that the whole data set is divided into m batches, and the simplest form is to average each small batch

\begin{aligned} \mathcal{F}_ {i}^{\mathrm{EQ}}\left(\mathcal{D}_{i}, \theta\right)=\frac{1}{M} \mathrm{KL}[q(\mathbf{w} | \theta) \| P(\mathbf{w})] -\mathbb{E}_ {q(\mathbf{w} | \theta)}\left[\log P\left(\mathcal{D}_{i} | \mathbf{w}\right)\right] \end{aligned}

This will enable$\sum_ {i} \mathcal{F}_ {i}^{\mathrm{EQ}}\left(\mathcal{D}_{i}, \theta\right)=\mathcal{F}(\mathcal{D}, \theta)$ Established. On this basis, another scaling method is proposed

\begin{aligned} \mathcal{F}_ {i}^{\pi}\left(\mathcal{D}_{i}, \theta\right)=\pi_ {i} \mathrm{KL}[q(\mathbf{w} | \theta)\| P(\mathbf{w})] -\mathbb{E}_ {q(\mathbf{w} | \theta)} &\left[\log P\left(\mathcal{D}_{i} | \mathbf{w}\right)\right] \end{aligned}

Just take it$\pi \in[0,1]^{M}$ And guarantee$\sum_ {i=1}^{M} \pi_ {i}=1$ , so$\sum_ {i=1}^{M} \mathcal{F}_ {i}^{\pi}\left(\mathcal{D}_{i}, \theta\right)$ namely$\mathcal{F}(\mathcal{D}, \theta)$ The unbiased estimation of. In particular, take$\pi_ {i}=\frac{2^{M-i}}{2^{M}-1}$ In the early stage of a round of training, we can focus on fitting a priori and later on fitting data, which can improve the performance.

## Local reparameterization

So far, the uncertainties introduced in the weights of neural networks can be regarded as global uncertainties. In this case, it's more expensive to introduce the global inference in the neural network$1000\times1000$ Fully connected layer for$M\times1000$ Input requirements for$M\times1000\times1000$ In general neural networks, such a fully connected layer can be converted into multiplication between a (m, 1000) matrix and (1000, 1000) matrix because the parameters are the same matrix; After introducing the uncertainty, it is necessary to sample m groups of different (10001000) parameters for m times (11000) and (10001000), which are two completely different things for the general matrix parallel library.

In response to this problem,  It is observed that if all the parameters are independent Gaussian distribution, the result of matrix multiplication will also be independent Gaussian distribution. In other words, for$\mathbf{Y}=\mathbf{X}\mathbf{W}$ , if any

\begin{aligned} q\left(w_{i, j}\right)=N\left(\mu_{i, j}, \sigma_{i, j}^{2}\right) \forall w_ {i, j} \in \mathbf{W} \end{aligned}

So for$\mathbf{Y}$ There will be

\begin{aligned} q\left(y_{m, j} | \mathbf{X}\right)=N\left(\gamma_{m, j}, \delta_{m, j}\right) \end{aligned}

among$\gamma_ {m, j}=\sum_ {i} x_ {m, i} \mu_ {i, j}$ And$\delta_ {m, j}=\sum_ {i} x_ {m, i}^{2} \sigma_ {i, j}^{2}$

With this conclusion, it is not necessary to sample parameters every time$\mathbf{W}$ We can calculate the result directly$\mathbf{Y}$ The mean and variance are sampled and then propagated back to$\mathbf{W}$ Go ahead. In this way, the sampling for each calculation is the local sampling of the corresponding data points. This technique is called local reparameterization.

• # Meta application compatibility transformation of windows input method

|

## TL;DR

Microsoft is really a pit father does not allow to use shared memory in uwp. As a result, wolf hung up and changed to windows named pipe It can be used after cross process interaction.

## reason

After picking up the five strokes again, I have been distressed that there is no suitable five stroke input method. As a half beginner, one of the functions I like very much is the five strokes reverse search; However, the WuBi Pinyin mixed input function is not easy to use. All kinds of old-fashioned five stroke input methods are either too ugly or not updated for a long time. What's more, they will display a rogue style.

A few years ago, I tried the rime input method and knew that it was a powerful input method, which could meet various customization requirements. But at the beginning, I did not have the development ability, and the customization of rime was slightly complicated, so I did not continue. Now there is a demand, looking back, found that rime is undoubtedly the most suitable for me now.

Rime has a new compatibility problem with windows. As the first batch of windows 10 users, there is no doubt that an input method that can be used normally in uwp applications is needed. However, rime's original windows front-end "little wolf hair" is not compatible with Metro applications starting from Windows 8, including the uwp of windows 10. The original author is too busy to be separated Gave up the maintenance of "little wolf hair" Other front ends implemented by other developers have various unstable problems. For example, the users of rime bar have transplanted rim to a called PIME In addition, its input service implementation is very painful, often crashes, and there is no reliable exception handling. In many cases, it needs to restart the input service manually.

One of the major principles of the open source industry is "you can go, you go" and "show me the code". At that time, it did not experience any abnormal collapse when using it, indicating that its design in this respect is very excellent; The main problem now is that it is not compatible with Metro applications. Now that you have the ability to develop, it's better to maintain yourself and others.

• # Miko: geisha's semi residual compiler

|

People's fate is so unpredictable. Finally, geisha's compiler implementation was actually taken out as a school curriculum design during the postgraduate entrance examination.

Because of various factors, the final result is far from the expected result. For example, there is no time to implement GC, so the closure is incomplete; For example, there is no custom type at all. But in the end, it turned out to be something with a result (though it was half broken).

This is a document submitted as a course report during the summer vacation. As for the follow-up work, although the postgraduate entrance examination has been completed, there are a lot of things suddenly. We can only see when we have time to continue to improve

Code, of course On GitHub It's just that the stars are all on the front-end demo written by Haskell, so no one pays attention to emmmm

## Overall design

Used rust-peg The syntax analyzer is generated and llvm is used as the back end of code generation.

• # How to compile function closure

|

//Zhihu not only did not give me a column, but also said that my article ZZ sensitive, breaking the website to eat jujube pills

closure (English: closure), also known as closure Lexical Closure (lexical closure) or Function Closure Function closures are functions that reference free variables. The referenced free variable will exist with the function, even if it has left the environment in which it was created. Therefore, there is another view that a closure is an entity composed of a function and its associated reference environment. Closures can have multiple instances at runtime. Different reference environments and the same function combination can produce different instances.

In order to implement a programming language that supports both lexical scope and first class function, an important transformation is called lambda lifting: transforming nested function definition (including anonymous function) into independent global function definition; Convert the reference of free variable into parameter passing.

• # Document duplicate checking based on hash and winnowing methods

This is just an experimental report

## Design ideas

Considering the practical computing power, it is impossible to use the traditional text diff algorithm to compare the similarity between many large documents. Hash value can reflect the characteristics of data to a certain extent; However, the general hash method emphasizes avoiding collision, and a little change of source data can cause a big change of hash value. For duplicate checking, it is necessary to extract the feature of the document, which should also have similarity when the source data is similar.

Winnowing method is proposed by Saul schleimer and others to extract document features (document fingerprints). By hashing the k-gram sequence of documents, the feature value sequence which can reflect the similarity of documents is extracted, and then compared with the feature value sequence , the proportion of the same eigenvalue obtained can reflect the similarity between documents.

• # Two or three things about windows llvm environment

|

To be reasonable, the postgraduate entrance examination party is disturbed by this kind of thing with the environment, the life of salted fish is very unhappy.

The cause is Haskell's llvm binding—— llvm-general Only version 3.5 is supported on hackage. Of course, there is a follow-up version in repo, but the author said in the issues that the follow-up version did not work well.

In that case, install 3.5. However, after downloading the source code of llvm 3.5.2, cmake is very smooth, and it is also very smooth to open it with vs 2017. However, an error is reported during compilation

In short, the method of a certain class has const qualifier after it, and then the calling context does not return const, so it is not given.

Strangely, the location of the error is the standard library file provided by vs. It's embarrassing. I can either change the vs standard library or study a wave of llvm source code.

I tried it in WSL and compiled painlessly. It seems that vs has some limitation on egg pain. After a little searching on the Internet, we didn't find any similar situation (after all, in the 3.5's, they didn't officially support vs compilation). So we decided to build on MinGW.

There is no problem using cmake to generate the makefile of MinGW, but the command syntax error will be prompted when making

Found this build Make found the command in this line strange

I haven't seen a CD in windows for so many years. I can bring a CD /d Parametric It seems that the / D parameter is not supported by Shell's CD. Then delete it.

When the error occurs, continue to find the symbol. It is found that it is the target file generated by the two lines just changed LTO.def There is a problem. I don't know why it is not written into a normal text file, but a binary file (vs code open prompt file is too large or binary file). direct type It came out in a strange, wide spaced format that looked like a text file. Very angry.

However, since .def It's a text file anyway. I can't build it manually def So manually run the command lto.exports What's in it type come out

Get a bunch of symbolic names

New LTO.def Cover the old one and stick it in.

Rerun cmake --build . At last.

• # Using parsec to handle left recursion

|

When we put expression syntax before LISP interpreter written before, we encountered several grammars

\begin{aligned} Expr & \rightarrow Factor ... \\\\ Exprs & \rightarrow Expr , Exprs\\\\ Factor & \rightarrow Integer|Apply|Identify|...|{(} {Expr} {)} \\\\ Integer & \rightarrow... \\\\ ...\\\\ Apply & \rightarrow Factor ( Exprs ) \end{aligned}

Obviously, non-terminal$Apply$ The leftmost derivative of will enter$Factor$ And then back$Apply$ Textbook left recursion.

• # Call With Current Continuation: Coroutine

Through nesting call/cc It is convenient to implement the coprocess

• # Write You a Scheme

|

I've got one Scheme interpreter It's also like making something out of Haskell (although it's just a toy)

My biggest experience is that I am familiar with Haskell and consolidate scheme (although I have seen SiCp, I don't quite understand its quosiquote and call / cc, etc.)

I was going to define a language by myself this ）However, it was very troublesome (FOG), and I was more concerned about explaining the execution process, so I decided to implement scheme. Basically, it follows Write Yourself a Scheme in 48 Hours Yes, we have added continuation and other things to it

• # What is functional programming thinking?

|

Why should I move my Zhihu answer here... Maybe it's too long since I sent anything. Let's make up the number.

Author: name overflow

Source: Zhihu

The biggest difference between functional programming and imperative programming is that:

Functional programming is concerned with data mapping, while imperative programming is concerned with problem solving steps

The mapping here is the mathematical concept of "function" - the correspondence between one thing and another.

This is why "functional programming" is called "functional programming.".

What does that mean?

If, now you come to Google for an interview, the interviewer asks you to reverse the image of the binary tree

Almost without thinking, you can write Python code like this:

Now, stop and see what this piece of code stands for——

Its meaning is: first, judge whether the node is empty; Then flip the left tree; Then flip the right tree; Finally, the left and right are exchanged.

This is imperative programming - what you do, you have to describe in detail the steps to get there, and then leave it to the machine to run.

This is also the theoretical model of command programming Turing machine characteristics. A paper tape full of data, a machine that moves according to the contents of the tape. Every move of the machine needs to write on the tape how to achieve it.

So, how to flip a binary tree without this method?

Functional thinking provides another way of thinking——

The so-called "flipped binary tree" can be seen as getting a new binary tree symmetrical with the original binary tree.

The feature of this new binary tree is that every node is recursively opposite to the original tree.

It is expressed in Haskell code as follows:

(in case you can't understand it, translate it into equivalent Python)

The thinking reflected in this code is the mapping from the old tree to the new tree. For a binary tree, its image tree is the tree with recursive images of the left and right nodes.

The ultimate goal of this code is to flip the binary tree, but the way it gets the result is essentially different from that of Python code: by describing the mapping of an old tree to a new tree, rather than describing "how to get a new tree from an old tree".

So what's the advantage of thinking like this?

First of all, from the most intuitive point of view, functional style code can be written very concise, greatly reducing the loss of the keyboard（

More importantly, functional code is "the description of mapping". It can not only describe the corresponding relationship between data structures such as binary trees, but also describe the corresponding relationship between anything that can be reflected in the computer, such as the mapping between functions (such as functor); For example, the mapping between external operations and GUI (that is, the so-called FRP in the front-end is hot now). It can be highly abstract, which means that functional code can be more convenient to reuse.

At the same time, writing the code like this makes it easy to study it mathematically (that's why it's possible to refer to "in the category of" uuuuuuuuuuu " This mathematical concept of depth)

As for what Coriolis and what data are immutable, they are just denotations.