Bloom Filters

Burton Howard Bloom is a MIT alumni, who, mwhile employed by the Computer Usage Company, developed and published a data structure that later became famous as “Bloom Filter” [1].

In this post we’ll talk about Bloom Filters. We’ll give a description of the data structure and its operations. We’ll then study the theory that explains the performance of this data structure. Next, we’ll describe an implementation in Python, run some experiments and finally discuss applications.

Introduction

Bloom filter is a probabilist data structure that enables inserting elements in a set and test whether an element belongs this set, sacrificing accuracy for lower memory usage.

More precisely, it can say an element is in the set when in fact it’s not (false positive), but if it says it’s not in the set, then it’s true. Also, the original variant doesn’t allow removal of elements [2].

649px-Bloom_filter.svg

In its original form, it doesn’t allow removing elements.

Algorithm

The bloom filter structure consists of a bit array of size m. We then define k independent hash functions, which can distribute a given value into any of the m positions uniformly.

Inserting an element consists in applying the k hash functions to the element and set the bit in the k positions returned by these functions to 1.

Querying for an element consists in applying the same k hash functions and testing whether all the bits in the k positions returned by these functions is set to 1.

If the test returns positive, there’s a chance the element is there, but it could be a false positive, since the bits could have been set by the insertion of other elements. On the other hand, if it returns negative, we’re sure the element is there, because we never unset a bit. Also, as the number of elements in the set grow, the probability of false positives increases.

Time complexity. Both insertion and querying for an element are constant-time operations, since they are proportional to k, assuming random access memory and O(1) hash functions.

Probability of false positives. The probability p that the algorithm will return true when the element is not there (false positive), can be described in terms of m, k, n, through the following equation:

p = (1 -e^{(kn/m)})^k

In the appendix A, we provide an approximated derivation of this result.

Optimal number of hash functions. For given values n and m, we have a choice of how many hash functions to use, k. Intuitively, if we use too few hash functions, we may increase the chances of collision when querying for an element, whereas if we use too many, the array will be filled up earlier and thus the collisions will also increase.

It’s possible to prove (a proof is sketched in appendix B), that for fixed m, n, the value of k that minimizes p is given by:

k = ln(2)m/n

Python Implementation

In this experiment, we use the bitarray python module. It’s a C implementation that represents a bit array efficiently. It has conveniently overloaded operators so most of the time it’s like working with arrays.

We can define a sufficiently large value of our bit array length, m, for example:

from bitarray import bitarray
m = 1 << 20
bit = bitarray(m)

# bitarray doesn't guarantee the bits are all set to 0
# upon initialization
bit.setall(False)

The murmur algorithm

We need to pick a hash function to use with our Bloom filter. References such as [3] suggest using an algorithm called Murmur. This stack exchange thread has a nice survey about different hash functions.

Murmur is a hash algorithm developed by Austin Appleby. Even though it’s not suitable for cryptographic applications, in practice it is very fast and tends to distributed real world instances well.

According to the author, the name comes from an early implementation detail, which used multiply-rotate-multiply-rotate operations to mix the internal state of the hash (hence MuR-MuR).

From the source code, this algorithm seems to shuffle the bits from the input key by using multiplication, shifting and xor operations.

The author mentions using simulated annealing to find some of the constants of the final mix of algorithm. This final part is used to cause the avalanche effect, in which very similar inputs are hashed to very different values.

There’s is a Python library called pyhash that has an interface for several hash functions (implemented in C++).

To install it, we can use the easy_install command. Easy Install is a python module that is used to download packages. pyhash is particular is available in the default repository, PyPI (python package index), so all we need to do is:

> sudo easy_install pyhash

To use it within a Python script:

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
hash_value = hasher(input)

Families of hash functions

In Jonathan Ellis‘s post, he mentions a strategy for generating a family of hash functions. He cites a paper from Kirsch and Mitzenmacher [4] which shows we can generate a set of k hash functions from 2 independent hash functions in the following way:

h_k(x) = h_A(x) + i \cdot h_B(x), \, i = 0, \cdots, k-1

In practice, since murmur3_x64_128() generates 128-bit hashes, we can use the first 64-bits as the first hash function, and the last 64-bits as the second.

from pyhash import murmur3_x64_128

hasher = murmur3_x64_128()
h128 = hasher(input)
h64l = h128 & ((1L << 64) - 1)
h64u = h128 >> 64

hashes = map(
  lambda i: (h64l + i*h64u) % k,
  range(k)
)

All the code is available on github.

Experiments

Before running experiments with our Bloom filter implementation, let’s try to visualize the distribution of the Murmur hash functions. For this first experiment, we compute the hash function of keys ranging form 1 to 10k and module it 10k, so they’re all in the same range of values. Then we plot the keys vs. their hashes in a scatter plot, by exporting this data as a CSV file and rendering in R (using ggplot2):

ggplot(data, aes(x=i, y=h_i)) + geom_point(size=1.5) 

murmur3

From the chart, we can see it apparently distributes the keys uniformly, without any obvious pattern. This is not sufficient to prove it’s a good hash function, but it’s a necessary property.

To test the family of hashes, we can add a new column, which has i for the i-th hash function. Then, in ggplot, we render each class with a different color with the following command:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_grid(class ~ .) 

multi

We can now compare the distributions between each of the hashes by plotting scatter plots for each pair i, j. Now we use another ggplot2 command to generate a 5×5 grid of charts:

ggplot(data, aes(x=i, y=h_i, color=class)) 
  + geom_point(size=1.5) + facet_wrap(~class, ncol=5) 

pairwise

We can see each chart has a uniform pattern, which is a good sign if we’re looking for independent hash functions. For comparison, let’s plot the same chart for a set of hash functions derived only from a single hash function, for example

h_i(x) = i \cdot h_A(x), i = 1, \cdots, k

bad_hash

Now, let’s analyze the false positive rate. We can do that by counting the number of false positives FP and the true negatives TN and evaluating:

\dfrac{FP}{FP + TN}

We fix m and k and shuffle an array from 0 to n-1 to simulate a sampling without replacement. After each inserted element of this array, we go over all non-inserted elements (FP + TN) and count how many of those our Bloom filter thinks they are in the set FP. We then plot a line chart with the number of elements inserted vs. the false positive rate, using qplot() R function:

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements'
)

bloom

Let’s plot the effect of increasing the size of the bit array, using the optimal k and increasing the number of elements. In this particular experiment, we use n=250 and try different values of m: 100, 250, 500, 1000

qplot(
  data$n, 
  data$prob, 
  ylab='Pr(false positive)', 
  xlab='N elements', 
  size=I(1), 
  color=data$m
) + scale_colour_manual(
  values=c("red", "blue", "green", "black"), 
  name="m values"
)

Rplot04

One strange behaviour seems to happen towards the right end of the chart for some curves — the false probability ratio seems to drop. I’ve double checked the code and it looks sane. One possible explanation is that we are calculating the probability over the non-inserted elements and as it approaches the end, the sample size is smaller, so the noise is more significant. Other than that, the probability of false positives tend to increase as we add more and more elements.

Now, let’s analyze the effect of the number of hash functions used. We fix it in n=250, m=1000 and try out different values of k=1, 5, 10, 50.

Rplot05

We can see that using many hash functions can be bad because it fills up the array too fast (k=50). In general k=5 and k=10 perform best for most part of the spectrum.

Applications

Bloom filters are suitable where false positives are tolerable. For example, it can be used as a first step to avoid lookup to a cache. Periodically a cache server could send a Bloom filter to a local machine, corresponding to items it has cached. If the Bloom filter says an element is on cache, it might not be true, but if it says the item is not there, the local client can avoid a request to the cache and talk directly to the underlying storage.

Bloom filters use little memory, so they are suitable for network transmission.

Broder and Mitzenmacher discuss a lot more examples [5], especially network applications.

Conclusion

In this post we learned about Bloom filters. The idea itself is quite simple, by studying the theory we are able to review some probability and calculus concepts.

The implementation made us aware of details which are usually not discussed in theoretical texts, for example, which hash functions to use and how to generate a family of hash functions.

We used two Python libraries, pyhash and bitarray and learned a little bit about the Python packaging system, PyPI. We got some experience with the ggplot2 R library, which I plan to post about later.

What we didn’t cover was variants of Bloom filters, like the count version, which allows deletion. Chazelle et al., introduced the Bloomier Filters, which is a generalization of Bloom filters [6].

References

[1] Quora – Where can one find a photo and biographical details for Burton Howard Bloom, inventor of the Bloom filter?
[2] Wikipedia – Bloom Filter
[3] Spyved – All you ever wanted to know about writing bloom filters
[4] Less hashing, same performance: Building a better Bloom filter – A Kirsch, M. Mitzenmacher.
[5] Network Applications of Bloom Filters – A. Broder, M. Mitzenmacher
[6] Bloomier Filters – B. Chazelle et al.
[7] Probability and Computing: Randomized Algorithms and Probabilistic Analysis – M. Mitzenmacher, E. Upfal

Appendix A: Probability of false positives

Let’s consider an array of m bits and the insertion operation. The probability of a given bit to be set by one of the hash functions is 1/m. Conversely, the probability of a bit not being set is 1 - 1/m. By the hypothesis of independent hash functions, the probability of a bit not being set by any of the k hash functions is

\left(1 - \frac{1}{m}\right)^k

Inserting n elements is equivalent to running the k functions n times, so after inserting n elements, the probability that a given bit is set to 0 is

\left(1 - \frac{1}{m}\right)^{kn}

or the probability it’s set to one as

1 - \left(1 - \frac{1}{m}\right)^{kn}

Now, the probability of a false positive p, is the probability k positions being set for a element that was never inserted.

p = \left(1 - \left(1 - \frac{1}{m}\right)^{kn}\right)^k

According to [2] though, this derivation is not accurate, since assumes that the probability of each bit being set is independent from each other. The Wikipedia article sketches a more precise proof from Mitzenmacher and Upfal [7] which arrives at the same result.

To simplify the equation, we’ll use the following limit:

\lim_{x\to\infty} \left( 1 - 1/x \right) = 1/e

So, for large values of m, we can assume that

(1 - 1/m)^{kn} = (1 - 1/m)^{(mkn)/m} \approx (1/e)^{(kn/m)}

Finally we have:

p = (1 -e^{(kn/m)})^k

Appendix B: Optimal number of hash functions

If we make p a function of k, it’s possible to prove it has a global minimum for positive k‘s so we can use the derivative to find the value of k that minimizes the false positive probability.

To find the derivative, we can use the generalized power rule to get

\ln(1 - e^{-ck}) + \frac{k}{(1 - e^{-ck})} c e^{-ck}

where c=n/m is a constant. If we make y = e^{-ck} (and use k = \ln(y)/-c)

\ln(1 - y) - \frac{\ln(y)}{(1 - y)} y

By making it equals to 0, we get the equation:

\ln(1 - y) (1-y) = \ln(y) y

To solve this for y, let x = 1 - y,

\ln(x) x = \ln(y) y

Since \ln(n) and n are both monotonic functions, so is \ln(n)n. Then if () holds, we can conclude that x = y, because otherwise, if x > y, \ln(x) x > \ln(y) y and if x < y, \ln(x) x < \ln(y) y.

Replacing x back, we find out that y = 1/2 and finally k = ln(2)m/n

Advertisements

Supervised Machine Learning

TIME Summit On Higher Education
Andrew Ng is a Computer Science professor at Stanford. He is also one of the co-founders of the Coursera platform. Andrew recently joined the Chinese company Baidu as a chief scientist.

His main research focus is on Machine Learning, Artificial Intelligence and Deep Learning. He teaches the Machine Learning class in Coursera, which lasts 10 weeks and is comprised of videos, multiple choice assignments and programming assingments (with Octave).

Professor Andrew is a very good lecturer and makes the content more accessible by relying more on intuition rather than rigor, but always mentioning that a given result is backed by mathematical proofs. The material is very practical and he often discusses about real world techniques and applications.

Throughout the classes I got curious about the math behind many of the results he presented, so this post is an attempt to help me with that, as well as providing a overview of the content of the course.

Supervised Learning

ml

Suppose we have a set of data points and each of these points encodes information about a set of features, and for each of these vectors we are also given a value or label, which represents the desired output. We call this set a training set.

We want to find a black box that encodes some structure from this data and is able to predict the output of a new vector with certain accuracy.

It’s called supervised because we require human assistance in collecting and labeling the training set. We’ll now discuss some methods for performing supervised learning, specifically: linear regression, logistic regression, neural networks and support vector machines.

Linear regression

We are given a set X of m data points, where x_i \in R^n and a set of corresponding values Y, where y_i \in R.

Linear regression consists in finding a hyperplane in R^n, which we can represent by y = \Theta^{T} x (here the constant is represented by \Theta_0 and we assume x_0 = 1), such that the sum of the square distances of y_i and the value of x_i in that hyperplane, that is, \Theta^{T} x_i, for each i=1, \cdots, m, is minimized.

More formally, we can define a function:

f_{X,Y}(\Theta) = \sum_{i=1}^{m} (y_i - \Theta^{T}x_i)^2

We want to fit a hyperplane that minimizes f. We can do so by finding the optimal parameters \Theta^{*}:

\Theta^{*} = \min_{\Theta} f_{X, Y}(\Theta).

This linear regression with the least square cost function is also known as linear least squares.

We can solve this problem numerically, because it’s possible to show that f(\Theta) is convex. One common method is gradient descent (we used this method before for solving Lagrangian relaxation).

We can also find the optimal \Theta^{*} analytically, using what is known as the normal equation:

(X^T X) \Theta^{*} = X^T y

This equation comes from the fact that a convex function has its minimum/maximum value when its derivative in terms of its input is 0. In our case, \dfrac{\partial f_{X, Y}}{\partial \Theta}(\Theta^{*}) = 0.

The assumption is that our data lies approximately in a hyperplane, but in reality it might be closer to a quadratic curve. We can then create a new feature that is a power of one of the original features, so we can represent this curve as a plane in a higher dimension.

This method also returns a real number corresponding to the estimated value of y, but many real world problems are classification problems, most commonly binary (yes/no). For this case linear regression doesn’t work very well. We’ll now discuss another method called logistic regression.

Logistic regression

In logistic regression we have y_i \in \{0, 1\} and the idea is that after trained, our model returns a probability that a new input belongs to the class y=1.

Here we still train our model to obtain \Theta and project new input in the corresponding hyperplane to obtain a numerical value. But this time, we convert this to the range [0-1], by applying the sigmoid function, so it becomes

h_{\Theta}(x^{(i)}) = \dfrac{1}{1 - e^{(\Theta^T x^{(i)})}}

This represents the probability that given x^{(i)} parametrized by \Theta, the output will be 1, that is, p(y^{(i)}=1|x^{(i)}; \Theta). The probability of y being 0 is the complementary, that is

p(y^{(i)}=0|x^{(i)}; \Theta) = 1 - p(y^{(i)}=1|x^{(i)}; \Theta)

The likelihood of the parameter \Theta to fit the data can be given by

(1) \ell(\Theta) = \prod_{i=1}^{m} (h_{\Theta}(x^{(i)}))^{y^{(i)}} (1 - h_{\Theta}(x^{(i)}))^{({1 - y^{(i)}})}

We want to find the maximum likelihood estimation (MLE), that is, the parameter \Theta that minimizes (1). We can apply the log function to the \ell(\Theta), because the optimal \Theta is the same in both. This get us

(2) - \left[\sum_{i=1}^{m} y^{(i)} \log (h_{\Theta}(x^{(i)})) + (1 - y^{(i)})\log(1 - h_{\Theta}(x^{(i)}))\right]

This function doesn’t allow a closed form, so we need a numerical algorithm, such as gradient descent. This function is not necessarily convex either, so we might get stuck in local optima.

After estimating the parameter \Theta^{*}, we can classify a input point x by:

y = \left\{ \begin{array}{ll}   1 & \mbox{if } h_{\Theta^{*}}(x)) \ge 0.5 \\  0 & \mbox{otherwise}  \end{array}  \right.

Neural Networks

Neural networks can be seen as a generalization of the simple form of logistic regression seen above. In there, we have one output (y) that is a combination of several inputs (x_j). We can use a diagram to visualize it. More specifically, we could have a directed graph like the following:

Single Layer Network

Single Layer Network

Here we have nodes representing the input data and the output and each edge as a corresponding parameter \Theta. We can partition this graph in layers. In this case, the input and the output layer. In order to generalize that, we could add intermediate layers, which can be called hidden-layers:

x

Multi-layer network

In the general case, we have L layers. Between layers \ell an \ell+1, we have a complete bipartite graph. Each edge (i,j) connecting node i at level \ell to node j at level \ell+1 is associated to a \Theta_{ij}^{(\ell)}. The idea at each node it the same: perform the linear combination of the nodes incident to it and then apply the sigmoid function. Note that at each layer, we have one node that is not connected to any of the previous layers. This one acts as the constant factor for the nodes in the following layer. The last node will represent the final value of the function based on the input X.

There’s a neat algorithm to solve this, called back-propagation algorithm.

Forward-propagation is the method of calculating the cost function for each of the nodes. The idea is the following. For each level \ell we’ll use variables a and z.

Initialize a:

a^{(0)} = x

For the following layers \ell=1, \cdots, L we do:

\begin{array}{ll}   z^{(\ell)} & = (\Theta^{(\ell-1)})^{T} a^{(\ell-1)}\\  a^{(\ell)} & = g(z^{(\ell)})  \end{array}

for the simple case of binary classification, the last layer will contain a single node and after running the above procedure, it will contain the cost function for an input x:

h_{\Theta}(x) = a^{(L)}

Then, we can define the cost function J to be the average of (2) for the training set.

J(\Theta) = - \dfrac{1}{m} \left[\sum_{i=1}^{m} y^{(i)} \log (h_{\Theta}(x^{(i)})) + \right.
\left. (1 - y^{(i)})\log(1 - h_{\Theta}(x^{(i)}))\right]

Back-propagation is the algorithm to find the derivatives of the cost function with respect to \Theta. In this procedure, we use the helper variables \delta, which we initialize as:

\delta^{(L)} = a^{(L)} - y^{(i)}

We iterate over the layers backwards \ell = L-1, \cdots, 1:

\delta^{(\ell)} = (\Theta^{(\ell)})^{T} \delta^{(\ell+1)} .* g'(z^{(i)})

\delta is defined in such a way that

\dfrac{\partial J(\Theta)}{\partial \Theta_{j,k}^{(\ell)}} = \dfrac{1}{m} \sum_{i=1}^{m} a_{k}^{(\ell)} \delta_{j}^{(\ell+1)}

For an sketch of the derivation of these equations, see the Appendix.

Support Vector Machine

Support Vector Machine is a method for binary classification and it does so by defining a boundary that separates points from different classes. For this model, our labels are either y \in \{-1, 1\} (as opposed to y \in \{0, 1\}).

In the simple case, where we want to perform a linear separation, it consists in finding a hyperplane that partitions the space in regions, one containing points with y=-1 and the other y=1; also, it tries to maximize the minimum distance from any point to this hyperplane.

Finding a separating hyperplane (Wikipedia)

Finding a separating hyperplane (Wikipedia)

Instead of working with the parameters \Theta, we’ll have w and b, so we can represent our hyperplane as w^Tx^{(i)} + b. The classification function is give by:

h_{w,b}(x) = g(w^Tx + b)

Where g is the sign function, that is, g(z) = 1 if z \ge 0 and g(z) = -1 otherwise.

Margin

For each training example x^{(i)}, we can define a margin as:

\gamma^{(i)} = y^{(i)} (w^Tx^{(i)} + b)

Intuitively, a function margin represents the confidence that point x^{(i)} belongs to the class y^{(i)}. Given m input points, we define the margin of the training set as

(3) \gamma = \min_{ i = 1, \cdots, m} \gamma^{(i)}

We then want to maximize the margin, so we can write the following optimization problem:

\max \gamma

s.t. y^{(i)} (w^Tx^{(i)} + b) \ge \gamma

The constraint is to enforce (3). The problem with this model is that we can arbitrarily scale w and b without changing h_{w,b}(x), so the optimal solution is unbounded.

The alternative is to normalize the by dividing by ||w||, so it becomes indifferent to scaling.

Primal form

It’s possible to show that in order to maximize the margin, we can minimize the following linear programming with quadratic objective function:

\min \frac{1}{2} ||w||^2

s.t. y^{(i)}(w^Tx^{(i)} + b) \ge 1, for i = 1, \cdots, m

Dual form

For non-linear programming, it’s possible to make use of Lagrangian multipliers to obtain the dual model. In particular, the following model obtained using this idea, together with the primal form, satisfy the Karush–Kuhn–Tucker (KKT) conditions, so the value of the optimal solution for both cases is the same.

\max_{\alpha} W(\alpha) = \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i,j=1}^{m} y^{(i)}y^{(j)} \alpha_i \alpha_j \langle x^{(i)}, x^{(j)} \rangle

s.t.
\begin{array}{llr}   \alpha_i & \ge 0 & i=1,\cdots,m \\  \sum_{i=1}^{m} \alpha_i y^{(i)} & = 0  \end{array}

It’s possible to show that the non-zeros entries of \alpha_i correspond to the input points that will define the margin. We call these points support vectors.

Moreover, w can be obtained by the following equation:

w = \sum_{i=1}^{m} \alpha_i y^{(i)} x^{(i)}

Non-linear SVM: Kernels

Kernel is a way to generate features from the training set. For example, we can generate m features from the m points in the training set by having feature f_i to be the similarity of a point to it. One common similarity function is the Gaussian distance. Given points x^{(i)} and x^{(j)}, we can compute it by:

K_{\mbox{gaussian}}(x^{(i)}, x^{(j)}) = \exp \left(-\dfrac{||x^{(i)} - x^{(j)}||^2}{2\sigma^2}\right)

Conclusion

I’ve completed an Artificial Intelligence class in college and previously did the PGM (Probabilistic Graphical Models). One of the main reasons I enrolled for this course was the topic of Support Vector Machines.

Unfortunately, SVM was discussed in a high level and apart from kernels, it was pretty confusing and mixed up with logistic regression. SVM happened to be a very complicated topic and I ended up learning a lot about it with this post, but I still don’t fully understand all the details.

References

[1] Wikipedia: Back-propagation
[2] Prof. Cosma Shalizi’s notes – Logistic Regression
[4] Prof. Andrew Ng’s notes – Support Vector Machine
[5] Wikipedia: Support Vector Machine
[6] An Idiot’s guide to Support Vector Machines

Appendix: Derivation of the backpropagation relations

In this appendix we’ll calculate the partial derivative of J(\Theta) in relation to a \Theta_{ij}^{(\ell)} component to get some insight into the back-propagation algorithm. For simplicity, we’ll assume a single input x^{(i)}, so we can get rid of the summation.

By the chain rule we have:

\dfrac{\partial J(\Theta)}{\partial \Theta_{j, k}^{(\ell)}} =   \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot  \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot  \dfrac{\partial z^{(\ell)}}{\partial \Theta_{j,k}^{(\ell)}}

Let’s analyze each of the terms:

For the term \dfrac{\partial z^{(l)}}{\partial \Theta_{j,k}^{(\ell)}}, we have that z^{(\ell)} = \Theta^{(\ell)} a^{(\ell-1)} or breaking down by elements:

z^{(\ell)}_{k'} = \sum_{j=1}^{n^{(\ell-1)}} \Theta^{(\ell)}_{j,k'} a^{(\ell-1)}_j

For k=k', it’s easy to see that \dfrac{\partial z^{(\ell)}_k}{\partial \Theta_{j,k}^{(\ell)}} = a^{(\ell-1)}_j. For k \ne k',

\dfrac{\partial z^{(\ell)}_{k'}}{\partial \Theta_{j,k}^{(\ell)}} = 0

For the term \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}, we have that a^{(\ell)} = g(z^{(\ell)}), thus, \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} = g'(z^{(\ell)})

Finally, for the term \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}}, we will assume that J(\Theta) \approx \dfrac{1}{2} (a^{(\ell)} - y)^2

If \ell = L, then a^{(\ell)} = h_\Theta(x), so it’s the derivative is straightforward:

\dfrac{\partial J(\Theta)}{\partial a^{(L)}} = (h_\Theta(x) - y)

For the intermediate layers, we can find the derivative by writing it in terms of subsequent layers and then applying the chain rule. Inductively, suppose we know \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} (base is \ell = L) and we want to find \dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}}. We can write:

(4) \dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} = \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot \dfrac{\partial a^{(\ell)}}{\partial a^{(\ell - 1)}}

Given that a^{(\ell)} = g(z^{(\ell)}) = g((\Theta^{(\ell-1)})^T a^{(\ell - 1)}), we have:

\dfrac{\partial a^{(\ell)}}{\partial a^{(\ell - 1)}} =   \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot   \dfrac{\partial z^{(\ell)}}{\partial a^{(\ell - 1)}} =     (\Theta^{(\ell - 1)})^T \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}

Replacing back in (4):

\dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} =     \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot     (\Theta^{(\ell -1)})^T  \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}}    =    (\Theta^{(\ell -1)})^T  \dfrac{\partial J(\Theta)}{\partial z^{(\ell)}}

If we name \delta^{(\ell)} = \dfrac{\partial J(\Theta)}{\partial z^{(\ell)}}, we now have

(5) \dfrac{\partial J(\Theta)}{\partial a^{(\ell - 1)}} =     (\Theta^{(\ell -1)})^T \delta^{(\ell)}

“Undoing” one of the chain steps, we have

\dfrac{\partial J(\Theta)}{\partial z^{(\ell-1)}} =     \dfrac{\partial J(\Theta)}{\partial a^{(\ell-1)}} \cdot  \dfrac{\partial a^{(\ell-1)}}{\partial z^{(\ell-1)}}

Replacing (5) here,

=    (\Theta^{(\ell -1)})^T \delta^{(\ell)}  \dfrac{\partial a^{(\ell-1)}}{\partial z^{(\ell-1)}}

With that, we can get the following recurrence:

\delta^{(\ell-1)} = (\Theta^{(\ell-1)})^T \delta^{(\ell)} g'(z^{(\ell - 1)})

Finally, replacing the terms in the original equation give us

\dfrac{\partial J(\Theta)}{\partial \Theta_{j, k}^{(\ell)}} =   \dfrac{\partial J(\Theta)}{\partial a^{(\ell)}} \cdot  \dfrac{\partial a^{(\ell)}}{\partial z^{(\ell)}} \cdot  \dfrac{\partial z^{(\ell)}}{\partial \Theta_{j,k}^{(\ell)}}     =    a_j^{(\ell)} \delta_k^{(\ell+1)}