You don’t have to be working in the field of Computer Vision to have heard about Convolutional Neural Network(CNN) – the superstar neural network that became immensely popular especially after the 2012 paper on ImageNet classification by Alex Krizhevsky.[1] Since then, many state-of-the-art performance on computer vision problems such as semantic segmentation[2] and face recognition[3] were made possible thanks to CNN.
However, CNN may not necessarily be the be-all and end-all of neural networks. There are still many setbacks and trade-offs when it comes to using Convolutional Neural Network. One such example is the loss of information in the pooling layer which leads to reduction of spatial resolution.
Late last year(November 2017), Sara Sabour and Geoffrey E. Hinton of Google Brain published a paper called Dynamic Routing Between Capsules. [4]. Their paper introduced a network they called CapsNet which is composed of capsules rather than neurons.
What are capsules in the first place? Capsules are basically a small group of neurons that learn to detect an object within a given region of image and then gives an output vector. The output vector’s length represents the estimated probability that the object is present whilst the orientation of the output vector encodes the state of the detected feature. In other words, when the detected features move around the image, the length of the output vector stays the same but its orientation changes.
Besides the concept of capsules, the paper also introduced an algorithm called routing by agreement. In the caps-net, the capsules are basically divided into two layers – the lower-level layer ( also called Primary Capsule Layer) and the higher-level layer ( called the Routing Capsule Layer)

The routing algorithm takes as its input – l : the lower level capsule, r: the number of routing iteration and U-hat: the output of the lower-level capsules. It then gives the output for the higher-level capsule V_j.
The basic idea of this algorithm is to calculate the similarity between input and output as a dot product between the input and output of a capsule. The routing coefficient is then updated accordingly.

Compared to a typical CNN, the CapsNet architecture is rather shallow as it only consists of two convolutional layers and one fully connected layer. In the paper, the performance of this CapsNet architecture was compared to a standard CNN with three convolutional layers of 256, 256, 128 channels. The 3 layer CapsNet model achieved higher test classification accuracy than the baseline convolutional model.
I am rather aware than I’m skipping a lot of the technical details here because this is not the aim of the post. Rather, I would like to talk about how capsules and CapsNet could potentially be the future of computer vision.
What’s so great about capsules and CapsNet?
First of all, number of datasets! For CNN, one needs a huge amount of data(think of the millions of images in ImageNet for example) to train it. For general objects that has a lot of example images already on the net(e.g: everyday objects such as chairs and tables), this might not be such a big problem. But what about images in areas such as medicine where enough example images might not be possible? Having a network that only needs a small/medium dataset for training will open up many application possibilities.
Also, capsules are able to capture the spatial relationship between features – something that CNN can’t usually do. A popular example is how a CNN will wrongly classify the picture below as a “face” because it has all the features that it knows belongs to a face(eyes, nose, mouth etc). As CapsNet take into account the spatial relationship between the features, it will not wrongly classify this as a face.

The idea of CapsNets is definitely exciting because it means that we’re getting even closer to how humans intuitively see the world (A human being will not describe Figure 3 as a “face”). However, there are still many setbacks and challenges that CapsNet faces such as training time(due to the loop in the routing by agreement algorithm)
There is also not enough testing for large image datasets yet but I think we can except a rather bright future for CapsNet.
Original Paper: Sabour, Sara, Frosst, Nicholas, and Hinton, Geoffrey E. Dy- namic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3859–3869, 2017.
日本語(後半のみ)
カプセルとCapsNetについて何がすばらしいですか?
まず、データセットの数! CNNのためには、膨大な量のデータが必要です。(たとえば、ImageNetの数百万のイメージを考えてください) ネット上に多数のサンプル画像が存在する一般的なオブジェクト(例えば、椅子やテーブルなど)では、これは大きな問題ではないかもしれません。 しかし、医学などの分野では、十分なsample imageが得られない可能性がある時はどうでしょうか? 小規模/中規模のデータセットのみを必要とするネットワークを持つことで、画像認識がより幅広い分野で応用できるようになるでしょう。
また、カプセルは、特徴量間の空間的関係を獲得することができます。これは、通常CNNができません。 よく知られている例は、CNNが顔に属する特徴量(眼、鼻、口など)を認識したら、下の画像を誤って「顔」として分類してしまう例です。 CapsNetは特徴量の空間的関係を考慮しているため、下の画像を誤って顔として分類することはありません。

CapsNetは、人間が直感的に世界を見ている様子にさらに近づけるでしょう。(人間は図3を「顔」として誤らないです)。 しかしながら、CapsNetが訓練時間(Routing by algorithm内のループのため)が長くなる場合もあるといったような多くの課題がまだ残っています。
まだ大きな画像データセットのテストには十分ではありませんが、私はCapsNetのかなり明るい未来が待ってると思っています。
Original Paper: Sabour, Sara, Frosst, Nicholas, and Hinton, Geoffrey E. Dy- namic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3859–3869, 2017.
References
[1] Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural net- works. In NIPS, pp. 1106–1114, 2012.
[2] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
[3]H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5325-5334.