Long Text Decryption Convolutional Neural Network Architecture

Long Text Decryption Convolutional Neural Network Architecture

introduction

Let me be honest here. For a while, I just couldn’t really understand deep learning. I looked at research papers and articles and it seemed incredibly complex. I tried to understand neural networks and their variants, but I still had a hard time.

Then one day I decided to do it step by step, starting from the basics. I broke down the steps of how the technology worked and performed the steps (and calculations) manually until I understood how they worked. It was time-consuming and stressful, but the results were extraordinary.

Now, I not only have a comprehensive understanding of deep learning, but also have good ideas based on it because my foundation is solid. It is one thing to apply neural networks casually, but it is another thing to understand what it is and the mechanism behind it.

Today, I will share with you my experience and show you how I got started with Convolutional Neural Networks and finally figured it out. I will do a comprehensive presentation so that you can have a deep understanding of the working mechanism of CNN.

In this article, I will discuss the architecture behind CNNs, which were designed to solve image recognition and classification problems. I will also assume that you have some basic familiarity with neural networks.

Table of contents

1.How does the machine read images?

2. How to help neural networks recognize images?

3. Define Convolutional Neural Network

  • Convolutional Layer
  • Pooling Layer
  • Output Layer

4. Summary

5. Classify images using CNN

1. How do machines read images?

The human brain is a very powerful machine that can see (capture) multiple images per second and process these images without realizing it. But machines are not like this. The first step for machines to process images is to understand, understand how to express an image, and then read the image.

In simple terms, every image is a series of dots (pixels) in a specific order. If you change the order or color of the pixels, the image will change. For example, store and read an image with the number 4 written on it.

Basically, the machine breaks the image into a matrix of pixels and stores the color code for each pixel at that location. In the representation below, the value 1 is white and 256 is the darkest green (for simplicity, we limit our example to one color).

Once you have the image information stored in this format, the next step is to get the neural network to understand the order and pattern.

2. How to help neural networks recognize images?

The values ​​that characterize the pixels are ordered in a specific way.

Suppose we try to recognize images using a fully connected network, how can we do it?

A fully connected network can treat the image as an array by flattening it and using the pixel values ​​as features to predict the value in the image. Specifically, it is very difficult to make the network understand what is happening in the image below.

Even humans have a hard time understanding that the image above represents the number 4. We have completely lost the spatial arrangement of the pixels.

What can we do? We can try to extract features from the original image so that the spatial arrangement is preserved.

Case 1

Here we use a weight to multiply the initial pixel value.

Now it’s easier for the naked eye to recognize that this is a “4”. But before passing it to the fully connected network, we need to flatten it so that we can preserve the spatial arrangement of the image.

Case 2

Now we can see that flattening the image completely destroys its arrangement. We need to come up with a way to feed the image to the network without flattening it and still preserve the spatial arrangement, which means we need to feed it a 2D/3D arrangement of pixel values.

We can try taking two pixel values ​​of the image at a time instead of just one. This will give the network good insights into the characteristics of neighboring pixels. Since we are taking two pixels at a time, we also need to take two weight values ​​at a time.

Hopefully you'll notice that the image has gone from 4 columns of values ​​to 3. Because we're now shifting two pixels at a time (the pixels are shared on each shift), the image has become smaller. Even though the image is smaller, we can still largely understand that this is "4". Also, an important thing to realize is that we're taking two consecutive horizontal pixels, so only horizontal arrangements are considered.

This is one way we extract features from an image. We can see the left and middle parts, but the right part doesn’t look so clear. This is mainly because of two problems:

1. The left and right corners of the image are obtained by multiplying the weights once.

2. The left side is still retained because of its high weight value; the right side is somewhat lost because of its slightly lower weight.

Now we have two problems, requiring two solutions.

Case 3

The problem we have is that the weights are only passed through the left and right corners of the image once. What we need to do is make the network consider the corners just like any other pixel. We have a simple way to fix this: put zeros on either side of the weight movement.

You can see that by adding zeros, the information from the corners is retrained. The image also becomes larger. This can be used in cases where we don't want to shrink the image.

Case 4

The problem we are trying to solve here is that the smaller weight value in the right corner is lowering the pixel value, thus making it difficult for us to recognize. What we can do is take multiple weight values ​​and combine them.

The weight values ​​of (1,0.3) give us an output of the form

At the same time, weight values ​​of the form (0.1,5) will also give us an output form.

The combined version of both images will give us a clear picture. So what we have done is simply using multiple weights instead of just one, thereby training more information on the image. The end result will be a combined version of the above two images.

Case 5

We have so far tried to combine horizontal pixels by using weights. But in most cases we need to keep the spatial layout in both horizontal and vertical directions. We take a 2D matrix of weights and combine the pixels horizontally and vertically. Again, remember that there has been horizontal and vertical weight movement, and the output will be one pixel lower in both horizontal and vertical directions.

Special thanks to Jeremy Howard for inspiring me to create these images.

So what did we do?

What we did above is trying to extract features from an image by using the spatial arrangement of the image. In order to understand an image, it is extremely important for a network to understand how the pixels are arranged. What we did above is exactly what a convolutional network does. We can take an input image, define a weight matrix, and the input is convolved to extract specific features from the image without losing information about its spatial arrangement.

Another great benefit of this method is that it can reduce the number of parameters of the image. As you can see, the convolved image has fewer pixels compared to the original image.

3. Define a convolutional neural network

We need three basic elements to define a basic convolutional network:

1. Convolutional Layer

2. Pooling layer (optional)

3. Output Layer

Convolutional Layer

In this layer, what actually happens is just like what we saw in Example 5 above. Suppose we have a 6*6 image. We define a weight matrix to extract certain features from the image.

We initialize the weights into a 3*3 matrix. This weight should now be combined with the image so that all pixels are covered at least once to produce a convolutional output. The 429 above is obtained by calculating the value of the element-wise product of the weight matrix and the 3*3 highlight part of the input image.

Now the 6*6 image is converted to a 4*4 image. Imagine the weight matrix is ​​like a brush used to paint a wall. First, use this brush to paint the wall horizontally, then move down and paint the next row horizontally. As the weight matrix moves along the image, the pixel values ​​are used again. In fact, this allows parameters to be shared in the convolutional neural network.

Below we take a real image as an example.

The weight matrix acts like a filter in the image that extracts specific information from the original image matrix. One weight combination might be used to extract edge information, another might be used to extract a specific color, and the next might be used to blur unwanted noise.

The weights are learned first, and then the loss function can be minimized, similar to a multi-layer perceptron (MLP). Therefore, it is necessary to learn the parameters to extract information from the original image to help the network make correct predictions. When we have multiple convolutional layers, the initial layers tend to extract more general features. As the network structure becomes deeper, the features extracted by the weight matrix become more and more complex and more and more applicable to the problem at hand.

The concept of stride and padding

As we have seen above, the filter or weight matrix moves across the image one pixel at a time. We can define this as a hyperparameter to indicate how we want the weight matrix to move across the image. If the weight matrix moves one pixel at a time, we call its stride 1. Let's look at what happens when the stride is 2.

You can see that as we increase the stride value, the size of the image keeps getting smaller. Padding the input image with a zero border can solve this problem. We can also add more than one layer of zero border around the image for high stride values.

We can see how the original shape of the image is maintained after we add a layer of zero padding to the image. Since the output image is the same size as the input image, this is called same padding.

This is same padding (meaning we only consider valid pixels of input image). The middle 4*4 pixels are the same. Here we have retained more information by using borders and also have retained the original size of the image.

Multiple filters and activation maps

It is important to remember that the depth dimension of the weights is the same as the depth dimension of the input image. The weights extend across the entire depth of the input image. Therefore, convolving with a single weight matrix produces a convolved output with a single depth dimension. In most cases, instead of using a single filter (weight matrix), multiple filters of the same dimension are applied.

The output of each filter is stacked together to form the depth dimension of the convolution image. Suppose we have an input of 32*32*3. We use 10 filters of 5*5*3 with valid padding. The output dimension will be 28*28*10.

As shown in the following figure:

Activation maps are the output of the convolutional layer.

Pooling Layer

Sometimes the images are so large that we need to reduce the number of trainable parameters, which requires introducing pooling layers periodically between subsequent convolutional layers. The sole purpose of pooling is to reduce the spatial size of the image. Pooling is done independently in each depth dimension, so the depth of the image remains constant. The most common form of pooling layer is max pooling.

Here, we set the stride to 2 and the pooling size to 2. The max pooling is also applied to the depth dimension of each convolution output. As you can see, after the max pooling operation, the output of the 4*4 convolution becomes 2*2.

Let's see how max pooling works on real images.

As you can see, we convolved the image and max pooled it. The max pooled image still retains the information of the car on the street. If you look closely, you will see that the size of the image has been halved. This can significantly reduce the number of parameters.

Similarly, other forms of pooling can also be applied in the system, such as average pooling and L2 norm pooling.

Output Dimensions

It can be a bit difficult to understand the size of the input and output of each convolutional layer. The following three points may give you some understanding of the output size issue. There are three hyperparameters that control the size of the output volume.

1. Number of filters - The depth of the output volume is proportional to the number of filters. Remember how the output of each filter is stacked to form an activation map. The depth of the activation map is equal to the number of filters.

2. Stride - If the stride is 1, then we process the image at a single pixel level. A higher stride means more pixels are processed at the same time, resulting in a smaller output.

3. Zero padding - This helps us preserve the size of the input image. If single zero padding is added, the motion of a single stride filter will remain at the original image size.

We can apply a simple formula to calculate the output size. The spatial size of the output image can be calculated as ([WF + 2P] / S) + 1. Here, W is the input size, F is the filter size, P is the padding number, and S is the stride number. Suppose we have an input image of 32*32*3, and we use 10 filters of size 3*3*3, single stride, and zero padding.

Then W=32, F=3, P=0, S=1. The output depth is equal to the number of filters applied, which is 10, and the output size is ([32-3+0]/1)+1 = 30. So the output size is 30*30*10.

Output Layer

After multiple layers of convolution and padding, we need the output in the form of classes. Convolution and pooling layers only extract features and reduce the parameters brought by the original image. However, to generate the final output, we need to apply fully connected layers to generate an output equal to the number of classes we need. This requirement is difficult to achieve with just convolutional layers. Convolutional layers can generate 3D activation maps, and we only need content such as whether the image belongs to a specific class. The output layer has a loss function similar to categorical cross entropy to calculate the prediction error. Once the forward propagation is completed, backpropagation begins to update the weights and biases to reduce the error and loss.

4. Summary

As you can see, CNN is composed of different convolutional layers and pooling layers. Let's see what the whole network looks like:

  • We pass the input image to the first convolutional layer, and after convolution, it is output as an activation map. The features of the image after filtering in the convolutional layer are output and passed on.
  • Each filter gives different features that help in making the correct class prediction. Since we need to keep the image size consistent, we use the same padding (zero padding), otherwise padding is used as it helps in reducing the number of features.
  • Pooling layers are then added to further reduce the number of parameters.
  • Before the prediction is finally made, the data will be processed by multiple convolution and pooling layers. The convolution layer will help extract features. The deeper the convolutional neural network, the more specific the features will be extracted, while the shallower the network, the more superficial the features will be extracted.
  • As mentioned earlier, the output layer in CNN is a fully connected layer where the input from other layers is flattened and sent here in order to transform the output into the parameters required by the network.
  • The output layer then generates outputs, which are compared against each other to eliminate errors. The loss function is the RMS loss calculated by the fully connected output layer. We then calculate the gradient error.
  • Errors are back-propagated to continuously improve the filters (weights) and bias values.
  • One training cycle is completed by a single forward and backward pass.

5. Image Classification Using CNN in KERAS

Let's try this out by feeding in pictures of cats and dogs and asking the computer to identify them. This is a classic problem of image recognition and classification, where the machine needs to see the image and understand the different physical features of a cat and a dog. These features can be things like the outline of the shape or the whiskers of a cat, and the convolutional layer will grab these features. Let's try this out with our dataset.

The following images are from the dataset.

[[196183]]

We first need to resize these images so that they are of the same shape. This is usually done before processing images, because when taking a photo, it is almost impossible to make all the images the same size.

To simplify the understanding, we only use one convolution layer and one pooling layer here. Note: This simple situation does not occur in the application stage of CNN.

#import various packages

import os
import numpy as np
import pandas as pd
import scipy
import sklearn
import keras
from keras.models import Sequential
import cv2
from skimage import io
%matplotlib inline

#Defining the File Path

cat=os.listdir("/mnt/hdd/datasets/dogs_cats/train/cat")
dog=os.listdir("/mnt/hdd/datasets/dogs_cats/train/dog")
filepath="/mnt/hdd/datasets/dogs_cats/train/cat/"
filepath2="/mnt/hdd/datasets/dogs_cats/train/dog/"

#Loading the Images

images=[]
label = []
for i in cat:
image = scipy.misc.imread(filepath+i)
images.append(image)
label.append(0) #for cat images

for i in dog:
image = scipy.misc.imread(filepath2+i)
images.append(image)
label.append(1) #for dog images

#resizing all the images

for i in range(0,23000):
images[i]=cv2.resize(images[i],(300,300))

#converting images to arrays

images = np.array(images)
label = np.array(label)

# Defining the hyperparameters

filters=10
filtersize=(5,5)

epochs = 5
batchsize=128

input_shape=(300,300,3)

#Converting the target variable to the required size

from keras.utils.np_utils import to_categorical
label = to_categorical(label)

#Defining the model

model = Sequential()

model.add(keras.layers.InputLayer(input_shape=input_shape))

model.add(keras.layers.convolutional.Conv2D(filters, filtersize, strides=(1, 1), padding='valid', data_format="channels_last", activation='relu'))
model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
model.add(keras.layers.Flatten())

model.add(keras.layers.Dense(units=2, input_dim=50,activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(images, label, epochs=epochs, batch_size=batchsize,validation_split=0.3)

model.summary()

In this model, I only used a single convolution and pooling layer, and the trainable parameters are 219,801. I was curious how many parameters I would have if I used an MLP in this case. By adding more convolution and pooling layers, you can further reduce the number of parameters. The more convolutional layers we add, the more specific and complex the features extracted become.

In this model, I only used one convolutional layer and pooling layer, and the number of trainable parameters is 219,801. If you want to know how much you will gain in this case using MLP, you can reduce the number of parameters by adding more convolutional and pooling layers. More convolutional layers means more specific and complex features are extracted.

Conclusion

Hopefully, this article has given you an idea of ​​what Convolutional Neural Networks are. This article does not go into the intricate mathematics of CNNs. If you want to learn more, you can try building your own Convolutional Neural Network to understand how it works and makes predictions.

This article is reproduced from Synced, the original text comes from Analyticsvidhya, and the author is DISHASHREE GUPTA.

<<:  How to build machine learning models using JavaScript

>>:  A Brief Analysis of High-Performance IO Model

Recommend

Youshe-15 practical courses to quickly master AI from scratch

Course Catalog├──1. Interface Settings for Ai Expe...

How much does it cost to customize a home decoration mini program in Chuzhou?

The mini program provides convenience for publici...

In 2022, re-understand brand marketing!

"Corporate brand marketing and national bran...

Brand marketing promotion: How to increase tens of millions of users in 60 days?

Recently, some groups have been forwarding a mess...

Programmers: The age of 30 that cannot be hurt

[[120972]] Programmers work until they are 30, an...

Gradle for Android Part 4 (Build Variants)

When you develop an app, you usually have several...

What is Computer Graphics?

1. What is computer graphics? What is Computer Gr...

7 minefields of big data marketing, how many have you stepped on?

Nowadays, when talking about marketing, if you do...

Why should we advertise on CCTV?

Why do we advertise on CCTV? Since the era of sel...

Analyze the live broadcast process from 0 to 1!

According to the survey data of [2021 China's...

How to write an excellent event planning and implementation plan?

Google is in talks with Chinese internet company N...