Building a simple logistic regression model from scratch using TensorFlow

TensorFlow is a Python-based machine learning framework. After learning the course content of logistic regression on Coursera, I wanted to re-implement the content implemented in MATLAB using TensorFlow as a stepping stone to learn Python and the framework.

Target audience

Know what logistic regression is, know a little Python, and have heard of TensorFlow

Dataset

ex2data1.txt from Andrew's machine learning course on Coursera, which determines whether a student will be admitted based on his or her two test scores.

environment

Python 2.7 - 3.x

pandas, matplotlib, numpy

Install TensorFlow

Install the TensorFlow framework on your computer. The installation process is not described here. The CPU version is relatively easier, and the GPU version requires CUDA support. You can install it according to your needs.

start

Create a folder (for example, called tensorflow), create a Python file main.py in the folder, and put the dataset file in this folder:

Data format:

The first two columns are the scores of the two exams (x1, x2), and the last column is whether the student is admitted (y), 1 means admitted, and 0 means not.

In the source file main.py, we first import the required packages:

import pandas as pd # used to read data files import tensorflow as tf
import matplotlib.pyplot as plt # for drawing import numpy as np # for subsequent calculations

Pandas is a data processing package that can read and perform various other operations on data sets; matplotlib can be used to plot our data sets into charts.

Then we read the dataset file into the program for subsequent training:

# Read data file df = pd.read_csv("ex2data1.txt", header=None)
train_data = df.values

The pandas function read_csv can read the data in the csv (comma-separated values) file into the df variable and convert the DataFrame into a two-dimensional array through df.values:

After we have the data, we need to put the features (x1, x2) and labels (y) into two variables respectively so that we can substitute them into the formula during training:

# Separate features and labels, and get data dimensions train_X = train_data[:, :-1]
train_y = train_data[:, -1:]
feature_num = len(train_X[0])
sample_num = len(train_X)
print("Size of train_X: {}x{}".format(sample_num, feature_num))
print("Size of train_y: {}x{}".format(len(train_y), len(train_y[0])))

[[195335]]

As you can see, there are 100 samples in our data set, and the number of features of each sample is 2.

TensorFlow model design

In logistic regression, the prediction function (Hypothesis) we use is:

hθ(x)=sigmoid(XW+b)

Among them, sigmoid is an activation function, which represents the probability of a student being admitted:

P(y=1|x,θ)

Please Baidu the shape of this function

W and b are our next learning goals. W is the weight matrix (Weights), and b is the bias (Bias, also called intercept in the image).

The loss function we use is:

J(θ)=−1m[∑i=1my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))]

Since our data set has only two features, there is no need to worry about overfitting, so the regularization term in the loss function is not needed.

First, we use TensorFlow to define two variables to store our training data:

# Dataset X = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)

Here, X and y are not general variables, but placeholders, which means that the values of these two variables are unspecified until you start training the model and you need to assign the given data to the variables.

Next, we define the W and b we want to train:

# Training target W = tf.Variable(tf.zeros([feature_num, 1]))
b = tf.Variable([-.9])

Here, their type is Variable, which means that these two variables will continue to change during the training iteration and eventually get the values we expect. As you can see, we set the initial value of W to the 0 vector of feature_num dimension and the initial value of b to -0.9 (just set it casually, don't mind 😶)

Next, we need to express the loss function using TensorFlow:

db = tf.matmul(X, tf.reshape(W, [-1, 1])) + b
hyp = tf.sigmoid(db)

cost0 = y * tf.log(hyp)
cost1 = (1 - y) * tf.log(1 - hyp)
cost = (cost0 + cost1) / -sample_num
loss = tf.reduce_sum(cost)

As you can see, I express the loss function in three steps: first express the two parts in the sum separately, then add them together and calculate with the external constant m, and finally sum this vector to get the value of the loss function.

Next, we define the optimization method to use:

optimizer = tf.train.GradientDescentOptimizer(0.001)
train = optimizer.minimize(loss)

Among them, the first step is to select the optimizer. Here we choose the gradient descent method; the second step is to optimize the target. As the name of the function suggests, our optimization goal is to minimize the value of the loss function.

Note: The learning rate here (0.001) should be as small as possible, otherwise the problem of log(0) appearing in the loss calculation may occur.

train

After completing the above work, we can start training our model.

In TensorFlow, we first need to initialize the previously defined Variable:

init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)

Here, we see a tf.Session(), which is the subject of task execution. We have defined a bunch of things above, which are just the execution steps and frameworks that a model needs to get results, something like a flowchart. A flowchart alone is not enough, we need a subject to actually run it, which is the role of Session.

----------Special Tips----------

If you are using the GPU version of TensorFlow, and you want to train the model when the graphics card is highly occupied (such as playing games), you must allocate a fixed amount of video memory to it when initializing the session, otherwise you may get an error and exit directly when starting training:

2017-06-27 20:39:21.955486: E c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\stream_executor\cuda\cuda_blas.cc:365] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "C:\Users\DYZ\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
    return fn(*args)
  File "C:\Users\DYZ\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
    status, run_metadata)
  File "C:\Users\DYZ\Anaconda3\envs\tensorflow\lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "C:\Users\DYZ\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMV launch failed: m=2, n=100
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_arg_Placeholder_0_0/_3, Reshape)]]

At this time you need to create a Session using the following method:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

The 0.333 here is the share of your total video memory.

----------End Special Tips----------

Now we use our dataset to train the model:

feed_dict = {X: train_X, y: train_y}

for step in range(1000000):
    sess.run(train, {X: train_X, y: train_y})
    if step % 100 == 0:
        print(step, sess.run(W).flatten(), sess.run(b).flatten())

First, store the data to be passed in a variable and pass it to sess.run() when training the model; we perform 10,000 training runs, with each run running for 100 seconds.
Output the current target parameter W, b once.

At this point, the training code is complete and you can run it using your own python command. If you strictly follow the code above and no errors occur, you should now be able to see the training status being continuously output in the console:

Graphical representation of results

When the training is completed, you can get a W and a b, so that we can visually display the data set and the fitting results through charts.

While writing, I used the above code to train a result:

We write it directly into the code, namely:

w = [0.12888144, 0.12310864]
b = -15.47322273

Let's first represent the data set on a chart (x1 is the horizontal axis and x2 is the vertical axis):

x1 = train_data[:, 0]
x2 = train_data[:, 1]
y = train_data[:, -1:]

for x1p, x2p, yp in zip(x1, x2, y):
    if yp == 0:
        plt.scatter(x1p, x2p, marker='x', c='r')
    else:
        plt.scatter(x1p, x2p, marker='o', c='g')

Among them, we use red x to represent not being admitted and green o to represent being admitted.

Next, we plot the decision boundary XW + b = 0 obtained through training on the graph:

# Get the straight line according to the parameters x = np.linspace(20, 100, 10)
y = []
for i in x:
    y.append((i * -w[1] - b) / w[0])
    
plt.plot(x, y)
plt.show()

At this point, if your code is correct, run it again and you will get the following results:

As you can see, the parameters we obtained through training draw a straight line that very appropriately distinguishes the two different data samples.

At this point, a complete and simple logistic regression model has been implemented. I hope that through this article, you can have a preliminary understanding of the implementation of machine learning models in TensorFlow. I am also in the initial learning process. If there is anything inappropriate, please feel free to criticize in the comment area. If you encounter any problems in the process of implementing the above code, please feel free to fire in the comment area.

<<: Aiti Tribe Stories (21): How difficult is it to take over a project midway? Teach you how to be a good takeover man!

>>: Android Study: findViewById's evolution

The ancient zither is present in paintings from the Song and Yuan dynasties, and the taste of the zither is pure joy

Blog

Killing one overwintering mosquito is equivalent to killing thousands of second-generation mosquitoes? It turns out that mosquitoes hide in these places in winter...

Blog

The "most mysterious bird in the world" in Weiyuan, Sichuan was released back into the wild and found its "family"!

State Council: China's new energy vehicle production and sales will reach 9.587 million and 9.495 million respectively in 2023

Blog

Xiaomi, Huawei, and OPPO are in a "group fight", while Hammer and Meizu have a hard time "surviving"

Blog

300 days after the new online car-hailing policy, Didi and others’ 6-year roller coaster ride has not yet reached the end

Blog

A painter who can't do gymnastics is not a good neurologist

Blog

Tujia.com's big data precision marketing solution!

Blog

The old king of the empire passed away, and the crown prince Lee Jae-yong ascended the throne. His first priority will be to prevent Samsung from being split up.

Samsung Chairman Lee Kun-hee died in his own hosp...

Building a simple logistic regression model from scratch using TensorFlow

Target audience

Dataset

environment

Install TensorFlow

start

TensorFlow model design

train

----------Special Tips----------

----------End Special Tips----------

Graphical representation of results

The ancient zither is present in paintings from the Song and Yuan dynasties, and the taste of the zither is pure joy

Killing one overwintering mosquito is equivalent to killing thousands of second-generation mosquitoes? It turns out that mosquitoes hide in these places in winter...

The "most mysterious bird in the world" in Weiyuan, Sichuan was released back into the wild and found its "family"!

The latest policy on rural car subsidies in 2022: When will it start? How much subsidy is there for each car?

An analysis of the entire process of Keep’s customer acquisition: How to gain 140 million users in 3 years

State Council: China's new energy vehicle production and sales will reach 9.587 million and 9.495 million respectively in 2023

Xiaomi, Huawei, and OPPO are in a "group fight", while Hammer and Meizu have a hard time "surviving"

300 days after the new online car-hailing policy, Didi and others’ 6-year roller coaster ride has not yet reached the end

A painter who can't do gymnastics is not a good neurologist

Tujia.com's big data precision marketing solution!

Recommend

BlackBerry's last self-owned mobile phone is fully exposed: dual-curved full keyboard

Talking nonsense when confused: Is this a "mysterious performance" in the brain theater?

Revelation from Pepsi marketing!

Case Analysis: How to impact the Weibo topic list with 0 budget and link up 30 apps?

iOS 18 is coming, the biggest update ever!

Joining the smart hardware battlefield, what attracts PPTV mobile phones

Segment "user activity status" to help you achieve KPI indicators

Why is Weibo becoming a standard “must-buy” marketing channel for advertisers?

Monkey is happy: Monkeys are only happy when they see it, but ants are really happy when they see it

Allen Yoga Allen [Core Training + Backbend Flow]

Learn store sales from Shao Huining and get results from details

China Passenger Car Association & CAM: Passenger car market product competitiveness index is 91.1 in November 2020

In-depth information | The most comprehensive guide to Zhihu information flow advertising is online

After 8 million years! The secret of "Yuanyuan" standing firm is →

The old king of the empire passed away, and the crown prince Lee Jae-yong ascended the throne. His first priority will be to prevent Samsung from being split up.