Android mobile live broadcast project development overview analysis

1. Description

In the past two years, the live broadcast industry has become increasingly popular, with more than 300 live broadcast platforms in total. Some live broadcast platforms do live broadcasts of shows and entertainment (Laifeng Live), some do live broadcasts of games (Panda Live), and some do live broadcasts of sports events (LeTV Live). The following picture well reflects the general classification of domestic live broadcast platforms.

I have the honor to participate in the research and development of Laifeng Android mobile phone live broadcast. In the spirit of technology sharing, I am now writing a series of articles to introduce Android mobile phone live broadcast. On the one hand, I hope to help everyone understand the technologies related to Android mobile phone live broadcast. On the other hand, it can also be regarded as a summary of my work for a period of time.

II. Overall

If you were asked to develop an Android live streaming application right away, you might feel like you have no idea where to start. Therefore, it is very important to have an overall understanding of the entire mobile live streaming process.

What mobile live streaming needs to achieve is nothing more than processing the video and audio collected by the mobile phone and sending them to the server in a certain format. The whole process is as follows:

3. Collection

The acquisition mainly includes two aspects: video acquisition and audio acquisition. Video is acquired through the camera, which involves the relevant operation of the camera and the setting of the camera parameters. Due to the differences between the cameras of various mobile phone manufacturers, there are some pitfalls in this regard, which will be discussed in the subsequent articles on cameras. Audio is acquired through the microphone. The microphones of different mobile phones support different audio sampling rates, and sometimes the audio needs to be echo-cancelled in order to support the microphone connection function.

Key points of video acquisition technology:

Check whether the camera is available;
The image captured by the camera is horizontal, so it needs to be rotated before being displayed;
There are a series of image sizes to choose from when the camera is capturing images. When the size of the captured image is inconsistent with the size of the mobile phone screen, special processing is required;
Android phone cameras have a series of states, and the camera can only be operated in the correct state;
There are compatibility issues with many parameters of Android phone cameras, and these compatibility issues need to be better handled.

Key points of audio acquisition technology:

Detect whether the microphone is available;
Need to detect the phone's support for a certain audio sampling rate;
In some cases, the audio needs to be echo-cancelled;
Set the correct buffer size when capturing audio.

4. Processing

Video Processing

Beauty filter is now almost a standard feature of mobile live streaming software. After beauty filter, the host will be more attractive to fans and more attractive to fans. There are also some Android live streaming applications that can perform facial recognition on the host and then add fun animation effects. Sometimes we also need to add watermarks to the video.

In fact, beautifying and adding special effects to videos are all processed through OpenGL. Android has GLSurfaceView, which is similar to SurfaceView, but can be rendered using Renderer. Textures can be generated through OpenGL, and SurfaceTexture can be generated through the texture ID, and SurfaceTexture can be handed over to the Camera. The camera preview screen is connected to OpenGL through the texture, so that a series of operations can be performed through OpenGL.

The whole process of beautification is nothing more than generating a new texture based on the texture previewed by the Camera through the FBO technology in OpenGL, and then using the new texture for drawing in onDrawFrame() in the Renderer. Adding a watermark means first converting an image into a texture, and then drawing it using OpenGL. Adding dynamic pendant effects is more complicated. First, an algorithm analysis is performed based on the current preview image to identify the corresponding parts of the face, and then the corresponding images are drawn on each corresponding part. The whole process is somewhat difficult to implement.

The following figure is a flowchart of the entire beautification process:

The images below show the beauty and animation effects very well.

Audio Processing

In some cases, the host needs to add some extra sounds to enhance the live broadcast atmosphere, such as applause, etc. One way to deal with this is to play the additional sounds directly, so that the microphone can pick them up and record them together. However, this method does not work when the host wears headphones or needs to perform echo cancellation on the sounds. Since we have not added the corresponding function to our project, we have no relevant experience to share for the time being. We may add this function later and share it with you then.

5. Coding

We can collect the corresponding video and audio data through the camera and microphone, but these are raw data in a fixed format. Generally speaking, the camera collects pictures frame by frame, and the microphone collects PCM audio data. If these data are sent directly, the data volume will often be large, resulting in a large waste of bandwidth. Therefore, the video and audio often need to be encoded before sending.

Video Encoding

1. Predictive Coding

As we all know, an image is composed of many points called pixels. A large number of statistics show that there is a strong correlation between pixels in the same image. The shorter the distance between two pixels, the stronger the correlation. In layman's terms, the closer the values of the two pixels are. Therefore, people can use this correlation between pixels for compression coding. This compression method is called intra-frame prediction coding. In addition, the correlation between adjacent frames is generally stronger than the correlation between pixels within a frame, and the compression ratio is also greater. It can be seen that video compression coding can be achieved by using the correlation between pixels (within a frame) and the correlation between frames, that is, finding the corresponding reference pixels or reference frames as prediction values.

2. Transform Coding

A large number of statistics show that video signals contain DC and low-frequency components that account for most of the energy, that is, the flat part of the image, and a small amount of high-frequency components, that is, the details of the image. Therefore, another method can be used to encode the video, and the image in the transform domain is obtained after a certain mathematical transformation of the image (as shown in the figure), where u and v are spatial frequency coordinates respectively.

3. Waveform-based coding

Waveform-based coding uses a block-based hybrid coding method that combines predictive coding and transform coding. In order to reduce the complexity of coding and make video coding operations easier to perform, when using the hybrid coding method, an image is first divided into fixed-size blocks, such as 8×8 blocks (i.e., 8 rows per block, 8 pixels per row), 16×16 blocks (16 rows per block, 16 pixels per row), etc., and then the blocks are compressed and coded.

Since ITU-T released the first digital video coding standard, H.261, in 1989, it has successively released video coding standards such as H.263 and multimedia terminal standards such as H.320 and H.323. The Moving Picture Experts Group (MPEG) under ISO defines international standards for entertainment and digital television compression coding such as MPEG-1, MPEG-2, and MPEG-4.

In March 2003, ITU-T promulgated the H.264 video coding standard. It not only significantly improves video compression compared to previous standards, but also has good network affinity, especially for IP Internet, wireless mobile network and other networks prone to errors, blocking, and QoS guarantee. All these video coding methods use block-based hybrid coding methods, which are waveform-based coding.

4. Content-based encoding

There is also a content-based coding technique, in which the video frame is first divided into regions corresponding to different objects, and then encoded. Specifically, the shape, motion, and texture of different objects are encoded. In the simplest case, the shape of the object is described by a two-dimensional contour, its motion state is described by a motion vector, and the texture is described by a color waveform.

When the types of objects in the video sequence are known, knowledge-based or model-based coding can be used. For example, for human faces, some predefined wireframes have been developed to encode the features of the face. In this case, the coding efficiency is very high, and only a few bits are needed to describe its features. For facial expressions (such as anger, happiness, etc.), possible behaviors can be semantically encoded. Since the number of possible behaviors of an object is very small, very high coding efficiency can be achieved.

The encoding method adopted by MPEG-4 is both block-based hybrid encoding and content-based encoding.

5. Soft and hard coding

There are two ways to implement video encoding on the Android platform, one is software encoding and the other is hard encoding. Software encoding often relies on the CPU and uses the computing power of the CPU to encode. For example, we can download the x264 encoding library, write the relevant jni interface, and then pass in the corresponding image data. After being processed by the x264 library, the original image is converted into a video in h264 format.

Hard coding uses the MediaCodec provided by Android itself. The use of MediaCodec requires the passing of corresponding data, which can be yuv image information or a Surface. It is generally recommended to use Surface, which is more efficient. Surface directly uses the local video data cache without mapping or copying them to ByteBuffers; therefore, this method is more efficient. When using Surface, you usually cannot directly access the original video data, but you can use the ImageReader class to access unreliable decoded (or original) video frames. This may still be more efficient than using ByteBuffers because some local caches can be mapped to direct ByteBuffers. When using ByteBuffer mode, you can use the Image class and getInput/OutputImage(int) methods to access the original video data frames.

Audio Coding

AudioRecord can be used in Android to record sound, and the recorded sound is PCM sound. If you want to express the sound in computer language, you must digitize the sound. The most common way to digitize sound is through Pulse Code Modulation (PCM). The sound passes through the microphone and is converted into a series of voltage change signals. The method to convert such a signal into PCM format is to use three parameters to represent the sound, which are: number of channels, number of sampling bits, and sampling frequency.

1. Sampling frequency

The sampling frequency refers to the number of times the sound samples are obtained per second. The higher the sampling frequency, the better the sound quality and the more realistic the sound restoration, but it also takes up more resources. Since the resolution of the human ear is very limited, too high a frequency cannot be distinguished. In a 16-bit sound card, there are several levels such as 22KHz and 44KHz. Among them, 22KHz is equivalent to the sound quality of ordinary FM broadcasting, and 44KHz is equivalent to the sound quality of CD. The commonly used sampling frequency does not exceed 48KHz.

2. Number of sampling bits

That is, the sampling value or sampling value (that is, the quantization of the sampling sample amplitude). It is a parameter used to measure the fluctuation of sound, and can also be said to be the resolution of the sound card. The larger its value, the higher the resolution, and the stronger the sound produced.

In computers, the number of sampling bits is generally divided into 8 bits and 16 bits, but please note that 8 bits does not mean dividing the vertical coordinate into 8 parts, but dividing it into 2 to the 8th power, which is 256 parts; similarly, 16 bits means dividing the vertical coordinate into 2 to the 16th power, which is 65536 parts.

3. Number of channels

It is very easy to understand. There are mono and stereo. Mono sound can only be produced by one speaker (some are also processed into two speakers to output the sound of the same channel). Stereo PCM can make both speakers produce sound (generally the left and right channels are divided into different tasks), and the spatial effect can be felt more.

Now we can get the formula for the capacity occupied by the pcm file:

Storage capacity = (sampling frequency sampling bits channel time)? 8 (unit: number of bytes)

If all audio is transmitted in PCM format, it will occupy a large bandwidth, so the audio needs to be encoded before transmission.

There are some widely used sound formats now, such as: wav, MIDI, MP3, WMA, AAC, Ogg, etc. Compared with the pcm format, these formats compress the sound data and can reduce the transmission bandwidth.

Audio encoding can also be divided into two types: soft encoding and hard encoding. Soft encoding downloads the corresponding encoding library, writes the corresponding jni, and then passes in the data for encoding. Hard encoding uses the MediaCodec provided by Android itself.

6. Packaging

The corresponding format of audio and video needs to be defined during the transmission process so that it can be correctly parsed when transmitted to the other end.

1. HTTP-FLV

In the Web 2.0 era, if we talk about what type of websites are the most popular, it is naturally Youtube abroad, Youku and Tudou websites in China. The video content provided by these websites can be said to have its own merits, but they all use Flash as the video playback carrier without exception. The technical basis supporting these video websites is Flash video (FLV). FLV is a new streaming video format. It uses the Flash Player platform widely used on web pages to integrate videos into Flash animations. In other words, as long as website visitors can watch Flash animations, they can also watch FLV format videos without having to install other video plug-ins. The use of FLV videos has brought great convenience to video dissemination.

HTTP-FLV encapsulates audio and video data into FLV and transmits it to the client through HTTP protocol. As the uploader, you only need to transmit the audio and video in FLV format to the server.

Generally speaking, for video and audio in FLV format, the video usually uses the h264 format, and the audio usually uses the AAC-LC format.

The FLV format transmits the FLV header information first, then transmits the metadata with the video and audio parameters, then transmits the video and audio parameter information, and then transmits the video and audio data.

2. RTMP

RTMP is the acronym for Real Time Messaging Protocol. This protocol is based on TCP and is a protocol cluster that includes the RTMP basic protocol and multiple variants such as RTMPT/RTMPS/RTMPE. RTMP is a network protocol designed for real-time data communication, mainly used for audio, video and data communication between Flash/AIR platforms and streaming media/interactive servers that support the RTMP protocol.

RTMP protocol is a real-time transmission protocol launched by Adobe, which is mainly used for real-time transmission of audio and video streams based on flv format. After obtaining the encoded audio and video data, it must first be packaged in FLV format, then packaged into rtmp format, and then transmitted.

To transmit using RTMP format, you need to connect to the server first, then create a stream, then publish the stream, and then transmit the corresponding audio and video data. The entire transmission is defined by messages. RTMP defines various forms of messages, and in order to send the messages well, the messages are divided into blocks, and the entire protocol is relatively complex.

There are several other forms of protocols, such as RTP, etc. The principles are basically the same, so I will not explain them one by one.

7. Poor network processing

In a good network, audio and video can be sent in time, without causing the accumulation of audio and video data locally, and the live broadcast effect is smooth with less delay. In a bad network environment, audio and video data cannot be sent out, so we need to process the audio and video data. There are generally four ways to process audio and video data in a bad network environment: buffer area design, network detection, frame loss processing, and bit rate reduction processing.

1. Buffer design

The audio and video data is passed into the buffer, and the sender obtains the data from the buffer and sends it, thus forming an asynchronous producer-consumer model. The producer only needs to push the collected and encoded audio and video data to the buffer, and the consumer is responsible for taking the data out of the buffer and sending it.

The above figure only shows the video frames, but there are obviously corresponding audio frames. To build an asynchronous producer-consumer model, Java has provided a good class. Since frame dropping, insertion, and removal are required later, LinkedBlockingQueue is obviously a good choice.

2. Network detection

An important process in dealing with poor network is network detection. When the network becomes poor, it can be detected quickly and then processed accordingly. This way, the network response will be more sensitive and the effect will be much better.

We calculate the data input into the buffer and the data sent out every second in real time. If the data sent out is less than the data input into the buffer, it means that the network bandwidth is not good enough. At this time, the data in the buffer will continue to increase, and the corresponding mechanism must be activated.

3. Frame loss processing

When the network is detected to be deteriorating, frame dropping is a good response mechanism. After the video is encoded, there are key frames and non-key frames. The key frame is a complete picture, while the non-key frame describes the relative changes of the image.

There are many different frame dropping strategies, which can be defined by yourself. One thing to note is that if you want to drop P frames (non-key frames), you need to drop all non-key frames between two key frames, otherwise mosaics will appear. The design of the frame dropping strategy varies according to the needs, and you can design it yourself.

4. Reduce bitrate

In Android, if hard encoding is used, we can change the hard encoding bitrate in real time in a poor network environment to make the live broadcast smoother. When a poor network environment is detected, we can also reduce the video and audio bitrate while dropping frames. When the Android SDK version is greater than or equal to 19, the bitrate of the data output by the hard encoding encoder can be changed by passing parameters to MediaCodec.

 Bundle bitrate = new Bundle();bitrate.putInt(MediaCodec.PARAMETER_KEY_VIDEO_BITRATE, bps * 1024);
 mMediaCodec.setParameters(bitrate);

8. Send

After various processing, the data needs to be sent out, which is a relatively simple step. Whether it is HTTP-FLV or RTMP, we use TCP to establish a connection. Before the live broadcast begins, you need to connect to the server through Socket to verify whether you can connect to the server. After the connection, use this Socket to send data to the server, and close the Socket after the data is sent.

<<: Vue develops WeChat H5 and solves the problem of WeChat sharing signature failure

>>: 11 low-key but useful details in Android 9.0 Pie

WeChat circle application template, how to write the reason for applying for WeChat circle?

Android mobile live broadcast project development overview analysis

WeChat circle application template, how to write the reason for applying for WeChat circle?

DIKW Pyramid, which level has AI climbed to?

12 viewpoints on knowledge payment + fan fission operation!

World Drowning Prevention Day丨Before you enjoy playing in the water, please check this WHO drowning prevention guide

Douyin 0-cost project: single-day income of 500, no posting of works, no account maintenance

If you throw a cat upside down, the smart ones will land on all fours, but the stupid ones will...

Really? The meat we eat in the future may not necessarily come from living animals!

Detailed official interpretation of TikTok operations, policies, etc.!

Middle-aged people die of melanoma. The "effect" of ultraviolet rays is beyond your imagination.

Valentine's Day marketing ideas, save it!

Recommend

Drone mini program functions, how much does it cost to develop a drone mini program?

Science cover: 20 years later, the Human Genome Project is finally complete

APP promotion: A brief discussion on the cold start of Internet products!

How to write a hot note on Xiaohongshu?

Still the familiar Xiaomi, Xiaomi 8 detailed review

Gray hair at a young age? These habits and nutrient deficiencies are accelerating the formation of gray hair

How to do free (low cost) promotion online (Part 2)

Fake traffic in channels is rampant---How can operators develop a keen eye?

I have compiled 15 online channels for event promotion. I recommend you to save them.

Redstone China: The most beautiful scenery

BYD launches new e-platform 3.0 Evo, first model Sea Lion 07EV is launched, priced from RMB 189,800

JD+, the smart JD product, enters 751D•LAB Design is the next watershed of smart products

Xiaomi’s cost-effectiveness crisis has finally emerged. What is the road to future redemption?

"Lost in Russia" is available for free online. Who did Xu Zheng offend?

"Sky City" on the Qinghai-Tibet Plateau