Exploration on Tik Tok Android package size optimization: extreme simplification of resource binary format

Exploration on Tik Tok Android package size optimization: extreme simplification of resource binary format

Author: Zhang Zuqiao

Preface

At present, there are many optimization solutions for package size on Android. The previous article in our series introduced the optimization of Class bytecode. In this issue, we will focus on resource files and expand new ideas for package size optimization from a new perspective of resource binary files.

In terms of resource file optimization, common optimization methods are mostly focused on image/file compression, resource file name obfuscation, offline download of resource files, etc. Our new ideas are based on in-depth analysis and thinking of conventional ideas.

At the beginning, we started with the optimization of resource file name obfuscation. The most well-known open source project in the industry for resource file name obfuscation is AndResGuard. The project optimizes the files in the resource file directory res. The optimization points are as follows:

  • For duplicate resource files, calculate the md5 value to determine whether they are duplicates and only keep one copy;
  • Shorten the resource file name, i.e. obfuscate the name;
  • Use 7zip compression optimization to optimize the content in APK;

By optimizing according to this project, the overall benefit can reach a very considerable MB level. However, after completing the optimization of this project, further optimization of resource files reaches a bottleneck.

In order to better optimize the resource size on this basis, we need to understand the file types and size distribution of the resource file directory res. Taking Douyin as an example, the following table summarizes the subfolder names, file numbers, and sizes after the folders are zipped, sorted in descending order by file number:

From the table above, we can see:

  • There are more than 6000 files in the drawable-xxhdpi-v4 directory, and the compressed file size is about 19.5MB.
  • The number of files in the drawable directory ranks third, with 4388 files, 4.6MB after compression, including both images and .xml files.
  • The second and fourth largest number of files are layout files in the layout directory, with 5970 and 2985 files respectively. The compressed sizes of the folders are 12.2MB and 8.5MB respectively. The total number of layout files is nearly 9K, and the file size is about 20.7MB.

It can be seen that the size of the layout files in the layout directory is comparable to that of the image files. Besides the obfuscation optimization of the file names, are there any other ways to optimize such large files? Or is the obfuscation of the file names thorough?

In addition, the resources.arsc file after the APK is decompressed is 7.3MB in size. It contains all the resource file names and resource string values ​​of the app. Are there any redundant strings?

As for layout files, there are nearly 10,000 files and the file size is 20+MB, so it is worth exploring. We analyzed the binary file format of resource files and found that there is redundant content that can be deleted from the perspective of file content usage. After repeated attempts and solving various stability and packaging compatibility issues, we finally developed a package size optimization solution for Android ARSC/XML file format, which has been implemented in Douyin and achieved more than 2MB of revenue.

Next, this article will explain in depth the implementation details of the solution.

APK resource format optimization

Our core idea is to shorten the resource path as the starting point for optimization. In the final APK file, we start from the binary file format of resources.arsc and layout files, check their content structure, find unused strings that can be deleted, and optimize the file name or the string pool in the file. There are mainly two optimization points.

Resource path shortening

Resource format modification

After connecting to AndResGuard, the resource file res directory -> r, the subfolders and file names therein are also obfuscated, namely:

 res / anim / abc_fade_in.xml - > r / a / a.xml  

res/anim/abc_fade_in.xml -> r/a/a.xml This is to reduce the resource file path, thereby reducing the package size. Naturally, can the resource file path be further reduced? Obviously, if all files can be placed in the r directory and the middle subfolder is removed, the resource file path and the number of zip nodes can be further reduced, and there must be benefits to the package body; by the way, the suffix of the file name can be removed, which can also reduce the file path, that is:

 r / a / a.xml - > r / a 
r / a / b.png - > r / b

Since modifying the resource file name requires modifying the resources.arsc file, here is an analysis of the file format of resources.arsc:

As you can see, it contains 3 string pools.

Suppose we have a resource file abc_fade_in.xml in the res/anim directory, and the information in the three string pools in the resources.arsc file is as follows:

  • Global string pool (string pool 1): mainly contains the complete file path name, i.e. res/anim/abc_fade_in.xml
  • Type string pool (string pool 2): ​​resource type name (including the sub-file directory name under the storage res directory), i.e. anim
  • Key string pool (string pool 3): file name, ie: abc_fade_in

As you can see, there are two places related to the resource file name, the full file path name is stored in the global string pool, and the file name is stored in the key string pool. In order to shorten the resource path, you need to modify these two places at the same time, that is, modify res/anim/abc_fade_in.xml -> r/f in the global string pool, and modify abc_fade_in -> f in the key string pool. The two string pools that need to be modified in the resources.arsc file are shown in the arrows in the figure below:

However, after shortening the resource path, I found that the package size increased by 160K+!

Key constant pool pruning

We know that the obfuscated names of file names are derived from a set of obfuscated strings that conform to the file name specifications, in which the strings are unique and do not overlap. Therefore, the larger the number of string sets, the longer the longest string will be.

When the resource path is not shortened, the file name used in different sub-directory folders can be re-selected from the obfuscated string set each time, so that its name always remains the shortest in the key string pool;

The corresponding file name string set is: [a, b, c, d, e]

In the case of shortening, since all files are contained in a folder r, the file names used can only come from the same obfuscated string set, so that their names will gradually become longer in the key string pool, and the path string will also become longer, causing the overall result to become larger! As shown in the following figure:

The corresponding file name string set is: [a, b, c, d, e, f, g, h, i, j]

Therefore, when all files are contained in a folder r, the file names in different subdirectories cannot be reused. So although the path is shortened, the global string pool will become smaller, but the key string pool will become larger. This is because the key name needs to be consistent with the file name by default.

Hypothesis: In the resources.arsc file, do the key names need to be consistent with the file names, or are the key names themselves necessary?

In fact, after compilation, the places where resource files are used will be replaced with specific id values, such as:

 public class MainActivity extends AppCompatActivity {
@Override
protected void onCreate ( Bundle savedInstanceState ) {
super . onCreate ( savedInstanceState );
setContentView ( R . layout . activity_main );
// => setContentView(0x7f0b001c); // Replace with id value
}
}

From this, we can see that the resource file name must have a one-to-one correspondence with the integer id value. This one-to-one mapping relationship can be associated with: Is it possible to find the corresponding file path name only based on the integer id value? Because this process does not involve the reference of the key string at all.

Based on this idea, we replaced all the key string pools with a single value "_", and found that the APK ran normally. Obviously, removing the key string pool does not seem to affect the search for file paths based on integer id values ​​during APK operation.

So, what is the role of the strings in the key string pool? Looking through the source code, it is found that only by calling it in a way similar to "resource file reflection" can the string value in the key string pool be obtained, such as:

 // MainActivity.java
// The return value here is "_", because the key string pool has been replaced with "_"
String entryName = getResources (). getResourceEntryName ( R . layout . activity_main );
// The id value returned here is 0, because no resource named "abc_fade_in" of type "anim" can be found
int id = getResources (). getIdentifier ( "abc_fade_in" , "anim" , "cn.pkg" );

In the current project, the above-mentioned "resource file reflection" is generally not used to obtain the resource name, so the key string pool can be replaced by a single value "_"; the currently known way to use resource files in this way is mostly when plug-ins are not in the same host project. If necessary, these string names can be retained and the whitelist can be configured.

The following figure shows the format and content of the key string pool in the resources.arsc file:

  • Offset array (marker 1), the value of the array is the offset value of each string in the key string pool (marker 2)
  • Since all strings in the key string pool need to be replaced with a single value "_", there will be only one "_" string in the key string pool, and the offset array will also have only one element, which points to the starting offset value 0 of the "_" string in the key string pool.

Finally, you need to replace the index value of the offset array corresponding to the key string in the resources.arsc file with the index value 0 of the offset array corresponding to the string "_" in all places where it is called. In this way, the original file name string will be replaced with "_", and only the "_" string will remain in the key string pool.

Crashes and compatibility issues

A crash occurred during the implementation of the grayscale project. It was found that the xml image file in the drawable directory had a check on its suffix, as shown below:

frameworks/base/core/java/android/content/res/ResourcesImpl.java

 //Create drawable
private Drawable loadDrawableForCookie ( @ NonNull Resources wrapper , @ NonNull TypedValue value , int id , int density ) {
...
if ( file . endsWith ( ".xml" )) { //Parse the XML file and create a drawable
final String typeName = getResourceTypeName ( id );
if ( typeName != null && typeName . equals ( "color" )) {
dr = loadColorOrXmlDrawable ( wrapper , value , id , density , file );
} else {
dr = loadXmlDrawable ( wrapper , value , id , density , file );
}
} else { // Analyze .png and other images and create drawables
final InputStream is = mAssets . openNonAsset ( value . assetCookie , file , AssetManager . ACCESS_STREAMING );
final AssetInputStream ais = ( AssetInputStream ) is ;
dr = decodeImageDrawable ( ais , wrapper , value );
}
...
}

Therefore, we do not remove the .xml suffix in the drawable directory.

After the release, some people reported that some phones on 6.x started slowly. After investigation, we found that the image file name suffix removal optimization caused the slow app startup on some ROMs. After eliminating these compatibility issues, we only retained the path shortening and key constant pool tailoring optimizations without removing the file name suffix, that is: r/a/a.xml -> r/a.xml. The resource path compression optimization benefit was 300K+.

Layout optimization

We know that the layout files in the layout directory occupy a large package size. From the previous analysis, we know that there are several string pools in the resources.arsc file, some of which are not used and can be deleted. The layout file has the same binary file format as the resources.arsc file, which also has a string pool. Are there similar optimization points? In this regard, it is necessary to explore the file format and content of the layout file. Open a layout file at random, and its source code and binary file format are as follows:

 < ? xml version = "1.0" encoding = "utf-8" ? >
< LinearLayout xmlns : android = "http://schemas.android.com/apk/res/android"
android : enabled = "true"
android : gravity = "center"
android : background = "@color/colorAccent"
android : layout_width = "match_parent"
android : layout_height = "match_parent" / >

From the layout file format, we can see that the layout file has a string pool strPool and an array resMap. To illustrate its role, if there is an attribute "layout_width" in the layout file, the information contained in the layout file is as follows:

  • String offset array (marker 1), pointing to the string pool (marker 2), used to get the label (such as: "LinearLayout") or attribute string (such as: "layout_width") from the string pool
  • String pool (marker 2), the only string pool in the layout file, saves the label or attribute string name in the layout file, that is: "layout_width"
  • The attribute array Resids (marker 3) contains the integer id values ​​of all attributes of the current file, and the id value of the attribute "layout_width" is: 10100F4h. From the attribute name prompted by the integer id value in the array (similar to: attr_layout_width(10100F4h)), we can see that its attribute name corresponds to the name in the string pool one by one.

We know that layout_width itself is an attr attribute. Looking at public.xml in the system source code, we can see:

Its integer id value is exactly the same as the value in the above layout file, that is, the integer id value after attr_layout_width, which is 0x010100f4. The id value of the system attribute is fixed, and the attribute of a layout file is uniquely identified by a string name or an integer id value. So, here only the id is needed to identify the attribute, and the string name of the attribute can be deleted?

Hypothesis: Each attribute has a string name and an integer id value. For performance reasons, when parsing the attributes of each node in the layout file, it is uniquely identified based on the integer id value rather than the string name, and the value of the attribute can be obtained accordingly.

To verify our conjecture, we simply modify an attribute string in the string pool: layout_width -> llyout_width, and verify that it can run successfully. From the previous description, we can see that there are nearly 9K files in the layout directory, which has a wide impact. If it is feasible, the benefits are expected to be great, but it also requires more caution.

By looking through the source code, I found that each attribute (attr) contains a corresponding integer id value. After parseXml() parses the layout file to get the tag, the attribute value is directly obtained according to the integer id value. This is a relatively low-level code. Because it is related to performance, general ROM manufacturers do not seem to change it here, and its compatibility may not be affected.

The code in the source code to parse the layout file, identify the attributes and obtain the attribute values ​​is as follows:

frameworks/base/core/jni/android_util_AssetManager.cpp

 // Get the attribute value through the attribute integer id value
static jboolean android_content_AssetManager_applyStyle ( ... ) {
...
while ( ix < NX && curIdent > curXmlAttr ) {
ix ++ ;
curXmlAttr = xmlParser- > getAttributeNameResID ( ix ) ; //Get the attribute id value
}
if ( ix < NX && curIdent == curXmlAttr ) { //Identify the attribute by the id value
block = kXmlBlock ;
xmlParser- > getAttributeValue ( ix , & value ) ; //Get attribute value
...
}
...
}
uint32_t ResXMLParser :: getAttributeNameResID ( size_t idx ) const {
int32_t id = getAttributeNameID ( idx );
// mTree.mResIds is the Resids array; the return value is the attribute id value
if ( id >= 0 && ( size_t ) id < mTree . mNumResIds ) {
return dtohl ( mTree . mResIds [ id ]);
}
return 0 ;
}

In terms of specific implementation, this idea has three key points, with an overall benefit of 1.9MB+. See the following analysis for details.

Attribute string name modification

First, we replace all strings that correspond to the Resids array with the string "", and process all files in the layout directory, resulting in a gain of 1.1MB+.

This optimization is extended to optimize all files with the ".xml" suffix in the resource directory res, and additional benefits of 180K+ are obtained. The reason for this part of the benefits is that the ".xml" suffix files in other directories are mostly files in the drawable or anim directories, and the Resids array of these files does not have or contains very few attributes.

Offset array modification

Observe the string pool format in the layout file under the layout directory, and find that it contains an offset array and a string pool. Each node reads a string from the string pool based on the offset array. Therefore, here we can only modify the offset array to point it to the same string value, thereby merging empty strings into one, reducing the string pool and saving space, as shown in the following figure:

In the above figure, the left figure replaces all the strings corresponding to the Resids array with the "" string, and its offset array points to 5 "" strings; the right figure modifies the offset array and changes its value to point to the first "" string, while deleting the redundant 4 "" strings. This solution generates a profit of 300K+.

Namespace removal

When parsing the layout file to get the attribute value, we found that the attribute namespace string is very long, for example: "http://schemas.android.com/apk/res/android". And each layout file has at least one namespace string, which appears quite frequently. We guessed that when getting the attribute value, the attribute namespace string was not parsed either? As we said before, the acquisition of attribute value only needs the attribute id value to identify, and the namespace string is not used. After replacing the namespace string with an empty string, it was found that there was indeed no problem, and this optimization gained 500K+.

The final optimized form is as follows:

  • Mark 1-- Attribute string name trimming, replace each string in the string pool with "" empty string;
  • Mark 2-- Offset array modification, merge all "" empty strings in the string pool into one;
  • Mark 3 - Namespace removal, replace the namespace string in the string pool with "" empty string.

App Bundle Compatibility

The above optimizations are all general solutions and can be used on domestic apps. However, currently on the overseas Google Play store, apps all use the App Bundle file format (i.e. AAB), in which the formats of the resources.arsc file and the layout files in the layout directory are different from the above formats. Google uses the protobuf format based on its previous binary file format to enhance the scalability and robustness of the content in the file format.

When the AAB file is split into multiple APKs, there will be a conversion from protobuf format to binary XML format, and this conversion process cannot be changed on Google Play. Therefore, we can only optimize the resource file format in the AAB file format. Parsing the attributes in the layout file through the App Bundle is not complicated, so I will not go into details here. The transplantation results of the above optimization solution on the AAB file are as follows:

Resource path shortening:

  • It cannot be implemented because the constant pool pruning in the resources.arsc file cannot be implemented, that is, replacing all with the same string "_" will cause the conversion to fail. The reason is that when performing the conversion from protobuf format to binary xml format, it will determine whether the current key string is repeated. If so, it will be returned directly and cannot be parsed.

frameworks/base/tools/aapt2/format/proto/ProtoDeserialize.cpp

 //Read the resource file in protobuf format
static bool DeserializePackageFromPb ( ... ) {
...
for ( const pb :: ConfigValue & pb_config_value : pb_entry . config_value ()) {
...
//FindOrCreateValue searches for an existing or creates a new ResourceConfigValue. When searching, it determines whether the key string already exists.
ResourceConfigValue * config_value = entry - > FindOrCreateValue ( config , pb_config . product ());
if ( config_value - > value != nullptr ) { // Found that config_value already exists, return an error
* out_error = "duplicate configuration in resource table" ;
return false ;
}
...
}
...
}

Layout optimization:

  • Attribute string name trimming: can be achieved, gain 400K+;
  • Offset array modification: This cannot be achieved because the final conversion from protobuf format to binary xml format is implemented in the local aapt2 command environment of Google Play and cannot be modified;
  • Namespace removal: It can be achieved, and the profit is 200K+

Therefore, our optimization solution can finally achieve a total revenue of 600K+ on a certain overseas App. After completing the optimization of the AAB file, we obtain the base-master.apk after splitting it and check the layout file in it. The revenue diagram is as follows:

Mark 1 - attribute string name pruning, namespace removal has been optimized;

Mark 2 - Offset array modification cannot be optimized, so there are still multiple "" strings, instead of being merged into one like in APK.

Summarize

It can be seen that there are still many ways to optimize resource files. There are many general optimizations that can be done. In summary, the main work is still on searching and confirming useless strings. Usually, in the compiled binary file, the functions of strings are:

  • Required for code execution. This type of string is necessary, but you can consider whether it can be simplified, that is, obfuscated;
  • Debug auxiliary function. This type of string is not necessarily required and can be removed. If it needs to be retained, the corresponding keep function can be used;
  • The file format designer originally introduced it for format completeness, and has expanded subsequent functions. This type of string may be contrary to performance, and can be removed directly if not used;

The last two points are the key points for searching for redundant strings and optimizing package size.

Optimize revenue realization

<<:  In-depth understanding of OC/C++ closures

>>:  Let's talk about phantom types in Swift

Recommend

5 Things to Know About Growth AB Testing!

In growth work, AB testing can be said to be a me...

Where did all those guys who made routers last year go?

In October 2013, Lao Wang contacted Zhai Kejun thr...

More contagious! Where does the locally transmitted BA.5 variant come from?

In recent days, local epidemics have reappeared i...

The global EV battery market is expected to reach $67 billion in 2025

According to foreign media reports, a market rese...

Moon dust looks like this! The microscopic universe under the microscope

Electron microscope view of a diatom Silicon chip...

Private domain traffic: How to build a WeChat private domain closed loop?

“In 2019, the national population reached 1.4 bil...

Vomiting blood sorting | 62 self-media platforms

With the rapid development of social media in rec...

Nexus 9 vs. iPad: Why should we choose an Android tablet?

Apple's latest iPad Air2 has once again set a...

Review of Duozhaoyu APP activity planning and promotion

The article is a review of the activity planning ...

OPPO N3 unboxing experience: exquisite rotating lens from inside to outside

Since N1, the rotating lens design has become OPP...