DEX file format analysis

DEX file format analysis

I was busy cracking the apk of China Mobile and China Telecom some time ago, and I haven't updated my blog for a long time. Recently, I am writing a tool, the main function of which is to hide the types, functions, and attributes in the dex file through configuration to prevent static analysis. Therefore, before writing the tool, you must have a clear understanding of the dex file format. Compared with the elf file format, the dex file format is simpler.

Original link: DEX file format analysis

0x00 Introduction

The best way to analyze the dex file format is to find an introduction document, write a simple demo and then use 010Editor to analyze it. You can refer to the official document http://source.android.com/devices/tech/dalvik/dex-format.html. If you are not good at English, you can also find a Chinese one, like me. . . . . .

010Editor is a good tool. I used it to analyze elf files before. In fact, as long as you install the template, you can analyze many files. Although it is a paid software, there is a 30-day free trial. But if you are using a Mac 😏 The trial period is over, delete this file 🙄 ~/.config/SweetScape/010 Editor.ini.

0x01 File Layout

The dex file can be divided into three modules: header file, index area (xxxx_ids), and data area (data). The header file describes the distribution of the entire dex file, including the size and offset of each index area. The ids in the index area is the abbreviation of identifiers, which indicates the identification of each data. The index area mainly points to the offset of the data area.

In 010Editor, except for the data area, all other sections are displayed. In addition, link_data is defined as map_list in the template.

0x02 header

The header describes the dex file information and the indexes of other areas. 010Editor (it is a bit troublesome to write 010Editor, so just write 010 below) uses the structure struct header_item to describe the header.

There are two data types used, char and uint. The char here is the char in C/C++, which occupies 8 bits, and the char in Java, which occupies 16 bits. There is a little difference, but we can use it to represent short/ushort. This will be introduced in the tool I wrote recently. The official document defines it in ubyte, so let's follow the official one. Structure description:

  1. ubyte 8- bit unsigned int  
  2. uint 32- bit unsigned int , little-endian
  3.  
  4. struct header_item
  5. {
  6. ubyte[8] magic;
  7. unit checksum;
  8. ubyte[20] signature;
  9. uint file_size;
  10. uint header_size;
  11. unit endian_tag;
  12. uint link_size;
  13. uint link_off;
  14. uint map_off;
  15. uint string_ids_size;
  16. uint string_ids_off;
  17. uint type_ids_size;
  18. uint type_ids_off;
  19. uint proto_ids_size;
  20. uint proto_ids_off;
  21. uint method_ids_size;
  22. uint method_ids_off;
  23. uint class_defs_size;
  24. uint class_defs_off;
  25. uint data_size;
  26. uint data_off;
  27. }

Except for magic, checksum, signature, file_size, endian_tag, and map_off, all other elements appear in pairs. _off indicates the offset of the element, and _size indicates the number of elements. The remaining 6 descriptions are mainly the information of the dex file.

  • magic: This is a fixed value used to identify dex files. Converted to a string:
  1. {0x64, 0x65, 0x78, 0x0A, 0x30, 0x33, 0x35, 0x00} = "dex\n035\0"  

There is a line break in the middle, followed by 035, which is the version number.

  • checksum: file verification code, uses the alder32 algorithm to verify the file except for maigc and checksum, and is used to check for file errors.
  • signature: Use the SHA-1 algorithm to hash all file areas except magic, checksum, and signature to uniquely identify the file.
  • file_size: dex file size
  • header_size: The size of the header area, currently fixed at 0x70
  • endian_tag: endian tag, the dex file format is little endian, fixed value is 0x12345678 constant
  • map_off: The offset address of map_item. This item belongs to the content in the data area. The value must be greater than or equal to the size of data_off and is at the end of the dex file.

0x03 string_ids

The string_ids section describes all the strings in the dex file. The format is very simple with only one offset, which points to a string in the string_data section:

The above description mentions the LEB128 (little endian base 128) format, which is an indefinite length encoding method based on 1 byte. If the highest bit of the first byte is 1, it means that the next byte is needed to describe it, until the highest bit of the last byte is 0. The remaining bits of each byte are used to represent data, as shown in the following table. In practice, LEB128 can only reach a maximum of 32-bit, which can be seen by reading the Leb128.h source code in dalvik.

The data structure is:

  1. ubyte 8- bit unsigned int  
  2. uint 32- bit unsigned int , little-endian
  3. uleb128 unsigned LEB128, valriable length
  4.  
  5. struct string_ids_item
  6. {
  7. uint string_data_off;
  8. }
  9.  
  10. struct string_data_item
  11. {
  12. uleb128 utf16_size;
  13. ubyte data;
  14. }

The data stores the string value. string_ids is the most important part, and many subsequent sections directly point to the index of string_ids. When writing tools for comparison, you also need to extract string_ids.

0x04 type_ids

The type_ids section indexes all data types in the dex file, including class types, array types, and primitive types. The element format in the section is type_ids_item, and the structure is described as follows:

  1. uint 32- bit unsigned int , little-endian
  2.  
  3. struct type_ids_item
  4. {
  5. uint descriptor_idx; // -->string_ids  
  6. }

The value of descriptor_idx in type_ids_item means the index number in string_ids, which is the string used to describe this type.

0x05 proto_ids

proto means method prototype, which represents the prototype of a method in Java language. The elements in proto_ids are proto_id_item, and the structure is as follows:

  1. uint 32- bit unsigned int , little-endian
  2.  
  3. struct proto_id_item
  4. {
  5. uint shorty_idx; // -->string_ids  
  6. uint return_type_idx; // -->type_ids  
  7. uint parameters_off;
  8. }
  • shorty_idx: Like type_ids, its value is the index number of string_ids, and finally a short string description to illustrate the method prototype.
  • return_type_idx: Its value is the index number of a type_ids, indicating the return value type of the method prototype.
  • parameters_off: points to the parameter list type_list of the method prototype. If the method has no parameters, the value is 0. The format of the parameter list is type_list, which will be described below.

0x06 field_ids

The filed_ids section contains all the fields referenced by the dex file. The element format of the section is field_id_item, and the structure is as follows:

  1. ushort 16- bit unsigned int , little-endian
  2. uint 32- bit unsigned int , little-endian
  3.  
  4. struct filed_id_item
  5. {
  6. ushort class_idx; // -->type_ids  
  7. ushort type_idx; // -->type_ids  
  8. uint name_idx; // -->string_ids  
  9. }
  • class_idx: Indicates the class type to which the field belongs. The value of class_idx is an index of type_ids and must point to a class type.
  • type_idx: indicates the type of this field, and its value is also an index of type_ids.
  • name_idx: represents the name of this field, and its value is an index of string_ids.

0x07 method_ids

method_ids is the last entry in the index area, describing all methods in the dex file. The element format of method_ids is method_id_item, and the structure is very similar to fields_ids:

  1. ushort 16- bit unsigned int , little-endian
  2. uint 32- bit unsigned int , little-endian
  3.  
  4. struct filed_id_item
  5. {
  6. ushort class_idx; // -->type_ids  
  7. ushort proto_idx; // -->proto_ids  
  8. uint name_idx; // -->string_ids  
  9. }
  • class_idx: indicates the class type to which the method belongs. The value of class_idx is an index of type_ids and must point to a class type. The ushort type is also the reason why we say that a dex can only have 65535 methods, and more methods must be packaged.
  • proto_idx: Indicates the type of method, and its value is also an index of type_ids.
  • name_idx: represents the name of the method, and its value is an index of string_ids.

0x08 class_defs

The class_def section is mainly for class definition. Its structure is very complicated, and I feel a little dizzy when I read it. Let's take a look at the structure diagram of 010:

It makes me dizzy just looking at it, let alone analyzing it.

class_def_item

The class_def_item structure is described as follows:

  1. uint 32- bit unsigned int , little-endian
  2.  
  3. struct class_def_item
  4. {
  5. uint class_idx; // -->type_ids  
  6. uint access_flags;
  7. uint superclass_idx; // -->type_ids  
  8. uint interface_off; // -->type_list  
  9. uint source_file_idx; // -->string_ids  
  10. uint annotations_off; // -->annotation_directory_item  
  11. uint class_data_off; // -->class_data_item  
  12. uint static_value_off; // -->encoded_array_item  
  13. }
  • class_idx: describes the specific class type, and its value is an index of type_ids. The value must be a class type, not an array type or a basic type.
  • access_flags: describes the access type of the class, such as public, final, static, etc. There is a detailed description in "access_flags Definitions" in dex-format.html.
  • superclass_idx: describes the type of superclass, the value format is the same as class_idx.
  • interfaces_off: The value is an offset address, pointing to the interfaces of the class, and the data structure pointed to is type_list. If the class has no interfaces, the value is 0.
  • source_file_idx: indicates the source code file information, and its value is an index of string_ids. If this information is missing, this value is assigned NO_INDEX=0xffff ffff.
  • annotations_off: The value is an offset address, pointing to the annotations of the class, located in the data area, in the format of annotations_direcotry_item. If there is no such content, the value is 0.
  • class_data_off: The value is an offset address, pointing to the data used by the class, located in the data area, and in the format of class_data_item. If there is no such content, the value is 0. This structure contains a lot of content, describing in detail the field, method, execution code in the method, etc. of the class. class_data_item will be introduced later.
  • static_value_off: The value is an offset address pointing to a list in the data area, the format is encoded_array_item. If there is no such content, the value is 0.

type_list

type_list is in the data section, class_def_item->interface_off refers to the data here. The data structure is as follows:

  1. uint 32- bit unsigned int , little-endian
  2.  
  3. struct type_list
  4. {
  5. uint size ;
  6. type_item list [ size ]
  7. }
  8.  
  9. struct type_item
  10. {
  11. ushort type_idx // -->type_ids  
  12. }
  • size: indicates the number of types
  • type_idx: corresponds to an index of type_ids

annotations_directory_item

The data section pointed to by class_def_item->annotations_off defines the data description related to annotation. The data structure is as follows:

  1. uint 32- bit unsigned int , little-endian
  2.  
  3. struct annotation_directory_item
  4. {
  5. uint class_annotations_off; // -->annotation_set_item  
  6. uint fields_size;
  7. uint annotated_methods_size;
  8. uint annotated_parameters_size;
  9.      
  10. field_annotation field_annotations[fields_size];
  11. method_annotation method_annotations[annotated_methods_size];
  12. parameter_annotation parameter_annotations[annotated_parameters_size];
  13. }
  14.  
  15. struct field_annotation
  16. {
  17. uint field_idx;
  18. uint annotations_off; // -->annotation_set_item  
  19. }
  20.  
  21. struct method_annotation
  22. {
  23. uint method_idx;
  24. uint annotations_off; // -->annotation_set_item  
  25. }
  26.  
  27. struct parameter_annotation
  28. {
  29. uint method_idx;
  30. uint annotations_off; // -->annotation_set_ref_list  
  31. }
  • class_annotations_off: This offset points to annotation_set_item. For details, see the introduction on dex-format.html.
  • fields_size: indicates the number of attributes
  • annotated_methods_size: indicates the number of methods
  • annotated_parameters_size: indicates the number of parameters

class_data_item

class_data_off points to the class_data_item structure in the data area. class_data_item stores various data used by this class. The following is the structure of class_data_item:

  1. uleb128 unsigned little-endian base 128
  2.  
  3. struct class_data_item
  4. {
  5. uleb128 static_fields_size;
  6. uleb128 instance_fields_size;
  7. uleb128 direct_methods_size;
  8. uleb128 virtual_methods_size;
  9.  
  10. encoded_field static_fields[static_fields_size];
  11. encoded_field instance_fields[instance_fields_size];
  12. encoded_method direct_methods[direct_methods_size];
  13. encoded_method virtual_methods[virtual_methods_size];
  14. }
  15.  
  16. struct encoded_field
  17. {
  18. uleb128 filed_idx_diff;
  19. uleb128 access_flags;
  20. }
  21.  
  22. struct encoded_method
  23. {
  24. uleb128 method_idx_diff;
  25. uleb128 access_flags;
  26. uleb128 code_off;
  27. }

class_data_item

  • static_fields_size: the number of static member variables
  • instance_fields_size: number of instance member variables
  • direct_methods_size: number of direct functions
  • virtual_methods_size: number of virtual functions

The following are descriptions of

encoded_field

  • method_idx_diff: The prefix methd_idx indicates that its value is an index of method_ids, and the suffix _diff indicates that it is a difference from another method_idx, that is, the difference from the method_idx of the previous element in the encoded_method[] array. In fact, encoded_filed -> field_idx_diff means the same thing, but the compiled Hello.dex file does not use class field, so I will not explain it in detail. For details, please refer to the official website document of dex_format.html.
  • access_flags: access rights, such as public, private, static, final, etc.
  • code_off: An offset address pointing to the data area, the target is the code implementation of this method. The structure pointed to is code_item, which has nearly 10 elements.

code_item

The code_item structure describes the specific implementation of a method. Its structure is described as follows:

  1. struct code_item
  2. {
  3. ushort registers_size;
  4. ushort ins_size;
  5. ushort outs_size;
  6. ushort tries_size;
  7. uint debug_info_off;
  8. uint insns_size;
  9. ushort insns [insns_size];
  10. ushort padding; // optional
  11. try_item tries [tyies_size]; // optional
  12. encoded_catch_handler_list handlers; // optional
  13. }

The last three items are marked as optional, which means they may or may not be present, depending on the specific code.

  • registers_size: The number of registers used by this code.
  • ins_size: The number of parameters passed into the method.
  • outs_size: The number of parameters required when this code calls other methods.
  • tries_size: number of try_item structures.
  • debug_off: offset address, pointing to the debug information storage location of this code segment, which is a debug_info_item structure.
  • insns_size: The size of the instruction list, in 16-bit units. insns is the abbreviation of instructions.
  • padding: The value is 0, which is used for byte alignment.
  • tries and handlers: used to handle exceptions in java, common syntax is try catch.

encoded_array_item

class_def_item->static_value_off offset points to the section data.

  1. uleb128 unsigned LEB128, valriable length
  2.  
  3. struct encoded_array_item
  4. {
  5. encoded_array value;
  6. }
  7.  
  8. struct encoded_array
  9. {
  10. uleb128 size ;
  11. encoded_value values ​​[ size ];
  12. }
  • size: indicates the number of encoded_values
  • encoded_value: I haven't figured out how this is done 🙄

0x09 map_list

Most items in map_list are the same as the corresponding descriptions in the header, which introduce the offset and size of each area, but the description in map_list is more comprehensive, including HEADER_ITEM, TYPE_LIST, STRING_DATA_ITEM, DEBUG_INFO_ITEM and other information.

In 010, map_list is represented as:

The data structure is:

  1. ushort 16- bit unsigned int , little-endian
  2. uint 32- bit unsigned int , little-endian
  3.  
  4. struct map_list
  5. {
  6. uint size ;
  7. map_item list [ size ];
  8. }
  9. struct map_item
  10. {
  11. ushort type;
  12. ushort unuse;
  13. uint size ;
  14. uint offset;
  15. }

In map_list, a uint is used to describe the size map_items, followed by the corresponding size map_item descriptions. The map_item structure has 4 elements: type indicates the type of the map_item, the definition of the Type Code in Dalvik Executable Format; size indicates the number of this type of item to be subdivided; offset is the offset of the first element relative to the initial position of the file; unuse is used for alignment bytes and has no practical use.

<<:  Application of Image Technology in Live Broadcasting (Part 2) - Image Recognition

>>:  Five important factors to improve the quality of mobile application development

Recommend

Mi La Micro Course "Blue Ocean Public Account Project Training Camp"

Introduction to the resources of Mi La Micro Clas...

Analysis of the technical principles of mobile terminal monitoring system

[[184536]] In such an era that focuses on user ex...

80 episodes of video on how Gu Yu became a professional trader

80 episodes of Gu Yu's video on how to become...

Notes from an e-commerce private domain operator!

Maybe everyone now knows what public domain and p...

Baidu promotion information flow advertising display style - three-picture style

Information flow advertising style - three-image ...

Gou Wenqiang's "31 Posture Correction Training Camp" will give you a perfect body

Introduction to the training course content: In 7 ...

Liu Guosheng's Sanyuan Yanggong Fengshui Guangzhou Training Course 2017

Feng Shui, which originated in ancient times, is ...

Tips for cold start operation and promotion of Tik Tok account!

“It’s hard to create a new Douyin account now!” R...

How to make the layout of the official account look more elegant?

How to make the layout of articles in public acco...