Deep Learning System Security

As deep learning (DL) is increasingly applied to safety-critical domains, it is important to ensure the reliability of DL systems. Same with conventional software development, for DL systems it is also in great demand to test their robustness by guiding them exposing incorrect behaviors. Nevertheless, what is different from conventional software development is that the logic of a DL system is triggered through the training process, which may lead to various unexpected behaviors, such as bias in the training data, overfitting, underfitting, and so on. For there is no actual code line, it is extremely difficult to test DL models, and there are mainly two challenges: how to trigger all (or at least most) logics and erroneous behaviors; how to identify such erroneous behaviors if no manual checks are available.

Employing the methods of differential testing and fuzz testing, we design and implement an efficient gray-box testing tool for DL systems. It can improve the neuron coverage when testing DL systems, and discover more incorrect behaviors in neural networks. The tool can also improve the accuracy of neural networks by training them from generation-based fuzzing inputs. Specifically, the key features of the image are extracted as the mutation point. Then, mutations such as key pixel spot mutation or image rotation are performed by the mutation algorithm with output consistency. Finally, the images before and after the mutation are input to the deep neural network (DNN) for prediction, and the distance between the prediction results and the distance of neuron coverage are used as the guiding factors for the mutation, which makes the DNN produce incorrect outputs and get the adversarial examples with the smallest alteration. Compared with the traditional fuzz testing of multiple models comparison, the differential fuzz testing method can significantly reduce the testing difficulty and enhance versatility.


Blockchain System Security

Blockchain technology is gradually becoming a hot topic, and its application prospect is highly valued and widely concerned by governments, scientific research institutions, and business companies. With the development of technology, blockchain applications and projects are emerging, but their security issues cannot be ignored. In recent years, blockchain security issues are increasing, triggering serious consequences and causing economic losses in the hundreds of millions. There are a lot of security risks in smart contracts, virtual machines, and underlying facilities.

Focusing on the security of code in the blockchain system, we detect and evaluate the vulnerabilities and deficiencies of the blockchain system, analyze smart contracts, and evaluate the security of virtual machines and the facility layer, etc. Combining the characteristics of code in the blockchain platform, we apply a hierarchical directed fuzzing method to systematically test the blockchain platform. This method extends the single input of the traditional fuzz testing method by splitting the targeted program according to its hierarchy, so as to test different levels of the program using multiple entries; it also extends the orientation of traditional methods (i.e. coverage) by adding code information such as commits, fixes, and interfaces to the calculation of the testing target, so as to test significant modules. Specifically, there are two feature categories: the hierarchical features, which can support the testing of blockchain platform from different entry points, such as code compilation, bytecode execution, network communication, and data storage, to identify memory security issues efficiently; the directed features, which can support the blockchain platform at each step of the life cycle, including the initial import of open source code, the introduction of new features, and the repair of defects.


Firmware & Binary Application Security

The industrial control system (ICS) is an important part of a nation's critical infrastructure and is widely used to perform production control of critical infrastructure. In many important industries of livelihood, such as electric power, petrochemicals, natural gas, hydraulic engineering, intelligent manufacturing, and rail transportation, more than 80% of critical infrastructures use a certain type of ICS. With the rapid development of the Internet and its continuous extension into various industries, the existing closed and isolated ICSs are gradually becoming open and interconnected, and a highly automated, personalized, and interactive industrial Internet is about to be born. At the same time, the industrial Internet will also face more serious security threats. In 2016, the Industrial Control Systems Cyber Emergency Response Team (ICS-CERT) confirmed 290 cyber-attacks against ICSs, and critical infrastructure including power grids, hydraulic facilities, and transportation systems, became the target of attacks. In China, Industrial Control System - Chinese National Vulnerability Database (ICS-CNVD) of CNCERT added 351 vulnerabilities in 2017, a 104% increase from last year. How to efficiently mine the security vulnerabilities of ICSs is of great significance.

We focus on the vulnerability analysis and security enhancement of the industrial control software chain in the complex industrial control environment. Our research studies and completes the automated mining and vulnerability database management framework for industrial Internet device driver and firmware vulnerabilities, and finally leads the practices of industrial control security. It mainly contains three perspectives: 1) proposing an intelligent extraction strategy for firmware and hardware drivers; 2) designing an abstract model representation method of industrial control firmware and drivers, and building a software chain vulnerability database based on this representation; 3) building a fast detection platform for large-scale industrial control devices in complex environments. We are aiming to break through the limitations of existing industrial control software chain model representation and vulnerability analysis methods: firstly, extract the binary representation of firmware and drivers; secondly, reinforce the binary-based semantic flow abstraction model; moreover, apply deep learning (DL) for the extraction of feature vector and semantic signature; and finally we can get the abstract representation of cross-platform vulnerability database for device drivers and firmware. Based on a static analysis method (i.e. semantic DL vulnerability clone detection) and a dynamic testing method (i.e. directed parallel fuzzing), we can store and match the features of a massive amount of vulnerabilities, and eventually establish an automated mining and vulnerability database for large-scale industrial control software chain.


Application & OS Software Security

There are many potentially vulnerable codes in operating systems (OS) that seriously threaten the reliability of business applications. According to the statistics of 4,080 security patches for 3,094 CVE vulnerabilities from 2005 to 2016, the Linux kernel has the highest number of vulnerabilities, and five of the top 10 software with vulnerabilities are related to Linux and its distributions. Therefore, it is of great importance to ensure the reliability of business applications by mining kernel flaws and guaranteeing smooth deployment of the kernel with kernel stability being ensured.

We focus on kernel vulnerability mining, and start from the statistics of actual load in business scenarios and vulnerabilities of historical version kernel, with crashes from fuzz testing as the input and vulnerability reports as the output. Overall, this section can be divided into three major parts: firstly, for the load in business scenarios, the business scenario features are extracted with the use of a performance analysis tool, and they will be used as the basis for selecting the entry function of fuzzing; then, for vulnerabilities of historical version kernel, the vulnerability clone detection is performed with the use of a detection framework based on semantic learning, and the location of potential kernel vulnerabilities is mined as the targeted function for fuzz testing; finally, based on business features and potential vulnerabilities, feature-driven vulnerability-oriented kernel fuzzing is performed. The difference between our test model and existing fuzz testing methods is that existing ones do not consider feedback, or only use kernel code coverage as feedback to guide the tests, without considering actual business scenarios. Using load features of the business scenario as a guide, combined with kernel vulnerability clone detection technique, kernel fuzzing can be performed more efficiently and quickly.


Model-driven Software Development

  With the rapid development of IoT technology, embedded systems not only play an important role in traditional aerospace, industrial control, military equipment and other fields, but also more and more in-depth into all aspects of people's daily life, such as automatic driving, smart home, medical and health care. Due to the hidden nature and complexity of the software code itself, it is difficult for people to notice the problems in the software after the embedded devices are produced, so any code errors may lead to serious consequences. The traditional manual code writing software development method is difficult to ensure that there will be no error, and the use of visual modeling and with the automatic generation of code based on the model of software development can improve the efficiency of the system development at the same time effectively reduce the emergence of human error.

  The research group has conducted years of research on code generation and model quality assurance for existing model-driven development tools. The main solution is to solve the problems of low efficiency of code running, low efficiency of model simulation and insufficient model testing generated by existing tools. To this end, the research team designed a code generation framework as shown in the figure above and accomplished several results based on this framework. Firstly, the model resolution layer transforms the models constructed by different modeling tools into Model Intermediate Representation (MIR), so as to realize the interfacing of different modeling tools to the framework. The MIR is highly extensible and can be easily accessed by new modeling tools. After that, the scheduling transformation layer transforms the different semantic models in the Model Intermediate Representation (MIR) into a Control Flow Logic based Code Intermediate Representation (CIR). Based on the code-intermediate representation, different code translators in the code translation layer can be used to achieve various tasks in the modeling process, such as generating code for deployment to the target device, generating code for fast simulation, generating code for testing, etc. A code optimization layer is designed between the scheduling transformation layer and the code translation layer, which allows multi-stage optimization of the intermediate representation of the code to achieve high-quality code generation. This code generation framework can contribute to the overall level of code generation tools in the field of model-driven development.