arXiv:2402.18945v1 Announce Type: new
Abstract: Pre-trained language models (PLMs) have been found susceptible to backdoor attacks, which can transfer vulnerabilities to various downstream tasks. However, existing PLM backdoors are conducted with explicit triggers under the manually aligned, thus failing to satisfy expectation goals simultaneously in terms of effectiveness, stealthiness, and universality. In this paper, we propose a novel approach to achieve invisible and general backdoor implantation, called \textbf{Syntactic Ghost} (synGhost for short). Specifically, the method hostilely manipulates poisoned samples with different predefined syntactic structures as stealth triggers and then implants the backdoor to pre-trained representation space without disturbing the primitive knowledge. The output representations of poisoned samples are distributed as uniformly as possible in the feature space via contrastive learning, forming a wide range of backdoors. Additionally, in light of the unique properties of syntactic triggers, we introduce an auxiliary module to drive the PLMs to learn this knowledge in priority, which can alleviate the interference between different syntactic structures. Experiments show that our method outperforms the previous methods and achieves the predefined objectives. Not only do severe threats to various natural language understanding (NLU) tasks on two tuning paradigms but also to multiple PLMs. Meanwhile, the synGhost is imperceptible against three countermeasures based on perplexity, fine-pruning, and the proposed maxEntropy.

Source link