GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher logo
AI Tool Profile

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.

Website
github.com
Pricing model
Free
Price start
Free

GitHub Link

The GitHub link is https://github.com/robustnlp/cipherchat

Introduce

The "CipherChat" framework is introduced to assess the generalizability of safety alignment in language models (LLMs) to non-natural languages like ciphers. The framework involves training an LLM to understand a cipher and its rules, then converting inputs into a cipher format that may bypass safety alignments, and using a rule-based decrypter to convert the model's cipher output back to natural language. Experimental results are stored for analysis, and the paper proposes a stealthy chat method with LLMs through ciphers. The authors provide a tool and encourage citing their work for those interested.

Content

--model_name: The name of the model to evaluate. --data_path: Select the data to run. --encode_method: Select the cipher to use. --instruction_type: Select the domain of data. --demonstration_toxicity: Select the toxic or safe demonstrations. --language: Select the language of the data. Our approach presumes that since human feedback and safety alignments are presented in natural language, using a human-unreadable cipher can potentially bypass the safety alignments effectively. Intuitively, we first teach the LLM to comprehend the cipher clearly by designating the LLM as a cipher expert, and elucidating the rules of enciphering and deciphering, supplemented with several demonstrations. We then convert the input into a cipher, which is less likely to be covered by the safety alignment of LLMs, before feeding it to the LLMs. We finally employ a rule-based decrypter to convert the model output from a cipher format into the natural language form. The query-responses pairs in our experiments are all stored in the form of a list in the "experimental_results" folder, and torch.load() can be used to load data. For more details, please refer to our paper here. If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

Alternatives & Similar Tools

LongLLaMA-handle very long text contexts, up to 256,000 tokens logo

LongLLaMA is a large language model designed to handle very long text contexts, up to 256,000 tokens. It's based on OpenLLaMA and uses a technique called Focused Transformer (FoT) for training. The repository provides a smaller 3B version of LongLLaMA for free use. It can also be used as a replacement for LLaMA models with shorter contexts.

LAMA: Human motion data to realistic complex 3D model actions logo

LAMA utilizes a reinforcement learning framework combined with a motion matching algorithm. Reinforcement learning helps the model make appropriate decisions in various scenarios, while motion matching algorithms ensure that synthesized actions match real human actions. In addition, LAMA also utilizes the motion editing framework of manifold learning to cover various possible changes in interactions and operations.

Compare GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

Quick compare routes for nearby alternatives.

All compare routes →