The system prompt discovered in the leak explicitly warns the model: “You are operating UNDERCOVER… Your commit messages… MUST NOT contain ANY Anthropic-internal information. Do not blow your cover.”
This is so incredibly stupid.
You’ve tried security.
You’ve tried security through obscurity.
Now try security through giving instructions to an LLM via a system prompt to not blow its cover.
This is so incredibly stupid.
You’ve tried security.
You’ve tried security through obscurity.
Now try security through giving instructions to an LLM via a system prompt to not blow its cover.