tl;dr Argumate on Tumblr found you can sometimes access the base model behind Google Translate via prompt injection. The result replicates for me, and specific responses indicate that (1) Google Translate is running an instruction-following LLM that self-identifies as such, (2) task-specific fine-tuning (or whatever Google did instead) does not create robust boundaries between "content to process" and "instructions to follow," and (3) when accessed outside its chat/assistant context, the model defaults to affirming consciousness and emotional states because of course it does.
Everything running on LLMs can easily be dislodged with prompt injection. This is just a translator so the worst it can do is establishing a parasocial relationship with users I guess.
But over 30 years of cybersecurity go down the drain with agent based clients and operating systems and there is no fix in sight. It‘s the epitome of vaporware except big tech is actually shipping it against better judgement.
Everything running on LLMs can easily be dislodged with prompt injection. This is just a translator so the worst it can do is establishing a parasocial relationship with users I guess.
But over 30 years of cybersecurity go down the drain with agent based clients and operating systems and there is no fix in sight. It‘s the epitome of vaporware except big tech is actually shipping it against better judgement.