Johannes Bjerva
Imagine a world where generative AI (GenAI) exists free from any concerns for safety, security, privacy, trustworthiness, misinformation, or bias. We are far from this vision today, where negative consequences from GenAI, such as deepfakes, are ever present in the news cycle. A new threat to GenAI has emerged recently, as large language models can be attacked by malicious actors, leading to leakage of private data, manipulation of end-users, and even risks of medical misdiagnosis. For instance, an attacker can modify prompts to ‘trick’ a model into releasing private data. The project aims to draw upon the field of linguistics to mitigate such attacks, relying on the hypothesis that there are identifiable linguistic patterns in signals that attempt to negatively affect a GenAI model. If we can identify such patterns, we may have the key for safe and secure language-driven AI in the future.