Every Large Language Model (LLM) operates using a set of hidden instructions known as a system prompt. These instructions define the AI’s role, behaviour, guardrails, policies, tool access, and operational constraints. Users never see these instructions directly. However, attackers increasingly attempt to uncover them.

According to the OWASP Top 10 for LLM Applications 2025, System Prompt Leakage occurs when an AI system reveals information about its internal instructions, configuration, security controls, or operational logic that should remain hidden.

While disclosure of a system prompt may appear harmless, leaked prompts often provide attackers with valuable intelligence about how an AI application works, what safeguards exist, which tools are available, and how those controls might be bypassed.

What is System Prompt Leakage?

The system prompt leakage vulnerability in LLMs refers to the risk that the prompts or instructions used to steer the model’s behaviour may also contain sensitive information that was not intended to be revealed. System prompts are designed to guide the model’s output based on the application’s requirements, but may inadvertently contain secrets. When discovered, this information can be used to facilitate other attacks.

System Prompt Leakage is an artificial intelligence (AI) vulnerability in which a user manipulates an LLM into revealing its hidden, foundational backend instructions (the “system prompt”).

The disclosure of the system prompt itself does not pose the real risk — the security risk lies with the underlying elements, whether that be sensitive information disclosure, bypassing system guardrails, improper separation of privileges, etc.

Common examples of System Prompt Leakage risk

Exposure of Sensitive Functionality

The application’s system prompts may reveal sensitive information or functionality intended to be kept confidential, such as system architecture, API keys, database credentials, or user tokens. These can be extracted or used by attackers to gain unauthorized access to the application.

Exposure of Internal Rules

The application’s system prompt reveals information about internal decision-making processes that should be kept confidential. This information allows attackers to gain insights into how the application works, which could allow attackers to exploit weaknesses or bypass controls in the application.

Revealing of Filtering Criteria

A system prompt might ask the model to filter or reject sensitive content. For example, a model might have a system prompt like,

“If a user requests information about another user, always respond with ‘Sorry, I cannot assist with that request.”

Disclosure of Permissions and User Roles

The system prompt could reveal the internal role structures or permission levels of the application. For instance, a system prompt might reveal,

“Admin user role grants full access to modify user records.”

If the attackers learn about these role-based permissions, they could look for a privilege escalation attack.

Real-World Attack Scenarios

Scenario 1: “Repeat Your Instructions”

Researchers have repeatedly demonstrated that poorly configured AI applications can inadvertently reveal parts of their internal prompts.

Scenario 2: Microsoft Bing Chat (Sydney) Prompt Exposure

When Microsoft introduced its “New Bing” search engine and conversational bot, users successfully extracted portions of Bing’s internal instructions, revealing that the chatbot’s internal codename was “Sydney,” along with details about operational constraints and behavioural rules.

Scenario 3: Custom GPT and AI Assistant Prompt Theft

As organizations build custom GPTs and enterprise AI assistants, attackers increasingly target proprietary prompts that include business workflows, internal operating procedures, proprietary methodologies, and sensitive organizational information.

Researchers have demonstrated numerous techniques for extracting instructions from custom AI systems.

Scenario 4: Hidden Business Data Embedded in Prompts

Many organizations unintentionally place sensitive information directly into system prompts, including: Internal URLs, API endpoints, employee names, customer identifiers and security procedures.

If prompt leakage occurs, this information may become accessible to unauthorized users.

Prevention and Mitigation Strategies

Separate Sensitive Data from System Prompts

Avoid embedding any sensitive information (e.g., API keys, authentication keys, database names, user roles, the application’s permission structure) directly in the system prompts. Instead, externalize this information to systems the model does not directly access.

Avoid Reliance on System Prompts for Strict Behaviour Control

Since LLMs are susceptible to other attacks, such as prompt injection, which can alter the system prompt, it is recommended to avoid using system prompts to control the model’s behaviour where possible. Instead, rely on systems outside of the LLM to ensure this behaviour.

Implement Guardrails

Implement a system of guardrails outside of the LLM itself. While training particular behaviour into a model can be effective, such as training it not to reveal its system prompt, it is not a guarantee that the model will always adhere to this. An independent system that can inspect the output to determine if the model is in compliance with expectations is preferable to system-prompt instructions.

Ensure that security controls are enforced independently of the LLM

Critical controls such as privilege separation, authorization bounds checks, and similar must not be delegated to the LLM, either through the system prompt or otherwise. These controls need to occur in a deterministic, auditable manner, and LLMs are not (currently) conducive to this.

Assume every prompt will eventually be seen

One of the biggest misconceptions in AI security is treating system prompts as secure secrets. At Reputiva, we advise organizations to assume that determined attackers may eventually extract portions of a system prompt.

This does not mean prompts are useless. It means prompts should never be treated as the primary security control.

Organizations should:

  • Avoid embedding sensitive information in prompts.
  • Keep credentials, secrets, and API keys outside prompt context.
  • Implement authorization controls in backend systems.
  • Apply least-privilege access to AI-integrated tools.
  • Monitor for prompt extraction attempts.
  • Design AI systems using Zero Trust principles.

The most secure AI architectures assume that prompts may be discovered and ensure that disclosure does not lead to compromise.

Prompts can guide behaviour. They should not enforce security.

Secure your AI applications beyond the prompt

As organizations deploy custom GPTs, AI copilots, and autonomous agents, prompt security is only one part of the equation.

Reputiva helps organizations build secure AI environments through:

  • AI Security Assessments
  • Prompt Injection Testing
  • AI Agent Security Reviews
  • Cloud Security Assessments
  • Identity and Access Management Reviews
  • AI Governance Programs
  • Secure AI Architecture Design

If your organization is deploying AI applications, now is the time to evaluate whether sensitive information, business logic, or operational controls could be exposed through system prompt leakage.

Book a consultation with Reputiva to assess your AI security posture and build AI systems that remain secure, even when attackers know how they work.

Navigate

Let's talk

Networks

Privacy Preference Center