What Changes When PostHog Trains Models on Your Workspace Data by Default?
# What Changes When PostHog Trains Models on Your Workspace Data by Default?
It changes who carries the risk budget: if your analytics/observability vendor trains AI models on your workspace data by default, your logs, replays, transcripts, and uploads can move from “customer-controlled telemetry” into “vendor training inputs” that may influence shared model behavior—unless you have an explicit exclusion via settings or contract. Practically, that expands your privacy/IP/trade-secret exposure, raises the bar on what you must measure and redact before ingest, and forces you to productize controls (consent, filtering, retention) rather than treating analytics as a passive pipe.
How “Training on Workspace Data” Actually Works (and What It Touches)
In this context, training means using customer-provided inputs—prompts, logs, behavioral/transactional records, support transcripts, uploaded documents—as part of datasets that update or fine-tune models (including LLMs). The important mechanism is that training can occur on:
- Raw inputs (highest risk, highest fidelity)
- Filtered/anonymized inputs (risk reduced, not eliminated)
- Aggregated signals (lower fidelity, often lower risk)
The builder consequence: you can no longer assume “we didn’t send user content to an LLM.” If your workspace data includes user-entered text, URLs, identifiers, stack traces, or “helpful” debugging payloads, those can end up inside datasets used to improve vendor models—depending on vendor policy and your contractual posture.
Why It Matters Now
The current inflection is disclosure plus default behavior. UpGuard analyzed 176 privacy notices from 250 monitored vendors using GPT-based classification and found that about 40% mention AI in a way that could involve personal data, and about 20% of those indicate they train models on user input by default—signaling a real shift toward “silent default training” as a product norm rather than an exceptional program.
At the same time, policy and legal commentary is converging on a few expectations: affirmative opt-in is increasingly favored; personal information should be filtered from chat inputs by default; and transparency about training data use is becoming a front-line obligation. For solo builders who ship fast and rely on third-party platforms, this is the point where “privacy policy reading” becomes an engineering task: you need instrumentation and controls, not just a checkbox in procurement.
What You Must Measure First: Signal Discovery
Start with an inventory of what your PostHog workspace collects that could plausibly be used for training datasets. Typical categories called out across vendor training pipelines include:
- Chat inputs and free-form text fields
- Usage logs and telemetry
- Customer-uploaded documents
- Support transcripts
Treat session replay and event payloads as high-risk by default because they often contain unintended content (typed PII, access tokens pasted into fields, URLs with query params, etc.). Then quantify contamination risk:
- Run automated scanners over representative exports (or your own mirror) using regex + entropy checks for secrets and identifiers.
- Measure frequency, not just presence: “1 event type leaks tokens 0.1% of the time” is still actionable, because training pipelines are scale machines.
Finally, track contractual scope per customer/project: which tenants are covered by DPAs/AI addenda that forbid training or restrict “derivative use,” and which are not. If you don’t map this, you can’t reliably promise customers anything.
Immediate Mitigations: Technical and Operational
Your first mitigation is governance: decide whether you want any of your workspace data used for vendor training at all. From there, your practical controls are:
- Consent path (opt-out/opt-in): implement a hard stance for your org, and—if you embed PostHog into a product—consider mirroring that stance in your own onboarding so customers can express a preference you can operationalize.
- Pre-ingest filtering: apply client-side or server-side redaction for PII and secrets before data reaches analytics ingestion. Don’t rely on later “training filters” you can’t verify.
- Data minimization & retention: send only what you need. Shorter retention windows reduce what exists to be repurposed into training datasets and lower breach blast radius.
- Isolation for sensitive projects: where vendors offer private fine-tuning / isolated models, prefer it for high-sensitivity work so your data doesn’t influence shared weights.
If you’re building agentic features on top of analytics and automated workflows, treat outbound actions as part of the same threat model: logs can contain the very data an agent later “helpfully” sends. See: How to Stop Agents from Silently Exfiltrating Files via Outbound Messages.
Contract and Policy Guardrails to Demand
Most of the long-term control surface is contractual. Your minimum viable AI training clause should:
- Define “use” precisely (raw inputs, derivatives, embeddings, aggregated signals).
- Require notice and consent for any training use (or prohibit it outright).
- Specify ownership and IP around derivative models and outputs—especially whether the vendor can commercialize improvements influenced by your data.
- Require security controls and auditability, including subprocessors and third-party model providers (even if the vendor claims training is “in-house,” you still want the paper trail).
Also require deletion/forget mechanisms and documented data lineage for training datasets. Without provenance, you can’t answer customer questions, and you can’t verify compliance if policies shift.
How to Productize the Change (So It’s Not a Fire Drill)
Treat this as a feature, not a legal footnote. The pattern that holds up is to expose privacy tiers that directly gate what telemetry leaves your boundary:
- “Strict”: no free-form text, aggressive redaction, minimal event payloads
- “Default”: balanced payloads, targeted filters
- “Developer”: richer debugging, explicit warnings and time-boxed retention
Then automate enforcement in CI/CD and instrumentation so developers don’t make ad-hoc decisions per event. If you’re experimenting with “AI insights” features fed by analytics (summaries, anomaly explanations, replay insights), run controlled tests comparing value vs risk across tiers. The goal is to quantify what you lose when you minimize, not to guess. Related builder workflow: How a Solo Builder Can Run Multi‑Model LLM Code Reviews That Actually Improve Code.
Practical Checklist for a Solo Builder (Quick Wins)
- Audit what you’ve sent in the last 90 days for PII/secrets (regex + entropy).
- Decide your stance: allow training, prohibit it, or gate it by tier.
- Add redaction hooks for known sensitive fields (tokens, emails, IDs, URLs).
- Reduce payload granularity and retention to what your product truly needs.
- Update onboarding + legal templates so consent/prohibition isn’t implicit.
- Track product impact: measure whether any “AI improvements” justify the expanded risk surface.
Risks That Persist (Even If the Vendor Says “Anonymized”)
Anonymization helps, but it’s not a magic eraser. Aggregation and linkage can re-identify data in some settings, and trained models can sometimes reproduce fragments of inputs. Even if training is internal, you still need contractual proof of what data flows into training sets, how long it’s kept, and what remediation exists if leakage is observed.
The posture to adopt: assume mistakes will happen, and design for detection, minimization, and reversibility—because once something is in a training corpus or influences shared weights, “undo” is operationally hard.
What to Watch
- Whether vendor defaults move toward affirmative opt-in and default filtering of personal data from inputs.
- New or revised DPAs/AI addenda: look for retention, derivative-use language, subprocessors, and audit rights.
- Ongoing evidence of disclosure gaps: UpGuard’s numbers suggest many vendors still under-specify training behavior in privacy notices.
- Emerging guidance that pushes stronger transparency about training datasets and model risk classifications—expect procurement questions to become more specific, not less.
Sources:
https://www.upguard.com/blog/third-party-risk-ai-models-trained-on-user-data
https://www.terms.law/2024/05/02/legalities-of-using-customer-data-for-ai-training/
https://hai.stanford.edu/news/be-careful-what-you-tell-your-ai-chatbot
About the Author
yrzhe
AI Product Thinker & Builder. Curating and analyzing tech news at TechScan AI. Follow @yrzhe_top on X for daily tech insights and commentary.