Skip to content

AI System Privacy Audit: Role-Based Access Controls and GDPR Compliance

System in scope: doc_quality_compliance_check (backend API, session auth, RBAC layer, PostgreSQL user/session storage).

1. System Diagram

RBAC-relevant architecture facts used in this risk sheet:

  • The backend exposes a FastAPI application with two authentication modes: browser users via HTTP-only session cookies (email/password login) and service clients via X-API-Key / Authorization: Bearer header.
  • Role enforcement uses a require_roles(...) dependency on every protected route. Roles defined: qm_lead, architect, riskmanager, auditor, service.
  • The service role is restricted to explicitly machine-to-machine routes (/api/v1/skills/*, /api/v1/observability/*); it is no longer a blanket bypass.
  • Two routes are currently unauthenticated: /api/v1/dashboard/* and /api/v1/templates/*.
  • Session data (email, roles, org, last_seen_at) is persisted to PostgreSQL. Bootstrap/MVP credentials are configured via environment variables.

2. Data Flow Analysis

Data Flow Source Destination Encrypted? Logged? Priority
Login request (email + password) Browser / API client POST /api/v1/auth/login → session store (PostgreSQL) In-transit (HTTPS/TLS) Auth event (success/failure, timestamp, email); login throttle state High
Session cookie issued to browser FastAPI auth route Browser (HTTP-only Set-Cookie) In-transit (TLS); cookie is HTTP-only, secure flag enforced outside dev Session record in PostgreSQL (token hash, email, roles, org, expiry) High
Authenticated request (cookie or API key) Browser / service client FastAPI route → require_roles dependency In-transit (HTTPS/TLS) last_seen_at updated in session row; access control outcome not separately audit-logged High
Service-client auth (API key / Bearer) Orchestrator or automation tool FastAPI /skills/* or /observability/* endpoints In-transit (HTTPS/TLS) API key check in require_api_auth; no per-call access log entry High
Unauthenticated dashboard access Any caller (no credentials required) GET /api/v1/dashboard/* In-transit (HTTPS/TLS) No auth check; no identity logged Medium
Unauthenticated template access Any caller (no credentials required) GET /api/v1/templates/* In-transit (HTTPS/TLS) No auth check; no identity logged Medium
Bootstrap user provisioning Environment config (AUTH_MVP_* vars) FastAPI startup → app_users table Config-level (env vars / secrets manager) Startup event; credential in env var outside runtime log High
Role and org resolution (session row) PostgreSQL session table resolve_user_from_cookie / require_roles At-rest DB controls Implicit (session row lookup); no dedicated access-decision log Medium
Session revocation on logout DELETE /api/v1/auth/logout PostgreSQL (is_revoked = True) In-transit (HTTPS/TLS) Logout event in session row state; no separate audit-trail entry Medium

Corrected interpretation for RBAC and GDPR

  • The primary GDPR boundary in the auth flow is the persistence of identity data (user_email, user_org, user_roles, last_seen_at) in the session table — minimisation and retention obligations apply.
  • The viewer role appears in tests (test_auth_authorization_api.py) but is not in the documented implemented role set. A viewer caller receives 403 on all tested routes, meaning this role effectively has zero access — a potential misconfiguration gap.
  • The service role hardening (no blanket bypass) is a positive control. Residual risk: the observability/* endpoint remains service-accessible and can return rich trace payloads containing personal data (see also Risk Sheet 1 — model trace over-collection).
  • The two unauthenticated routes expose application data without identity context, which is inconsistent with GDPR accountability and access-control principle (GDPR Art. 5(1)(f), Art. 25).

3. Sensitive Data

Sensitive Data: User Identity in Session Store

  • Category: Personal data (identified natural persons) — GDPR Art. 4(1)
  • Examples: user_email (primary identifier), user_org, session session_id, expires_at, last_seen_at
  • Why Sensitive: Directly identifies users and links their organisational role to access patterns and audit trails; retained in PostgreSQL with no visible TTL-based purge policy
  • Current Protection: Server-side session with hashed token; DB access controls; session expiry and revocation support
  • Risk (or Harm) if Exposed: Unauthorised disclosure of user identity and role assignments; GDPR breach; profiling of users from access patterns

Sensitive Data: Role Assignments and Permission Scope

  • Category: Attributes linked to natural persons — implicit GDPR personal data when combined with identity
  • Examples: user_roles array per session row (e.g., ["qm_lead"], ["auditor"]), org isolation field user_org
  • Why Sensitive: Reveals organisational responsibilities and access privileges; can be used for social engineering or targeted attacks; GDPR data minimisation applies
  • Current Protection: Stored in session row; resolved per request by require_roles; role set validated at route boundary
  • Risk (or Harm) if Exposed: Privilege mapping; lateral movement by attacker with partial DB access; GDPR accountability gap if role-to-action mapping is not audited

Sensitive Data: Bootstrap/MVP Credentials in Environment Configuration

  • Category: Authentication secrets and provisioning data
  • Examples: AUTH_MVP_EMAIL, AUTH_MVP_PASSWORD, AUTH_MVP_ROLES, AUTH_MVP_ORG (env vars); SECRET_KEY (API key secret)
  • Why Sensitive: Compromise of bootstrap credentials gives attacker a fully provisioned account with configurable roles; SECRET_KEY grants service-client access to skills/* and observability/*
  • Current Protection: Environment-variable configuration (not in code); excluded from source code
  • Risk (or Harm) if Exposed: Full account takeover; unrestricted access to trace data; GDPR breach; credential reuse risk if same password is used across environments

Sensitive Data: Access Decision and Audit Context Not Separately Persisted

  • Category: Audit/traceability gap — GDPR Art. 5(2) accountability
  • Examples: Which role accessed which route at what time; 403 access denials; service-client route usage with payload summary
  • Why Sensitive: Absence of access-decision audit log prevents retrospective investigation of data-access incidents; required for GDPR Art. 30 Record of Processing Activities and breach response
  • Current Protection: last_seen_at updated on session lookup (coarse-grained); login throttle state tracked per email/IP
  • Risk (or Harm) if Exposed: Inability to detect or evidence unauthorised access; weakened GDPR breach-response capability

4. Privacy Risks

Risk 1: Unauthenticated routes expose application data without identity context

  • Priority: High
  • Risk Category: Access control — missing authentication
  • GDPR Reference: Art. 5(1)(f) — integrity and confidentiality; Art. 25 — data protection by design
  • Potential Harm/Impact: Any network-accessible caller can read dashboard data and templates without establishing identity; GDPR accountability principle violated (no record of who accessed what); potential leak of structural compliance artefacts or meta-information
  • Ability to Implement Control: High
  • Recommended controls:
  • Add require_authenticated_user dependency to all /api/v1/dashboard/* and /api/v1/templates/* routes.
  • If anonymous read-only access is intentional for templates, scope it to non-personal, non-compliance-sensitive data only and document the deliberate design choice in the privacy notice.
  • Add access logging (caller IP or session ID + timestamp) for these routes until auth is enforced.

Risk 2: Session table retains personal data (email, org, last_seen_at) without visible purge policy

  • Priority: High
  • Risk Category: Data retention and minimisation — GDPR Art. 5(1)(e) and Art. 25
  • GDPR Reference: Art. 5(1)(e) — storage limitation; Art. 13/14 — data subject transparency on retention
  • Potential Harm/Impact: Expired or revoked sessions (is_revoked = True) remain in app_user_sessions with full personal data indefinitely; contradicts GDPR storage limitation principle; complicates data-subject erasure requests (Art. 17)
  • Ability to Implement Control: High
  • Recommended controls:
  • Implement a scheduled job (e.g., nightly) to hard-delete session rows where is_revoked = True OR expires_at < NOW() - <grace_period>.
  • Define and document maximum retention period for session records (e.g., 30 days post-expiry).
  • On GDPR erasure request, immediately revoke and delete all session rows for the subject's email.

Risk 3: No dedicated access-decision audit log (GDPR accountability gap)

  • Priority: High
  • Risk Category: Audit trail and accountability — GDPR Art. 5(2), Art. 30
  • GDPR Reference: Art. 5(2) — accountability; Art. 32 — security of processing; Art. 33 — breach notification readiness
  • Potential Harm/Impact: last_seen_at on session lookup provides coarse activity signal but does not record which route was accessed, what role was used, or whether a 403 was returned; in a breach scenario, no evidence trail exists for forensic investigation or Data Protection Authority reporting
  • Ability to Implement Control: High
  • Recommended controls:
  • Add a FastAPI middleware or route-level dependency that writes structured access-decision log entries: session_id, email, roles, method, path, status_code, timestamp.
  • Include 403 denial events in the log (route attempted, role required vs role held).
  • Protect this log with the same retention and access controls as the main audit-trail table.
  • Cross-reference: access-decision log entries should link to audit_events correlation_id where available.

Risk 4: viewer role defined in test code but absent from authorised role set creates misconfiguration risk

  • Priority: Medium
  • Risk Category: RBAC gap and role lifecycle governance
  • GDPR Reference: Art. 25 — data protection by design; Art. 32 — appropriate technical measures
  • Potential Harm/Impact: The viewer role returns 403 on all tested routes but is not defined in the canonical role registry. If a user is assigned viewer (e.g., through bootstrap config or future user management), they receive zero access — which may be the intended behaviour but is undocumented. Alternatively, if a future route is added and viewer is not excluded, it could gain unintended access
  • Ability to Implement Control: High
  • Recommended controls:
  • Define an explicit canonical role registry (enum or constants file) listing all valid roles with their intended permission scope and whether they are human or machine roles.
  • Add a startup assertion or test that no undefined role names can be provisioned via bootstrap config.
  • Document whether viewer is a planned future role (read-only access) or should be removed entirely.

Risk 5: Service-client (X-API-Key) can access observability endpoints containing personal data in traces

  • Priority: Medium
  • Risk Category: Least-privilege and machine-identity access to personal data
  • GDPR Reference: Art. 25 — data protection by design (least privilege); Art. 32 — appropriate technical measures
  • Potential Harm/Impact: The observability/* endpoint is explicitly marked allow_service=True, meaning automated orchestrator or any bearer-token holder can retrieve rich trace payloads that may contain personal data from documents and prompts (see Risk Sheet 1, Risk 2). Service clients have no session identity or org context, making attribution difficult
  • Ability to Implement Control: Medium
  • Recommended controls:
  • Apply pre-response redaction to observability payloads served to service clients: strip prompt_text, output_text, and any fields flagged as personal-data-bearing before returning.
  • Log service-client access to observability endpoints with X-API-Key identifier hash and query parameters in the access-decision log.
  • Consider restricting service-client access to operational metadata only (latency, status, counts), not the rich trace payload.

Risk 6: Bootstrap/MVP credential pattern must not persist to production

  • Priority: Medium
  • Risk Category: Credential management and production hardening
  • GDPR Reference: Art. 32 — appropriate technical measures for security of processing
  • Potential Harm/Impact: If AUTH_MVP_EMAIL / AUTH_MVP_PASSWORD env vars are reused or not rotated between environments, a staging credential compromise grants production access; bootstrap accounts may accumulate active sessions that are never revoked; default password (CHANGE_ME_BEFORE_USE visible in test code) presents brute-force risk if left in place
  • Ability to Implement Control: High
  • Recommended controls:
  • Enforce a startup check: if AUTH_MVP_ENABLED=true and environment is production, reject startup unless password has been explicitly changed from default.
  • Document the bootstrap account as a break-glass account: rotate immediately after initial provisioning, restrict to a dedicated bootstrap role with minimal scope, and disable after handover.
  • Use a secrets manager (e.g., Vault, AWS Secrets Manager) for all production credential injection rather than plain environment variables.

5. RBAC Consistency with Risk Sheet 1 (Model Providers)

The following cross-cutting controls are shared with the model-provider risk set and must remain consistent:

Control Area Risk Sheet 1 Finding Risk Sheet 2 Alignment Required
Observability endpoint access Over-retained trace data with personal data Service-client access to observability must apply same redaction policy (sheet 1, Risk 2)
Audit trail completeness Model/version metadata per run Access-decision log must link to same correlation_id as audit events
Least-privilege principle Prompt/output access restricted by role Role enforcement on observability and skills endpoints must not be wider than role enforcement on document/compliance routes
HITL approval for high-impact outputs HITL role required for approval actions HITL-approver role must map to a defined RBAC role (qm_lead or auditor) with explicit route permission — no anonymous or service-client approval permitted

Additional information from the repo

What exists

  • RBAC: require_roles(...) gates API routes (e.g. documents, skills, research).
  • Cookies: Session cookie is HttpOnly, SameSite=lax, Secure enforced outside development (session_auth.py, config.py).
  • AuthenticatedUser includes org from session (user_sessions.user_org).
  • audit_events, audit_schedules, LogEventRequest: optional tenant_id / org_id fields for labeling.

document and HITL scope

  • SkillDocumentORM has no org_id or tenant_id column.
  • search_documents (src/doc_quality/services/skills_service.py) queries all skill_documents rows (filters: type, optional text search against extracted_text / filename), capped by limit.
  • GET /api/v1/documents uses search_documents with empty query → returns up to 100 documents for any authenticated user with an allowed role — no filter by user.org.
  • HITL (hitl_workflow.py): no org/tenant parameters in queries.

Conclusion: The codebase matches a single-tenant / shared-database MVP: isolation is role-based, not organization-row-level. Multi-customer SaaS would need schema + query changes and consistent propagation of org from JWT/session into every read/write path.