Skip to content

feat(otel): advanced observability phase 2 - data layer and privacy hardening#3797

Draft
jaegeral wants to merge 16 commits into
google:masterfrom
jaegeral:otel-advanced-observability
Draft

feat(otel): advanced observability phase 2 - data layer and privacy hardening#3797
jaegeral wants to merge 16 commits into
google:masterfrom
jaegeral:otel-advanced-observability

Conversation

@jaegeral
Copy link
Copy Markdown
Collaborator

@jaegeral jaegeral commented May 5, 2026

This Pull Request implements Phase 2 of the OpenTelemetry instrumentation for Timesketch. The focus of this phase is extending distributed tracing into the data layer (OpenSearch and PostgreSQL)

Data Layer Visibility

  • OpenSearch: Implemented manual instrumentation for search() and count() methods. This captures targeted indices and internal processing latency (took_ms) without exposing the query content.

  • SQLAlchemy (PostgreSQL): Integrated automatic tracing for all database operations, providing visibility into connection health and statement execution time.

  • ACL Layer: Wrapped has_permission() checks in logical spans. This groups the many individual database metadata lookups (often 40+ per request) into a single, understandable context in the trace waterfall.

  • Content Exclusion: Raw search query strings and DSL structures are now completely excluded from span attributes. This ensures that no sensitive case data (PII, hostnames, identifiers) ever leaves the application via telemetry.

  • Sensitive Data Scrubber: A global SpanProcessor automatically redacts credentials (passwords, tokens, sessions) from any span attribute across the entire system.

  • Analyst Attribution: Authenticated user.id and user.name are captured on all API spans. This allows correlating performance bottlenecks or errors back to a specific investigator without seeing their search terms.-

Infrastructure & Documentation

  • Configuration: Relocated otel-collector-config.yaml to the central data/ directory for better deployment readiness.
  • Guides: Updated docs/OpenTelemetry.md with current architecture and configuration details.

@jaegeral
Copy link
Copy Markdown
Collaborator Author

jaegeral commented May 5, 2026

/gemini review

@jaegeral jaegeral requested a review from rgayon May 5, 2026 14:07
@jaegeral jaegeral self-assigned this May 5, 2026
@jaegeral jaegeral marked this pull request as ready for review May 5, 2026 14:07
- CHOKIDAR_USEPOLLING=true
- PROMETHEUS_MULTIPROC_DIR=/tmp/
- ENABLE_STRUCTURED_LOGGING=true
# - ENABLE_STRUCTURED_LOGGING=true
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread timesketch/lib/datastores/opensearch.py
@jaegeral jaegeral changed the title feat: Telemetry Privacy & Redaction and add a span for acl checks feat(otel): advanced observability phase 2 - data layer and privacy hardening May 6, 2026
@jaegeral
Copy link
Copy Markdown
Collaborator Author

jaegeral commented May 6, 2026

Had timesketch/models/acl.py:324:4: W9016: "permission, user" missing in parameter type documentation (missing-type-doc)

Copy link
Copy Markdown
Collaborator

@jkppr jkppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General:
Your PR says Safe Telemetry Calls: Implemented a @safe_telemetry_call decorator ... but the whole diff does not contain this decorator at all. Can you please ensure the PR message reflects what the implementation actually contains?

Comment thread docs/OpenTelemetry.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any opentelemetry modules we can enable on our Opensearch cluster to get better insights from the other end too?

Comment thread timesketch/lib/telemetry.py

# 2. Check for sensitive keywords in the value
lower_val = value.lower()
if any(keyword in lower_val for keyword in SENSITIVE_KEYWORDS):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the SENSITIVE_KEYWORDS contain pretty general stuff like key or auth, this will scrub data a bit too eagerly don't you think? e.g., searching for "windows_auth" or an error saying "index key_logs missing" will be redacted

Maybe limit value-based redaction to regex word boundaries or only run value-checks against suspicious keys rather than unconditionally across all string attributes. WDYT?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me see, I had a better idea yesterday evening on how to avoid any search related things to spill.

Returns:
opentelemetry.trace.Tracer: A tracer instance.
"""
return trace.get_tracer(name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken this will lead to application crashes if HAS_OTEL=False. Either we need to return a dump tracker or ensure with other methods that the with blocks in the core application (e.g. ACL.py) don't crash.

if is_credential_key:
redacted_keys.append(key)
# pylint: disable=protected-access
span._attributes[key] = "[REDACTED]"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to mutate a protected member here?

Comment thread timesketch/models/acl.py
permission = codecs.decode(permission, "utf-8")
return self._get_ace(permission=permission, user=user)
tracer = telemetry.get_tracer(__name__)
with tracer.start_as_current_span("acl.has_permission") as span:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what your PR message says is true (one request can generate 40+ calls), doesn't this wrapper now create 40+ tiny spans per request which will result in a bloat of traces and load on the GCP or Jaeger end?
Do we need this level of granularity or could we also aggregate AACL checks into a single parent span?

Also: How does this impact memory usage of the Timesketch services? I imagine the trace tree per request with multiple users is rather large?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it would create that many, but that is also true if we just enable sqlalchemy tracing :-/

self.assertEqual(attrs["custom_field"], "[REDACTED]")

# Verify analyst identity is NOT redacted
self.assertEqual(attrs["user.name"], "analyst@google.com")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, this is not redacted because it does not contain any of the keywords used on your list. So wouldn't this test be more accurate when you check for auth-analyst@google.com instead?

And we should not use @google.com here. Maybe use @example.com instead.

# Attributes that are explicitly exempt from PII redaction (e.g. analyst identity)
EXEMPT_PII_ATTRIBUTES = {"user.name", "user.id", "timesketch.user_id"}

class SensitiveDataScrubber(SpanProcessor):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this impact performance? If this scrubs every string doing a sub-string search for a list of keywords to redact, won't this impact the API and worker performance of Timesketch?

Would it be sufficient to work only on a list of known risk attribute keys?

@jaegeral
Copy link
Copy Markdown
Collaborator Author

jaegeral commented May 8, 2026

I think I will need to split up this PR. To messy to review

@jaegeral jaegeral marked this pull request as draft May 12, 2026 06:22
@jaegeral
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly expands OpenTelemetry integration by adding manual tracing for OpenSearch queries and ACL checks, implementing SQLAlchemy auto-instrumentation, and introducing a PII scrubber for spans. It also improves log correlation by injecting trace context into application logs. The code review identified several critical issues where the application could crash if OpenTelemetry dependencies are missing, specifically due to unhandled exceptions in the logging and tracer initialization paths. Additionally, feedback highlighted that the PII scrubbing logic is overly aggressive and fails to use the defined exemption list, and noted a discrepancy in the documentation regarding search query capturing.

Comment on lines 166 to 167
except ImportError:
HAS_OTEL = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since TraceLogFilter is defined inside the try block, it will not exist if the opentelemetry package is missing. This will cause an AttributeError in timesketch/app.py at line 286, crashing the application. You should provide a fallback implementation of TraceLogFilter in this except block.

Suggested change
except ImportError:
HAS_OTEL = False
except ImportError:
HAS_OTEL = False
class TraceLogFilter(logging.Filter):
"""Fallback filter when OpenTelemetry is not installed."""
def filter(self, record):
record.trace_id = "0" * 32
record.span_id = "0" * 16
return True

Returns:
opentelemetry.trace.Tracer: A tracer instance.
"""
return trace.get_tracer(name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function will raise a NameError if the opentelemetry package is not installed, as the trace module is only imported within the try block at the top of the file. Since get_tracer is now called in core modules like opensearch.py and acl.py, this will crash the application for any user without the optional telemetry dependencies. This function must check HAS_OTEL and return a no-op tracer if it is False.


# 2. Check for sensitive keywords in the value
lower_val = value.lower()
if any(keyword in lower_val for keyword in SENSITIVE_KEYWORDS):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Checking for keywords like 'key' or 'session' within all attribute values is extremely aggressive for a digital forensics tool. This will likely cause significant data loss in telemetry; for example, any SQL statement containing a 'sessions' table or any registry path containing the word 'key' will be completely redacted. Consider removing these common words from the value-based check or using more precise matching (e.g., regex for tokens/passwords).

Comment on lines +154 to +155
if not HAS_OTEL:
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If HAS_OTEL is False, this filter returns early without setting record.trace_id or record.span_id. This will cause a KeyError in timesketch/app.py at line 308 because the log formatter expects these attributes to exist. You should always ensure these attributes are initialized with default values (e.g., zeros) before checking HAS_OTEL.

Suggested change
if not HAS_OTEL:
return True
def filter(self, record):
record.trace_id = "0" * 32
record.span_id = "0" * 16
if not HAS_OTEL:
return True

Comment thread docs/OpenTelemetry.md
* **Async Exporting:** Spans are exported asynchronously using a `BatchSpanProcessor` to ensure minimal impact on application performance.
* **Flask:** Captures all HTTP requests, status codes, and analyst identity.
* **Celery:** Maintains trace context across background tasks (analyzers, data imports).
* **OpenSearch:** Manual instrumentation captures search query structure (`db.statement`), targeted indices, and internal processing time (`took_ms`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line states that search query structure is captured in db.statement, but the PR description and the implementation in timesketch/lib/datastores/opensearch.py indicate that query content is explicitly excluded for privacy hardening. Please update this documentation to reflect the actual behavior.


redacted_keys = []
# Create a copy of keys to avoid modification during iteration
for key in list(span.attributes.keys()):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The EXEMPT_PII_ATTRIBUTES set (defined at line 71) is never used in the on_end method. This means analyst identity attributes like user.name or user.id could be accidentally redacted if they happen to contain one of the sensitive keywords, which contradicts the goal of maintaining analyst attribution.

Suggested change
for key in list(span.attributes.keys()):
for key in list(span.attributes.keys()):
if key in EXEMPT_PII_ATTRIBUTES:
continue
# Protect our own audit info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants