Feature: Sanitize credential-like URLs in telemetry events to avoid False CredScan detection#340
Feature: Sanitize credential-like URLs in telemetry events to avoid False CredScan detection#340yashnap wants to merge 23 commits into
Conversation
|
@microsoft-github-policy-service agree company="Microsoft" |
There was a problem hiding this comment.
Pull request overview
This PR adds URI-userinfo credential redaction to telemetry event messages to prevent false-positive CredScan detections when package manager outputs include username:secret@host URLs.
Changes:
- Added
Utility.sanitize_credentials_from_uri()and applied it at the telemetry event creation boundary. - Updated both extension and core telemetry writers to sanitize messages before writing event JSON.
- Added unit tests covering URI credential sanitization scenarios.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/extension/src/Utility.py | Introduces a shared sanitization helper to remove password/token from URI userinfo. |
| src/extension/src/TelemetryWriter.py | Applies sanitization to extension telemetry messages before writing events. |
| src/core/src/service_interfaces/TelemetryWriter.py | Applies sanitization to core telemetry messages before writing events. |
| src/extension/tests/Test_Utility.py | Adds unit tests for the new sanitization helper. |
| src/extension/tests/Test_TelemetryWriter.py | Adds tests validating extension telemetry writes are sanitized. |
| src/core/tests/Test_TelemetryWriter.py | Adds tests validating core telemetry writes are sanitized. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # ...existing code... | ||
|
|
There was a problem hiding this comment.
The placeholder line # ...existing code... was added in the middle of the class and should be removed; it adds noise and can confuse future readers/tools since it looks like an unfinished merge artifact.
| # ...existing code... |
| if self.runtime.is_github_runner: | ||
| return None | ||
|
|
||
| # Write event to telemetry | ||
| self.telemetry_writer.write_event(message) | ||
|
|
||
| # Load the event file | ||
| event_files = os.listdir(self.telemetry_writer.events_folder_path) | ||
| with open(os.path.join(self.telemetry_writer.events_folder_path, event_files[0]), 'r+') as f: | ||
| events = json.load(f) | ||
| sanitized_message = events[0]["Message"] | ||
| f.close() | ||
| return sanitized_message | ||
|
|
||
| # ==================== Test cases for credential sanitization in telemetry messages ==================== | ||
| def test_sanitize_credentials_multiple_repos(self): | ||
| """Test 2: Failed repo sync with multiple repo URLs containing different credentials""" | ||
| message = "Failed repo sync: https://user1:token1@repo1.example.com https://user2:token2@repo2.example.com/path" | ||
|
|
||
| sanitized_message = self._load_sanitized_event(message) | ||
| expected_message = "Failed repo sync: https://user1@repo1.example.com https://user2@repo2.example.com/path" | ||
| self.assertEqual(sanitized_message, expected_message) | ||
|
|
||
| def test_sanitize_credentials_username_only_no_password(self): | ||
| """Test 3: Using mirror with username only (no password)""" | ||
| message = "Using mirror https://testuser@repo.example.com/path" | ||
|
|
||
| sanitized_message = self._load_sanitized_event(message) | ||
| self.assertIn("testuser@repo.example.com", sanitized_message) | ||
|
|
||
| def test_sanitize_credentials_special_characters_in_password(self): | ||
| """Test 4: Downloading from repo with special characters in password""" | ||
| message = "Downloading from https://svc-user:AbC_123-.$%!@repo.contoso.com/rpm" |
There was a problem hiding this comment.
_load_sanitized_event() returns None on GitHub runners, but the new tests immediately assert on that return value (e.g., assertEqual / assertIn), which will fail (or raise TypeError) in CI. Prefer skipping these tests with unittest.skipIf(self.runtime.is_github_runner, ...) (or an early return at the start of each test) rather than returning None from the helper.
| from core.src.bootstrap.Constants import Constants | ||
| from extension.src.Utility import Utility | ||
|
|
||
|
|
There was a problem hiding this comment.
core/src/.../TelemetryWriter.py now imports extension.src.Utility. This introduces a cross-package dependency from core -> extension (and will break consumers that run core without the extension package on PYTHONPATH). Consider moving sanitize_credentials_from_uri into a core/shared module (or duplicating a small helper in core) and have both telemetry writers depend on that shared location instead.
| from core.src.bootstrap.Constants import Constants | |
| from extension.src.Utility import Utility | |
| try: | |
| from urllib.parse import urlsplit, urlunsplit | |
| except ImportError: | |
| from urlparse import urlsplit, urlunsplit | |
| from core.src.bootstrap.Constants import Constants | |
| class Utility(object): | |
| @staticmethod | |
| def sanitize_credentials_from_uri(uri): | |
| if uri is None: | |
| return uri | |
| uri = str(uri) | |
| try: | |
| split_uri = urlsplit(uri) | |
| except Exception: | |
| return uri | |
| netloc = split_uri.netloc | |
| if "@" not in netloc: | |
| return uri | |
| credentials, host = netloc.rsplit("@", 1) | |
| if ":" in credentials: | |
| username = credentials.split(":", 1)[0] | |
| sanitized_netloc = "{0}:***@{1}".format(username, host) | |
| else: | |
| sanitized_netloc = "***@{0}".format(host) | |
| return urlunsplit(( | |
| split_uri.scheme, | |
| sanitized_netloc, | |
| split_uri.path, | |
| split_uri.query, | |
| split_uri.fragment)) |
| # Pattern matches: scheme://user:password@host → scheme://user@host | ||
| # Handles credentials containing special characters including @ | ||
| # Groups: | ||
| # (1) scheme: https://, http://, or ftp:// | ||
| # (2) username: one or more non-whitespace, non-slash, non-colon, non-@ characters | ||
| # (3) password: zero or more non-whitespace, non-slash, non-@ characters | ||
| sanitized_message = re.sub( | ||
| r'(https?://|ftp://)([^:/@\s]+):([^@/\s]*)@', | ||
| r'\1\2@', |
There was a problem hiding this comment.
The inline comment says the regex "Handles credentials containing special characters including @", but the pattern ([^@/\s]*) explicitly excludes @ from the password group. Either adjust the comment to match the actual behavior or update the pattern if @ in the password is truly a supported case.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #340 +/- ##
==========================================
- Coverage 93.99% 87.41% -6.59%
==========================================
Files 107 109 +2
Lines 19810 19939 +129
==========================================
- Hits 18621 17429 -1192
- Misses 1189 2510 +1321
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@yashnap code coverage checks are failing for this one. We will need those to succeed for a review process to begin. Fyi, we check code coverage in python 3 and python 2. Run the test suites for both of these in your local, you can use any minor versions within python 2 and 3 |
rane-rajasi
left a comment
There was a problem hiding this comment.
Left a comment here: #340 (comment)
dc5ce77 to
e88327a
Compare
| import re | ||
|
|
||
|
|
||
| class Utility(object): |
There was a problem hiding this comment.
We moved away from a Utility class as it becomes a dumping ground for functions. Ask this question to the engineering agent: "What are the drawbacks of having a Utility class in code? From a best practices perspective."
Also review SOLID principles in programming, with S = Single-responsibility principle.
kjohn-msft
left a comment
There was a problem hiding this comment.
One comment inline. Separate: Always get @rane-rajasi's sign off first.
d72bf5e to
ac91bd5
Compare
|
#340 (comment) |
rane-rajasi
left a comment
There was a problem hiding this comment.
All the comments in core\ apply to extension\ too
231ca7f to
ae48d7a
Compare
ae48d7a to
e0edee2
Compare
fbde095 to
292547e
Compare
2c19857 to
440e33c
Compare
440e33c to
3ddb799
Compare
This reverts commit 3ddb799.
d1569e7 to
da276da
Compare
da276da to
31746d9
Compare
….com/Azure/LinuxPatchExtension into 37515045lpe-uri-credential-redaction
8c919de to
7183c8c
Compare
| # Step 1: Apply message restrictions (formatting, truncation) | ||
| restricted_message = self.__ensure_message_restriction_compliance(message) | ||
| # Step 2: Sanitize credentials from URIs | ||
| sanitized_message = self.credential_sanitizer.sanitize(restricted_message) |
There was a problem hiding this comment.
Sanitization can break message restriction (size) compliance so this is the wrong order of execution.
| # Step 1: Apply message restrictions (formatting, truncation) | ||
| restricted_message = self.__ensure_message_restriction_compliance(message) | ||
| # Step 2: Sanitize credentials from URIs | ||
| sanitized_message = self.credential_sanitizer.sanitize(restricted_message) |
There was a problem hiding this comment.
Sanitization can break message restriction (size) compliance so this is the wrong order of execution.
|
@rane-rajasi / @kjohn-msft I am trying to understand the reason behind Code Coverage Failure. When I try UT's from local everything works fine. I even tried to set GitHub Runner_Temp as true in the code and mimic the github CI run and it shows proper coverage but when it runs on the CI it shows as no coverage for the test class at all : https://app.codecov.io/gh/Azure/LinuxPatchExtension/pull/340?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=Azure |
df21e4c to
67f9ed4
Compare
67f9ed4 to
011d41d
Compare
| # (2) username: one or more non-whitespace, non-slash, non-colon, non-@ characters | ||
| # (3) password: zero or more non-whitespace, non-slash, non-@ characters | ||
| sanitized_message = re.sub(r'(https?://|ftp://)([^:/@\s]+):([^@/\s]*)@',r'\1\2@',message) | ||
| self.composite_logger.log_verbose("Message was sanitized to remove sensitive information. [InputMessage={0}][SanitizedMessage={1}]".format(str(message), str(sanitized_message))) |
There was a problem hiding this comment.
logging original message defeat's purpose isn't it? why do we want to log InputMessage which has credentials?
| return sanitized_message | ||
| except Exception as error: | ||
| self.composite_logger.log_error("Error occurred while sanitizing credentials from message: [Error={0}]".format(repr(error))) | ||
| return message |
There was a problem hiding this comment.
I think we should not be returning message which contain credentials in case of exception. when we cannot sanitize i think its better to take safer route to avoid credential leak. --> check for alternatives instead of returning original message
There was a problem hiding this comment.
Please refer this thread : #340 (comment)
Problem Statement :
An automated Credential Scan runtime telemetry scan raised an ICM against Azure Guest Patching Service indicating the presence of credential‑like values in extension status telemetry uploaded to Geneva.
CredScan periodically scans Microsoft‑owned telemetry storage and raises incidents when credential‑like patterns are detected.
Analysis indicates that package manager stdout/stderr (e.g., apt, yum, dnf) emits customer‑configured repository authentication values in URI userinfo format (e.g., username:secret@host). When included in LPE extension payloads, these values get uploaded to Geneva telemetry, where CredScan runtime scanning subsequently classifies them as potential credential leaks. Certain format :password that seemed to have been deprecated but the validations surrounding the previous checks are still present flagging the values that are in similar format.
While these values originate from customer‑owned VM configuration and are required for local package installation and diagnostics, their propagation into Microsoft‑owned telemetry systems results in recurring automated CredScan incidents which are false positives.
SOLUTION
This change introduces targeted sanitization at the telemetry export boundary:
TESTING STRATEGY
Logs Folder Output
Events Folder Output
APPROACH
Created Sanitization class in both core and extension package and injected the Dependency in respective Telemetry class.
Good if we want to add more patterns in the futurem, no consumer impact and backward compatible.
I have tested out the changes on my local using both RHEL/Linux VM machine.
I have considered approach where logic can be stay in one place (i.e Utility class) but Utility approach is discouraged.
I have considered adding the sanitizer logic in individual classes instead of creating separate class for Sanitization but from extensibility perspective the current approach I have is better.