Skip to content

Add RFC-0055 Out-of-Tree Platform Build and Distribution#97

Open
afrittoli wants to merge 1 commit into
pytorch:masterfrom
afrittoli:rfc0051
Open

Add RFC-0055 Out-of-Tree Platform Build and Distribution#97
afrittoli wants to merge 1 commit into
pytorch:masterfrom
afrittoli:rfc0051

Conversation

@afrittoli

Copy link
Copy Markdown

No description provided.

Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
@meta-cla meta-cla Bot added the cla signed label Jun 22, 2026
@groenenboomj

Copy link
Copy Markdown

Red Hat is also very interested in supporting a nightly runner signal.

@albanD

albanD commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Thanks for sending this rfc, I expect we'll finish the current CRCR for testing and focus on onboarding projects there and make sure we get benefits from that before we build more pieces there.
But happy to take a closer look at this one after that!

@afrittoli afrittoli changed the title Add RFC-0051 Out-of-Tree Platform Build and Distribution Add RFC-0055 Out-of-Tree Platform Build and Distribution Jun 22, 2026
@afrittoli

Copy link
Copy Markdown
Author

Thanks for sending this rfc, I expect we'll finish the current CRCR for testing and focus on onboarding projects there and make sure we get benefits from that before we build more pieces there. But happy to take a closer look at this one after that!

Thanks @albanD - feedback would be welcome - I believe there's plenty of design and prototyping work that I can look into in parallel to the current work on CRCR.


Each platform operates in an isolated lane:

- **Credential isolation**: Each platform has a dedicated IAM role that can only write to that platform's storage prefix. OIDC trust policies scope the role to the specific vendor repo. A compromised vendor repo cannot access another platform's storage or the main PyTorch artifact space.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies AWS S3 infra. Also, this implies centralized management for the IAMs but I guess RelEng team (which is very Meta-heavy atm)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC describes credential isolation within the existing AWS-based storage infrastructure (S3, IAM), which is what PyTorch uses today.

Management and payment of the S3 bucket are both handled by Meta today, changing that is a conversation that is beyond the scope of the RFC, however the implications need to be considered:

  • costs: adding more platforms on the same S3 infra would add to the storage costs. Do you think this would be an issue?
  • management: the idea is to grant vendors the ability to add/remove binaries by themselves, to avoid the extra burden for the RelEng team. The only thing required would be provisioning (and deprovisioning) of the roles required to give/remove access to the vendors

Each platform operates in an isolated lane:

- **Credential isolation**: Each platform has a dedicated IAM role that can only write to that platform's storage prefix. OIDC trust policies scope the role to the specific vendor repo. A compromised vendor repo cannot access another platform's storage or the main PyTorch artifact space.
- **Upload workflow isolation**: Uploads go through the official `_binary_upload.yml` workflow, which enforces naming conventions before writing to S3. Once [Stage 3](#implementation-plan) is complete, this workflow also generates provenance attestations. If the `job_workflow_ref` dual-gate can be confirmed (see [Credentials and Publishing Access](#credentials-and-publishing-access)), vendors cannot bypass this workflow even with valid OIDC credentials.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think _binary_upload.yml can really be used to enforce anything, it must be done on IAM level, which is hard and implies a lot of heavy lifting from the RelEng team

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC is designed to reduce that heavy lifting as much as possible - I think it should be possible to reduce it to adding/removing a new line in terraform to provision/de-provision a role.

The idea is to use OIDC for AWS/GitHub Actions. The role must be associated with the vendor workflow on the vendor repo, and it would grant access to the S3 bucket only in the vendor specific namespace.

Platform vendors are responsible for security vulnerabilities in their platform-specific code. When a vulnerability affects packages hosted at `download.pytorch.org`, the following process applies:

1. Vendor discloses the vulnerability to the PyTorch security team at security@pytorch.org (or equivalent) within 7 days of discovery.
2. PyTorch infra can yank (remove from the CDN index without deleting) the affected artifacts while a fix is prepared.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why without deleting? What if affecting artifact distributes a malicious content?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the artifact in place and only remove it from the index could be helpful for "forensics" like post mortem analysis or so, but that works only if removing the artifact from the index prevents the artifact from being installed by end-users and used by CI/CD pipelines. If not it should be completely removed.
I'll include both options here and clarify the intent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants