Add RFC-0055 Out-of-Tree Platform Build and Distribution#97
Conversation
Signed-off-by: Andrea Frittoli <andrea.frittoli@uk.ibm.com>
|
Red Hat is also very interested in supporting a nightly runner signal. |
|
Thanks for sending this rfc, I expect we'll finish the current CRCR for testing and focus on onboarding projects there and make sure we get benefits from that before we build more pieces there. |
Thanks @albanD - feedback would be welcome - I believe there's plenty of design and prototyping work that I can look into in parallel to the current work on CRCR. |
|
|
||
| Each platform operates in an isolated lane: | ||
|
|
||
| - **Credential isolation**: Each platform has a dedicated IAM role that can only write to that platform's storage prefix. OIDC trust policies scope the role to the specific vendor repo. A compromised vendor repo cannot access another platform's storage or the main PyTorch artifact space. |
There was a problem hiding this comment.
This implies AWS S3 infra. Also, this implies centralized management for the IAMs but I guess RelEng team (which is very Meta-heavy atm)
There was a problem hiding this comment.
The RFC describes credential isolation within the existing AWS-based storage infrastructure (S3, IAM), which is what PyTorch uses today.
Management and payment of the S3 bucket are both handled by Meta today, changing that is a conversation that is beyond the scope of the RFC, however the implications need to be considered:
- costs: adding more platforms on the same S3 infra would add to the storage costs. Do you think this would be an issue?
- management: the idea is to grant vendors the ability to add/remove binaries by themselves, to avoid the extra burden for the RelEng team. The only thing required would be provisioning (and deprovisioning) of the roles required to give/remove access to the vendors
| Each platform operates in an isolated lane: | ||
|
|
||
| - **Credential isolation**: Each platform has a dedicated IAM role that can only write to that platform's storage prefix. OIDC trust policies scope the role to the specific vendor repo. A compromised vendor repo cannot access another platform's storage or the main PyTorch artifact space. | ||
| - **Upload workflow isolation**: Uploads go through the official `_binary_upload.yml` workflow, which enforces naming conventions before writing to S3. Once [Stage 3](#implementation-plan) is complete, this workflow also generates provenance attestations. If the `job_workflow_ref` dual-gate can be confirmed (see [Credentials and Publishing Access](#credentials-and-publishing-access)), vendors cannot bypass this workflow even with valid OIDC credentials. |
There was a problem hiding this comment.
I don't think _binary_upload.yml can really be used to enforce anything, it must be done on IAM level, which is hard and implies a lot of heavy lifting from the RelEng team
There was a problem hiding this comment.
The RFC is designed to reduce that heavy lifting as much as possible - I think it should be possible to reduce it to adding/removing a new line in terraform to provision/de-provision a role.
The idea is to use OIDC for AWS/GitHub Actions. The role must be associated with the vendor workflow on the vendor repo, and it would grant access to the S3 bucket only in the vendor specific namespace.
| Platform vendors are responsible for security vulnerabilities in their platform-specific code. When a vulnerability affects packages hosted at `download.pytorch.org`, the following process applies: | ||
|
|
||
| 1. Vendor discloses the vulnerability to the PyTorch security team at security@pytorch.org (or equivalent) within 7 days of discovery. | ||
| 2. PyTorch infra can yank (remove from the CDN index without deleting) the affected artifacts while a fix is prepared. |
There was a problem hiding this comment.
Why without deleting? What if affecting artifact distributes a malicious content?
There was a problem hiding this comment.
Leaving the artifact in place and only remove it from the index could be helpful for "forensics" like post mortem analysis or so, but that works only if removing the artifact from the index prevents the artifact from being installed by end-users and used by CI/CD pipelines. If not it should be completely removed.
I'll include both options here and clarify the intent.
No description provided.