Skip to content

Add TPU nodes in new GCP project#228

Merged
QiliangCui merged 8 commits into
vllm-project:mainfrom
CienetStingLin:sting/add_node_new_project
Dec 5, 2025
Merged

Add TPU nodes in new GCP project#228
QiliangCui merged 8 commits into
vllm-project:mainfrom
CienetStingLin:sting/add_node_new_project

Conversation

@CienetStingLin

Copy link
Copy Markdown
Contributor

Added the creation of TPU nodes in the second project. It uses the same bucket, "tpu_commons_ci-infra_tf", but a different folder to store the Terraform state. The buildkite_hf_token and buildkite_agent_token are also retrieved from the Secret Manager of the original project (cloud-tpu-inference-test).

Modified the naming when connecting to the Buildkite agent in the ci_v6e module by adding the project number to prevent naming conflicts between different projects.


sudo sed -i "s/xxx/${var.buildkite_token_value}/g" /etc/buildkite-agent/buildkite-agent.cfg
sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg
sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${data.google_project.project.number}-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I want to ask: Although the project number is not sensitive data, can I use it in the agent name to distinguish agents from different projects? (The agent name is visible on the public Buildkite page.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! Yes, this is the thing I want to talk about!!

The number is not sensitive but it is hard to read... can we do this:

instead of using a name like this: vllm-tpu-443452445451-v6e-1-0.

vllm-tpu-${var.accelerator_type}-${count.index} ---> 

${var.accelerator_type}-ci-${count.index}-${var.project_short_name}-${var.zone}

then, let's say
project_short_name can be test and cicd.
then, the machine name becomes:

v6e-1-ci-0-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-8-ci-0-test-us-central1-b
v6e-8-ci-1-test-us-central1-b

v6e-1-ci-0-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, the adjustment has been completed. Currently, this is only applied to the cloud-ullm-inference-ci-cd project. The Terraform code for the cloud-tpu-inference-test project has also been modified, but the terraform apply for it will be done in the next iteration.


sudo sed -i "s/xxx/${var.buildkite_token_value}/g" /etc/buildkite-agent/buildkite-agent.cfg
sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg
sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${data.google_project.project.number}-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! Yes, this is the thing I want to talk about!!

The number is not sensitive but it is hard to read... can we do this:

instead of using a name like this: vllm-tpu-443452445451-v6e-1-0.

vllm-tpu-${var.accelerator_type}-${count.index} ---> 

${var.accelerator_type}-ci-${count.index}-${var.project_short_name}-${var.zone}

then, let's say
project_short_name can be test and cicd.
then, the machine name becomes:

v6e-1-ci-0-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-8-ci-0-test-us-central1-b
v6e-8-ci-1-test-us-central1-b

v6e-1-ci-0-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b

data "google_project" "project" {
project_id = var.project_id
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not referenced right?

per suggestion below, maybe let's give it a project_short_name variable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it will be replaced by adding a data resource that reads the zone value from the provider.

Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
…name

Signed-off-by: StingLin <sting.lin@cienet.com>
@CienetStingLin CienetStingLin force-pushed the sting/add_node_new_project branch from 2c5bdcc to 465a4ea Compare December 4, 2025 06:27
@CienetStingLin

Copy link
Copy Markdown
Contributor Author

Although the currently set TPU agent quantities are 24 for v6e-1 and 13 for v6e-8, due to the disk quota limit, we can currently only create 18 v6e-1 and 1 v6e-8 on cloud-ullm-inference-ci-cd. The remaining quantity will be deployed with a subsequent terraform apply after the quota is increased.

@QiliangCui QiliangCui merged commit edd06a0 into vllm-project:main Dec 5, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants