Add TPU nodes in new GCP project#228
Conversation
|
|
||
| sudo sed -i "s/xxx/${var.buildkite_token_value}/g" /etc/buildkite-agent/buildkite-agent.cfg | ||
| sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg | ||
| sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${data.google_project.project.number}-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg |
There was a problem hiding this comment.
Here, I want to ask: Although the project number is not sensitive data, can I use it in the agent name to distinguish agents from different projects? (The agent name is visible on the public Buildkite page.)
There was a problem hiding this comment.
ha! Yes, this is the thing I want to talk about!!
The number is not sensitive but it is hard to read... can we do this:
instead of using a name like this: vllm-tpu-443452445451-v6e-1-0.
vllm-tpu-${var.accelerator_type}-${count.index} --->
${var.accelerator_type}-ci-${count.index}-${var.project_short_name}-${var.zone}
then, let's say
project_short_name can be test and cicd.
then, the machine name becomes:
v6e-1-ci-0-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-8-ci-0-test-us-central1-b
v6e-8-ci-1-test-us-central1-b
v6e-1-ci-0-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b
There was a problem hiding this comment.
That's a good idea, the adjustment has been completed. Currently, this is only applied to the cloud-ullm-inference-ci-cd project. The Terraform code for the cloud-tpu-inference-test project has also been modified, but the terraform apply for it will be done in the next iteration.
|
|
||
| sudo sed -i "s/xxx/${var.buildkite_token_value}/g" /etc/buildkite-agent/buildkite-agent.cfg | ||
| sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg | ||
| sudo sed -i 's/name="%hostname-%spawn"/name="vllm-tpu-${data.google_project.project.number}-${var.accelerator_type}-${count.index}"/' /etc/buildkite-agent/buildkite-agent.cfg |
There was a problem hiding this comment.
ha! Yes, this is the thing I want to talk about!!
The number is not sensitive but it is hard to read... can we do this:
instead of using a name like this: vllm-tpu-443452445451-v6e-1-0.
vllm-tpu-${var.accelerator_type}-${count.index} --->
${var.accelerator_type}-ci-${count.index}-${var.project_short_name}-${var.zone}
then, let's say
project_short_name can be test and cicd.
then, the machine name becomes:
v6e-1-ci-0-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-1-ci-2-test-us-east5-b
v6e-8-ci-0-test-us-central1-b
v6e-8-ci-1-test-us-central1-b
v6e-1-ci-0-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b
v6e-1-ci-2-cicd-us-central1-b
| data "google_project" "project" { | ||
| project_id = var.project_id | ||
| } | ||
|
|
There was a problem hiding this comment.
this is not referenced right?
per suggestion below, maybe let's give it a project_short_name variable?
There was a problem hiding this comment.
Yes, but it will be replaced by adding a data resource that reads the zone value from the provider.
Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
Signed-off-by: StingLin <sting.lin@cienet.com>
…name Signed-off-by: StingLin <sting.lin@cienet.com>
2c5bdcc to
465a4ea
Compare
|
Although the currently set TPU agent quantities are 24 for v6e-1 and 13 for v6e-8, due to the disk quota limit, we can currently only create 18 v6e-1 and 1 v6e-8 on |
Added the creation of TPU nodes in the second project. It uses the same bucket,
"tpu_commons_ci-infra_tf", but a different folder to store the Terraform state. Thebuildkite_hf_tokenandbuildkite_agent_tokenare also retrieved from the Secret Manager of the original project (cloud-tpu-inference-test).Modified the naming when connecting to the Buildkite agent in the ci_v6e module by adding the project number to prevent naming conflicts between different projects.