Skip to content

V1.5.0 Release Plan #115

@yukirora

Description

@yukirora

Endgame

  • Code freeze: Dec. 26
  • Bug Bash date:
  • Testing start date:
  • Release date:

Main Features

Platform Features & Stability

    • LTP prometheus data auth
    • Containers losing access to NVIDIA GPUs
    • local storage ssh key gen consistency enhancement
    • Job exporter support on arm64 architecture GPUs

Inference

    • statistic about inference usage
    • Inference user data auth design and implementation

Security

    • Add support of Azure email communication service

TODO

Automatic Endpoint Deployment/Upgrade CI/CD
Distributed prometheus design
Job Reliability & Monitoring
Automatic Failure Detector
Interactive gracefully job exit/retry
Hardware Failure Detector
Software Failure Detector
Cluster and Job utilization metrics
LTP-Megatron Support for Project Level Metrics
Project Level Metrics Monitoring

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions