My Nodes Won't Start

What to do if your nodes won’t start

AWS vCPU Limits

Most of the time, if your nodes won’t start this is because you need to increase your vCPU limits. AWS has a quota in place which limits the number of concurrent vCPUs you can use, these are organized by tier. You can confirm that this is the issue by running

aws autoscaling describe-scaling-activities

You should see a message like this:

 {
            "ActivityId": "2323-234234-234234",
            "AutoScalingGroupName": "saturn-cluster-demo-16xlarge20202323242342342",
            "Description": "Launching a new EC2 instance.  Status Reason: You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an
 adjustment to this limit. Launching EC2 instance failed.",
            "Cause": "At 2020-07-14T13:13:24Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.",
            "StartTime": "2020-07-14T13:13:25.764Z",
            "EndTime": "2020-07-14T13:13:25Z",
            "StatusCode": "Failed",
            "StatusMessage": "You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 ins
tance failed.",
            "Progress": 100,
            "Details": "{\"Subnet ID\":\"subnet-24234234234234234\",\"Availability Zone\":\"eu-west-3a\"}"
        },

Which details the failure. To resolve this, navigate to the AWS Service Quotas Console. From there select EC2.

For CPU instances, you want to increase the value for Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances. For T4 GPUs, it’s Running On-Demand G instances, and for V100 GPUs it’s Running On-Demand P instances.

AWS generally takes 24 hours to respond to limit increases.