Mind - A journey of fixing all eks worker nodes stuck in NotReady status

AWS has released fargate support for EKS for more than a year, I think it should be stable now and I'm investigating on how to use it as our serverless etl compute resources.before we moved fargate, we have to manage a set of AutoScaling Groups to manage the eks worker nodes, in our etl flow we have to modify ASG desired capacity to spin up new node to join cluster , use the compute resources, after the etl job, we have to terminate the ec2 worker node, in a perfect world, this works fine, but when anything goes wrong in the etl job or job failed, we have to make sure we terminate the new launched ec2 worker nodes, otherwise we will waste money on unneeded compute resources

Fargate change the world completely, with fargate, I could just spin up Job/Pods from yaml file , Fargate will spin up new worker node with desired compute resources (including cpu and memory), when we finish, no mater job fail or not, Fargate will terminate the worker node when pod get removed, this will release us from the ASG management hell. further more, Fargate will make sure each pod is running on an isolated worker node which means there is no side effect between each pod, so it's easy to do isolate problem as well. although the Fargate price for the compute resources is higher than normal ec2 but for a longer term if we just use the compute resources for a short time period, it will save us lots of money

With so many benefit, I can't wait to start to use it, firstly we have to create an IAM role with AWS predefined policy AmazonEKSFargatePodExecutionRolePolicy , at the moment the policy is really allow fargate worker node to communicate with ECR to pull docker image, but I think AWS will add more permission into the policy later.
Secondly we need to create a Fargate profile, in the fargate profile we need to specify a namespace and some label matcher. for my example, I would like all the etl job running in namespace flow with label infrastructure=fargate to run on Fargate worker node.

when this is in place, I could spin up pod to run on Fargate just like yaml below

apiVersion: v1
kind: Pod
metadata:
  name: efs-app
  namespace: flow
  labels:
    infrastructure: fargate
spec:
  containers:
    - name: app
      image: centos
      command: ["/bin/sh"]
      args: ["-c", "while true; do echo $(date -u) >> /data/out; sleep 5; done"]

if we remove label infrastructure=fargate, the pod will be scheduled on existing worker node, otherwise it will be scheduled on Fargate worker node and the worker node will be terminated as soon as Pod get deleted (really cool)

then I continue to test some other features with Fargate, e.g: mount EFS into Fargate worker nodes, I have to deleted some existing fargate profile and create a new profile. everything seems working fine, but later on the day, some app developer contact me about they are not able to connect to our EKS (dev) env any more, when I check the cluster health, I noticed that all the normal ec2 worker nodes stuck in "NotReady" status, and if I describe any worker node , in the event I can see some error like below

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Thu, 27 May 2021 11:51:41 +0800   Thu, 27 May 2021 11:55:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Thu, 27 May 2021 11:51:41 +0800   Thu, 27 May 2021 11:55:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Thu, 27 May 2021 11:51:41 +0800   Thu, 27 May 2021 11:55:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Thu, 27 May 2021 11:51:41 +0800   Thu, 27 May 2021 11:55:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.

all the kubelet could not communicate with EKS api server any more so all the nodes are stuck in "NotReady" status.

this doesn't make sense, we spend a day to investigate the issue, luckily one of our sr. engineer found out the root cause, in the cloud watch log, we have some api auth error to says that the Role are not mapped correctly in aws-auth configmap. then we come and check the value of aws-auth configmap in kube-system namespace and that's where cause the problem.
our aws-auth configmap was over-written by some naughty aws process I believe at some point (but AWS may not admit it) . the fact is our aws-auth does not have the default EKS rolearn, instead it only contains the fargate profile role

data:
mapRoles: |
- groups:
  - system:bootstrappers
  - system:nodes
  - system:node-proxier
  rolearn: arn:aws:iam::xxxxxxxx:role/acme-eks-fargate-pod
  username: system:node:{{SessionName}}

while the expected aws-auth configmap should look like below

data:
mapRoles: |
- groups:
  - system:bootstrappers
  - system:nodes
  rolearn: arn:aws:iam::xxxxxxxx:role/acme-eks-worker-nodes-NodeInstanceRole-xxxxxxx
  username: system:node:{{EC2PrivateDNSName}}
- groups:
  - system:bootstrappers
  - system:nodes
  - system:node-proxier
  rolearn: arn:aws:iam::xxxxxxxx:role/acme-eks-fargate-pod
  username: system:node:{{SessionName}}

The 1st part is over-written so all the normal worker nodes could could not obtain STS token to communicate with API server, then cause this issue, after we set the default rolearn back, everything come back to normal.