AWS has released fargate support for EKS for more than a year, I think it should be stable now and I'm investigating on how to use it as our serverless etl compute resources.before we moved fargate, we have to manage a set of AutoScaling Groups to manage the eks worker nodes, in our etl flow we have to modify ASG desired capacity to spin up new node to join cluster , use the compute resources, after the etl job, we have to terminate the ec2 worker node, in a perfect world, this works fine, but when anything goes wrong in the etl job or job failed, we have to make sure we terminate the new launched ec2 worker nodes, otherwise we will waste money on unneeded compute resources
Fargate change the world completely, with fargate, I could just spin up Job/Pods from yaml file , Fargate will spin up new worker node with desired compute resources (including cpu and memory), when we finish, no mater job fail or not, Fargate will terminate the worker node when pod get removed, this will release us from the ASG management hell. further more, Fargate will make sure each pod is running on an isolated worker node which means there is no side effect between each pod, so it's easy to do isolate problem as well. although the Fargate price for the compute resources is higher than normal ec2 but for a longer term if we just use the compute resources for a short time period, it will save us lots of money
With so many benefit, I can't wait to start to use it, firstly we have to create an IAM role with AWS predefined policy AmazonEKSFargatePodExecutionRolePolicy
, at the moment the policy is really allow fargate worker node to communicate with ECR to pull docker image, but I think AWS will add more permission into the policy later.
Secondly we need to create a Fargate profile, in the fargate profile we need to specify a namespace and some label matcher. for my example, I would like all the etl job running in namespace flow
with label infrastructure=fargate
to run on Fargate worker node.
when this is in place, I could spin up pod to run on Fargate just like yaml below
apiVersion: v1
kind: Pod
metadata:
name: efs-app
namespace: flow
labels:
infrastructure: fargate
spec:
containers:
- name: app
image: centos
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out; sleep 5; done"]
if we remove label infrastructure=fargate
, the pod will be scheduled on existing worker node, otherwise it will be scheduled on Fargate worker node and the worker node will be terminated as soon as Pod get deleted (really cool)
then I continue to test some other features with Fargate, e.g: mount EFS into Fargate worker nodes, I have to deleted some existing fargate profile and create a new profile. everything seems working fine, but later on the day, some app developer contact me about they are not able to connect to our EKS (dev) env any more, when I check the cluster health, I noticed that all the normal ec2 worker nodes stuck in "NotReady" status, and if I describe any worker node , in the event I can see some error like below
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Thu, 27 May 2021 11:51:41 +0800 Thu, 27 May 2021 11:55:52 +0800 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Thu, 27 May 2021 11:51:41 +0800 Thu, 27 May 2021 11:55:52 +0800 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Thu, 27 May 2021 11:51:41 +0800 Thu, 27 May 2021 11:55:52 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Thu, 27 May 2021 11:51:41 +0800 Thu, 27 May 2021 11:55:52 +0800 NodeStatusUnknown Kubelet stopped posting node status.
all the kubelet could not communicate with EKS api server any more so all the nodes are stuck in "NotReady" status.
this doesn't make sense, we spend a day to investigate the issue, luckily one of our sr. engineer found out the root cause, in the cloud watch log, we have some api auth error to says that the Role are not mapped correctly in aws-auth configmap. then we come and check the value of aws-auth configmap in kube-system namespace and that's where cause the problem.
our aws-auth configmap was over-written by some naughty aws process I believe at some point (but AWS may not admit it) . the fact is our aws-auth does not have the default EKS rolearn, instead it only contains the fargate profile role
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::xxxxxxxx:role/acme-eks-fargate-pod
username: system:node:{{SessionName}}
while the expected aws-auth configmap should look like below
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
rolearn: arn:aws:iam::xxxxxxxx:role/acme-eks-worker-nodes-NodeInstanceRole-xxxxxxx
username: system:node:{{EC2PrivateDNSName}}
- groups:
- system:bootstrappers
- system:nodes
- system:node-proxier
rolearn: arn:aws:iam::xxxxxxxx:role/acme-eks-fargate-pod
username: system:node:{{SessionName}}
The 1st part is over-written so all the normal worker nodes could could not obtain STS token to communicate with API server, then cause this issue, after we set the default rolearn back, everything come back to normal.