Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airbyte Pod Intermittently Losing IRSA Permissions and Falling Back to Node IAM Role #53652

Open
talhermon opened this issue Feb 12, 2025 · 12 comments
Labels
area/platform issues related to the platform community kubernetes team/deployments type/bug Something isn't working

Comments

@talhermon
Copy link

Helm Chart Version

1.3.1

What step the error happened?

Other

Relevant information

I'm encountering an intermittent issue with an Airbyte pod running on EKS, which is configured to assume an AWS IAM role using IRSA. Every few days, the pod appears to lose the service account permissions and instead starts using the IAM role associated with the underlying EKS node. This results in permission-denied errors when attempting to access an S3 bucket.

Expected Behavior:
The Airbyte pod should consistently assume the IAM role associated with its service account via IRSA and retain the expected permissions throughout its lifecycle.

Relevant log output

2025-02-11 13:49:44.286	
INFO i.a.c.s.h.SchedulerHandler(createJob):602 - Found the following streams to reset for connection <CONNECTION_ID>: []
2025-02-11 13:49:44.320	
INFO i.a.p.j.DefaultJobPersistence(enqueueJob):582 - Enqueuing pending job for scope: <CONNECTION_ID>
2025-02-11 13:49:44.324	
INFO i.a.c.s.h.SchedulerHandler(createJob):650 - New job created, with id: <JOB_ID>
2025-02-11 13:49:44.733	
WARN i.a.c.j.JsonSchemas(traverseJsonSchemaInternal):200 - The object is a properties key or a combo keyword. The traversal is silently stopped. Current schema: {"type":"object","airbyte_hidden":true,"additionalProperties":true}
2025-02-11 13:49:44.787	
WARN i.a.c.j.JsonSchemas(traverseJsonSchemaInternal):200 - The object is a properties key or a combo keyword. The traversal is silently stopped. Current schema: {"type":"object","airbyte_hidden":true,"additionalProperties":true}
2025-02-11 13:50:04.777	
ERROR i.a.s.a.ApiHelper(execute):46 - Unexpected Exception
software.amazon.awssdk.services.s3.model.S3Exception: User: arn:aws:sts::<AWS_ACCOUNT_ID>:assumed-role/<EKS_NODE_ROLE>/<INSTANCE_ID> is not authorized to perform: s3:ListBucket on resource: "arn:aws:s3:::<S3_BUCKET_NAME>" because no identity-based policy allows the s3:ListBucket action (Service: S3, Status Code: 403, Request ID: <REQUEST_ID>, Extended Request ID: <EXTENDED_REQUEST_ID>)
2025-02-11 13:50:04.777	
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
2025-02-11 13:50:04.777	
	at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:93)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:74)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:43)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:79)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:41)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage2.executeRequest(RetryableStage2.java:93)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage2.execute(RetryableStage2.java:56)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage2.execute(RetryableStage2.java:36)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:53)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:35)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:82)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:62)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:43)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:210)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.services.s3.DefaultS3Client.listObjectsV2(DefaultS3Client.java:7327)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.services.s3.paginators.ListObjectsV2Iterable$ListObjectsV2ResponseFetcher.nextPage(ListObjectsV2Iterable.java:154)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.services.s3.paginators.ListObjectsV2Iterable$ListObjectsV2ResponseFetcher.nextPage(ListObjectsV2Iterable.java:145)
	2025-02-11 13:50:04.777	
	at software.amazon.awssdk.core.pagination.sync.PaginatedResponsesIterator.next(PaginatedResponsesIterator.java:58)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.storage.AbstractS3StorageClient.list(StorageClient.kt:571)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.logging.LogClient.getLogs(LogClient.kt:95)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.logging.LogClientManager.getLogs(LogClientManager.kt:50)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.server.converters.JobConverter.getAttemptLogs(JobConverter.java:306)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.server.converters.JobConverter.getSynchronousJobRead(JobConverter.java:334)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.server.converters.JobConverter.getSynchronousJobRead(JobConverter.java:329)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.server.handlers.SchedulerHandler.retrieveDiscoveredSchema(SchedulerHandler.java:567)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.server.handlers.SchedulerHandler.discoverAndGloballyDisable(SchedulerHandler.java:400)
	2025-02-11 13:50:04.777	
	at io.airbyte.commons.server.handlers.SchedulerHandler.discoverSchemaForSourceFromSourceId(SchedulerHandler.java:365)
	2025-02-11 13:50:04.777	
	at io.airbyte.server.apis.SourceApiController.lambda$discoverSchemaForSource$6(SourceApiController.java:114)
	2025-02-11 13:50:04.777	
	at io.airbyte.server.apis.ApiHelper.execute(ApiHelper.kt:31)
	2025-02-11 13:50:04.777	
	at io.airbyte.server.apis.SourceApiController.discoverSchemaForSource(SourceApiController.java:114)
	2025-02-11 13:50:04.777	
	at io.airbyte.server.apis.$SourceApiController$Definition$Exec.dispatch(Unknown Source)
	2025-02-11 13:50:04.777	
	at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invokeUnsafe(AbstractExecutableMethodsDefinition.java:461)
	2025-02-11 13:50:04.777	
	at io.micronaut.context.DefaultBeanContext$BeanContextUnsafeExecutionHandle.invokeUnsafe(DefaultBeanContext.java:4354)
	2025-02-11 13:50:04.777	
	at io.micronaut.web.router.AbstractRouteMatch.execute(AbstractRouteMatch.java:272)
	2025-02-11 13:50:04.777	
	at io.micronaut.web.router.DefaultUriRouteMatch.execute(DefaultUriRouteMatch.java:38)
	2025-02-11 13:50:04.777	
	at io.micronaut.http.server.RouteExecutor.executeRouteAndConvertBody(RouteExecutor.java:488)
	2025-02-11 13:50:04.777	
	at io.micronaut.http.server.RouteExecutor.lambda$callRoute$5(RouteExecutor.java:465)
	2025-02-11 13:50:04.777	
	at io.micronaut.core.execution.ExecutionFlow.lambda$async$1(ExecutionFlow.java:87)
	2025-02-11 13:50:04.777	
	at io.micronaut.core.propagation.PropagatedContext.lambda$wrap$3(PropagatedContext.java:211)
	2025-02-11 13:50:04.777	
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	2025-02-11 13:50:04.777	
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)

	2025-02-11 13:50:04.777	
	at java.base/java.lang.Thread.run(Thread.java:1583)
@talhermon talhermon added area/platform issues related to the platform needs-triage type/bug Something isn't working labels Feb 12, 2025
@talhermon talhermon changed the title airbyte pod permissions falls back to Node from Service account Airbyte Pod Intermittently Losing IRSA Permissions and Falling Back to Node IAM Role Feb 12, 2025
@jacoblElementor
Copy link

jacoblElementor commented Feb 16, 2025

Having the same issue with helm chart version 1.4.1

@jacoblElementor
Copy link

Hey @talhermon as you saw my comment before, I had the same issue, but I managed to resolve it.
In my case, the IRSA role which I was using was missing the assume role configuration to allow the airbyte-admin service account to use it.
Also, I was updating from an old airbyte chart so I needed to update my values structure:

global:
  storage:
    type: S3
    bucket:
      log: "<your bucket>"
      state: "<your bucket>"
      workloadOutput: "<your bucket>"
      activityPayload: "<your bucket>"
    s3:
      region: <your region>
      authenticationType: instanceProfile # This is important

I am now using Helm chart version: 1.3.1

@talhermon
Copy link
Author

@jacoblElementor Happy to hear that you solved the issue.
In my case im able to assume the role - after some time it just "loses" the service account permissions and falls back to the node role permissions(which doesn't have access to the bucket).

So I'm still facing the problem :(

@jacoblElementor
Copy link

@talhermon

@jacoblElementor Happy to hear that you solved the issue. In my case im able to assume the role - after some time it just "loses" the service account permissions and falls back to the node role permissions(which doesn't have access to the bucket).

So I'm still facing the problem :(

Does the container restart or something? I have it running for an hour with no issues till now + I ran some connections already
Could you send your Values configuration?

@talhermon
Copy link
Author

talhermon commented Feb 17, 2025

global:
  serviceAccountName: airbyte-admin
  storage:
      type: "S3"
      bucket: ## S3 bucket names that you've created. We recommend storing the following all in one bucket.
        log: <s3_bucket>
        state: <s3_bucket>
        workloadOutput: <s3_bucket>
      s3:
        region: "us-east-1" ## e.g. us-east-1
        authenticationType: instanceProfile ## Use "credentials" or "instanceProfile"
  
  
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<account_id>:role/Airbyte-role"

@jacoblElementor Once I restart the pod it fix the problem for a couple of hours.

@jacoblElementor
Copy link

I will monitor my deployment for a few more days to see if it reproduces.
Try running with DEBUG log level for the server.

Also, what Kubernetes version are you running?

@talhermon
Copy link
Author

@jacoblElementor EKS 1.31.

@marcosmarxm
Copy link
Member

cc @airbytehq/platform-deployments

@dimisjim
Copy link

We are using GKE, and we are experiencing a similar issue on 1.4.0 helm chart. We are using gcs for the logs, that's possibly related to it, analogously, and it's a workload identity issue there instead of IRSA in AWS?

2025-02-18 10:01:28,412 [public-api-executor-thread-1]	WARN	i.a.c.j.JsonSchemas(traverseJsonSchemaInternal):203 - The object is a properties key or a combo keyword. The traversal is silently stopped. Current schema: {"type":"object","airbyte_hidden":true,"additionalProperties":true}

We are seeing these warnings in the server pod logs when planning with the airbyte tf provider. Applying afterwards leaves shows no errors in the server logs, but fails to persist the state with provider message: "failure to invoke the API unknown status code returned: Status 504 upstream request timeout". I am commenting here as there is no such issue in the airbyte-terraform-provider repo, and this issue is the closest I have found.

@talhermon
Copy link
Author

/bump - This is still an ongoing issue affecting our production environment.

Would appreciate any insights from the platform team on potential causes or workarounds. Happy to provide additional debugging information if needed.

@jacoblElementor
Copy link

@talhermon The issue has not persisted for me, I have it up and running for a week.
I just went over your comment with the values, just to make sure the

serviceAccount:
  annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<account_id>:role/Airbyte-role"

is not under global, right? It should be on the root level of the yaml.

@sc-yan
Copy link

sc-yan commented Mar 11, 2025

same here with helm 1.5.1.
we were using aws IAM user(access key, secret) before and switched to IAM role with service account. however, this issue happens from time to time, especially when you change the manifest(we use argocd to sync the cluster).
but if you delete the pod, the issue might disappear. but you never know when the next time it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform community kubernetes team/deployments type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants