Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substream cursor #28278

Open
maxi297 opened this issue Jul 13, 2023 · 14 comments
Open

Substream cursor #28278

maxi297 opened this issue Jul 13, 2023 · 14 comments
Labels

Comments

@maxi297
Copy link
Contributor

maxi297 commented Jul 13, 2023

What area the feature impact?

Connectors

Revelant Information

As requested in Slack: For example, the first time I get 3 ids from the API /v1/deals, I pass it to the API /v1/deals/{id}/flow, the second time I run the API /v1/deals, I get 2 new ids, then I pass it to the API /v1 /deals/{id}/flow. How to do this?

As of 2023-07-13, there are no ways to do this because it's a whole new way of managing the state (it's not incremental as "the ids are incremental" but it's incremental as "we have never fetched the information for those ids).

Proposed solution
Have a new component (name to be reworked) to allow the cursor to manage a substream like this:

incremental_sync:
  type: SubstreamAlreadyFetchedCursor
  parent_stream: "#/definitions/parent_stream"
  parent_key: id
@tjhiggins
Copy link

@maxi297 any updates here? This seems to be a blocker for me using airbyte for Chorus's api. I can incrementally pull conversations, but they don't include the transcript. So I need to use a stream partition to query another endpoint for each conversation. However, I haven't found a way to do that where it isn't a full refresh on the substream.

@maxi297
Copy link
Contributor Author

maxi297 commented Oct 23, 2023

@tjhiggins I see that some people have shown their interest for this issue. Let me bring this back to the team and see if it's enough to prioritize it.

In the meanwhile if this is blocking for you and you are up for the challenge, you could implement your own version of HttpStream that would:

  • have a parent stream field which would be a conversations stream
  • re-implement state getter and setter to forward this to the conversations stream
  • re-implement stream_slices to fetch the conversations stream records
  • re-implement path in order to consider the information of the slices

I'll keep you posted on this issue!

@maxi297
Copy link
Contributor Author

maxi297 commented Oct 24, 2023

Grooming:

  • in terms of YAML manifest, we could avoid having a new type of cursor by having a field "forward_to_parent". The implementer can see if this makes sense

@maxi297
Copy link
Contributor Author

maxi297 commented Oct 24, 2023

@tjhiggins This has been deemed not aligned with our team's current goals. We will re-evaluate before the next cycle which is around mid-November

@tjhiggins
Copy link

@tjhiggins This has been deemed not aligned with our team's current goals. We will re-evaluate before the next cycle which is around mid-November

Thanks for the update.

@lmossman
Copy link
Contributor

Another request for this feature: https://airbytehq-team.slack.com/archives/C027KKE4BCZ/p1705077322351399

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
@NAjustin
Copy link
Contributor

NAjustin commented May 7, 2024

And another one as well from Slack: https://airbytehq.slack.com/archives/C027KKE4BCZ/p1715089626142729

I have one API that would align with this as well, since the child object is only changed when the parent is changed, so a feature like this it would prevent about 100K unneeded requests per run, which would also help with their strict API limits.

@TorstenFraust
Copy link

Same here, Jiminny API.
We have a parent stream activities which would take 17h+ for a full refresh with low chance of running through without timeouts. A child stream 'summary' is requesting the summary of one of this activites. While the parent stream is incremental the child stream tries to run the parent again for the whole year which is
a) data we don't want to pull again
b) doomed to fail as its too much for the api

@mariana-s-fernandes
Copy link

Any update on this?

@lmossman
Copy link
Contributor

@mariana-s-fernandes @TorstenFraust @NAjustin
This is now supported in the Connector Builder if the substream is also configured to be incremental:
image

Can you confirm if this solves your use cases?

@TorstenFraust
Copy link

@lmossman
Unfortunately not as the substream has to be incremental. I don't undertand this limitation as I was expecting this feature to just pass the parrent id's from the current run to the child stream. My substream just accepts one imput the id of the parrent stream, so I can not make it incremental.
https://arc.net/l/quote/dvqztooa

What it the behaviour if the substream is not incremental? Right now it looks to me like the substream runs before the parent stream.

Screenshot 2024-09-05 at 18 28 38

@htkapiche
Copy link

htkapiche commented Dec 5, 2024

@TorstenFraust have you found a solution or workaround, specifically related to Jiminny?

@adeolaemmanuelmorren
Copy link

adeolaemmanuelmorren commented Mar 7, 2025

Is there any update on the above? @lmossman it has not solved the usecase that @TorstenFraust mentioned. I'm having the exact same problem.

Regardless of whether that option is selected, the child still runs a full refresh every time.

@htkapiche were you able to solve this at all?

@lmossman
Copy link
Contributor

I have raised the request to support this on non-incremental child streams to the team. I don't have a guarantee on when we will get to it but they are hoping to tackle it soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests