Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider Adding Cache Layer to Reduce Requests to Trino and Improve Response Time #606

Open
digarcia78 opened this issue Jan 31, 2025 · 18 comments

Comments

@digarcia78
Copy link

We’ve noticed that multiple requests are being made to Trino, which can sometimes result in slow response times due to the need to recalculate results repeatedly. To improve performance, we would like to suggest implementing a cache layer to store intermediate results and avoid unnecessary requests to Trino for the same data.

Would love to hear if this is something that could be implemented or if there are any existing solutions in place for this. Do you have any recommendations or suggestions on how we might approach this?

@xkrogen
Copy link
Member

xkrogen commented Jan 31, 2025

Actually just two weeks ago I built something like this into the Gateway as a hackday project :) I think it's a great idea -- at LinkedIn we see the same situation as what you described, where there are many instances of duplicate queries doing redundant computation. Dashboarding solutions hooked up to Trino definitely cause this -- maybe a bunch of different people are loading the same dashboard, or you've edited one chart on a dashboard and when it refreshes it reloads all of the charts even though most of them haven't changed.

Given that Trino queries typically have very small output resultset sizes, the cache should be able to be fairly small. In my POC, I stored the results into the same DB that the GW uses for all of its other operational data. We could consider doing this only for queries with very small outputs, and use a different, more scalable storage for queries with larger result sets. One area I'm also super interested in is using the new spooling client protocol to implement this cache in a zero-copy manner, at least for clients which leverage the spooled protocol.

I think the hardest part of this is probably determining how long the cache the results for and when to invalidate the cache. In my POC I just had a fixed TTL and use the exact query text as the cache key. But I think to make this work well, we'd have to add much smarter invalidation logic. We could look for nondeterministic functions (e.g. random) or time-based functions (e.g. current_time or current_date) and treat those differently. For data sources which support it, such as Iceberg, we could use a mechanism like snapshot IDs to determine if the input data sources have changed and then invalidate the cache. Lots of interesting stuff that can be explored here!

As an even further extension, what really gets me excited is that once we're persisting the query results, we could use that to implement query-level retries in the Gateway, pulling the same functionality as Trino FTE's QUERY retry policy up to the Gateway layer to allow it to recover even from coordinator/cluster failures and do retries across clusters (including potentially retrying a query on a "larger" or differently tuned cluster).

@RoeyoOgen
Copy link

@xkrogen just out of curiosity can you possibly share the code?
@digarcia78 I too think this will be an awesome feature to implement. Would love to be a part of it if you need some help.

@xkrogen
Copy link
Member

xkrogen commented Feb 14, 2025

I think I should be able to share a branch, though warning that is veryyyy far from being production-ready :) I will try to share next week.

@RoeyoOgen
Copy link

RoeyoOgen commented Feb 15, 2025

@xkrogen thanks!

I was thinking about implementing this as part of the routing rules engine (before stumbling upon this issue). Would it be acceptable to implement this as part of the actual gateway service?
Maybe @mosabua can give an opinion ?

Was also leaning towards using Redis as the caching db...

@willmostly
Copy link
Contributor

I would prefer implementing caching at Trino and making the gateway cache-aware to having the gateway implement caching directly. This would have the advantage of supporting data caching (like https://trino.io/docs/current/object-storage/file-system-cache.html) and subquery caching as well as result set caching. This paper has some interesting ideas: https://storage.googleapis.com/gweb-research2023-media/pubtools/5018.pdf

Caching at the gateway layer introduces security concerns that we do not have today.

  • Minimally, the gateway would need to authenticate users instead of delegating this responsibility to Trino
  • A multi-user shared cache would require authorization, and authorization requires analyzing the query to implement view security. Analysis requires connectors - at this point we've almost turned the gateway into a trino coordinator. If you were to do all this, you would also need to ensure consistency between the authorization rules at the gateway and all of its backend clusters. This would not be an issue if the gateway were cache-aware

@xkrogen
Copy link
Member

xkrogen commented Feb 21, 2025

Minimally, the gateway would need to authenticate users instead of delegating this responsibility to Trino

Agreed, though doing auth in the GW seems reasonable.

A multi-user shared cache would require authorization, ...

Yes, agreed that a shared cache introduces a lot of complexity. In my POC I have limited the cache in scope to be per-user to sidestep such issues; a user can only access their own cached results. I'm sure other environments will vary, but I found that in our environment this actually only hurts the cache efficacy/hit-rate by a few %.

I would prefer implementing caching at Trino and making the gateway cache-aware to having the gateway implement caching directly. ...

I would agree with you except for this point I called out in my earlier comment:

As an even further extension, what really gets me excited is that once we're persisting the query results, we could use that to implement query-level retries in the Gateway, pulling the same functionality as Trino FTE's QUERY retry policy up to the Gateway layer to allow it to recover even from coordinator/cluster failures and do retries across clusters (including potentially retrying a query on a "larger" or differently tuned cluster).

This one is really the motivating factor for me -- QUERY-level retries in the GW is my real goal, which requires persisting/spooling results, and that was what gave me the idea that if you're persisting these results anyway, you may as well build a cache on top of it as well. Caching (at least scoped to a user) is much simpler than the retries, so I started there.

@mosabua
Copy link
Member

mosabua commented Feb 21, 2025

We discussed this in the dev sync as well. I strongly agree with @xkrogen .. per user cache will already solve a lot of dashboard and service user refresh scenarios and is worth doing just for that use case only. Roey will write up more details and potentially work in PoC .. but we should totally collaborate on that.

At the same time we do need to be aware of the security issues. We can probably grow functionality of it all over time.

And lastly ... we know from trino-lb that issuing query id in the lb, holding on to user while clusters scale up, or start up, and holding on to the request until then are ready work .. we eventually should get there and we have the db backend to store the queue so we should be able to move towards that as well.

@xkrogen
Copy link
Member

xkrogen commented Feb 21, 2025

@RoeyoOgen here's my branch. Lots of room for improvement!! https://github.com/xkrogen/trino-gateway/tree/xkrogen/resultset-caching-poc

@mosabua
Copy link
Member

mosabua commented Feb 21, 2025

Before I forget .. we should use an open source db for caching - so probably valkey rather than redis

@RoeyoOgen
Copy link

@RoeyoOgen here's my branch. Lots of room for improvement!! https://github.com/xkrogen/trino-gateway/tree/xkrogen/resultset-caching-poc

Thanks!

Taking all your comments on board and will start to tinker with a design on my part :)

@RoeyoOgen
Copy link

RoeyoOgen commented Mar 3, 2025

So a bit of an update.
After reading more about the Caching functionally, it seems really that it will be better to rely on the Trino cluster's caching.
I think we do not want to change the current response flow in which the Trino cluster returns the query response to the client and not the GW.

Regarding the point we made about doing so just for Dashboard refreshes

per user cache will already solve a lot of dashboard and service user refresh scenarios and is worth doing just for that use case only

@mosabua do you still think this should be solved at the GW level and not the cluster one?

Also @willmostly

  • This would not be an issue if the gateway were cache-aware

Why does the GW need to be "cache-aware"? will the Trino cluster not just return the cached result immediately? - or do we want this just to minimise the number of queries arriving to the cluster?

@xkrogen
Copy link
Member

xkrogen commented Mar 3, 2025

For each of future readers, here is a sequence diagram showing the communication flow with the POC from my branch. The same query is submitted twice; the first populates the cache, and the second returns responses from the cache. Note that the first flow (cache population) is unchanged from the GW's current behavior except for the addition of INSERT into the database.

As @RoeyoOgen notes, for cached responses, the Trino backend is completely out of the loop; responses are returned directly from the GW itself.

sequenceDiagram
    User->>+Client: "SELECT a FROM tbl"

    Client->>+GW: POST /v1/statement "SELECT a FROM tbl"
    GW->>+Backend: POST /v1/statement "SELECT a FROM tbl"
    Backend->>-GW: {"id": "123", "state": "QUEUED", "nextUri": "/v1/statement/queued/123/xxx/1"}
    GW->>Database: INSERT {"queryID": "123", pageID: "POST", content: "..."}
    GW->>-Client: {"id": "123", "state": "QUEUED", "nextUri": "/v1/statement/queued/123/xxx/1"}

    Client->>+GW: GET /v1/statement/queued/123/xxx/1
    GW->>+Backend: GET /v1/statement/queued/123/xxx/1
    Backend->>-GW: {"id": "123", "state": "RUNNING", "nextUri": "/v1/statement/executing/123/xxx/0"}
    GW->>Database: INSERT {"queryID": "123", pageID: "queued/1", content: "..."}
    GW->>-Client: {"id": "123", "state": "RUNNING", "nextUri": "/v1/statement/executing/123/xxx/0"}

    Client->>+GW: GET /v1/statement/executing/123/xxx/0
    GW->>+Backend: GET /v1/statement/executing/123/xxx/0
    Backend->>-GW: {"id": "123", "state": "RUNNING", "nextUri": "/v1/statement/executing/123/xxx/1", "data": ["a", "b"]}
    GW->>Database: INSERT {"queryID": "123", pageID: "executing/0", content: "..."}
    GW->>-Client: {"id": "123", "state": "RUNNING", "nextUri": "/v1/statement/executing/123/xxx/1", "data": ["a", "b"]}

    Client->>+GW: GET /v1/statement/executing/123/xxx/1
    GW->>+Backend: GET /v1/statement/executing/123/xxx/1
    Backend->>-GW: {"id": "123", "state": "FINISHED", "nextUri": "/v1/statement/executing/123/xxx/2", "data": ["c", "d"]}
    GW->>Database: INSERT {"queryID": "123", pageID: "executing/1", content: "..."}
    GW->>-Client: {"id": "123", "state": "FINISHED", "nextUri": "/v1/statement/executing/123/xxx/2", "data": ["c", "d"]}

    Client->>+GW: GET /v1/statement/executing/123/xxx/2
    GW->>+Backend: GET /v1/statement/executing/123/xxx/2
    Backend->>-GW: {"id": "123", "state": "FINISHED"}
    GW->>Database: INSERT {"queryID": "123", pageID: "executing/2", content: "..."}
    GW->>Database: INSERT {"queryID": "123", completed: TRUE}
    GW->>-Client: {"id": "123", "state": "FINISHED"}

    Client->>-User: ["a", "b", "c", "d"]


    User->>+Client: "SELECT a FROM tbl"

    Client->>+GW: POST /v1/statement "SELECT a FROM tbl"
    GW->>+Database: SELECT {"queryID": "123", pageID: "POST"}
    Database->>-GW: content: {...}
    GW->>-Client: {"id": "123", "state": "QUEUED", "nextUri": "/v1/statement/queued/123/xxx/1"}

    Client->>+GW: GET /v1/statement/queued/123/xxx/1
    GW->>+Database: SELECT {"queryID": "123", pageID: "queued/1"}
    Database->>-GW: content: {...}
    GW->>-Client: {"id": "123", "state": "RUNNING", "nextUri": "/v1/statement/executing/123/xxx/0"}

    Client->>+GW: GET /v1/statement/executing/123/xxx/0
    GW->>+Database: SELECT {"queryID": "123", pageID: "executing/0"}
    Database->>-GW: content: {...}
    GW->>-Client: {"id": "123", "state": "RUNNING", "nextUri": "/v1/statement/executing/123/xxx/1", "data": ["a", "b"]}

    Client->>+GW: GET /v1/statement/executing/123/xxx/1
    GW->>+Database: SELECT {"queryID": "123", pageID: "executing/1"}
    Database->>-GW: content: {...}
    GW->>-Client: {"id": "123", "state": "FINISHED", "nextUri": "/v1/statement/executing/123/xxx/2", "data": ["c", "d"]}

    Client->>+GW: GET /v1/statement/executing/123/xxx/2
    GW->>+Database: SELECT {"queryID": "123", pageID: "executing/2"}
    Database->>-GW: content: {...}
    GW->>-Client: {"id": "123", "state": "FINISHED"}

    Client->>-User: ["a", "b", "c", "d"]
Loading

I think we do not want to change the current response flow in which the Trino cluster returns the query response to the client and not the GW.

Can you elaborate more on your opinion here @RoeyoOgen ?

@RoeyoOgen
Copy link

RoeyoOgen commented Mar 3, 2025

@xkrogen I think you may have copied the first query's drawing for the 2nd query as well but i get your point anyways :)

Lets focus on the way the gateway works today from high level (From my understanding and testing):

When the client (user) submits the query the Gateway routes it to the correct cluster but the Trino cluster itself returns the response to the client ("current state" below).

In the current suggestion of adding caching, the response is returned via the gateway . This can be problematic for various reasons. With the main one being that the GW cluster will need to have enough resources to save the response and return to the user not only from one cluster but essentially from all of them. Meaning it needs to be as big as all the coordinators and becoming this "mega coordinator"

Image

@mosabua
Copy link
Member

mosabua commented Mar 4, 2025

Some points:

  • The current preferred behavior is that request and response go through the Trino Gateway. The reason for this is that a common and preferred deployment is where the underlying and responding clusters are an implementation detail and should not be exposed to the user. In fact in many deployments they are not even available from a network perspective .. only the Trino Gateway is .. since it is .. well.. the gateway. See config details and such in https://trinodb.github.io/trino-gateway/installation/#trino-configuration

  • In terms of load - keep in mind that Trino Gateway is a light weight server that does NOT have to do that much - at least compared to the coordinator. In addition it can be clustered since it is stateless .. so if you really end up with load problems .. just make Trino Gateway be a cluster of machines.

Both of these aspects mean in my opinion that caching at the Trino Gateway layer does in fact make sense. Even if we just implement a naive, per-user cache. As long as its an optional thing we can disable it and we can also see what usage scenario it works well in and in which it does not.

Caching on the Trino cluster level still make sense as well in parallel. I think they can be complementary and as long as we have both configurable users can adjust to what works for them.

Now to one more point that does change things quite a bit - new spooling protocol.

We have not really tested this much but it will most likely work just fine. The protocol has multiple operating modes and potentially reduces the time to query completion from the coordinators point of view quite a lot. It also reduces the load on the coordinator a lot. Potentially it also requires client tools to have access to the object storage use for spooling .. or the workers. So depending on the scenario we might have to adjust some aspects in Trino Gateway .. all that work still has to be started .. well.. in fact .. testing and playing around with it has to be started. However I dont think it should stop us from implementing a cache for Trino Gateway..

@mosabua
Copy link
Member

mosabua commented Mar 4, 2025

Also .. to be explict:

@RoeyoOgen saying

When the client (user) submits the query the Gateway routes it to the correct cluster but the Trino cluster itself returns the response to the client ("current state" below).

that is NOT the suggested config as documented.. but yes.. it is also possible to do that.

@RoeyoOgen
Copy link

@mosabua
Regarding -

that is NOT the suggested config as documented.. but yes.. it is also possible to do that.

In our case it's actually a big necessity. Since we have on-prem kubernetes clusters we don't mind that the client has knowledge of the Trino backends, it is however important that the query result are not returned via the GW.
We use the gateway to give the clients a single endpoint which we utilise for blue-green cluster upgrades, automatic routing to the correct KB8's cluster site etc... but since the client communication can be with a different KB8's site we will want the client to talk to the Trino instance and not the GW - in order to minimise Trafic between the KB8's clusters.

@xkrogen
Copy link
Member

xkrogen commented Mar 4, 2025

@xkrogen I think you may have copied the first query's drawing for the 2nd query as well but i get your point anyways :)

I'm not sure what you mean. The 2nd query follows a different flow. The communication between client and GW is identical, which is intentional -- the responses are replayed. But the communication between GW and backend has been replaced with communication between GW and DB.

Anyway, +1 with @mosabua's commentary above. A lot of functionality, including caching, can be added by using the GW as a full proxy. I agree that if you cannot leverage this setup, then caching in the GW won't work for you.

@mosabua
Copy link
Member

mosabua commented Mar 4, 2025

@RoeyoOgen I agree that we also want to continue to allow the usage mode as you mention where follow ups go to the cluster .. but they should not be the default use .. basically just what we have now.

Spooling also puts another wrinkle on that since there are modes where it goes through the coordinator, through the workers, or straight to storage .. all of them are reasonable in some scenarios and all should ultimately work with Trino Gateway.

I think the main thing we are lacking for all that is documentation to explain and help with configuration for the different scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants