-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: instrument Spanner client with OpenCensus metrics #54
Conversation
google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Outdated
Show resolved
Hide resolved
...-spanner/src/main/java/com/google/cloud/spanner/v1/stub/metrics/MetricRegistryConstants.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks good as well. Adding @olavloite to get his thoughts on the implementation as well since he's the most familiar with the Java client library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks generally good to me, but I would recommend changing the description of the measurement of maxSessionsInUse
. I think it would confuse many users if they see a value that is only increasing (or at least never decreasing) during a period of 10 minutes that is called 'The number of active sessions'.
// The Metric name and description | ||
public static final String ACTIVE_SESSIONS = "cloud.google.com/java/spanner/active_sessions"; | ||
public static final String MAX_SESSIONS = "cloud.google.com/java/spanner/max_sessions"; | ||
public static final String ACTIVE_SESSIONS_DESCRIPTION = "The number of active sessions"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the description of this metric should be different, considering it is reading maxSessionsInUse
. This value will only increase during a 10 minute interval, and then be reset to zero (and then start increasing again). Suggestion, something like:
'Max number of sessions in use during the last 10 minutes'.
I noticed the other discussion about whether to measure numSessionsInUse
or maxSessionsInUse
. I think both could be valuable for the user, and numSessionsInUse
should have a description like 'The number of sessions currently in use'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will change the description for maxSessionsInUse
. Could you please elaborate more on numSessionsInUse
metric. How is it different from maxSessionsInUse
and how do you think it is useful for users in conjunction with maxSessionsInUse
?
Just so you know, we don't want to add too many metrics due to costs associated with exporting to backend (like stackdriver). Having said that, I am more than happy to include metrics which are meaningful (and useful) for the users to resolve/understand the production issues at some extend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numSessionsInUse
reflects the number of sessions checked out from the pool at this very moment. Depending on probe intervals and average transaction duration, this number might often be measured to be zero for a system under low load. E.g. if the system executes 1 transaction per second, the average transaction duration is 300ms and the value is probed once every 10 seconds, the chances are big that you will often measure the value between two transactions and get a zero. On a system with higher load, measuring this value will give the user a better idea about how the distribution of the transactions are over time. If this value is relatively constant most of the time, but suddenly shows a spike that cannot be correlated with a similar sudden spike in user requests, it might be an indication of a concurrency problem somewhere (locks in the database, concurrency issues in the application itself etc.).maxSessionsInUse
reflects the maximum number ofnumSessionsInUse
during the current maintenance window. A maintenance window is a hard-coded 10 minute interval. After a complete maintenance window has passed, the value is reset to 0. The value is updated every time a session is checked out of the pool by the simple calculationmaxSessionsInUse = Math.max(numSessionsInUse, maxSessionsInUse)
.
As you can conclude from the above (and which I also just now realized ;-)), the maxSessionsInUse
has a slight flaw: If no new session is checked out during a maintenance window, the value will remain zero during the entire window, even if there are sessions that are kept checked out during the entire window (i.e. the sessions were checked out during a previous maintenance window).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I have added numSessionsInUse
metric PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to get this PR merged before adding other metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think you should change the packages of the MetricsRegistryConstants and MetricsRegistryTestUtils. Unless there's a very good reason for it, I would suggest placing these in the normal com.google.cloud.spanner package. Sorry for not noticing this before :-(
For the rest; LGTM
...-spanner/src/main/java/com/google/cloud/spanner/v1/stub/metrics/MetricRegistryConstants.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/v1/MetricRegistryTestUtils.java
Outdated
Show resolved
Hide resolved
"Max number of sessions in use during the last 10 minutes"; | ||
public static final String MAX_SESSIONS_DESCRIPTION = "The number of max sessions configured"; | ||
public static final String SESSIONS_IN_USE_DESCRIPTION = | ||
"The number of sessions checked out from the pool"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
names are confusing. I would suggest
old | new | Description |
---|---|---|
active_sessions | max_in_use_session | The max number of sessions concurrently in use during the current 10 minutes interval |
max sessions | max_allowed_sessions | The Maximum number of sessions allowed. Configurable by the user. |
sessions_in_use | in_use_sessions | The number of sessions currently in use |
Alternatively, in_use can be replaced with 'active' in name and description.
checked out has different meaning than active/in_use. If that is appropriate, then it should be used in the name and the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in e6938bc, decided to use in_use. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits and a comment about the test but LGTM otherwise.
google-cloud-spanner/src/main/java/com/google/cloud/spanner/MetricRegistryConstants.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/MetricRegistryConstants.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/MetricRegistryConstants.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/main/java/com/google/cloud/spanner/SessionPool.java
Outdated
Show resolved
Hide resolved
google-cloud-spanner/src/test/java/com/google/cloud/spanner/SessionPoolTest.java
Outdated
Show resolved
Hide resolved
Add active_sessions (The number of active sessions in use) and max_sessions (The number of max sessions configured the user) metrics
Use maxSessionsInUse instead of numSessionsInUse and update the description.
This change addes five metrics for session management: * in_use_sessions: the current number of sessions that are checked out from the session pool. * max_in_use_sessions: the maximum number of in_use_sessions in last 10 minutes. This is based on tumbling windows, instead of sliding windows. * max_allowed_sessions: the maximum number of sessions that is configured by the user. * get_sessions_timeout: the cumulative number of get sessions timeouts when pool exhaustion happens. * num_acquired_sessions: the cumulative number of sessions that are checked out from the session pool. * num_released_sessions: the cumulative number of sessions that are released back to the session pool. All metrics are tagged by: * client_id: each database has its own increasing ID sequence. For two different databases, their client IDs all start from "client-1". * database: the database ID. * instance_id: the instance ID. * library_version: the library version from google-cloud-go/internal/version which is a date in YYYYMMDD format. Notes: There are three ways to check out a session from the pool: 1) take(): get a read session called by a user 2) takeWriteSession(): get a read/write session by a user 3) getNextForTx() in healthchecker's worker (session.go:1336): healthchecker's workers convert read sessions to r/w sessions. E.g., when initializing the pool, workers check out 20 read sessions from the pool, turn them into r/w sessions, and put them back to the pool , assume that MinOpended=100 and WriteSessions=0.2. So some metrics are also emitted by case 3. This might confuse users if they are not aware of this behaviour. Related Java PRs: * googleapis/java-spanner#54 * googleapis/java-spanner#65 * googleapis/java-spanner#67 Change-Id: Ie163b08ef18ac2a47e1669fefab92d61fe8f2a82 Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/52953 Reviewed-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Knut Olav Løite <koloite@gmail.com>
This change addes five metrics for session management: * in_use_sessions: the current number of sessions that are checked out from the session pool. * max_in_use_sessions: the maximum number of in_use_sessions in last 10 minutes. This is based on tumbling windows, instead of sliding windows. * max_allowed_sessions: the maximum number of sessions that is configured by the user. * get_sessions_timeout: the cumulative number of get sessions timeouts when pool exhaustion happens. * num_acquired_sessions: the cumulative number of sessions that are checked out from the session pool. * num_released_sessions: the cumulative number of sessions that are released back to the session pool. All metrics are tagged by: * client_id: each database has its own increasing ID sequence. For two different databases, their client IDs all start from "client-1". * database: the database ID. * instance_id: the instance ID. * library_version: the library version from google-cloud-go/internal/version which is a date in YYYYMMDD format. Notes: There are three ways to check out a session from the pool: 1) take(): get a read session called by a user 2) takeWriteSession(): get a read/write session by a user 3) getNextForTx() in healthchecker's worker (session.go:1336): healthchecker's workers convert read sessions to r/w sessions. E.g., when initializing the pool, workers check out 20 read sessions from the pool, turn them into r/w sessions, and put them back to the pool , assume that MinOpended=100 and WriteSessions=0.2. So some metrics are also emitted by case 3. This might confuse users if they are not aware of this behaviour. Related Java PRs: * googleapis/java-spanner#54 * googleapis/java-spanner#65 * googleapis/java-spanner#67 Change-Id: Ie163b08ef18ac2a47e1669fefab92d61fe8f2a82 Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/52953 Reviewed-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Knut Olav Løite <koloite@gmail.com>
Updates #53
First set of Session Pool related metrics:
cloud.google.com/java/spanner/max_allowed_sessions
The maximum number of sessions allowed. Configurable by the user. This metric is labeled by database name, instance id and library version.
cloud.google.com/java/spanner/in_use_sessions
The number of sessions currently in use (or checked out from the pool at this very moment). This metric is labeled by database name, instance id and library version.
cloud.google.com/java/spanner/max_in_use_session
The maximum number of sessions in use during the last maintenance window interval. A maintenance window is a set 10 minute interval. After a complete maintenance window has passed, the value is reset to zero (and then start increasing again). This metric is labeled by database name, instance id and library version.
I will open follow-up PRs to address other metrics mentioned in the issue.