Skip to content

Conversation

bogdan-st
Copy link
Contributor

Adds initial draft for the User Overrides API, described in https://cortexmetrics.io/docs/proposals/overrides-api/

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@bogdan-st bogdan-st force-pushed the overrides_api branch 2 times, most recently from 29f5299 to 1dbe402 Compare August 15, 2025 18:55
Signed-off-by: Bogdan Stancu <[email protected]>
@bogdan-st bogdan-st changed the title Overrides API initial draft Add Overrides API component and rename old overrides to overrides-configs Aug 17, 2025
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am liking it. I have do have a couple of suggestions

Signed-off-by: Bogdan Stancu <[email protected]>
@bogdan-st bogdan-st force-pushed the overrides_api branch 2 times, most recently from 90512a2 to 608f1ab Compare August 20, 2025 16:01
Signed-off-by: Bogdan Stancu <[email protected]>
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it another pass. Thanks for your hard work!

# CLI flag: -query-scheduler.grpc-client-config.connect-timeout
[connect_timeout: <duration> | default = 5s]

overrides:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added a new target "overrides", that's good. No need to add new flags. Use the flags available: https://cortexmetrics.io/docs/configuration/configuration-file/#runtime_configuration_storage_config

Comment on lines 60 to 62
if c.Backend == bucket.Filesystem {
return errors.New(ErrFilesystemBackendNotSupported)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filesystem is supported by the runtime configuration overrides. The API should support it. Remove this.

)

// Config holds configuration for the overrides module
type Config struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this config. Use

type Config struct {

RuntimeConfig runtimeconfig.Config `yaml:"runtime_config"`
MemberlistKV memberlist.KVConfig `yaml:"memberlist"`
QueryScheduler scheduler.Config `yaml:"query_scheduler"`
Overrides overrides.Config `yaml:"overrides"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Overrides overrides.Config `yaml:"overrides"`

c.RuntimeConfig.RegisterFlags(f)
c.MemberlistKV.RegisterFlags(f)
c.QueryScheduler.RegisterFlags(f)
c.Overrides.RegisterFlags(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.Overrides.RegisterFlags(f)

| [Tenant delete request](#tenant-delete-request) | Purger || `POST /purger/delete_tenant` |
| [Tenant delete status](#tenant-delete-status) | Purger || `GET /purger/delete_tenant_status` |
| [Get user overrides](#get-user-overrides) | Overrides || `GET /api/v1/user-overrides` |
| [Set user overrides](#set-user-overrides) | Overrides || `PUT /api/v1/user-overrides` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [Set user overrides](#set-user-overrides) | Overrides || `PUT /api/v1/user-overrides` |
| [Set user overrides](#set-user-overrides) | Overrides || `POST /api/v1/user-overrides` |

Comment on lines 30 to 33
type RuntimeConfigFile struct {
Overrides map[string]map[string]interface{} `yaml:"overrides"`
HardOverrides map[string]map[string]interface{} `yaml:"hard_overrides"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
type RuntimeConfigFile struct {
Overrides map[string]map[string]interface{} `yaml:"overrides"`
HardOverrides map[string]map[string]interface{} `yaml:"hard_overrides"`
}

This overrides the other values saved in the runtime like ingester_limits

Use

type RuntimeConfigValues struct {
instead

And make sure the API doesn't delete ingester_limits values and similar

Comment on lines 18 to 25
var AllowedLimits = []string{
"max_global_series_per_user",
"max_global_series_per_metric",
"ingestion_rate",
"ingestion_burst_size",
"ruler_max_rules_per_rule_group",
"ruler_max_rule_groups_per_tenant",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// RegisterFlags registers the overrides module flags
func (c *Config) RegisterFlags(f *flag.FlagSet) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this

}

// Validate validates the configuration and returns an error if validation fails
func (c *Config) Validate() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

@friedrichg
Copy link
Member

Signed-off-by: Bogdan Stancu <[email protected]>
@bogdan-st bogdan-st force-pushed the overrides_api branch 3 times, most recently from 86bd67e to 948fd7a Compare September 8, 2025 16:21
Copy link

@araiu araiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, long needed feature 🙇
Thank you!

Copy link
Contributor

@dsabsay dsabsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I left some questions and suggestions to think about.

I didn't review all of overrides_test.go but reviewed everything else.

require.NoError(t, err)
defer resp.Body.Close()

assert.Equal(t, http.StatusOK, resp.StatusCode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 4xx of some kind is probably more appropriate in this case.

)

const (
// HTTP status codes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the existing http.StatusOK, etc. constants?


var config runtimeconfig.RuntimeConfigValues
if err := yaml.NewDecoder(reader).Decode(&config); err != nil {
return []string{}, nil // No allowed limits if config can't be decoded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function returns an error but it is always nil.

I agree that a non-existing or invalid config file should be equivalent to having no allowed limits, but it could be expressed better. A couple options:

  1. Remove the error from the return, but log each error.
  2. Return the empty slice and an appropriate error. It is then the caller's responsibility to handle the error.


// Read overrides from bucket storage
overrides, err := a.getOverridesFromBucket(r.Context(), userID)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error should be logged.

Also, it might be safer not to return the raw error to the user. A 500 and maybe a general error message would be sufficient. The raw error could expose underlying detail (like storage configurations, etc.) that some providers might consider sensitive. Also, while unlikely, attackers can sometimes learn more than they should from such error messages. This applies anywhere we are returning an error message to the user.

func (a *API) getOverridesFromBucket(ctx context.Context, userID string) (map[string]interface{}, error) {
reader, err := a.bucketClient.Get(ctx, a.runtimeConfigPath)
if err != nil {
return map[string]interface{}{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how widely this opinion is shared, but I'd prefer this error to be returned. Currently, as a user of getOverridesFromBucket() I can't distinguish between an empty configuration and a situation where there is a configuration but it couldn't be read for some reason (network error, etc.).

var invalidLimits []string

for limitName := range overrides {
if !IsLimitAllowed(limitName, allowedLimits) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// GetAllowedLimits returns the allowed limits from runtime config
// If no allowed limits are configured, returns empty slice (no limits allowed)
func GetAllowedLimits(allowedLimits []string) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

// Read the runtime config to get hard limits
reader, err := a.bucketClient.Get(context.Background(), a.runtimeConfigPath)
if err != nil {
// If we can't read the config, skip hard limit validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I think this function is too lenient when the hard limits can't be read. If we corrupt (or don't set) the hard limits, users could give themselves unreasonable limits (e.g. 5 trillion active series).

Two options:

  1. If hard limits can't be read, fail requests to update user overrides (this shouldn't really happen anyway).
  2. Create a default hard limit, maybe configured via CLI, so if a hard limit isn't set for a tenant, it will default to something that will prevent completely unreasonable values.

Also, these scenarios might be good to have in the e2e tests.

// validateHardLimits checks if the provided overrides exceed any hard limits from the runtime config
func (a *API) validateHardLimits(overrides map[string]interface{}, userID string) error {
// Read the runtime config to get hard limits
reader, err := a.bucketClient.Get(context.Background(), a.runtimeConfigPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do cortex operators set these hard limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same way overrides work today, manually in the runtime config file. This just adds the option for users to change a subset of limits within a set threshold through the api, cortex operator overrides stuff stays manual.

Signed-off-by: Bogdan Stancu <[email protected]>
Signed-off-by: Bogdan Stancu <[email protected]>
Signed-off-by: Bogdan Stancu <[email protected]>
Signed-off-by: Bogdan Stancu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants