Skip to content

Support for specifying log groups for subscription filters that exceed parameter value limit #219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pushred opened this issue Jan 30, 2023 · 24 comments

Comments

@pushred
Copy link

pushred commented Jan 30, 2023

Describe the enhancement:

The project where I am trying to use ESF currently has 176 log groups with more likely to come. Following the published docs for deploying ESF, all three of the options require the use of CloudFormation parameters to specify the ARNs for these groups. However with so many groups I quickly exceed the 4,096 byte limit of parameter values.

I'm not sure that there are any other possible ways I could provide the ARNs, given the type limitations of parameter values. Ideally I could pass an actual array/list, but CF does not appear to support this.

So given that, the enhancement I may need is a means of deploying ESF without using SAR/CloudFormation. The only options that I can see are:

  1. "Sharding" ESF deployments. Multiple deploys that each handle a subset of the ARNs. This would currently require 5 stacks — not ideal.

  2. Building my own CloudFormation stack. Essentially forking this project in order to bypass parameters and decomposing the stack and deploying it through some other means. Aside from the upfront cost of doing this I think it would impose a maintenance burden in staying aligned. And almost definitely unsupported.

Are there any other options?

Is option 1 the only viable option while AWS does not offer a solution?


Describe a specific use case for the enhancement or feature:

Projects with more than ~30 log groups.

@aspacca
Copy link
Contributor

aspacca commented Jan 31, 2023

hello @pushred

So given that, the enhancement I may need is a means of deploying ESF without using SAR/CloudFormation

we created a script for such purpose, it is already merged in main branch: https://github.com/elastic/elastic-serverless-forwarder/blob/main/publish_lambda.sh

it requires a yaml file for configuring the ARNs and IDs, similar to the ones you provide as parameters when publishing from SAR. It will be part of the lambda-v1.6.0 release and you'll find the relevant documentation at https://www.elastic.co/guide/en/observability/current/aws-elastic-serverless-forwarder.html once we will release

@aspacca
Copy link
Contributor

aspacca commented Jan 31, 2023

  1. "Sharding" ESF deployments. Multiple deploys that each handle a subset of the ARNs. This would currently require 5 stacks — not ideal.

this is indeed not possible when deploying from SAR: our SAR templates include a macro, that can be referenced only by literal name. basically we cannot create multiple macros for each deployment with dynamic names, since we could not use function required to compose the dynamic name when referencing the macros. trying to create the same macro after the first deployment will result in the failure of the following deployments
it's a cloudformation limit we cannot overcome

@pushred
Copy link
Author

pushred commented Jan 31, 2023

@aspacca thank you for such a quick reply, and good timing with the addition of this script! I managed to run it successfully with a few tweaks:

  • Updated script to call python3 vs. python as my shell alias for this doesn't work in the script. Maybe not an issue in our CI env but I haven't run Python in that environment before so not sure. Either way I can modify the script but perhaps another parameter might be warranted for this.

  • Had to run with AWS_PAGER="" to suppress paged output from the aws s3api command — perhaps this is something that should be configured at the environment level but a flag on that command could avoid that.

I also had to have Docker running for SAM.

With all that out of the way I successfully deployed a stack with a single log group. However once I added all 176 groups I am now encountering this error:

Successfully packaged artifacts and wrote output template to file /tmp/publish.KxVn82uCKi/.aws-sam/build/publish/packaged.yaml.
Execute the following command to deploy the packaged template
sam deploy --template-file /tmp/publish.KxVn82uCKi/.aws-sam/build/publish/packaged.yaml --stack-name <YOUR STACK NAME>

Error: Templates with a size greater than 51,200 bytes must be deployed via an S3 Bucket. Please add the --s3-bucket parameter to your command. The local template will be copied to that S3 bucket and then deployed.

My publish config file is 19k.

When this error is raised the temporary file no longer exists so I'm not able to simply resume the process. The script may need to be revised to use the S3 bucket deploy method. I'll review making that modification for my purposes but curious if this would also potentially happen for your version.

@pushred
Copy link
Author

pushred commented Jan 31, 2023

There actually appears to be a bug in the script that is at least contributing to the file size issue. I commented out the trap call to examine the build artifacts and found that I have a 1.9mb publish.yaml. Inspecting the file I found that for every log group I've specified there are multiple PolicyDocument Statement instances, including a logs:DescribeLogStreams that includes all groups that have been looped through up to that point, e.g. given 3 ARNs I have output like:

- Effect: Allow
  Action: logs:DescribeLogGroups
  Resource:
  - arn:aws:logs:us-east-1:123456789101:log-group:*:*
- Effect: Allow
  Action: logs:DescribeLogStreams
  Resource:
  - arn:aws:logs:us-east-1:123456789101:log-group:/aws/lambda/log-group-a:log-stream:*
- Effect: Allow
  Action: logs:DescribeLogGroups
  Resource:
  - arn:aws:logs:us-east-1:123456789101:log-group:*:*
- Effect: Allow
  Action: logs:DescribeLogStreams
  Resource:
  - arn:aws:logs:us-east-1:123456789101:log-group:/aws/lambda/log-group-a:log-stream:*
  - arn:aws:logs:us-east-1:123456789101:log-group:/aws/lambda/log-group-b:log-stream:*
- Effect: Allow
  Action: logs:DescribeLogGroups
  Resource:
  - arn:aws:logs:us-east-1:123456789101:log-group:*:*
- Effect: Allow
  Action: logs:DescribeLogStreams
  Resource:
  - arn:aws:logs:us-east-1:123456789101:log-group:/aws/lambda/log-group-a:log-stream:*
  - arn:aws:logs:us-east-1:123456789101:log-group:/aws/lambda/log-group-b:log-stream:*
  - arn:aws:logs:us-east-1:123456789101:log-group:/aws/lambda/log-group-c:log-stream:*

If I remove all of those instances I am still exceeding the template size limit at 60k but it is still far less.

@pushred
Copy link
Author

pushred commented Jan 31, 2023

Well I worked around that, but hit another AWS limit that will probably force a much bigger workaround. With this many log groups my ElasticServerlessForwarderPolicy is ~22k but this exceeds a maximum policy size of 10,240 bytes. To workaround that I tried specifying a wildcard to match based on a log group name prefix, e.g.

Effect: Allow
Action: `
Resource:
- arn:aws:logs:us-east-1: 123456789101:log-group:/aws/lambda/naming-prefix-*:log-stream:*

It's unclear if that is supported but attempting to deploy I ran into either a side effect of that or another limit with failures to create AWS::Lambda::Permission with this error:

The final policy size (20702) is bigger than the limit (20480). (Service: AWSLambdaInternal;
Status Code: 400;
Error Code: PolicyLengthExceededException;    

This appears to be due to the number of CloudWatchLogs events that each add a resource-based policy statement for the function's permissions. Those combine into a single policy document that reaches this limit at ~42 functions for me. I'm not sure if there's any possible workaround for this as we of course need the triggers and CloudFormation is generating these policy statements so I don't believe I have control over their content, if it is even possible to specify partial ARNs in that with wildcards as I did for the logs:DescribeLogStreams permissions.

Are there any options remaining other than somehow deploying this fully outside of CloudFormation? Or a major refactor of our project to bring our function count way down.

@aspacca
Copy link
Contributor

aspacca commented Feb 1, 2023

hi @pushred , thanks for your feedback

There actually appears to be a bug in the script that is at least contributing to the file size issue

thanks for reporting it: I missed that. I will push a fix as soon as possible
not only ARNs from the cloudwatch-logs might be affected

The script may need to be revised to use the S3 bucket deploy method

thanks again, I will think if making it the default behaviour or an option

The final policy size (20702) is bigger than the limit (20480). (Service: AWSLambdaInternal;

while can optimise the size of the policy to make it the smallest possibile for the number of given ARNs, in the end the final size can hit the limit from AWS. there's nothing we (or you) can do for this, but for deploying multiple lambdas so that the policy is split in different smaller ones

@aspacca
Copy link
Contributor

aspacca commented Feb 1, 2023

To workaround that I tried specifying a wildcard to match based on a log group name prefix, e.g.

if you provide a "glob" value directly as ARN in the publish config yaml, while this might produce (just assuming, I didn't check) a smaller policy that's working properly, it will fail to attach the cloudwatch logs groups as trigger of the forwarder

@aspacca
Copy link
Contributor

aspacca commented Feb 1, 2023

thanks for reporting it: I missed that. I will push a fix as soon as possible
not only ARNs from the cloudwatch-logs might be affected

I spotted the bug in an extra level of indentation

@aspacca
Copy link
Contributor

aspacca commented Feb 1, 2023

thanks again, I will think if making it the default behaviour or an option

I've decided to make it the default behaviour: could you please test script from the branch with the fixes?

@aspacca
Copy link
Contributor

aspacca commented Feb 1, 2023

  • Updated script to call python3 vs. python as my shell alias for this doesn't work in the script. Maybe not an issue in our CI env but I haven't run Python in that environment before so not sure. Either way I can modify the script but perhaps another parameter might be warranted for this.

not changing this for the moment: we'll most likely provide instructions about the fact that python is used, and suggest to run the script in a venv, so that the binary will be available

but thanks anyway for rising the issue

@pushred
Copy link
Author

pushred commented Feb 1, 2023

Confirmed that with the latest script the generated publish.yaml PolicyDocument is as expected and that I'm able to deploy with 40 log groups. Thanks!

if you provide a "glob" value directly as ARN in the publish config yaml, while this might produce (just assuming, I didn't check) a smaller policy that's working properly, it will fail to attach the cloudwatch logs groups as trigger of the forwarder

Good to know.. there is no escaping the limits around this.

suggest to run the script in a venv, so that the binary will be available

Ah right, I haven't worked with Python lately but we do use venv in all of our Python projects so that should work well.

while can optimise the size of the policy to make it the smallest possibile for the number of given ARNs, in the end the final size can hit the limit from AWS. there's nothing we (or you) can do for this, but for deploying multiple lambdas so that the policy is split in different smaller ones

So based on what you mentioned earlier re: multiple deployments, my understanding is that multiple CloudFormation ESF stacks isn't possible but there could be multiple Lambdas within the deployed stack over which the log groups can be spread. Given that ESF controls the stack, is this an enhancement that could be made anytime soon?

Beyond the current project I am working on I'm also wondering about other Lambda-based services where we would like to ship logs. If everything must be routed through the same ESF stack we would need this capability regardless of how well we could decompose the project or otherwise reduce the # of functions. Multiple AWS accounts could be an option but I don't believe a viable one in our org.

@pushred
Copy link
Author

pushred commented Feb 1, 2023

Thinking about this more, the prospect of having dozens of Lambdas and queues to handle our quantity of log groups across multiple environments is daunting.

Our use case is probably better suited by Elastic Agent which I looked at that initially but somehow overlooked the Kibana Fleet API for enabling log group integrations programatically. Will be pursuing that for now.

@aspacca
Copy link
Contributor

aspacca commented Feb 2, 2023

So based on what you mentioned earlier re: multiple deployments, my understanding is that multiple CloudFormation ESF stacks isn't possible but there could be multiple Lambdas within the deployed stack over which the log groups can be spread. Given that ESF controls the stack, is this an enhancement that could be made anytime soon?

please, let me try to rephrase what are exactly your needs:

  1. you would like to be able to manage with Infrastructure as Code the "stack" required to ingest in an Elasticsearch cluster the events coming from a high number of cloudwatch logs log groups
  2. you prefer to be able to have a single "definition" in the Infrastructure as Code tool that you need to use according to the solution. I mean you prefer to have a single CloudFormation template or terraform definition where to specify multiple Lambdas to deploy, vs having multiple templates/definition for each Lambda.
  3. you care for having the smallest number of infra components in the "stack" that's required for achieving 1.

Did I summarise correctly?

I assume the feature provided by each between Elastic Agent and the Elastic Serverless Forwarder, are enough for you and the choice is based only on the three points above.

how Elastic Agent vs Elastic Serverless Forwarder compare for each of the points?

Point Elastic Agent Elastic Serverless Forwarder
1. You have two options to run Elastic Agent: either as Fleet Managed or Standalone (you can check here the different capabilities between to two modes). Standalone mode is ready for IaC solution, but you'll lose the capabilities provided by the Fleet Managed option. To keep both Fleet Managed capabilities and IaC you have to built your own IaC tool to interact with the Fleet API Currently you have to options to deploy the Elastic Serverless Forwarder: either form AWS SAR or directly with the publishing script. The difference between the two options is about the level of granularity configuration that you can set for a single ingestion source (in general, for the specific case of cloudwatch logs log group, at the moment this does not apply, since what can be configured is the same for both - this might change if we will support the possibility to configure the FilterPattern in the publishing script). Another difference is that, given AWS limits about the "information" you can provide in a single CloudFormation template, the publishing script can be used as a workaround to bypass those limits. SAR option is available through AWS Console, CloudFormation and Terraform. The publishing script has to be integrated according to its requirement in your existing IaC solution. In general IaC is first citizen for us in providing deployment solutions for the Elastic Serverless Forwarder and our goal is to improve the IaC experience for the users, both providing a solution for the most used IaC tools. as well as providing the more capabilities we can for every IaC tool we will support.
2. In the case of a Standalone Elastic Agent this is probably generally achievable, according to the specific IaC solution you will adopt, even if requiring to manage more VMs due to scaling factor. In the case of a Fleet Managed Elastic Agent, you'll have to build your own IaC tool, and this will be achievable according to how you build the tool. This is true, currently, if you don't hit the limits given by AWS. If you hit those limits the publishing script can be used as a workaround to bypass them but at the same time, this means having multiple publish config yaml files. Since you have to integrate the execution of the publishing script in your IaC tool, according to the capabilities of that tool, the publish config yaml files could be generated on the fly from a single "definition" in the tool. Referring to our effort of making IaC a first-class citizen, we are already investigating alternative IaC solutions to automatically deal with the AWS limits without requiring the users to manage the scenario. No ETA or expectation about the support for a specific IaC tool can be given at the moment.
3. According to scaling factor related to the number of the cloudwatch logs log group you need 1 or more VM with Elastic Agent running on it. You are not forced to mange multiple VMs if not required by scaling factors, just for complying with limits on provisioning too much "information" According to scaling factor related to the number of the cloudwatch logs log group you need 1 or more Lambda and two SQS queues for each Lambda (the continuing queue and the replay queue). You might be forced to have multiple Lambda and the two SQS queues given the limits set by AWS about the "information" you can provide to a single Lambda while provisioning it.

Finally I want to mention an aspect to consider in relation to scaling:

  • AWS limits the number of concurrent execution for the Lambdas in a given region for a single account to 1000. You can request to AWS to increase this limit. I don't know it this limit can be increased indefinitely. There are others not increasable limits related to Lambda, like the size of a synchronous payload, that might impact on the infinite scaling possibility of a Lambda.
  • Something similar applies to EC2 Spot/On Demand instances (https://aws.amazon.com/ec2/faqs/#EC2_On-Demand_Instance_limits). In terms of compute power scaling there is no not increasable quotas as much as I can see. Still I don't know it the existing limits can be increased indefinitely.

This is my knowledge, just looking at the information made available from AWS: if you are concerned about hitting any scaling limits in your choice, you should contact AWS in order to properly address how a solution based on Lambda and one on EC2 instances compare.

@pushred
Copy link
Author

pushred commented Feb 2, 2023

you would like to be able to manage with Infrastructure as Code the "stack" required to ingest in an Elasticsearch cluster the events coming from a high number of cloudwatch logs log groups

It's a preference in our org for IaC via Terraform but some exceptions have been made for projects built with the Serverless framework, which is built over CloudFormation as well. So by extension we expected ESF to be granted a similar exception.

But we also have instances of Fleet-managed Elastic Agent already in use. The agents thus far are containerized and run on the same server as the services that are sending telemetry.

you prefer to be able to have a single "definition" in the Infrastructure as Code tool that you need to use according to the solution. I mean you prefer to have a single CloudFormation template or terraform definition where to specify multiple Lambdas to deploy, vs having multiple templates/definition for each Lambda.

This would be ideal as we otherwise have to somehow otherwise handle the stack splitting.

you care for having the smallest number of infra components in the "stack" that's required for achieving 1.

Correct, but we understand the hard AWS constraints that are preventing this from happening.


Thanks for your comparison of Elastic Agent vs. ESF and the other notes re: scaling. This has all been helpful today and spawned some lengthier discussion on our requirements. I confess that I jumped into trying to use ESF without much review, following a colleague's earlier effort that identified it as a solution following a separate earlier POC effort using Functionbeat. After further reviewing our needs and the context of everything else running in our account we've concluded that getting logs from CloudWatch's API, as the Elastic Agent/Filebeat integration does, won't be feasible either due to hard limits on FilterLogEvents API requests.

I think we may have been misguided in gravitating to Functionbeat and ESF for a solution to ingesting Lambda logs. This was partly due to their familiar deployment model. We associated the other solutions, e.g. Filebeat as being more suited to our EC2 instances. I see the value of ESF in simpler scenarios as something that is faster to deploy and cheaper to operate.

Longer term I believe what we actually need is something like the Elastic APM AWS Lambda extension which would bypass the CloudFront layer altogether. The team behind that indeed added some support for collecting logs last year but it is still preliminary. The recently released AWS Lambda Telemetry API seems that it will further the possibilities and many of Elastic's competitors already have extension-based solutions for collecting logs, using an earlier iteration of that API.

Unfortunately the project I am working on has a timeline that requires observability to be in place before such a solution will exist. So we are going to pursue building our own extension in order to write logs to S3 and then ingest them using some other Elastic solution.

Thank you for all of your assistance and time!

@aspacca
Copy link
Contributor

aspacca commented Feb 3, 2023

So we are going to pursue building our own extension in order to write logs to S3 and then ingest them using some other Elastic solution.

once you have the logs in S3, you can still use ESF for the ingestion (see this blog post, you can ignore the part about the integrations if you are ingesting your custom application logs).
same considerations apply as ingesting from cloudwatch logs

as far as I understand you have already multiple Lambdas in your stack, and you saw Functionbeat/ESF as a natural solution since they are Lambda as well: this is true just as far as you will use something that's based on a technology you already know, with all the benefits that come with it

Elastic APM AWS Lambda extension added support for forwarding logs in v1.2.0: I think that would be the best solution for you

ESF itself will switch to the extension from the Elastic APM Python Agent as soon as this issue will be addressed

@rpateloak9
Copy link

rpateloak9 commented Feb 3, 2023

@aspacca any timeline for an apm aws lambda extension with functions running on the .net runtime? I reason I ask is because we have the same issue as @pushred and are looking for a elegant solution to push lambda logs to elastic cloud. Currently we have a hoe grown solution of splitting up function beat into various lambdas dividing the log groups up amongst those lambdas. However we would like to truly shift left with lambda extension and have this be a part of our terraform provisioning to allow "DEVS" to toggle if they want logging or not for there lambdas.

@pushred
Copy link
Author

pushred commented Feb 3, 2023

@aspacca ah right! I forgot about S3 as a possible input. We plan to consolidate all of the logs in a single bucket so that would avoid these issues we're having around limits.

We did see the addition of forwarding logs in the extension for 1.2.0 however it seems that the logs are only available in the APM UI and it doesn't appear that we can specify the datastream. But those are some assumptions based on what we saw in GitHub and the limited docs.

@pushred
Copy link
Author

pushred commented Feb 4, 2023

@aspacca in further research I reviewed some of our org's other Serverless projects which currently send logs to Sumo Logic. I found that it's own Lambda solution for log collection somehow bypasses the need to create a trigger for each of the application Lambda function log groups. Instead their shipper Lambda is triggered by a subscription filter on their DLQ Lambda's log group. All of the Lambda functions also have a subscription filter pointing to it as a destination. When I view the shipper function however I do not see these filters as triggers, nor are there any resource policy statements for each log group. I'm not sure of the difference between a trigger and a subscription filter actually. Their stack has a single subscription filter that seems to be shared by all functions and permissions are similarly consolidated.

It also addresses the problem of subscribing to log groups on an ongoing basis. With ESF I intended to handle this at deployment by re-deploying ESF with a list of the current log groups at deploy time. But Sumo's Log Group Connector handles this with a Lambda function that is triggered on CreateLogGroup events and configured with a log group name prefix and such to match groups to subscribe. It's a nice solution because we haven't had to touch it even as projects and their deployments have changed.

I'm curious whether you're aware of their solution and why ESF has taken the approach it has in contrast?

We're committed to using Elastic for all of the benefit of Kibana, especially as it relates to structured logging. But I'm expecting some questions around why Elastic requires a more complicated solution than what we already have in place for Sumo.

@aspacca
Copy link
Contributor

aspacca commented Feb 5, 2023

I'm curious whether you're aware of their solution and why ESF has taken the approach it has in contrast?

a subscription filter is the way to trigger a lambda from an event in a cloudwatch logs log group, opposed to an event source that's used for sqs and kinesis.

the Sumo Logic's shipper lambda has a subscription filter, by default, on the cloudwatch logs log group that's created when deploying the cloudformation template. you have then to send all your logs to that cloudwatch logs log group. this is something we decided explicitly to avoid: we'd like for the users to not change their existing inputs (either be cloudwatch logs, kinesis data stream etc etc).

we ask for the ARNs of the inputs and we manage all the required permissions and settings at deployment type. compared to the Sumo Logic's solution, if you have existing cloudwatch logs log groups, they require you to manually add all the subscription filters and permissions on your own.

the Sumo's Log Group Connector works only for newly created cloudwatch logs groups, but I can see its value for dealing automatically with new log groups without the need to update the lambda deployment.

beware that you can achieve something similar to a "single cloudwatch log logs group with multiple forwarder destinations" with the Elastic Serverless Forwarder: it's just a matter to send the separated logs of each of your Lambda apps to a single log group but in a different and persistent log stream.
you can then configure the Forwarder for each of the log stream ARNs, to ingest in a different data stream:

  - type: cloudwatch-logs
    id: "arn:aws:logs:region:account-id:log-group:single-log-group-for-esf:log-stream:log-stream-name-app1"
    outputs:
      - type: elasticsearch
        args:
          elasticsearch_url: "..."
          es_datastream_name: "logs-app1-default"
          tags:
            - "app1"
  - type: cloudwatch-logs
    id: "arn:aws:logs:region:account-id:log-group:single-log-group-for-esf:log-stream:log-stream-name-app2"
    outputs:
      - type: elasticsearch
        args:
          elasticsearch_url: "..."
          es_datastream_name: "logs-app2-default"
          tags:
            - "app2"

I forgot about S3 as a possible input. We plan to consolidate all of the logs in a single bucket so that would avoid these issues we're having around limits.

beware that if you want to ingest each logs to a different datastream, you have to create multiple s3 notifications to different sqs queues, each for every logs (identified by a prefix in the bucket or something similar), in order to be able to specify a different target, similar to the example above with cloudwatch logs

as for the DLQ, you can check on the Forwarder documentation the details about error handling: https://www.elastic.co/guide/en/observability/current/aws-serverless-troubleshooting.html#_error_handling

@aspacca
Copy link
Contributor

aspacca commented Feb 15, 2023

@pushred can I close this issue or do you need more information?

@pushred
Copy link
Author

pushred commented Feb 15, 2023

@aspacca Sure you can close, we are currently testing ESF with Kinesis as an input, so far that is addressing our issue here. Thanks!

@aspacca
Copy link
Contributor

aspacca commented Mar 16, 2023

@pushred I know you moved from Cloudwatch logs input to Kinesis input due to this issue

#276 removed the calls to logs:DescribeLogGroups and logs:DescribeLogStreams
#283 removed adding the policies for those calls

Now having any number of Cloudwatch logs input does not produce any policy at all, so no limit can be exceed because of that

this might let you go back to Cloudwatch logs input and avoid the extra complexity you mentioned in #258

we'll release soon a new version

@pushred
Copy link
Author

pushred commented Mar 16, 2023

Great news! Thanks for the update. We haven't gone to production yet so may be able to give this a shot.

@rpateloak9
Copy link

i just tried the latest via 62 log groups and ran into an issue

``
CREATE_FAILED AWS::Lambda::Permission ApplicationElasticServerlessForwarderCloudW The final policy size (20667) is bigger atchLogsEventf481e407f4Permission than the limit (20480). (Service: AWSLambdaInternal; Status Code: 400; Error Code: PolicyLengthExceededException; Request ID: 9febcc01-e0b9-4426-a240-1e6e1741d79c; Proxy: null)

``

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants