作者都是各自领域经过审查的专家,并撰写他们有经验的主题. All of our content is peer reviewed and validated by Toptal experts in the same field.
Fabrice是一名云架构师和软件开发人员,在思科工作了20多年, Samsung, Philips, Alcatel, and Sagem.
PREVIOUSLY AT
Elasticsearch是一个功能强大的软件解决方案,旨在快速搜索大量数据中的信息. Combined with Logstash and Kibana, this forms the informally named “ELK stack”, and is often used to collect, temporarily store, analyze, and visualize log data. A few other pieces of software are usually needed, such as Filebeat to send the logs from the server to Logstash, and Elastalert 根据对存储在Elasticsearch中的数据运行的一些分析结果生成警报.
My experience with using ELK for managing logs is quite mixed. On the one hand, it’s very powerful and the range of its capabilities is quite impressive. On the other hand, it’s tricky to set up and can be a headache to maintain.
The fact is that Elasticsearch is very good in general and can be used in a wide variety of scenarios; it can even be used as a search engine! Since it is not specialized for managing log data, 这需要更多的配置工作来定制其行为,以满足管理此类数据的特定需求.
设置ELK集群是相当棘手的,需要我玩弄一些参数,以便最终得到它的启动和运行. Then came the work of configuring it. In my case, I had five different pieces of software to configure (Filebeat, Logstash, Elasticsearch, Kibana, and Elastalert). This can be a quite tedious job, 因为我必须通读文档并调试链中不与下一个通信的一个元素. Even after you finally get your cluster up and running, you still need to perform routine maintenance operations on it: patching, upgrading the OS packages, checking CPU, RAM, and disk usage, making minor adjustments as required, etc.
My whole ELK stack stopped working after a Logstash update. Upon closer examination, It turned out that, for some reason, ELK developers decided to change a keyword in their config file and pluralize it. 这是最后一根稻草,我决定寻找更好的解决方案(至少是针对我的特殊需求的更好的解决方案)。.
I wanted to store logs generated by Apache and various PHP and node apps, and to parse them to find patterns indicative of bugs in the software. The solution I found was the following:
And, at a high level, that’s it! 100%无服务器解决方案,无需任何维护即可正常工作,并且无需任何额外工作即可很好地扩展. The advantages of such serverless solutions over a cluster of servers are numerous:
So now let’s get into the details! Let’s explore what a CloudFormation template would look like for such a setup, complete with Slack webhooks for alerting engineers. We need to configure all the Slack set up first, so let’s dive into it.
AWSTemplateFormatVersion: 2010-09-09
Description: Setup log processing
Parameters:
SlackWebhookHost:
Type: String
Description: Host name for Slack web hooks
Default: hooks.slack.com
SlackWebhookPath:
Type: String
Description: Path part of the Slack webhook URL
Default: /services/YOUR/SLACK/WEBHOOK
You would need to set up your Slack workspace for this, check out this WebHooks for Slack guide for additional info.
Once you created your Slack app and configured an incoming hook, the hook URL will become a parameter of your CloudFormation stack.
Resources:
ApacheAccessLogGroup:
Type: AWS::Logs::LogGroup
Properties:
RetentionInDays: 100 # Or whatever is good for you
ApacheErrorLogGroup:
Type: AWS::Logs::LogGroup
Properties:
RetentionInDays: 100 # Or whatever is good for you
Here we created two log groups: one for the Apache access logs, the other for the Apache error logs.
我没有为日志数据配置任何生命周期机制,因为这超出了本文的讨论范围. In practice, 您可能希望缩短保留窗口,并设计S3生命周期策略,以便在一段时间后将它们移动到Glacier.
Now let’s implement the Lambda function that will process the Apache access logs.
BasicLambdaExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Here we created an IAM role that will be attached to the Lambda functions, to allow them to perform their duties. In effect, the AWSLambdaBasicExecutionRole
is (despite its name) an IAM policy provided by AWS. It just allows the Lambda function to create its a log group and log streams within that group, and then to send its own logs to CloudWatch Logs.
ProcessApacheAccessLogFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt BasicLambdaExecutionRole.Arn
Runtime: python3.7
Timeout: 10
Environment:
Variables:
SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
Code:
ZipFile: |
import base64
import gzip
import json
import os
from http.client import HTTPSConnection
def handler(event, context):
tmp = event['awslogs']['data']
# `awslogs.data` is base64-encoded gzip'ed JSON
tmp = base64.b64decode(tmp)
tmp = gzip.decompress(tmp)
tmp = json.loads(tmp)
events = tmp['logEvents']
for event in events:
raw_log = event['message']
log = json.loads(raw_log)
if log['status'][0] == "5":
# This is a 5XX status code
print(f"Received an Apache access log with a 5XX status code: {raw_log}")
slack_host = os.getenv('SLACK_WEBHOOK_HOST')
slack_path = os.getenv('SLACK_WEBHOOK_PATH')
print(f"发送Slack帖子到:host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
cnx = HTTPSConnection(slack_host, timeout=5)
cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
# It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
resp = cnx.getresponse()
resp_content = resp.read()
resp_code = resp.status
assert resp_code == 200
So here we are defining a Lambda function to process Apache access logs. Please note that I am not using the common log format which is the default on Apache. 我像这样配置访问日志格式(您将注意到它实际上生成的日志格式为JSON), which makes processing further down the line a lot easier):
LogFormat "{\"vhost\": \"%v:%p\", \"client\": \"%a\", \"user\": \"%u\", \"timestamp\": \"%{%Y-%m-%dT%H:%M:%S}t\", \"request\": \"%r\", \"status\": \"%>s\", \"size\": \"%O\", \"referer\": \"%{Referer}i\", \"useragent\": \"%{User-Agent}i\"}" json
This Lambda function is written in Python 3. It takes the log line sent from CloudWatch and can search for patterns. In the example above, 它只是检测导致5XX状态码的HTTP请求,并向Slack频道发布消息.
You can do anything you like in terms of pattern detection, and the fact that it’s a true programming language (Python), as opposed to just regex patterns in a Logstash or Elastalert config file, gives you a lot of opportunities to implement complex pattern recognition.
关于修订控制的简短介绍:我发现,将代码内联到CloudFormation模板中,用于小型实用程序Lambda函数(如此)是非常可接受和方便的. Of course, for a large project involving many Lambda functions and layers, this would most probably be inconvenient and you would need to use SAM.
ApacheAccessLogFunctionPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref ProcessApacheAccessLogFunction
Action: lambda:InvokeFunction
Principal: logs.amazonaws.com
SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*
The above gives permission to CloudWatch Logs to call your Lambda function. One word of caution: I found that using the SourceAccount
property can lead to conflicts with the SourceArn
.
Generally speaking, 当调用Lambda函数的服务在同一个AWS帐户中时,我建议不要包含它. The SourceArn
will forbid other accounts to call the Lambda function anyway.
ApacheAccessLogSubscriptionFilter:
Type: AWS::Logs::SubscriptionFilter
DependsOn: ApacheAccessLogFunctionPermission
Properties:
LogGroupName: !Ref ApacheAccessLogGroup
DestinationArn: !GetAtt ProcessApacheAccessLogFunction.Arn
FilterPattern: "{$.status = 5*}"
The subscription filter resource is the link between CloudWatch Logs and Lambda. Here, logs sent to the ApacheAccessLogGroup
will be forwarded to the Lambda function we defined above, but only those logs that pass the filter pattern. Here, 过滤器模式需要一些JSON作为输入(过滤器模式以'{'开始,以'}'结束), and will match the log entry only if it has a field status
which starts with “5”.
这意味着只有当Apache返回的HTTP状态码是500码时,我们才调用Lambda函数, which usually means something quite bad is going on. 这确保我们不会过多地调用Lambda函数,从而避免不必要的开销.
More information on filter patterns can be found in Amazon CloudWatch documentation. The CloudWatch filter patterns are quite good, although obviously not as powerful as Grok.
Note the DependsOn
field, 这确保CloudWatch日志可以在创建订阅之前调用Lambda函数. This is just a cherry on the cake, it’s most probably unnecessary as in a real-case scenario, Apache可能在几秒钟之前不会收到请求(例如:将EC2实例与负载平衡器链接起来), and get the load balancer to recognised the status of the EC2 instance as healthy).
Now let’s have a look at the Lambda function that will process the Apache error logs.
ProcessApacheErrorLogFunction:
Type: AWS::Lambda::Function
Properties:
Handler: index.handler
Role: !GetAtt BasicLambdaExecutionRole.Arn
Runtime: python3.7
Timeout: 10
Environment:
Variables:
SLACK_WEBHOOK_HOST: !Ref SlackWebHookHost
SLACK_WEBHOOK_PATH: !Ref SlackWebHookPath
Code:
ZipFile: |
import base64
import gzip
import json
import os
from http.client import HTTPSConnection
def handler(event, context):
tmp = event['awslogs']['data']
# `awslogs.data` is base64-encoded gzip'ed JSON
tmp = base64.b64decode(tmp)
tmp = gzip.decompress(tmp)
tmp = json.loads(tmp)
events = tmp['logEvents']
for event in events:
raw_log = event['message']
log = json.loads(raw_log)
if log['level'] in ["error", "crit", "alert", "emerg"]:
# This is a serious error message
msg = log['msg']
if msg.startswith("PHP Notice") or msg.startswith("PHP Warning"):
print(f"Ignoring PHP notices and warnings: {raw_log}")
else:
print(f"Received a serious Apache error log: {raw_log}")
slack_host = os.getenv('SLACK_WEBHOOK_HOST')
slack_path = os.getenv('SLACK_WEBHOOK_PATH')
print(f"发送Slack帖子到:host={slack_host}, path={slack_path}, url={url}, content={raw_log}")
cnx = HTTPSConnection(slack_host, timeout=5)
cnx.request("POST", slack_path, json.dumps({'text': raw_log}))
# It's important to read the response; if the cnx is closed too quickly, Slack might not post the msg
resp = cnx.getresponse()
resp_content = resp.read()
resp_code = resp.status
assert resp_code == 200
第二个Lambda函数处理Apache错误日志,只有在遇到严重错误时才会向Slack发布消息. 在这种情况下,PHP通知和警告消息不会被认为严重到足以触发警报.
Again, this function expects the Apache error log to be JSON-formatted. So here is the error log format string I have been using:
ErrorLogFormat "{\"vhost\": \"%v\", \"timestamp\": \"%{cu}t\", \"module\": \"%-m\", \"level\": \"%l\", \"pid\": \"%-P\", \"tid\": \"%-T\", \"oserror\": \"%-E\", \"client\": \"%-a\", \"msg\": \"%M\"}"
ApacheErrorLogFunctionPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref ProcessApacheErrorLogFunction
Action: lambda:InvokeFunction
Principal: logs.amazonaws.com
SourceArn: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:*
SourceAccount: !Ref AWS::AccountId
This resource grants permissions to CloudWatch Logs to call your Lambda function.
ApacheErrorLogSubscriptionFilter:
Type: AWS::Logs::SubscriptionFilter
DependsOn: ApacheErrorLogFunctionPermission
Properties:
LogGroupName: !Ref ApacheErrorLogGroup
DestinationArn: !GetAtt ProcessApacheErrorLogFunction.Arn
FilterPattern: '{$.msg != "PHP Warning*" && $.msg != "PHP Notice*"}'
Finally, 我们使用Apache错误日志组的订阅过滤器将CloudWatch日志与Lambda函数链接起来. Note the filter pattern, 它确保以“PHP警告”或“PHP通知”开头的消息的日志不会触发对Lambda函数的调用.
One last word about costs: this solution is much cheaper than operating an ELK cluster. The logs stored in CloudWatch are priced at the same level as S3, and Lambda allows one million calls per month as part of its free tier. 这对于流量适中的网站来说可能足够了(前提是你使用了CloudWatch日志过滤器)。, especially if you coded it well and doesn’t have too many errors!
Also, please note that Lambda functions support up to 1,000 concurrent calls. At the time of writing, this is a hard limit in AWS that can’t be changed. However, you can expect the call for the above functions to last for about 30-40ms. This should be fast enough to handle rather heavy traffic. If your workload is so intense that you hit this limit, you probably need a more complex solution based on Kinesis, which I might cover in a future article.
ELK is an acronym for Elasticsearch-Logstash-Kibana. Additional software items are often needed, 比如Beats(一个向Logstash发送日志和指标的工具集合)和Elastalert(基于Elasticsearch时间序列数据生成警报).
The short answer is: yes. 组成ELK堆栈的各种软件项目有各种软件许可证,但通常都有提供免费使用而不提供任何支持的许可证. It would be up to you, however, to set up and maintain the ELK cluster.
The ELK stack is highly configurable so there isn’t a single way to make it work. For example, here is the path of an Apache log entry: Filebeat reads the entry and sends it to Logstash, which parses it, and sends it to Elasticsearch, which saves and indexes it. Kibana can then retrieve the data and display it.
London, United Kingdom
Member since September 6, 2017
Fabrice是一名云架构师和软件开发人员,在思科工作了20多年, Samsung, Philips, Alcatel, and Sagem.
PREVIOUSLY AT
World-class articles, delivered weekly.
World-class articles, delivered weekly.
Join the Toptal® community.