Sending Pod Logs From EKS Clusters into HLA

Will Hallam · ‎11-15-2022

Disclaimer: The procedures and examples provided herein come with no support or warranty, explicit or implied. Caveat emptor!

Introduction

ServiceNow Health Log Analytics ("HLA") is a powerful lever to use for tapping into the wellspring of data which steadily pours out of your Kubernetes clusters. Its machine learning engine can speed new application onboarding and allow you to detect anomalous behavior before it turns into a user-impacting event. Here's an example of how I set up logs to flow from an EKS cluster into Health Log Analytics.

Cribl Makes it Easier

This process was made much easier by using the free cloud-hosted Cribl Stream (https://cribl.cloud). This combination of flexible log ingestion, routing, and propagation with an intuitive UI helped me to streamline the creation of my HLA data inputs as well as keep an eye on the data flowing through them. In addition, by instantiating my log pipelines via Cribl I was able to include support for historical archiving and querying of my logs.

Cluster-Side Setup

To capture logs from application pods, I opted to use one of the manifests provided in the Fluentd Daemonset for Kubernetes Git repo (https://github.com/fluent/fluentd-kubernetes-daemonset). Since Cribl Cloud comes with an Elastic API endpoint, I selected the fluentd-daemonset-elasticsearch-rbac manifest and applied it to my cluster, with the following tweaks:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: fluentd
roleRef:
  kind: ClusterRole
  name: fluentd
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
    version: v1
spec:
  selector:
    matchLabels:
      k8s-app: fluentd-logging
      version: v1
  template:
    metadata:
      labels:
        k8s-app: fluentd-logging
        version: v1
    spec:
      serviceAccount: fluentd
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/control-plane
        effect: NoSchedule
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
          - name: K8S_NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          - name:  FLUENT_ELASTICSEARCH_HOST
            value: "<cribl cloud FQDN here>"
          - name:  FLUENT_ELASTICSEARCH_PORT
            value: "9200"
          - name: FLUENT_ELASTICSEARCH_SCHEME
            value: "https"
          # Option to configure elasticsearch plugin with self signed certs
          # ================================================================
          - name: FLUENT_ELASTICSEARCH_SSL_VERIFY
            value: "true"
          # Option to configure elasticsearch plugin with tls
          # ================================================================
          - name: FLUENT_ELASTICSEARCH_SSL_VERSION
            value: "TLSv1_2"
          # X-Pack Authentication
          # =====================
          - name: FLUENT_ELASTICSEARCH_USER
            value: "<put user here>"
          - name: FLUENT_ELASTICSEARCH_PASSWORD
            value: "<put password here>"
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        # When actual pod logs in /var/lib/docker/containers, the following lines should be used.
        # - name: dockercontainerlogdirectory
        #   mountPath: /var/lib/docker/containers
        #   readOnly: true
        # When actual pod logs in /var/log/pods, the following lines should be used.
        - name: dockercontainerlogdirectory
          mountPath: /var/log/pods
          readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      # When actual pod logs in /var/lib/docker/containers, the following lines should be used.
      - name: dockercontainerlogdirectory
        hostPath:
          path: /var/lib/docker/containers
      # When actual pod logs in /var/log/pods, the following lines should be used.
      # - name: dockercontainerlogdirectory
      #   hostPath:
      #    path: /var/log/pods

I defined the value of "FLUENT_ELASTICSEARCH_HOST" to be the FQDN for my Cribl Cloud tenant.
I set "FLUENT_ELASTICSEARCH_USER" and "FLUENT_ELASTICSEARCH_PASSWORD" to match the values from my Cribl Cloud Elasticsearch API source definition.
I uncommented the volumeMount sections referring to /var/lib/docker/containers, which is where the EKS nodes keep container logs.
I commented the default volumeMount sections referring to /var/log/pods

Default Cribl Elasticsearch API inputSet a username/password for the Elasticsearch API endpoint

After applying the edited manifest to my cluster, I can see a fluentd pod running in each node and example data from the pods flowing into the Cribl Elasticsearch API source.

whallam@WAPVg8hnvBqCBUy:~/fluentd-kubernetes-daemonset$ kubectl get -n kube-system pod | grep fluent
fluentd-g29bf                        1/1     Running   0          27d
fluentd-vbcgh                        1/1     Running   0          27d
whallam@WAPVg8hnvBqCBUy:~/fluentd-kubernetes-daemonset$

When I check the Elasticsearch input in Cribl, I can see it is receiving data. Notice how Cribl automatically breaks down the inbound payload into its component key/values.

Connecting to HLA

To send data into Health Log Analytics (HLA) in my ServiceNow instance, I navigated to Health Log Analytics->Data Input->Data Inputs and created a new "TCP" data input. You can see an in-depth article on this piece at Configuring a Cribl Logstream Destination for Health Log Analytics .

A TCP data input for Cribl

Cribl destination for HLACribl destination for HLA - TLS tab

I tied the Cribl source and destination together using a Route, as seen below:

A Cribl route tying the Elasticsearch input with the HLA output

Something I noticed once those pod logs started flowing in -- there was significant diversity in the format of logs from different microservices. This prompted me to focus initially on sending logs from one microservice in particular into HLA, which was also made easier by Cribl. I created a new Cribl pipeline with two functions:

A "Drop" function which dropped any log events not tagged with an "app" value of "frontend"
An "Eval" function which generated normalized "criblSeverity" and "criblTimestamp" key/values, to make the ingestion into HLA easier.

A Cribl pipeline to filter out one microservice

I added this pipeline to my HLA destination under "Post-Processing".

Using post processing to filter what goes to HLA

Now I could focus my HLA setup efforts on one microservice at a time. The next step in this process was to publish a mapping for this data input. Navigating to Health Log Analytics->Mapping->Data Input Mapping, I could map the Kubernetes namespace into the HLA Application Service and the Kubernetes "app" tag to the underlying component. This facilitated branching each type of log into its own Source Type Structure, (format). NOTE: I chose to disable header detection in this data input so I would have full access to the key/values populated by Cribl in the log events.

Data input mapping for pod logs

In the course of using the HLA sample logs to validate my data input mapping, I noticed that periodically Cribl would send along what appeared to be a keepalive event:

{"authToken":"","format":"ndjson","info":{"hostname":"ip-x.us-west-2.compute.internal","platform":"linux","architecture":"arm64","release":"5.15.69-37.134.amzn2.aarch64","cpus":2,"totalmem":4024553472,"node":"v14.18.3","cribl":{"version":"3.5.4-4bf0fd31"}}}

To eliminate these events from the logs being sent through HLA, I added some preprocessor code by navigating to Health Log Analytics->Data Input Preprocessor and selecting the record for my Cribl data input. I used a couple lines of JavaScript to drop any events coming in with the "authToken" property in them.

Using HLA preprocessing to drop keepalive events

After publishing the mapping and preprocessor for the data input, I saw HLA create a source type and source type structure for "boutique-frontend" within 10 minutes. It appeared under Health Log Analytics->Source Type Structures. Because I had performed any needed transformations via my pipeline on the Cribl side, all I needed to do to finalize the source type structure was to assign the correct labels to the existing payload keys. The existing "log" key is labeled as the "Message", the "criblSeverity" key maps to "Severity", "criblTimestamp" maps to "Timestamp" and the "kubernetes.pod_name" key is labelled with the "Host" tag, to facilitate correlation with my discovered Kubernetes pods.

Source type structure labels pt1Source type structure labels pt2

With that final step complete, I was able to see properly labeled logs coming in from my frontend microservice inside the Boutique application/namespace. I then had some fun injecting chaos by killing various pods, both manually and via the fun "Kubeinvaders" tool (https://github.com/lucky-sideburn/KubeInvaders), and watching HLA detect the anomalous behavior and raise correlated alerts.

Alerts from HLA based on Kubernetes pod logs

Archiving and Historical Searching

Because Health Log Analytics is focused on analysis and not long-term retention of logs, it needs to be part of an overall strategy for log management. By bringing Cribl Stream into the mix, I'm able to address all aspects of logging, both the analytics component and the overall management of logs and their life cycle. After I got my example data input working as desired in HLA, I turned my attention back to Cribl to add archiving and ad hoc searching of that archived log data.

To start, I created a Cribl S3 destination which would store logs in an S3 bucket. Since Cribl Cloud runs in AWS, I created an IAM role which could be assumed by the Cribl service and used to write into the target S3 bucket.

{
    "Resources": {
        "Policy": {
            "Type": "AWS::IAM::ManagedPolicy",
            "Properties": {
                "Description": "Cribl Cloud perms",
                "ManagedPolicyName": "cribl-cloud-1",
                "PolicyDocument": {
                    "Version": "2012-10-17",
                    "Statement": [
                        {
                            "Action": [
                                "s3:ListBucket",
                                "s3:GetBucketLocation",
                                "s3:PutObject",
                                "s3:GetObject"
                            ],
                            "Resource": [
                                "arn:aws:s3:::mybucket",
                                "arn:aws:s3:::mybucket/*"
                            ],
                            "Effect":"Allow"
                        }
                    ]
                }
            }
        },
        "Role": {
            "Type": "AWS::IAM::Role",
            "Properties": {
                "AssumeRolePolicyDocument": {
                    "Version": "2012-10-17",
                    "Statement": [
                        {
                            "Effect": "Allow",
                            "Principal": {
                                "AWS": [
                                    "arn:aws:iam::XXX:role/main-default",
                                    "arn:aws:iam::XXX:role/worker-in.logstream"
                                ]
                            },
                            "Action": [
                                "sts:AssumeRole"
                            ]
                        }
                    ]
                },
                "RoleName": "cribl-cloud-1",
                "ManagedPolicyArns": [
                    { "Ref": "Policy" }
                ]
            }
        }
    }
}

I placed the ARN for the role in the "Assume Role" section of my Cribl S3 destination.

Cribl S3 destination pt1Cribl S3 destination p2 (assume role)

After establishing the S3 destination, I connected it with my incoming Elasticsearch API source by creating a route.

Routing data to S3 for archiving

Reviewing my S3 bucket showed an accumulation of files, following Cribl's standard date-based partitioning.

Logs archived from Cribl into S3

To make these archived logs searchable, I turned to AWS Athena. What I discovered was that the format of the data presented some challenges, in that there were some keys with "@" or "_" in them. There was also a set of key/value pairs with unpredictable key values, i.e., the Kubernetes labels for various objects. Once again, Cribl to the rescue. I was able to quickly do some transformation of the data before sending it to S3 by creating a pipeline.

A pipeline for Athena-friendly S3 log payloads

My S3 archive pipeline consisted of two functions:

A "Code" function which changed the Kubernetes labels key into an array of key/value pairs.
An "Eval" function which created an Athena-friendly timestamp and deleted the problematic keys.

One the pipeline was created, I attached it to the Post-Processing section of the S3 destination.

Adding a post-processing pipeline to S3 destination

After clearing the existing contents of my S3 bucket, I was able to create an external table in Athena.

CREATE EXTERNAL TABLE `cribl-2`(
  `cribl` string COMMENT 'from deserializer', 
  `cribl_pipe` array<string> COMMENT 'from deserializer', 
  `cribllabels` array<string> COMMENT 'from deserializer', 
  `docker` struct<container_id:string> COMMENT 'from deserializer', 
  `kubernetes` struct<container_image:string,container_image_id:string,container_name:string,host:string,master_url:string,namespace_id:string,namespace_name:string,pod_id:string,pod_ip:string,pod_name:string> COMMENT 'from deserializer', 
  `log` string COMMENT 'from deserializer', 
  `stream` string COMMENT 'from deserializer', 
  `tag` string COMMENT 'from deserializer', 
  `timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'case.insensitive'='TRUE', 
  'dots.in.keys'='FALSE', 
  'ignore.malformed.json'='FALSE', 
  'mapping'='TRUE') 
LOCATION
  's3://mybucket/eks'
TBLPROPERTIES (
  'classification'='json', 
  'has_encrypted_data'='true')

Once the external table was created in Athena, I could run queries against the raw logs as needed.

Querying logs from S3 with Athena

Conclusion

After completing this exercise, I was struck by how much easier it was to plumb all these pieces together once I added Cribl Stream as the connective tissue. I heartily recommend giving it a try if you have any logging use cases you're exploring. You can self-register at https://cribl.cloud and get a free instance with significant capability.