thanos ruler和alertmanager都部署在kubernetes集群,版本信息如下: a、kubernetes集群:v1.18.5 b、thanos ruler: v0.11.0 c、alertmanager: v0.20.0
thanos ruler的yaml文件简介:
apiVersion: apps/v1 kind: StatefulSet metadata: labels: app.kubernetes.io/name: thanos-rule name: thanos-rule namespace: monitoring spec: replicas: 2 selector: matchLabels: app.kubernetes.io/name: thanos-rule serviceName: thanos-rules template: metadata: labels: app.kubernetes.io/name: thanos-rule spec: containers: - image: registry.cn-shenzhen.aliyuncs.com/gzlj/thanos-reloader:v0.1 imagePullPolicy: Always name: reloader resources: limits: cpu: 100m memory: 100Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File - args: - rule - --grpc-address=0.0.0.0:10901 - --http-address=0.0.0.0:10902 - --rule-file=/etc/thanos/rules/*rules.yaml - --data-dir=/var/thanos/rule - --label=rule_replica="$(NAME)" #请注意--alert.label-drop这行记录,值是带"" - --alert.label-drop="rule_replica" - --query=dnssrv+_http._tcp.thanos-query.monitoring.svc.cluster.local - --alertmanagers.url=http://alertmanager-main.monitoring.svc.cluster.local:9093 env: - name: NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name image: quay.mirrors.ustc.edu.cn/thanos/thanos:v0.11.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 24 httpGet: path: /-/healthy port: 10902 scheme: HTTP periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 name: thanos-rule ports: - containerPort: 10901 name: grpc protocol: TCP - containerPort: 10902 name: http protocol: TCP readinessProbe: failureThreshold: 18 httpGet: path: /-/ready port: 10902 scheme: HTTP initialDelaySeconds: 10 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 volumeMounts: - mountPath: /var/thanos/rule name: data - mountPath: /etc/thanos/rules name: thanos-rules restartPolicy: Always serviceAccount: thanos-rules serviceAccountName: thanos-rules terminationGracePeriodSeconds: 30 volumes: - configMap: defaultMode: 420 name: thanos-rules name: thanos-rules - emptyDir: {} name: data重点截图如下
alertmanager收到重复告警,两个重复的告警唯一的区别是自定义标签rule_replica的值不一样,如图所示:
尝试过更换成thanos ruler的镜像版本(v0.15.0),但现象依旧。 即将放弃的时候,我把thanos ruler的启动命令参数 --alert.label-drop="rule_replica"变成 --alert.label-drop=rule_replica,即只是去掉了双引号,alertmanager重复接收告警的现象解决。
thanos ruler将告警信息中的标签 rule_replica 扔掉,再将告警发送给alertmanager,此时alertmanager中只存在一份告警信息,而不是先前的两份。