A complete guide for deploying Grafana on a K3s cluster using GitOps with ArgoCD. ![[Pasted image 20260104154635.png]] ## Overview This guide covers deploying a centralized logging stack: - **Loki** - Log aggregation system (like Prometheus, but for logs) - **Promtail** - Agent that ships logs to Loki (runs as DaemonSet on all nodes) - **Grafana Integration** - Dashboards and datasource configuration ### Architecture ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ Promtail │ │ Promtail │ │ Promtail │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ └──────────────────┼──────────────────┘ │ ▼ ┌─────────────┐ │ Loki │ │ (StatefulSet)│ └──────┬──────┘ │ ▼ ┌─────────────┐ │ Grafana │ └─────────────┘ ``` ## Prerequisites - K3s cluster (tested on 3-node HA setup) - ArgoCD installed and configured - Gitea or other Git repository for GitOps - Longhorn or other StorageClass for persistent storage - cert-manager for TLS (optional) ## Directory Structure ``` k3s-gitops/ └── apps/ └── loki/ ├── namespace.yaml ├── configmap.yaml ├── pvc.yaml ├── statefulset.yaml ├── service.yaml ├── promtail-rbac.yaml ├── promtail-configmap.yaml ├── promtail-daemonset.yaml ├── promtail-service.yaml ├── promtail-ingress.yaml # Optional ├── servicemonitor.yaml # Optional - for Prometheus └── application.yaml # ArgoCD Application ``` ## Step 1: Create Namespace **File: `apps/loki/namespace.yaml`** ```yaml apiVersion: v1 kind: Namespace metadata: name: loki labels: app.kubernetes.io/name: loki ``` ## Step 2: Loki Configuration **File: `apps/loki/configmap.yaml`** > **IMPORTANT**: Do NOT include `enforce_metric_name: false` - this field was deprecated in Loki 3.x and will cause CrashLoopBackOff. ```yaml apiVersion: v1 kind: ConfigMap metadata: name: loki-config namespace: loki labels: app.kubernetes.io/name: loki data: loki.yaml: | auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 log_level: info common: instance_addr: 127.0.0.1 path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory query_range: results_cache: cache: embedded_cache: enabled: true max_size_mb: 100 limits_config: retention_period: 168h ingestion_rate_mb: 16 ingestion_burst_size_mb: 24 max_streams_per_user: 10000 max_line_size: 256kb schema_config: configs: - from: 2024-01-01 store: tsdb object_store: filesystem schema: v13 index: prefix: index_ period: 24h ruler: alertmanager_url: http://k8s-monitoring-kube-promet-alertmanager.monitoring:9093 storage: type: local local: directory: /loki/rules rule_path: /loki/rules-temp ring: kvstore: store: inmemory enable_api: true ingester: wal: enabled: true dir: /loki/wal lifecycler: ring: kvstore: store: inmemory replication_factor: 1 chunk_idle_period: 1h max_chunk_age: 1h chunk_target_size: 1048576 chunk_retain_period: 30s compactor: working_directory: /loki/compactor compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h retention_delete_worker_count: 150 delete_request_store: filesystem ``` ## Step 3: Persistent Volume Claim **File: `apps/loki/pvc.yaml`** ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: loki-data namespace: loki labels: app.kubernetes.io/name: loki spec: accessModes: - ReadWriteOnce storageClassName: longhorn resources: requests: storage: 10Gi ``` ## Step 4: Loki StatefulSet **File: `apps/loki/statefulset.yaml`** ```yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: loki namespace: loki labels: app.kubernetes.io/name: loki spec: serviceName: loki-headless replicas: 1 selector: matchLabels: app.kubernetes.io/name: loki template: metadata: labels: app.kubernetes.io/name: loki annotations: prometheus.io/scrape: "true" prometheus.io/port: "3100" prometheus.io/path: "/metrics" spec: securityContext: fsGroup: 10001 runAsGroup: 10001 runAsNonRoot: true runAsUser: 10001 containers: - name: loki image: grafana/loki:3.3.2 args: - -config.file=/etc/loki/loki.yaml - -target=all ports: - name: http containerPort: 3100 protocol: TCP - name: grpc containerPort: 9096 protocol: TCP livenessProbe: httpGet: path: /ready port: http initialDelaySeconds: 45 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: http initialDelaySeconds: 45 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi volumeMounts: - name: config mountPath: /etc/loki - name: data mountPath: /loki securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true volumes: - name: config configMap: name: loki-config - name: data persistentVolumeClaim: claimName: loki-data ``` ## Step 5: Loki Services **File: `apps/loki/service.yaml`** ```yaml apiVersion: v1 kind: Service metadata: name: loki namespace: loki labels: app.kubernetes.io/name: loki spec: type: ClusterIP ports: - name: http port: 3100 targetPort: http protocol: TCP - name: grpc port: 9096 targetPort: grpc protocol: TCP selector: app.kubernetes.io/name: loki --- apiVersion: v1 kind: Service metadata: name: loki-headless namespace: loki labels: app.kubernetes.io/name: loki spec: type: ClusterIP clusterIP: None ports: - name: http port: 3100 targetPort: http protocol: TCP selector: app.kubernetes.io/name: loki ``` ## Step 6: Promtail RBAC **File: `apps/loki/promtail-rbac.yaml`** ```yaml apiVersion: v1 kind: ServiceAccount metadata: name: promtail namespace: loki labels: app.kubernetes.io/name: promtail --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: promtail labels: app.kubernetes.io/name: promtail rules: - apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "watch", "list"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: promtail labels: app.kubernetes.io/name: promtail roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: promtail subjects: - kind: ServiceAccount name: promtail namespace: loki ``` ## Step 7: Promtail Configuration **File: `apps/loki/promtail-configmap.yaml`** ```yaml apiVersion: v1 kind: ConfigMap metadata: name: promtail-config namespace: loki labels: app.kubernetes.io/name: promtail data: promtail.yaml: | server: http_listen_port: 3101 grpc_listen_port: 0 log_level: info positions: filename: /run/promtail/positions.yaml clients: - url: http://loki:3100/loki/api/v1/push batchwait: 1s batchsize: 1048576 timeout: 10s scrape_configs: # Scrape logs from all pods - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: # Only scrape pods with logs - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: drop regex: false # Use pod name as __host__ - source_labels: [__meta_kubernetes_pod_node_name] target_label: __host__ # Set namespace label - source_labels: [__meta_kubernetes_namespace] target_label: namespace # Set pod label - source_labels: [__meta_kubernetes_pod_name] target_label: pod # Set container label - source_labels: [__meta_kubernetes_pod_container_name] target_label: container # Set app label from pod label - source_labels: [__meta_kubernetes_pod_label_app] target_label: app - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name] target_label: app # Set the log file path - replacement: /var/log/pods/*$1/*.log separator: / source_labels: - __meta_kubernetes_pod_uid - __meta_kubernetes_pod_container_name target_label: __path__ # Scrape systemd journal logs - job_name: journal journal: max_age: 12h path: /var/log/journal labels: job: systemd-journal relabel_configs: - source_labels: ['__journal__systemd_unit'] target_label: 'unit' - source_labels: ['__journal__hostname'] target_label: 'node' ``` ## Step 8: Promtail DaemonSet **File: `apps/loki/promtail-daemonset.yaml`** ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: promtail namespace: loki labels: app.kubernetes.io/name: promtail spec: selector: matchLabels: app.kubernetes.io/name: promtail template: metadata: labels: app.kubernetes.io/name: promtail annotations: prometheus.io/scrape: "true" prometheus.io/port: "3101" prometheus.io/path: "/metrics" spec: serviceAccountName: promtail tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: promtail image: grafana/promtail:3.3.2 args: - -config.file=/etc/promtail/promtail.yaml ports: - name: http-metrics containerPort: 3101 protocol: TCP env: - name: HOSTNAME valueFrom: fieldRef: fieldPath: spec.nodeName resources: requests: cpu: 50m memory: 64Mi limits: cpu: 200m memory: 128Mi volumeMounts: - name: config mountPath: /etc/promtail - name: run mountPath: /run/promtail - name: pods mountPath: /var/log/pods readOnly: true - name: journal mountPath: /var/log/journal readOnly: true securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL readOnlyRootFilesystem: true volumes: - name: config configMap: name: promtail-config - name: run emptyDir: {} - name: pods hostPath: path: /var/log/pods - name: journal hostPath: path: /var/log/journal ``` ## Step 9: Promtail Service (Optional) **File: `apps/loki/promtail-service.yaml`** ```yaml apiVersion: v1 kind: Service metadata: name: promtail namespace: loki labels: app.kubernetes.io/name: promtail spec: type: ClusterIP ports: - name: http-metrics port: 3101 targetPort: http-metrics protocol: TCP selector: app.kubernetes.io/name: promtail ``` ## Step 10: ServiceMonitor for Prometheus (Optional) **File: `apps/loki/servicemonitor.yaml`** ```yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: loki namespace: loki labels: app.kubernetes.io/name: loki release: k8s-monitoring spec: selector: matchLabels: app.kubernetes.io/name: loki endpoints: - port: http interval: 30s path: /metrics --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: promtail namespace: loki labels: app.kubernetes.io/name: promtail release: k8s-monitoring spec: selector: matchLabels: app.kubernetes.io/name: promtail endpoints: - port: http-metrics interval: 30s path: /metrics ``` ## Step 11: ArgoCD Application **File: `apps/loki/application.yaml`** ```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: loki namespace: argocd spec: project: default source: repoURL: http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops targetRevision: HEAD path: apps/loki destination: server: https://kubernetes.default.svc namespace: loki syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true - ApplyOutOfSyncOnly=true ``` ## Step 12: Grafana Integration ### Add Loki Datasource Create a ConfigMap for Grafana to auto-provision the Loki datasource: ![[Pasted image 20260104154325.png]] ```yaml apiVersion: v1 kind: ConfigMap metadata: name: grafana-datasource-loki namespace: monitoring labels: grafana_datasource: "1" data: datasource-loki.yaml: | apiVersion: 1 datasources: - name: Loki type: loki uid: loki access: proxy url: http://loki.loki.svc.cluster.local:3100 isDefault: false editable: true jsonData: maxLines: 1000 ``` Apply and restart Grafana: ```bash kubectl apply -f grafana-datasource-loki.yaml kubectl rollout restart deployment <grafana-deployment> -n monitoring ``` ### Add Kubernetes Logs Dashboard Create a ConfigMap for the dashboard: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: kubernetes-logs-dashboard namespace: monitoring labels: grafana_dashboard: "1" data: kubernetes-logs.json: | { "title": "Kubernetes Logs", "uid": "kubernetes-logs", "panels": [ { "datasource": {"type": "loki", "uid": "loki"}, "title": "Log Volume by Namespace", "type": "timeseries", "targets": [{ "expr": "sum by (namespace) (count_over_time({namespace=~\"$namespace\"}[$__interval]))" }] }, { "datasource": {"type": "loki", "uid": "loki"}, "title": "Logs", "type": "logs", "targets": [{ "expr": "{namespace=~\"$namespace\", pod=~\"$pod\"}" }] }, { "datasource": {"type": "loki", "uid": "loki"}, "title": "Error Logs", "type": "logs", "targets": [{ "expr": "{namespace=~\"$namespace\"} |~ \"(?i)error|exception|fail|fatal\"" }] } ], "templating": { "list": [ { "name": "namespace", "type": "query", "datasource": {"type": "loki", "uid": "loki"}, "query": "label_values(namespace)", "includeAll": true, "multi": true }, { "name": "pod", "type": "query", "datasource": {"type": "loki", "uid": "loki"}, "query": "label_values({namespace=~\"$namespace\"}, pod)", "includeAll": true, "multi": true } ] } } ``` ## Deployment ### Option 1: GitOps with ArgoCD (Recommended) 1. Push all files to your GitOps repository: ```bash cd k3s-gitops git add apps/loki/ git commit -m "Add Loki + Promtail logging stack" git push ``` 2. Apply the ArgoCD Application: ```bash kubectl apply -f apps/loki/application.yaml ``` 3. ArgoCD will automatically sync and deploy all resources. ### Option 2: Manual Deployment ```bash kubectl apply -f apps/loki/namespace.yaml kubectl apply -f apps/loki/configmap.yaml kubectl apply -f apps/loki/pvc.yaml kubectl apply -f apps/loki/statefulset.yaml kubectl apply -f apps/loki/service.yaml kubectl apply -f apps/loki/promtail-rbac.yaml kubectl apply -f apps/loki/promtail-configmap.yaml kubectl apply -f apps/loki/promtail-daemonset.yaml kubectl apply -f apps/loki/promtail-service.yaml kubectl apply -f apps/loki/servicemonitor.yaml ``` ![[Pasted image 20260104154739.png]] ## Verification ### Check Loki Status ```bash # Check Loki pod kubectl get pods -n loki -l app.kubernetes.io/name=loki # Check Loki logs kubectl logs -n loki loki-0 # Check Loki is ready kubectl exec -n loki loki-0 -- wget -qO- http://localhost:3100/ready ``` ### Check Promtail Status ```bash # Check Promtail pods (should be one per node) kubectl get pods -n loki -l app.kubernetes.io/name=promtail # Check Promtail logs kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50 # Check targets kubectl exec -n loki <promtail-pod> -- wget -qO- http://localhost:3101/targets ``` ### Test Log Queries ```bash # Query logs via Loki API kubectl exec -n loki loki-0 -- wget -qO- \ 'http://localhost:3100/loki/api/v1/query?query={namespace="kube-system"}&limit=5' ``` ## Troubleshooting ### Loki CrashLoopBackOff **Symptom**: Loki pod keeps restarting with CrashLoopBackOff. **Common Causes**: 1. **Deprecated config field** - Check logs for: ``` field enforce_metric_name not found in type validation.plain ``` **Fix**: Remove `enforce_metric_name: false` from limits_config in configmap.yaml 2. **Permission issues** - Check logs for permission denied errors. **Fix**: Ensure fsGroup and runAsUser are set correctly (10001 for Loki). 3. **Storage issues** - PVC not bound or storage class unavailable. **Fix**: Check PVC status with `kubectl get pvc -n loki` ### Promtail Not Collecting Logs **Symptom**: No logs appearing in Loki/Grafana. **Check**: ```bash # Verify Promtail can reach Loki kubectl exec -n loki <promtail-pod> -- wget -qO- http://loki:3100/ready # Check Promtail targets kubectl exec -n loki <promtail-pod> -- wget -qO- http://localhost:3101/targets # Check for errors in Promtail logs kubectl logs -n loki <promtail-pod> | grep -i error ``` ### Grafana "No Data" **Symptom**: Dashboard shows "No data" with error icons. **Check**: 1. Verify Loki datasource is configured in Grafana 2. Test datasource connection: Grafana → Connections → Data Sources → Loki → Test 3. Verify datasource UID matches dashboard (`loki`) 4. Check Grafana can reach Loki: ```bash kubectl exec -n monitoring <grafana-pod> -c grafana -- \ wget -qO- http://loki.loki.svc.cluster.local:3100/ready ``` ### Grafana Pod Stuck in Pending **Symptom**: After restart, Grafana pod stays in Pending state. **Cause**: Multi-Attach error with PVC (ReadWriteOnce volume). **Fix**: ```bash # Scale down to release PVC kubectl scale deployment <grafana> -n monitoring --replicas=0 # Wait for pods to terminate kubectl get pods -n monitoring -w # Scale back up kubectl scale deployment <grafana> -n monitoring --replicas=1 ``` ## Useful LogQL Queries ```logql # All logs from a namespace {namespace="argocd"} # Errors across all namespaces {namespace=~".+"} |~ "(?i)error|exception|fail" # Logs from specific pod {namespace="default", pod="my-app-xyz"} # JSON logs - extract fields {namespace="default"} | json | level="error" # Rate of errors per minute rate({namespace="default"} |~ "error" [1m]) # Top 10 namespaces by log volume topk(10, sum by (namespace) (rate({namespace=~".+"}[5m]))) ``` ## Configuration Reference ### Loki Retention Default: 168h (7 days). Adjust in configmap.yaml: ```yaml limits_config: retention_period: 336h # 14 days ``` ### Resource Limits Adjust based on log volume: |Component|Small (<1GB/day)|Medium (1-10GB/day)|Large (>10GB/day)| |---|---|---|---| |Loki Memory|256Mi-512Mi|512Mi-1Gi|1Gi-4Gi| |Loki CPU|100m-500m|500m-1000m|1000m-2000m| |Loki Storage|10Gi|50Gi|100Gi+| |Promtail Memory|64Mi-128Mi|128Mi-256Mi|256Mi-512Mi| ---