A complete guide for deploying Grafana on a K3s cluster using GitOps with ArgoCD.
![[Pasted image 20260104154635.png]]
## Overview
This guide covers deploying a centralized logging stack:
- **Loki** - Log aggregation system (like Prometheus, but for logs)
- **Promtail** - Agent that ships logs to Loki (runs as DaemonSet on all nodes)
- **Grafana Integration** - Dashboards and datasource configuration
### Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ Promtail │ │ Promtail │ │ Promtail │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└──────────────────┼──────────────────┘
│
▼
┌─────────────┐
│ Loki │
│ (StatefulSet)│
└──────┬──────┘
│
▼
┌─────────────┐
│ Grafana │
└─────────────┘
```
## Prerequisites
- K3s cluster (tested on 3-node HA setup)
- ArgoCD installed and configured
- Gitea or other Git repository for GitOps
- Longhorn or other StorageClass for persistent storage
- cert-manager for TLS (optional)
## Directory Structure
```
k3s-gitops/
└── apps/
└── loki/
├── namespace.yaml
├── configmap.yaml
├── pvc.yaml
├── statefulset.yaml
├── service.yaml
├── promtail-rbac.yaml
├── promtail-configmap.yaml
├── promtail-daemonset.yaml
├── promtail-service.yaml
├── promtail-ingress.yaml # Optional
├── servicemonitor.yaml # Optional - for Prometheus
└── application.yaml # ArgoCD Application
```
## Step 1: Create Namespace
**File: `apps/loki/namespace.yaml`**
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: loki
labels:
app.kubernetes.io/name: loki
```
## Step 2: Loki Configuration
**File: `apps/loki/configmap.yaml`**
> **IMPORTANT**: Do NOT include `enforce_metric_name: false` - this field was deprecated in Loki 3.x and will cause CrashLoopBackOff.
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: loki
labels:
app.kubernetes.io/name: loki
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
limits_config:
retention_period: 168h
ingestion_rate_mb: 16
ingestion_burst_size_mb: 24
max_streams_per_user: 10000
max_line_size: 256kb
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://k8s-monitoring-kube-promet-alertmanager.monitoring:9093
storage:
type: local
local:
directory: /loki/rules
rule_path: /loki/rules-temp
ring:
kvstore:
store: inmemory
enable_api: true
ingester:
wal:
enabled: true
dir: /loki/wal
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 1h
max_chunk_age: 1h
chunk_target_size: 1048576
chunk_retain_period: 30s
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
delete_request_store: filesystem
```
## Step 3: Persistent Volume Claim
**File: `apps/loki/pvc.yaml`**
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: loki-data
namespace: loki
labels:
app.kubernetes.io/name: loki
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 10Gi
```
## Step 4: Loki StatefulSet
**File: `apps/loki/statefulset.yaml`**
```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: loki
namespace: loki
labels:
app.kubernetes.io/name: loki
spec:
serviceName: loki-headless
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: loki
template:
metadata:
labels:
app.kubernetes.io/name: loki
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3100"
prometheus.io/path: "/metrics"
spec:
securityContext:
fsGroup: 10001
runAsGroup: 10001
runAsNonRoot: true
runAsUser: 10001
containers:
- name: loki
image: grafana/loki:3.3.2
args:
- -config.file=/etc/loki/loki.yaml
- -target=all
ports:
- name: http
containerPort: 3100
protocol: TCP
- name: grpc
containerPort: 9096
protocol: TCP
livenessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 45
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 45
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/loki
- name: data
mountPath: /loki
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
volumes:
- name: config
configMap:
name: loki-config
- name: data
persistentVolumeClaim:
claimName: loki-data
```
## Step 5: Loki Services
**File: `apps/loki/service.yaml`**
```yaml
apiVersion: v1
kind: Service
metadata:
name: loki
namespace: loki
labels:
app.kubernetes.io/name: loki
spec:
type: ClusterIP
ports:
- name: http
port: 3100
targetPort: http
protocol: TCP
- name: grpc
port: 9096
targetPort: grpc
protocol: TCP
selector:
app.kubernetes.io/name: loki
---
apiVersion: v1
kind: Service
metadata:
name: loki-headless
namespace: loki
labels:
app.kubernetes.io/name: loki
spec:
type: ClusterIP
clusterIP: None
ports:
- name: http
port: 3100
targetPort: http
protocol: TCP
selector:
app.kubernetes.io/name: loki
```
## Step 6: Promtail RBAC
**File: `apps/loki/promtail-rbac.yaml`**
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: promtail
namespace: loki
labels:
app.kubernetes.io/name: promtail
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: promtail
labels:
app.kubernetes.io/name: promtail
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: promtail
labels:
app.kubernetes.io/name: promtail
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: promtail
subjects:
- kind: ServiceAccount
name: promtail
namespace: loki
```
## Step 7: Promtail Configuration
**File: `apps/loki/promtail-configmap.yaml`**
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: loki
labels:
app.kubernetes.io/name: promtail
data:
promtail.yaml: |
server:
http_listen_port: 3101
grpc_listen_port: 0
log_level: info
positions:
filename: /run/promtail/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
batchwait: 1s
batchsize: 1048576
timeout: 10s
scrape_configs:
# Scrape logs from all pods
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with logs
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: drop
regex: false
# Use pod name as __host__
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: __host__
# Set namespace label
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
# Set pod label
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
# Set container label
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
# Set app label from pod label
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
target_label: app
# Set the log file path
- replacement: /var/log/pods/*$1/*.log
separator: /
source_labels:
- __meta_kubernetes_pod_uid
- __meta_kubernetes_pod_container_name
target_label: __path__
# Scrape systemd journal logs
- job_name: journal
journal:
max_age: 12h
path: /var/log/journal
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
- source_labels: ['__journal__hostname']
target_label: 'node'
```
## Step 8: Promtail DaemonSet
**File: `apps/loki/promtail-daemonset.yaml`**
```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: promtail
namespace: loki
labels:
app.kubernetes.io/name: promtail
spec:
selector:
matchLabels:
app.kubernetes.io/name: promtail
template:
metadata:
labels:
app.kubernetes.io/name: promtail
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "3101"
prometheus.io/path: "/metrics"
spec:
serviceAccountName: promtail
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: promtail
image: grafana/promtail:3.3.2
args:
- -config.file=/etc/promtail/promtail.yaml
ports:
- name: http-metrics
containerPort: 3101
protocol: TCP
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
volumeMounts:
- name: config
mountPath: /etc/promtail
- name: run
mountPath: /run/promtail
- name: pods
mountPath: /var/log/pods
readOnly: true
- name: journal
mountPath: /var/log/journal
readOnly: true
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
volumes:
- name: config
configMap:
name: promtail-config
- name: run
emptyDir: {}
- name: pods
hostPath:
path: /var/log/pods
- name: journal
hostPath:
path: /var/log/journal
```
## Step 9: Promtail Service (Optional)
**File: `apps/loki/promtail-service.yaml`**
```yaml
apiVersion: v1
kind: Service
metadata:
name: promtail
namespace: loki
labels:
app.kubernetes.io/name: promtail
spec:
type: ClusterIP
ports:
- name: http-metrics
port: 3101
targetPort: http-metrics
protocol: TCP
selector:
app.kubernetes.io/name: promtail
```
## Step 10: ServiceMonitor for Prometheus (Optional)
**File: `apps/loki/servicemonitor.yaml`**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: loki
namespace: loki
labels:
app.kubernetes.io/name: loki
release: k8s-monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: loki
endpoints:
- port: http
interval: 30s
path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: promtail
namespace: loki
labels:
app.kubernetes.io/name: promtail
release: k8s-monitoring
spec:
selector:
matchLabels:
app.kubernetes.io/name: promtail
endpoints:
- port: http-metrics
interval: 30s
path: /metrics
```
## Step 11: ArgoCD Application
**File: `apps/loki/application.yaml`**
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: loki
namespace: argocd
spec:
project: default
source:
repoURL: http://gitea-http.gitea.svc.cluster.local:3000/admin/k3s-gitops
targetRevision: HEAD
path: apps/loki
destination:
server: https://kubernetes.default.svc
namespace: loki
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ApplyOutOfSyncOnly=true
```
## Step 12: Grafana Integration
### Add Loki Datasource
Create a ConfigMap for Grafana to auto-provision the Loki datasource:
![[Pasted image 20260104154325.png]]
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasource-loki
namespace: monitoring
labels:
grafana_datasource: "1"
data:
datasource-loki.yaml: |
apiVersion: 1
datasources:
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki.loki.svc.cluster.local:3100
isDefault: false
editable: true
jsonData:
maxLines: 1000
```
Apply and restart Grafana:
```bash
kubectl apply -f grafana-datasource-loki.yaml
kubectl rollout restart deployment <grafana-deployment> -n monitoring
```
### Add Kubernetes Logs Dashboard
Create a ConfigMap for the dashboard:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kubernetes-logs-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
kubernetes-logs.json: |
{
"title": "Kubernetes Logs",
"uid": "kubernetes-logs",
"panels": [
{
"datasource": {"type": "loki", "uid": "loki"},
"title": "Log Volume by Namespace",
"type": "timeseries",
"targets": [{
"expr": "sum by (namespace) (count_over_time({namespace=~\"$namespace\"}[$__interval]))"
}]
},
{
"datasource": {"type": "loki", "uid": "loki"},
"title": "Logs",
"type": "logs",
"targets": [{
"expr": "{namespace=~\"$namespace\", pod=~\"$pod\"}"
}]
},
{
"datasource": {"type": "loki", "uid": "loki"},
"title": "Error Logs",
"type": "logs",
"targets": [{
"expr": "{namespace=~\"$namespace\"} |~ \"(?i)error|exception|fail|fatal\""
}]
}
],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": {"type": "loki", "uid": "loki"},
"query": "label_values(namespace)",
"includeAll": true,
"multi": true
},
{
"name": "pod",
"type": "query",
"datasource": {"type": "loki", "uid": "loki"},
"query": "label_values({namespace=~\"$namespace\"}, pod)",
"includeAll": true,
"multi": true
}
]
}
}
```
## Deployment
### Option 1: GitOps with ArgoCD (Recommended)
1. Push all files to your GitOps repository:
```bash
cd k3s-gitops
git add apps/loki/
git commit -m "Add Loki + Promtail logging stack"
git push
```
2. Apply the ArgoCD Application:
```bash
kubectl apply -f apps/loki/application.yaml
```
3. ArgoCD will automatically sync and deploy all resources.
### Option 2: Manual Deployment
```bash
kubectl apply -f apps/loki/namespace.yaml
kubectl apply -f apps/loki/configmap.yaml
kubectl apply -f apps/loki/pvc.yaml
kubectl apply -f apps/loki/statefulset.yaml
kubectl apply -f apps/loki/service.yaml
kubectl apply -f apps/loki/promtail-rbac.yaml
kubectl apply -f apps/loki/promtail-configmap.yaml
kubectl apply -f apps/loki/promtail-daemonset.yaml
kubectl apply -f apps/loki/promtail-service.yaml
kubectl apply -f apps/loki/servicemonitor.yaml
```
![[Pasted image 20260104154739.png]]
## Verification
### Check Loki Status
```bash
# Check Loki pod
kubectl get pods -n loki -l app.kubernetes.io/name=loki
# Check Loki logs
kubectl logs -n loki loki-0
# Check Loki is ready
kubectl exec -n loki loki-0 -- wget -qO- http://localhost:3100/ready
```
### Check Promtail Status
```bash
# Check Promtail pods (should be one per node)
kubectl get pods -n loki -l app.kubernetes.io/name=promtail
# Check Promtail logs
kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50
# Check targets
kubectl exec -n loki <promtail-pod> -- wget -qO- http://localhost:3101/targets
```
### Test Log Queries
```bash
# Query logs via Loki API
kubectl exec -n loki loki-0 -- wget -qO- \
'http://localhost:3100/loki/api/v1/query?query={namespace="kube-system"}&limit=5'
```
## Troubleshooting
### Loki CrashLoopBackOff
**Symptom**: Loki pod keeps restarting with CrashLoopBackOff.
**Common Causes**:
1. **Deprecated config field** - Check logs for:
```
field enforce_metric_name not found in type validation.plain
```
**Fix**: Remove `enforce_metric_name: false` from limits_config in configmap.yaml
2. **Permission issues** - Check logs for permission denied errors. **Fix**: Ensure fsGroup and runAsUser are set correctly (10001 for Loki).
3. **Storage issues** - PVC not bound or storage class unavailable. **Fix**: Check PVC status with `kubectl get pvc -n loki`
### Promtail Not Collecting Logs
**Symptom**: No logs appearing in Loki/Grafana.
**Check**:
```bash
# Verify Promtail can reach Loki
kubectl exec -n loki <promtail-pod> -- wget -qO- http://loki:3100/ready
# Check Promtail targets
kubectl exec -n loki <promtail-pod> -- wget -qO- http://localhost:3101/targets
# Check for errors in Promtail logs
kubectl logs -n loki <promtail-pod> | grep -i error
```
### Grafana "No Data"
**Symptom**: Dashboard shows "No data" with error icons.
**Check**:
1. Verify Loki datasource is configured in Grafana
2. Test datasource connection: Grafana → Connections → Data Sources → Loki → Test
3. Verify datasource UID matches dashboard (`loki`)
4. Check Grafana can reach Loki:
```bash
kubectl exec -n monitoring <grafana-pod> -c grafana -- \ wget -qO- http://loki.loki.svc.cluster.local:3100/ready
```
### Grafana Pod Stuck in Pending
**Symptom**: After restart, Grafana pod stays in Pending state.
**Cause**: Multi-Attach error with PVC (ReadWriteOnce volume).
**Fix**:
```bash
# Scale down to release PVC
kubectl scale deployment <grafana> -n monitoring --replicas=0
# Wait for pods to terminate
kubectl get pods -n monitoring -w
# Scale back up
kubectl scale deployment <grafana> -n monitoring --replicas=1
```
## Useful LogQL Queries
```logql
# All logs from a namespace
{namespace="argocd"}
# Errors across all namespaces
{namespace=~".+"} |~ "(?i)error|exception|fail"
# Logs from specific pod
{namespace="default", pod="my-app-xyz"}
# JSON logs - extract fields
{namespace="default"} | json | level="error"
# Rate of errors per minute
rate({namespace="default"} |~ "error" [1m])
# Top 10 namespaces by log volume
topk(10, sum by (namespace) (rate({namespace=~".+"}[5m])))
```
## Configuration Reference
### Loki Retention
Default: 168h (7 days). Adjust in configmap.yaml:
```yaml
limits_config:
retention_period: 336h # 14 days
```
### Resource Limits
Adjust based on log volume:
|Component|Small (<1GB/day)|Medium (1-10GB/day)|Large (>10GB/day)|
|---|---|---|---|
|Loki Memory|256Mi-512Mi|512Mi-1Gi|1Gi-4Gi|
|Loki CPU|100m-500m|500m-1000m|1000m-2000m|
|Loki Storage|10Gi|50Gi|100Gi+|
|Promtail Memory|64Mi-128Mi|128Mi-256Mi|256Mi-512Mi|
---