实战 OOM Killer：一次容器内存泄漏的排查全过程

Paul 收录于 Linux

2023-09-20 约 1770 字预计阅读 4 分钟

某天凌晨，线上告警：NAS 管理服务莫名重启。查日志没有任何业务错误，服务就像被人拔了电源一样突然死亡。这篇文章记录了从 dmesg 到 pprof 的完整排查过程。

一、故障现象

1.1 诡异的重启

监控显示：

服务每隔 4-6 小时自动重启一次
重启前 CPU 正常（<10%），但内存持续上涨
进程没有打印任何 panic 日志
systemd 日志只有 Main process exited, code=killed, status=9/KILL

status=9 就是 SIGKILL——进程被强制杀死，连善后的机会都没有。

1.2 凶手是谁？

在 Linux 中，能发 SIGKILL 的无非几种情况：

管理员手动 kill -9
systemd 的 WatchdogSec 超时
OOM Killer

我首先排除了前两个（没人手动操作，systemd 配置也没有 Watchdog），那就只剩 OOM Killer 了。

二、OOM Killer 原理

2.1 为什么需要 OOM Killer？

Linux 采用 Overcommit 策略：允许进程申请的内存总和超过物理内存。这样做的好处是提高内存利用率，坏处是可能真的用光。

当内存真的耗尽时，Linux 必须选择一个"替死鬼"杀掉，释放内存，这就是 OOM Killer (Out Of Memory Killer) 的职责。

2.2 谁会被杀？

Linux 为每个进程计算一个 oom_score (0-1000)，分数越高越容易被杀：

1
2
3
4
5
# 查看进程的 oom_score
cat /proc/<PID>/oom_score

# 查看 oom_score_adj (-1000 到 1000，可手动调整)
cat /proc/<PID>/oom_score_adj

计算因素：

内存占用越大，分数越高
运行时间越长，分数越低（杀老进程损失更大）
root 用户的进程分数会打折
oom_score_adj 可以人为加减分

2.3 cgroups 的局部 OOM

在容器场景下，还有一层：cgroups memory 限制

1
2
3
4
5
# 容器的内存限制
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes

# 当前内存使用
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.usage_in_bytes

当容器内存使用达到 limit，会触发容器级别的 OOM，宿主机 dmesg 会记录。

三、定位凶手

3.1 dmesg 分析

1
dmesg | grep -i "oom\|kill" | tail -20

输出：

1
2
3
4
[1842059.123456] nas-agent invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0
[1842059.123457] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=docker-abc123,mems_allowed=0
[1842059.123458] Memory cgroup out of memory: Killed process 12345 (nas-agent) total-vm:2456789kB, anon-rss:1048576kB, file-rss:1234kB
[1842059.123459] oom_reaper: reaped process 12345 (nas-agent), now anon-rss:0kB, file-rss:0kB

关键信息解读：

constraint=CONSTRAINT_MEMCG: 是 cgroup 限制触发的 OOM，不是系统级别
nas-agent: 被杀的是我的服务
anon-rss:1048576kB: 匿名内存占用 1GB（正好是容器限制）

确认了：是我的服务内存泄漏，触及容器 1GB 限制，被 OOM Killer 杀死。

3.2 为什么是 1GB？

查看 systemd service 文件：

1
2
3
[Service]
MemoryMax=1G
MemoryHigh=900M

果然设置了 1GB 限制。这是好习惯（防止单个服务拖垮整机），但也暴露了我的服务有内存泄漏。

四、定位泄漏点

4.1 启用 pprof

Go 程序自带 pprof，只需在 main 中导入：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import _ "net/http/pprof"

func main() {
    // 启动 pprof HTTP 端点
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    
    // 正常业务逻辑...
}

4.2 观察内存增长

1
2
3
4
5
6
7
8
# 在服务运行 1 小时后采集 heap profile
curl -s http://localhost:6060/debug/pprof/heap > heap1.pprof

# 再等 1 小时
curl -s http://localhost:6060/debug/pprof/heap > heap2.pprof

# 对比差异
go tool pprof -base heap1.pprof heap2.pprof

pprof 交互界面：

1
2
3
4
5
6
(pprof) top 10
Showing nodes accounting for 256MB, 95% of 270MB total
      flat  flat%   sum%        cum   cum%
     200MB 74.07% 74.07%      200MB 74.07%  main.(*FileWatcher).Watch
      30MB 11.11% 85.19%       30MB 11.11%  bufio.NewReaderSize
      ...

找到了：FileWatcher.Watch 函数泄漏了 200MB！

4.3 代码审查

问题代码：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func (w *FileWatcher) Watch(path string) {
    for {
        // 每次循环都创建新的 reader，但从不关闭！
        file, _ := os.Open(path)
        reader := bufio.NewReader(file)
        
        data, _ := reader.ReadBytes('\n')
        w.process(data)
        
        time.Sleep(time.Second)
        // file 没有 Close！
    }
}

问题：

os.Open 的文件句柄没有 Close，导致 fd 泄漏
bufio.Reader 内部有缓冲区，每次循环都新建，旧的靠 GC 回收但因为 file 没关闭，无法回收

4.4 Goroutine 泄漏检查

顺便检查 goroutine：

1
curl -s http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutine.txt

发现还有 goroutine 泄漏：

1
2
3
4
goroutine profile: total 1234
   500 @ 0x43e5e6 0x44a4b7 ...
      main.(*EventBus).Subscribe+0x45
      ...

500 个 goroutine 卡在 Subscribe！查代码发现是 channel 没有接收方导致发送阻塞。

五、修复方案

5.1 修复文件泄漏

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func (w *FileWatcher) Watch(path string) {
    for {
        func() {
            file, err := os.Open(path)
            if err != nil {
                return
            }
            defer file.Close()  // 确保关闭
            
            reader := bufio.NewReader(file)
            data, _ := reader.ReadBytes('\n')
            w.process(data)
        }()
        
        time.Sleep(time.Second)
    }
}

5.2 修复 Goroutine 泄漏

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
func (bus *EventBus) Subscribe(ctx context.Context, topic string) <-chan Event {
    ch := make(chan Event, 100)  // 带缓冲，防止发送阻塞
    
    go func() {
        defer close(ch)
        for {
            select {
            case <-ctx.Done():
                return  // context 取消时退出
            case event := <-bus.internal[topic]:
                select {
                case ch <- event:
                case <-time.After(time.Second):
                    // 发送超时，丢弃事件，记录日志
                    log.Warn("event dropped due to slow consumer")
                }
            }
        }
    }()
    
    return ch
}

5.3 增加预警

在 Prometheus 中添加告警规则：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
groups:
- name: memory
  rules:
  - alert: HighMemoryUsage
    expr: process_resident_memory_bytes / container_spec_memory_limit_bytes > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "内存使用超过 80%，可能有泄漏"

六、预防措施

6.1 开发阶段

1
2
3
4
5
6
// 使用 goleak 在测试中检测 goroutine 泄漏
import "go.uber.org/goleak"

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

6.2 部署阶段

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# systemd service 配置
[Service]
MemoryMax=1G
MemoryHigh=900M  # 达到 900M 时开始限速，给预警时间

# 或者 Kubernetes
resources:
  limits:
    memory: 1Gi
  requests:
    memory: 512Mi

6.3 运行阶段

1
2
# 定期采集 pprof 数据（可以用 cron）
0 * * * * curl -s http://localhost:6060/debug/pprof/heap > /var/log/pprof/heap_$(date +\%Y\%m\%d\%H).pprof

七、总结

阶段	关键操作
发现问题	dmesg 确认 OOM Killer
定位原因	pprof heap 对比
代码审查	检查 Close/defer 和 goroutine 生命周期
修复验证	长时间运行测试 + 内存监控
预防措施	告警 + goleak + 资源限制

最大的教训：Go 虽然有 GC，但 GC 不管文件句柄、goroutine 和外部资源。任何涉及 Open、go func() 的地方，都要想清楚"谁来收尾"。

相关文章
Go Daemon 开发：优雅处理 SIGTERM 与热重载
NAS 磁盘健康监控：从 smartctl 到自定义 Agent

目录