业务日志监控中报告, 每天会有大约250次连接redis失败.
通过strace追踪发现.故障的时间点时写磁盘时间超过了10s.一般在10-15s之间. redis第二次重试使用的是10s.
这个实例所有的操作都是INCR, fdatasync 会block写.
strace -Ttt -f -p 11302 -T -e trace=fdatasync
11309 10:21:31.153900 fdatasync(116) = 0 <0.034295>
11309 10:21:32.078747 fdatasync(116) = 0 <7.592478>
11309 10:21:39.774959 fdatasync(116) = 0 <10.098802>
11309 10:21:49.990623 fdatasync(116) = 0 <2.026147>
11309 10:21:52.129676 fdatasync(116) = 0 <0.002802>
治标:
超时时间改为15s.
治本:
正在用watchdog抓一下超过5s的堆栈.
堆栈:
[11302 | signal handler] (1499754857)
--- WATCHDOG TIMER EXPIRED ---
/usr/local/bin/redis-server-2.8 10.160.86.216:6699(logStackTrace+0x3e)[0x445ace]
/lib64/libpthread.so.0(write+0x2d)[0x7f19ef3b06fd]
/lib64/libpthread.so.0(+0xf710)[0x7f19ef3b1710]
/lib64/libpthread.so.0(write+0x2d)[0x7f19ef3b06fd]
/usr/local/bin/redis-server-2.8 10.160.86.216:6699(flushAppendOnlyFile+0x4e)[0x44116e]
/usr/local/bin/redis-server-2.8 10.160.86.216:6699(serverCron+0x3b7)[0x41bb17]
/usr/local/bin/redis-server-2.8 10.160.86.216:6699(aeProcessEvents+0x1e9)[0x416b69]
/usr/local/bin/redis-server-2.8 10.160.86.216:6699(aeMain+0x2b)[0x416deb]
/usr/local/bin/redis-server-2.8 10.160.86.216:6699(main+0x31d)[0x41e49d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f19ef02cd5d]
/usr/local/bin/redis-server-2.8 10.160.86.216:6699[0x415bd9]
[11302 | signal handler] (1499754857) --------
fdatasync会在某个时间点超过10s.
看来因为写磁盘堵塞了, 把机械硬盘换成了SSD, 解决了.