[LU-1854] system crash when reading the file /proc/fs/lustre/ost/OSS/ost_create/req_history Created: 07/Sep/12  Updated: 18/Apr/13  Resolved: 17/Sep/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Gregoire Pichon Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: crash, procfs
Environment:

lustre 2.2.93
bullxlinux distribution (based on redhat 6.2)
kernel 2.6.32-220


Severity: 3
Rank (Obsolete): 4442

 Description   

The lustre version is 2.2.93.

When reading the file /proc/fs/lustre/ost/OSS/ost_create/req_history, the system crashed with LBUG ASSERTION( !list_empty(&svcpt->scp_hist_reqs).

Here are some information from the core dump

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.23.1.bl6.Bull.28.8.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2012-09-07-09:51:01/vmcore  [PARTIAL DUMP]
        CPUS: 16
        DATE: Fri Sep  7 09:50:45 2012
      UPTIME: 1 days, 19:58:49
LOAD AVERAGE: 0.05, 0.05, 0.05
       TASKS: 1006
    NODENAME: mo88
     RELEASE: 2.6.32-220.23.1.bl6.Bull.28.8.x86_64
     VERSION: #1 SMP Thu Jul 5 17:34:18 CEST 2012
     MACHINE: x86_64  (2199 Mhz)
      MEMORY: 32 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 29617
     COMMAND: "cat"
        TASK: ffff8806e65437d0  [THREAD_INFO: ffff8804dbf1c000]
         CPU: 9
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 29617  TASK: ffff8806e65437d0  CPU: 9   COMMAND: "cat"
 #0 [ffff8804dbf1fbf0] machine_kexec at ffffffff8102895b
 #1 [ffff8804dbf1fc50] crash_kexec at ffffffff810a4622
 #2 [ffff8804dbf1fd20] panic at ffffffff81484647
 #3 [ffff8804dbf1fda0] lbug_with_loc at ffffffffa0680f6b [libcfs]
 #4 [ffff8804dbf1fdc0] ptlrpc_lprocfs_svc_req_history_seek at ffffffffa0c30104 [ptlrpc]
 #5 [ffff8804dbf1fdd0] ptlrpc_lprocfs_svc_req_history_next at ffffffffa0c301e1 [ptlrpc]
 #6 [ffff8804dbf1fe20] seq_read at ffffffff81185e9a
 #7 [ffff8804dbf1fea0] proc_reg_read at ffffffff811c84ee
 #8 [ffff8804dbf1fef0] vfs_read at ffffffff81163a15
 #9 [ffff8804dbf1ff30] sys_read at ffffffff81163b51
#10 [ffff8804dbf1ff80] system_call_fastpath at ffffffff810030f2
    RIP: 0000003dc64d83f0  RSP: 00007fff6cb0c9e0  RFLAGS: 00010206
    RAX: 0000000000000000  RBX: ffffffff810030f2  RCX: 00000000024a7030
    RDX: 0000000000008000  RSI: 000000000249f000  RDI: 0000000000000003
    RBP: 000000000249f000   R8: 0000000000000003   R9: 0000000001000000
    R10: 0000000000008fff  R11: 0000000000000246  R12: ffffffffffff8000
    R13: 0000000000000003  R14: 0000000000008000  R15: 0000000000000003
    ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b

crash> dmesg | tail -n 50
Lustre: fsperf-OST0005: Now serving fsperf-OST0005 on /dev/dm-11 with recovery enabled
Lustre: 27386:0:(ldlm_lib.c:2110:target_recovery_init()) RECOVERY: service fsperf-OST000a, 1 recoverable clients, last_transno 1340929
Lustre: 27386:0:(ldlm_lib.c:2110:target_recovery_init()) Skipped 3 previous similar messages
Lustre: fsperf-OST000a: Now serving fsperf-OST000a on /dev/dm-26 with recovery enabled
Lustre: Skipped 3 previous similar messages
Lustre: 27419:0:(ldlm_lib.c:2110:target_recovery_init()) RECOVERY: service fsperf-OST0001, 1 recoverable clients, last_transno 1340929
Lustre: 27419:0:(ldlm_lib.c:2110:target_recovery_init()) Skipped 6 previous similar messages
Lustre: fsperf-OST0001: Now serving fsperf-OST0001 on /dev/dm-16 with recovery enabled
Lustre: Skipped 6 previous similar messages
LustreError: 137-5: UUID 'fsperf-OST000f_UUID' is not available for connect (no target)
Lustre: fsperf-OST0001: Will be in recovery for at least 5:00, or until 1 client reconnects
Lustre: fsperf-OST000b: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
Lustre: fsperf-OST000b: received MDS connection from 32.0.0.39@o2ib1
Lustre: Skipped 14 previous similar messages
Lustre: fsperf-OST0006: Recovery over after 0:01, of 1 clients 1 recovered and 0 were evicted.
Lustre: fsperf-OST000e: received MDS connection from 32.0.0.39@o2ib1
Lustre: Skipped 13 previous similar messages
Lustre: Echo OBD driver; http://www.lustre.org/
mlx4_core 0000:04:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.
mlx4_core 0000:82:00.0: vpd r/w failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update.
process `cat' is using deprecated sysctl (syscall) net.ipv6.neigh.default.retrans_time; Use net.ipv6.neigh.default.retrans_time_ms instead.
LustreError: 29617:0:(lproc_ptlrpc.c:431:ptlrpc_lprocfs_svc_req_history_seek()) ASSERTION( !list_empty(&svcpt->scp_hist_reqs) ) failed: 
LustreError: 29617:0:(lproc_ptlrpc.c:431:ptlrpc_lprocfs_svc_req_history_seek()) LBUG
Pid: 29617, comm: cat

Call Trace:
 [<ffffffffa0680905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0680f17>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0c30104>] ptlrpc_lprocfs_svc_req_history_seek+0xf4/0x100 [ptlrpc]
 [<ffffffffa0c301e1>] ptlrpc_lprocfs_svc_req_history_next+0x71/0x1b0 [ptlrpc]
 [<ffffffff81185e9a>] seq_read+0x24a/0x3f0
 [<ffffffff811c84ee>] proc_reg_read+0x7e/0xc0
 [<ffffffff81163a15>] vfs_read+0xb5/0x1a0
 [<ffffffff810c0e1a>] ? audit_syscall_entry+0x26a/0x290
 [<ffffffff81163b51>] sys_read+0x51/0x90
 [<ffffffff810030f2>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 29617, comm: cat Not tainted 2.6.32-220.23.1.bl6.Bull.28.8.x86_64 #1
Call Trace:
 [<ffffffff81484640>] ? panic+0x78/0x143
 [<ffffffffa0680f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa0c30104>] ? ptlrpc_lprocfs_svc_req_history_seek+0xf4/0x100 [ptlrpc]
 [<ffffffffa0c301e1>] ? ptlrpc_lprocfs_svc_req_history_next+0x71/0x1b0 [ptlrpc]
 [<ffffffff81185e9a>] ? seq_read+0x24a/0x3f0
 [<ffffffff811c84ee>] ? proc_reg_read+0x7e/0xc0
 [<ffffffff81163a15>] ? vfs_read+0xb5/0x1a0
 [<ffffffff810c0e1a>] ? audit_syscall_entry+0x26a/0x290
 [<ffffffff81163b51>] ? sys_read+0x51/0x90
 [<ffffffff810030f2>] ? system_call_fastpath+0x16/0x1b

crash> files
PID: 29617  TASK: ffff8806e65437d0  CPU: 9   COMMAND: "cat"
ROOT: /    CWD: /root
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff88045cba6180 ffff880217b5b800 ffff88046a2e89c8 FIFO 
  1 ffff88045cba6600 ffff880217b5bbc0 ffff8802ace6e148 FIFO 
  2 ffff88045cba6cc0 ffff880217b5b380 ffff88023fd6a048 FIFO 
  3 ffff880872bc5240 ffff880519514480 ffff880874370d38 REG  /proc/fs/lustre/ost/OSS/ost_create/req_history


I can provide additional information from the dump if needed.



 Comments   
Comment by Andreas Dilger [ 07/Sep/12 ]

Probably related to changes from SMP scaling

Comment by Liang Zhen (Inactive) [ 10/Sep/12 ]

I suspect this bug is there for long time: ptlrpc_server_drop_request() will never track culled request seq if the req::rq_reply_state is preallocated, it will just unlink the request from history list.

here is the patch: http://review.whamcloud.com/#change,3916

Comment by Liang Zhen (Inactive) [ 17/Sep/12 ]

patch landed

Generated at Sat Feb 10 01:20:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.