[LU-2362] MDS crash in llog_process_thread Created: 20/Nov/12  Updated: 20/Nov/12  Resolved: 20/Nov/12

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Prakash Surya (Inactive) Assignee: Li Wei (Inactive)
Resolution: Duplicate Votes: 0
Labels: topsequoia

Issue Links:
Related
is related to LU-2109 __llog_process_thread() GPF Resolved
Severity: 3
Rank (Obsolete): 5624

 Description   

On reboot of the MDS this morning, we hit the following crash:

Lustre: lstest-MDT0000: Recovery over after 5:00, of 449 clients 194 recovered and 255 were evicted.
Lustre: lstest-MDT0000: Client a449214e-338a-17fc-d926-5a3a273629af (at 172.20.17.191@o2ib500) reconnecting
Lustre: lstest-MDT0000: Client cbe0af34-1d34-affd-7602-97be89da80f8 (at 172.20.17.89@o2ib500) reconnecting
Lustre: Skipped 23 previous similar messages
Lustre: lstest-MDT0000: Client 3f7011c2-9552-065c-9d01-46c12930b0a3 (at 172.20.17.63@o2ib500) reconnecting
Lustre: Skipped 100 previous similar messages
LustreError: 11-0: lstest-OST0197-osc-MDT0000: Communicating with 172.20.3.8@o2ib500, operation ost_connect failed with -19
LustreError: Skipped 33 previous similar messages
LustreError: 11-0: lstest-MDT0000-osp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11
LustreError: Skipped 63 previous similar messages
Lustre: lstest-OST01fd-osc-MDT0000: Connection restored to lstest-OST01fd (at 172.20.3.109@o2ib500)
Lustre: lstest-OST01a3-osc-MDT0000: Connection restored to lstest-OST01a3 (at 172.20.3.19@o2ib500)
Lustre: lstest-OST01ca-osc-MDT0000: Connection restored to lstest-OST01ca (at 172.20.3.58@o2ib500)
Lustre: Skipped 1 previous similar message
Lustre: lstest-OST01cf-osc-MDT0000: Connection restored to lstest-OST01cf (at 172.20.3.63@o2ib500)
Lustre: Skipped 2 previous similar messages
Lustre: lstest-OST01ec-osc-MDT0000: Connection restored to lstest-OST01ec (at 172.20.3.92@o2ib500)
Lustre: Skipped 11 previous similar messages
Lustre: 32886:0:(llog.c:92:llog_free_handle()) Still busy: 2: 0x8d58:0x1:0: 64767 43517 1801984 1
Pid: 32886, comm: osp-syn-461


Call Trace:
 [<ffffffffa05be965>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0716e81>] llog_free_handle+0xb1/0x3c0 [obdclass]
 [<ffffffffa07171ed>] llog_close+0x5d/0x190 [obdclass]
 [<ffffffffa071cdf7>] llog_cat_cancel_records+0x137/0x360 [obdclass]
 [<ffffffffa0d30408>] osp_sync_process_committed+0x238/0x760 [osp]
 [<ffffffffa09259a7>] ? ptlrpcd_add_req+0x187/0x2e0 [ptlrpc]
 [<ffffffffa0d309c4>] osp_sync_process_queues+0x94/0x11c0 [osp]
 [<ffffffff8105ea30>] ? default_wake_function+0x0/0x20
 [<ffffffffa0718ecb>] llog_process_thread+0x8fb/0xe10 [obdclass]
 [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
 [<ffffffffa071cc13>] llog_cat_process_cb+0x2d3/0x380 [obdclass]
 [<ffffffffa0718ecb>] llog_process_thread+0x8fb/0xe10 [obdclass]
 [<ffffffffa071c940>] ? llog_cat_process_cb+0x0/0x380 [obdclass]
 [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
 [<ffffffffa071b749>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
 [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [<ffffffffa071b969>] llog_cat_process+0x19/0x20 [obdclass]
 [<ffffffffa05bf83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
 [<ffffffffa0d32aaa>] osp_sync_thread+0x22a/0x700 [osp]
 [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
 [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/pci0000:80/0000:80:02.2/0000:83:00.0/host7/port-7:1/expander-7:1/port-7:1:10/end_device-7:1:10/target7:0:30/7:0:30:0/state
CPU 18 
Modules linked in: osp(U) mdt(U) mdd(U) fid(U) fld(U) lod(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) acpi_cpufreq freq_table mperf ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath dm_mod vhost_net macvtap macvlan tun kvm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate sg ses enclosure sd_mod crc_t10dif isci libsas wmi mpt2sas scsi_transport_sas raid_class sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ahci ioatdma shpchp ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core igb dca [last unloaded: cpufreq_ondemand]

Pid: 32886, comm: osp-syn-461
 Tainted: P        W  ----------------   2.6.32-220.23.1.2chaos.ch5.x86_64 #1 appro 2620x-in/S2600GZ
RIP: 0010:[<ffffffffa071889c>]  [<ffffffffa071889c>] llog_process_thread+0x2cc/0xe10 [obdclass]
RSP: 0018:ffff881fa1b6db50  EFLAGS: 00010206
RAX: 5a5a5a5a5a5a5a5a RBX: 000000000000aa01 RCX: 0000000000000000
RDX: ffff881017306940 RSI: ffff881fa1c28000 RDI: ffff880f81308000
RBP: ffff881fa1b6dbf0 R08: ffff880fffeed800 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880f7f1bc058
R13: 000000000000fcff R14: ffff880f81306000 R15: ffff881fa1b6de80
FS:  00007ffff7fdc700(0000) GS:ffff880060740000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaafa4450 CR3: 0000000001a85000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process osp-syn-461
 (pid: 32886, threadinfo ffff881fa1b6c000, task ffff881fa1988ae0)
Stack:
 ffff880f00002000 0000000101e5b370 ffff880f00000000 ffff880f81306001
<0> 0000a9fe00000000 0000000000000000 0000000000000000 00000000002aa000
<0> 0000fd00a1b6dbd0 ffff880f81308000 ffff881017306940 ffff880fffeed800
Call Trace:
 [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
 [<ffffffffa071cc13>] llog_cat_process_cb+0x2d3/0x380 [obdclass]
 [<ffffffffa0718ecb>] llog_process_thread+0x8fb/0xe10 [obdclass]
 [<ffffffffa071c940>] ? llog_cat_process_cb+0x0/0x380 [obdclass]
 [<ffffffffa071ac0d>] llog_process_or_fork+0x12d/0x660 [obdclass]
 [<ffffffffa071b749>] llog_cat_process_or_fork+0x89/0x290 [obdclass]
 [<ffffffff8104cab9>] ? __wake_up_common+0x59/0x90
 [<ffffffffa0d30930>] ? osp_sync_process_queues+0x0/0x11c0 [osp]
 [<ffffffffa071b969>] llog_cat_process+0x19/0x20 [obdclass]
 [<ffffffffa05bf83a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs]
 [<ffffffffa0d32aaa>] osp_sync_thread+0x22a/0x700 [osp]
 [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
 [<ffffffff8100c14a>] child_rip+0xa/0x20
 [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
 [<ffffffffa0d32880>] ? osp_sync_thread+0x0/0x700 [osp]
 [<ffffffff8100c140>] ? child_rip+0x0/0x20
Code: 78 0c 00 01 00 00 00 e8 33 6d eb ff 48 83 7d b8 00 0f 84 f8 03 00 00 4c 8b 45 b8 49 8b 80 b0 00 00 00 48 85 c0 0f 84 e4 03 00 00 <48> 8b 40 08 48 85 c0 0f 84 67 02 00 00 4d 89 f1 c7 04 24 00 20 
RIP  [<ffffffffa071889c>] llog_process_thread+0x2cc/0xe10 [obdclass]
 RSP <ffff881fa1b6db50>
crash> bt
PID: 32886  TASK: ffff881fa1988ae0  CPU: 18  COMMAND: "osp-syn-461
"
 #0 [ffff881fa1b6d8e0] machine_kexec at ffffffff8103216b
 #1 [ffff881fa1b6d940] crash_kexec at ffffffff810b8d12
 #2 [ffff881fa1b6da10] oops_end at ffffffff814f2c00
 #3 [ffff881fa1b6da40] die at ffffffff8100f26b
 #4 [ffff881fa1b6da70] do_general_protection at ffffffff814f2792
 #5 [ffff881fa1b6daa0] general_protection at ffffffff814f1f65
    [exception RIP: llog_process_thread+716]
    RIP: ffffffffa071889c  RSP: ffff881fa1b6db50  RFLAGS: 00010206
    RAX: 5a5a5a5a5a5a5a5a  RBX: 000000000000aa01  RCX: 0000000000000000
    RDX: ffff881017306940  RSI: ffff881fa1c28000  RDI: ffff880f81308000
    RBP: ffff881fa1b6dbf0   R8: ffff880fffeed800   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff880f7f1bc058
    R13: 000000000000fcff  R14: ffff880f81306000  R15: ffff881fa1b6de80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff881fa1b6dbf8] llog_process_or_fork at ffffffffa071ac0d [obdclass]
 #7 [ffff881fa1b6dc48] llog_cat_process_cb at ffffffffa071cc13 [obdclass]
 #8 [ffff881fa1b6dc88] llog_process_thread at ffffffffa0718ecb [obdclass]
 #9 [ffff881fa1b6dd38] llog_process_or_fork at ffffffffa071ac0d [obdclass]
#10 [ffff881fa1b6dd88] llog_cat_process_or_fork at ffffffffa071b749 [obdclass]
#11 [ffff881fa1b6de18] llog_cat_process at ffffffffa071b969 [obdclass]
#12 [ffff881fa1b6de38] osp_sync_thread at ffffffffa0d32aaa [osp]
#13 [ffff881fa1b6df48] kernel_thread at ffffffff8100c14a
(gdb) l *llog_process_thread+716
0x58cc is in llog_process_thread (/builddir/build/BUILD/lustre-2.3.56/lustre/include/lustre_log.h:662).
657     /builddir/build/BUILD/lustre-2.3.56/lustre/include/lustre_log.h: No such file or directory.
        in /builddir/build/BUILD/lustre-2.3.56/lustre/include/lustre_log.h
lustre/include/lustre_log.h:

649 static inline int llog_next_block(const struct lu_env *env,                     
650                                   struct llog_handle *loghandle, int *cur_idx,  
651                                   int next_idx, __u64 *cur_offset, void *buf,   
652                                   int len)                                      
653 {                                                                               
654         struct llog_operations *lop;                                            
655         int rc;                                                                 
656                                                                                 
657         ENTRY;                                                                  
658                                                                                 
659         rc = llog_handle2ops(loghandle, &lop);                                  
660         if (rc)                                                                 
661                 RETURN(rc);                                                     
662         if (lop->lop_next_block == NULL)                                        
663                 RETURN(-EOPNOTSUPP);                                            
664                                                                                 
665         rc = lop->lop_next_block(env, loghandle, cur_idx, next_idx,             
666                                  cur_offset, buf, len);                         
667         RETURN(rc);                                                             
668 }     

This is probably related to LU-2109.



 Comments   
Comment by Li Wei (Inactive) [ 20/Nov/12 ]

Right, this should be a symptom of LU-2109. I'm closing it as dup.

Generated at Sat Feb 10 01:24:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.