Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
With a Lustre 2.5 based version, when trying to find a way to reproduce a leaked cl_object reference situation, likely to occur upon Client eviction from OST, I have triggered the following LBUG when running "while true; do echo <Client-UUID> > /proc/fs/lustre/obdfilter/<OST>/evict_client; done" cmd/loop on the concerned OSS :
Lustre: 702:0:(genops.c:1521:obd_export_evict_by_uuid()) lustre-OST0000: evicting c12f1e59-5f4d-9f75-bd66-7fad18ddd33f at adminstrative request LustreError: 702:0:(genops.c:1518:obd_export_evict_by_uuid()) lustre-OST0000: can't disconnect c12f1e59-5f4d-9f75-bd66-7fad18ddd33f: no exports found LustreError: 702:0:(tgt_lastrcvd.c:656:tgt_client_del()) lustre-OST0000: client 4294967295: bit already clear in bitmap!! LustreError: 702:0:(tgt_lastrcvd.c:657:tgt_client_del()) LBUG Pid: 702, comm: bash Call Trace: [<ffffffffa0531895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0531e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa12e30ef>] tgt_client_del+0x50f/0x510 [ptlrpc] [<ffffffffa1995ace>] ? ofd_grant_discard+0x3e/0x1c0 [ofd] [<ffffffffa197b87d>] ofd_obd_disconnect+0xfd/0x1f0 [ofd] [<ffffffffa1088a2d>] class_fail_export+0x23d/0x540 [obdclass] [<ffffffffa1088e72>] obd_export_evict_by_uuid+0x142/0x240 [obdclass] [<ffffffffa0541a31>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa12b7613>] lprocfs_wr_evict_client+0x2d3/0x3b0 [ptlrpc] [<ffffffffa10919eb>] lprocfs_fops_write+0x7b/0xa0 [obdclass] [<ffffffff811fa65e>] proc_reg_write+0x7e/0xc0 [<ffffffff8118e7f8>] vfs_write+0xb8/0x1a0 [<ffffffff8118f1c1>] sys_write+0x51/0x90 [<ffffffff810e608e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b Kernel panic - not syncing: LBUG Pid: 702, comm: bash Not tainted 2.6.32.504.30.3.el6_lustre #1 Call Trace: [<ffffffff8152a81c>] ? panic+0xa7/0x16f [<ffffffffa0531eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [<ffffffffa12e30ef>] ? tgt_client_del+0x50f/0x510 [ptlrpc] [<ffffffffa1995ace>] ? ofd_grant_discard+0x3e/0x1c0 [ofd] [<ffffffffa197b87d>] ? ofd_obd_disconnect+0xfd/0x1f0 [ofd] [<ffffffffa1088a2d>] ? class_fail_export+0x23d/0x540 [obdclass] [<ffffffffa1088e72>] ? obd_export_evict_by_uuid+0x142/0x240 [obdclass] [<ffffffffa0541a31>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffffa12b7613>] ? lprocfs_wr_evict_client+0x2d3/0x3b0 [ptlrpc] [<ffffffffa10919eb>] ? lprocfs_fops_write+0x7b/0xa0 [obdclass] [<ffffffff811fa65e>] ? proc_reg_write+0x7e/0xc0 [<ffffffff8118e7f8>] ? vfs_write+0xb8/0x1a0 [<ffffffff8118f1c1>] ? sys_write+0x51/0x90 [<ffffffff810e608e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
Problem seems to occur due to a race between Client reconnection and forced eviction processes, because new export can be found using Client uuid but Client's last_rcvd index has not already been assigned (-1), and as Lustre full debug trace from crash-dump also seems to indicate :
00000001:00000001:13.0:1463584027.765801:0:702:0:(tgt_lastrcvd.c:638:tgt_client_del()) Process entered 00000001:00000040:13.0:1463584027.765802:0:702:0:(tgt_lastrcvd.c:650:tgt_client_del()) lustre-OST0000: del client at idx 4294967295, off 0, UUID 'c12f1e59-5f4d-9f75-bd66-7fad18ddd33f' 00000001:00000001:20.0:1463584027.765802:0:27360:0:(tgt_lastrcvd.c:536:tgt_client_new()) Process entered 00000001:00000040:20.0:1463584027.765803:0:27360:0:(tgt_lastrcvd.c:565:tgt_client_new()) lustre-OST0000: client at idx 3 with UUID 'c12f1e59-5f4d-9f75-bd66-7fad18ddd33f' added 00000001:00020000:13.0:1463584027.765804:0:702:0:(tgt_lastrcvd.c:656:tgt_client_del()) lustre-OST0000: client 4294967295: bit already clear in bitmap!! 00000001:00000040:20.0:1463584027.765804:0:27360:0:(tgt_lastrcvd.c:575:tgt_client_new()) lustre-OST0000: new client at index 3 (8576) with UUID 'c12f1e59-5f4d-9f75-bd66-7fad18ddd33f'
According to the source code, problem is also present in current master.