[LU-8165] (tgt_lastrcvd.c:656:tgt_client_del()) lustre-OST0000: client 4294967295: bit already clear in bitmap!! Created: 19/May/16  Updated: 10/Sep/16  Resolved: 10/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: Bruno Faccini (Inactive) Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With a Lustre 2.5 based version, when trying to find a way to reproduce a leaked cl_object reference situation, likely to occur upon Client eviction from OST, I have triggered the following LBUG when running "while true; do echo <Client-UUID> > /proc/fs/lustre/obdfilter/<OST>/evict_client; done" cmd/loop on the concerned OSS :

Lustre: 702:0:(genops.c:1521:obd_export_evict_by_uuid()) lustre-OST0000: evicting c12f1e59-5f4d-9f75-bd66-7fad18ddd33f at adminstrative request
LustreError: 702:0:(genops.c:1518:obd_export_evict_by_uuid()) lustre-OST0000: can't disconnect c12f1e59-5f4d-9f75-bd66-7fad18ddd33f: no exports found
LustreError: 702:0:(tgt_lastrcvd.c:656:tgt_client_del()) lustre-OST0000: client 4294967295: bit already clear in bitmap!!
LustreError: 702:0:(tgt_lastrcvd.c:657:tgt_client_del()) LBUG
Pid: 702, comm: bash

Call Trace:
 [<ffffffffa0531895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0531e97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa12e30ef>] tgt_client_del+0x50f/0x510 [ptlrpc]
 [<ffffffffa1995ace>] ? ofd_grant_discard+0x3e/0x1c0 [ofd]
 [<ffffffffa197b87d>] ofd_obd_disconnect+0xfd/0x1f0 [ofd]
 [<ffffffffa1088a2d>] class_fail_export+0x23d/0x540 [obdclass]
 [<ffffffffa1088e72>] obd_export_evict_by_uuid+0x142/0x240 [obdclass]
 [<ffffffffa0541a31>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa12b7613>] lprocfs_wr_evict_client+0x2d3/0x3b0 [ptlrpc]
 [<ffffffffa10919eb>] lprocfs_fops_write+0x7b/0xa0 [obdclass]
 [<ffffffff811fa65e>] proc_reg_write+0x7e/0xc0
 [<ffffffff8118e7f8>] vfs_write+0xb8/0x1a0
 [<ffffffff8118f1c1>] sys_write+0x51/0x90
 [<ffffffff810e608e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

Kernel panic - not syncing: LBUG
Pid: 702, comm: bash Not tainted 2.6.32.504.30.3.el6_lustre #1
Call Trace:
 [<ffffffff8152a81c>] ? panic+0xa7/0x16f
 [<ffffffffa0531eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
 [<ffffffffa12e30ef>] ? tgt_client_del+0x50f/0x510 [ptlrpc]
 [<ffffffffa1995ace>] ? ofd_grant_discard+0x3e/0x1c0 [ofd]
 [<ffffffffa197b87d>] ? ofd_obd_disconnect+0xfd/0x1f0 [ofd]
 [<ffffffffa1088a2d>] ? class_fail_export+0x23d/0x540 [obdclass]
 [<ffffffffa1088e72>] ? obd_export_evict_by_uuid+0x142/0x240 [obdclass]
 [<ffffffffa0541a31>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffffa12b7613>] ? lprocfs_wr_evict_client+0x2d3/0x3b0 [ptlrpc]
 [<ffffffffa10919eb>] ? lprocfs_fops_write+0x7b/0xa0 [obdclass]
 [<ffffffff811fa65e>] ? proc_reg_write+0x7e/0xc0
 [<ffffffff8118e7f8>] ? vfs_write+0xb8/0x1a0
 [<ffffffff8118f1c1>] ? sys_write+0x51/0x90
 [<ffffffff810e608e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

Problem seems to occur due to a race between Client reconnection and forced eviction processes, because new export can be found using Client uuid but Client's last_rcvd index has not already been assigned (-1), and as Lustre full debug trace from crash-dump also seems to indicate :

00000001:00000001:13.0:1463584027.765801:0:702:0:(tgt_lastrcvd.c:638:tgt_client_del()) Process entered
00000001:00000040:13.0:1463584027.765802:0:702:0:(tgt_lastrcvd.c:650:tgt_client_del()) lustre-OST0000: del client at idx 4294967295, off 0, UUID 'c12f1e59-5f4d-9f75-bd66-7fad18ddd33f'
00000001:00000001:20.0:1463584027.765802:0:27360:0:(tgt_lastrcvd.c:536:tgt_client_new()) Process entered
00000001:00000040:20.0:1463584027.765803:0:27360:0:(tgt_lastrcvd.c:565:tgt_client_new()) lustre-OST0000: client at idx 3 with UUID 'c12f1e59-5f4d-9f75-bd66-7fad18ddd33f' added
00000001:00020000:13.0:1463584027.765804:0:702:0:(tgt_lastrcvd.c:656:tgt_client_del()) lustre-OST0000: client 4294967295: bit already clear in bitmap!!
00000001:00000040:20.0:1463584027.765804:0:27360:0:(tgt_lastrcvd.c:575:tgt_client_new()) lustre-OST0000: new client at index 3 (8576) with UUID 'c12f1e59-5f4d-9f75-bd66-7fad18ddd33f'

According to the source code, problem is also present in current master.



 Comments   
Comment by Gerrit Updater [ 19/May/16 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/20328
Subject: LU-8165 target: detect race by checking last_rcvd slot index
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e53872bb83410bbe7ec8e4f24af2c90daf65b16d

Comment by Gerrit Updater [ 10/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20328/
Subject: LU-8165 target: detect race by checking last_rcvd slot index
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d9869f25f25583da5dadd21c6f99d8204a041321

Comment by Peter Jones [ 10/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:15:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.