[LU-8990] Failback LBUG lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) Created: 05/Jan/17 Updated: 03/Nov/18 Resolved: 27/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0, Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Cliff White (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
Soak cluster lustre: 2.9.51_4_g39af202 - tip of master on 12/30 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
System completed failover of lola-8 to lola-9 (MDS failover) <4>Lustre: Failing over soaked-MDT0000 <3>LustreError: 34433:0:(osp_precreate.c:912:osp_precreate_cleanup_orphans()) soaked-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -5 <3>LustreError: 34433:0:(osp_precreate.c:912:osp_precreate_cleanup_orphans()) Skipped 10 previous similar messages <6>Lustre: soaked-MDT0000: Not available for connect from 192.168.1.126@o2ib100 (stopping) <6>Lustre: Skipped 3 previous similar messages <3>LustreError: 34664:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0000-mdtlov: statfs: rc = -108 <3>LustreError: 34664:0:(lod_qos.c:208:lod_statfs_and_check()) Skipped 43 previous similar messages <3>LustreError: 34700:0:(ldlm_resource.c:882:ldlm_resource_complain()) mdt-soaked-MDT0000_UUID: namespace resource [0x20002ac2a:0x1fbe:0x0].0xe8519c29 (ffff8803f859fec0) refcount nonzero (1) after lock cleanup; forcing cleanup. <3>LustreError: 34700:0:(ldlm_resource.c:882:ldlm_resource_complain()) Skipped 5 previous similar messages <3>LustreError: 34700:0:(ldlm_resource.c:1463:ldlm_resource_dump()) --- Resource: [0x20002ac2a:0x1fbe:0x0].0xe8519c29 (ffff8803f859fec0) refcount = 2 <3>LustreError: 34700:0:(ldlm_resource.c:1466:ldlm_resource_dump()) Granted locks (in reverse order): <3>LustreError: 34700:0:(ldlm_resource.c:1469:ldlm_resource_dump()) ### ### ns: mdt-soaked-MDT0000_UUID lock: ffff8804001e4b40/0xda913c453295768d lrc: 2/0,1 mode: PW/PW res: [0x20002ac2a:0x1fbe:0x0].0xe8519c29 bits 0x2 rrc: 2 type: IBT flags: 0x40316400000000 nid: local remote: 0x0 expref: -99 pid: 34549 timeout: 0 lvb_type: 0 <3>LustreError: 34700:0:(ldlm_resource.c:1469:ldlm_resource_dump()) Skipped 4 previous similar messages <3>LustreError: 34700:0:(ldlm_resource.c:1463:ldlm_resource_dump()) --- Resource: [0x20002ac14:0xd2e4:0x0].0x786cbc9c (ffff8807f5e87b00) refcount = 5 <3>LustreError: 34700:0:(ldlm_resource.c:1466:ldlm_resource_dump()) Granted locks (in reverse order): <3>LustreError: 34700:0:(ldlm_resource.c:1484:ldlm_resource_dump()) Waiting locks: <3>LustreError: 34700:0:(ldlm_resource.c:1486:ldlm_resource_dump()) ### ### ns: mdt-soaked-MDT0000_UUID lock: ffff880836be2b80/0xda913c4532957639 lrc: 2/0,1 mode: --/PW res: [0x20002ac14:0xd2e4:0x0].0x786cbc9c bits 0x2 rrc: 5 type: IBT flags: 0x40316400000020 nid: local remote: 0x0 expref: -99 pid: 34463 timeout: 0 lvb_type: 0 <3>LustreError: 34700:0:(ldlm_resource.c:1463:ldlm_resource_dump()) --- Resource: [0x20002ac2a:0x1fbe:0x0].0xc7d60cfe (ffff8807f5e87a40) refcount = 2 <3>LustreError: 34700:0:(ldlm_resource.c:1466:ldlm_resource_dump()) Granted locks (in reverse order): <3>LustreError: 34700:0:(ldlm_resource.c:1463:ldlm_resource_dump()) --- Resource: [0x20002ac2a:0x1fbe:0x0].0x0 (ffff88041165ac00) refcount = 9 <3>LustreError: 34700:0:(ldlm_resource.c:1466:ldlm_resource_dump()) Granted locks (in reverse order): <3>LustreError: 34700:0:(ldlm_resource.c:1463:ldlm_resource_dump()) --- Resource: [0x20002ac2a:0x1fbe:0x0].0x78a9f709 (ffff880401cd9b40) refcount = 2 <3>LustreError: 34700:0:(ldlm_resource.c:1466:ldlm_resource_dump()) Granted locks (in reverse order): <3>LustreError: 34700:0:(ldlm_resource.c:1463:ldlm_resource_dump()) --- Resource: [0x20002ac2a:0x1fbe:0x0].0xd5a94b89 (ffff8803f859fbc0) refcount = 2 <3>LustreError: 34700:0:(ldlm_resource.c:1466:ldlm_resource_dump()) Granted locks (in reverse order): <3>LustreError: 34700:0:(ldlm_resource.c:1463:ldlm_resource_dump()) --- Resource: [0x20002ac14:0xd2e4:0x0].0x0 (ffff8807fbdccf00) refcount = 9 <3>LustreError: 34700:0:(ldlm_resource.c:1466:ldlm_resource_dump()) Granted locks (in reverse order): <3>LustreError: 9135:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8803f0bb0c80 x1555552176176736/t0(0) o13->soaked-OST000f-osc-MDT0000@192.168.1.105@o2ib10:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 <3>LustreError: 9135:0:(client.c:1166:ptlrpc_import_delay_req()) Skipped 6 previous similar messages <6>Lustre: soaked-MDT0000: Not available for connect from 192.168.1.107@o2ib10 (stopping) <6>Lustre: Skipped 7 previous similar messages <6>Lustre: soaked-MDT0000: Not available for connect from 192.168.1.117@o2ib100 (stopping) <6>Lustre: Skipped 15 previous similar messages <6>Lustre: soaked-MDT0000: Not available for connect from 192.168.1.130@o2ib100 (stopping) <6>Lustre: Skipped 6 previous similar messages <3>LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0000_UUID namespace with 6 resources in use, (rc=-110) <3>LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0000_UUID namespace with 6 resources in use, (rc=-110) <6>Lustre: soaked-MDT0000: Not available for connect from 192.168.1.133@o2ib100 (stopping) <6>Lustre: Skipped 9 previous similar messages <6>Lustre: soaked-MDT0000: Not available for connect from 192.168.1.106@o2ib10 (stopping) <6>Lustre: Skipped 23 previous similar messages <3>LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0000_UUID namespace with 6 resources in use, (rc=-110) <3>LustreError: 34661:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0000-mdtlov: statfs: rc = -108 <3>LustreError: 34661:0:(lod_qos.c:208:lod_statfs_and_check()) Skipped 201 previous similar messages <0>LustreError: 34700:0:(mdt_handler.c:4565:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: <0>LustreError: 9095:0:(lod_dev.c:1654:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff8803ffd54000 <0>LustreError: 9095:0:(lod_dev.c:1654:lod_device_free()) LBUG <4>Pid: 9095, comm: obd_zombid A crash dump is available on lola-9 -/var/crash/127.0.0.1-2017-01-05-05:48:01 |
| Comments |
| Comment by Peter Jones [ 05/Jan/17 ] |
|
Lai Does this seem familiar to an issue you are already working on? Peter |
| Comment by Lai Siyao [ 09/Jan/17 ] |
|
This looks to be caused by leaked lock, but 'crash struct' shows all the listed locks and resources have been released (the content is 5a5a..), while More notes from logs: the listed locks still have users (mostly writers), and they are not converted into COS lock yet, so when the message is printed, they should still be in mdt handing, or were forgotten to unlock like |
| Comment by Peter Jones [ 12/May/17 ] |
|
Cliff Is this still a live issue on soak or has it been replaced by new issues on more recent runs? Peter |
| Comment by Cliff White (Inactive) [ 12/May/17 ] |
|
I have not seen it recently |
| Comment by Peter Jones [ 12/May/17 ] |
|
ok- thanks Cliff. So I will remove the Fix Version for now. It's still possible this is just a really rare edge case and it might come up again but it seems like our time would be better spent on other issues for 2.10 for the time being |
| Comment by Cliff White (Inactive) [ 11/Jul/17 ] |
|
Hit this again immediately on lustre b2_10 build 3 {code] [ 7028.656544] Lustre: Failing over soaked-MDT0000 [ 7028.693228] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.120@o2ib (stopping) [ 7028.704412] Lustre: Skipped 2 previous similar messages [ 7028.723793] LustreError: 4767:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0000-mdtlov: statfs: rc = -108 [ 7028.730576] LustreError: 12528:0:(ldlm_resource.c:1094:ldlm_resource_complain()) mdt-soaked-MDT0000_UUID: namespace resource [0x200000bf1:0x357e:0x0].0x5dedd0e5 (ffff8807f3886c00) refcount nonzero (2) after lock cleanup; forcing cleanup. [ 7028.730579] LustreError: 12528:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000bf1:0x357e:0x0].0x5dedd0e5 (ffff8807f3886c00) refcount = 3 [ 7028.730580] LustreError: 12528:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): [ 7028.730587] LustreError: 12528:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: mdt-soaked-MDT0000_UUID lock: ffff880774c4d200/0xaff450bee03d5e2e lrc: 2/0,1 mode: PW/PW res: [0x200000bf1:0x357e:0x0].0x5dedd0e5 bits 0x2 rrc: 4 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 4767 timeout: 0 lvb_type: 0 [ 7028.730589] LustreError: 12528:0:(ldlm_resource.c:1697:ldlm_resource_dump()) Waiting locks: [ 7028.730593] LustreError: 12528:0:(ldlm_resource.c:1699:ldlm_resource_dump()) ### ### ns: mdt-soaked-MDT0000_UUID lock: ffff8804072cb800/0xaff450bee03d5e35 lrc: 3/0,1 mode: --/PW res: [0x200000bf1:0x357e:0x0].0x5dedd0e5 bits 0x2 rrc: 4 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 4652 timeout: 0 lvb_type: 0 [ 7028.732498] LustreError: 12528:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000bf1:0x357e:0x0].0x0 (ffff880777c28780) refcount = 5 [ 7028.732499] LustreError: 12528:0:(ldlm_resource.c:1697:ldlm_resource_dump()) Waiting locks: [ 7028.850045] LustreError: 4222:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88077a390000 x1572582403710608/t0(0) o13->soaked-OST000e-osc-MDT0000@192.168.1.104@o2ib:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 [ 7030.100044] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.107@o2ib (stopping) [ 7030.100045] Lustre: soaked-MDT0000: Not available for connect from 192.168.1.107@o2ib (stopping) [ 7030.100047] Lustre: Skipped 6 previous similar messages [ 7030.134373] Lustre: Skipped 2 previous similar messages [ 7031.012257] LustreError: 4154:0:(lod_dev.c:1672:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff8807878f2000 {code}Looks like the node is wedged |
| Comment by Cliff White (Inactive) [ 13/Dec/17 ] |
|
Hit this again. version=2.10.2_RC1_4_g2022d41 [ 2083.015096] Lustre: 2491:0:(service.c:2112:ptlrpc_server_handle_request()) Skipped 1 previous similar message [ 2083.338528] LustreError: 2313:0:(lod_qos.c:208:lod_statfs_and_check()) Skipped 113 previous similar messages [ 2089.418303] LustreError: 2159:0:(lod_dev.c:1672:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff8803b71b0000 [ 2089.418369] LustreError: 3007:0:(mdt_handler.c:4808:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: [ 2089.418371] LustreError: 3007:0:(mdt_handler.c:4808:mdt_fini()) LBUG [ 2089.418372] Pid: 3007, comm: umount [ 2089.418373] Call Trace: [ 2089.418397] [<ffffffffc09717ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 2089.418405] [<ffffffffc097183c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 2089.418431] [<ffffffffc16cf832>] mdt_device_fini+0x8e2/0x920 [mdt] [ 2089.418470] [<ffffffffc0a8c921>] class_cleanup+0x971/0xcd0 [obdclass] [ 2089.418497] [<ffffffffc0a8ecbd>] class_process_config+0x19cd/0x23b0 [obdclass] [ 2089.418508] [<ffffffffc097cbc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 2089.418534] [<ffffffffc0a8f866>] class_manual_cleanup+0x1c6/0x710 [obdclass] [ 2089.418564] [<ffffffffc0abdffe>] server_put_super+0x8de/0xcd0 [obdclass] [ 2089.418571] [<ffffffff81203692>] generic_shutdown_super+0x72/0x100 [ 2089.418574] [<ffffffff81203a62>] kill_anon_super+0x12/0x20 [ 2089.418600] [<ffffffffc0a92162>] lustre_kill_super+0x32/0x50 [obdclass] [ 2089.418602] [<ffffffff81203e19>] deactivate_locked_super+0x49/0x60 [ 2089.418604] [<ffffffff81204586>] deactivate_super+0x46/0x60 [ 2089.418610] [<ffffffff812217cf>] cleanup_mnt+0x3f/0x80 [ 2089.418612] [<ffffffff81221862>] __cleanup_mnt+0x12/0x20 [ 2089.418618] [<ffffffff810ad275>] task_work_run+0xc5/0xf0 [ 2089.418622] [<ffffffff8102ab62>] do_notify_resume+0x92/0xb0 [ 2089.418627] [<ffffffff816b533d>] int_signal+0x12/0x17 [ 2089.418628] [ 2089.418629] Kernel panic - not syncing: LBUG Crash dump available on spirit. |
| Comment by Cliff White (Inactive) [ 11/Jan/18 ] |
|
Hit this again on 2.10.3-RC1 - It may be rare outside but we seem to hit it at least once every test cycle. This time it occurred when failing back from MDS failover. Jan 11 00:20:11 soak-9 kernel: LustreError: 22047:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0000-mdtlov: statfs: rc = -108 Jan 11 00:20:11 soak-9 kernel: LustreError: 22047:0:(lod_qos.c:208:lod_statfs_and_check()) Skipped 931 previous similar messages Jan 11 00:20:11 soak-9 kernel: Lustre: 22047:0:(service.c:2112:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (42:58s); client may timeout. req@ffff88081c680f00 x1589236901262384/t0(0) o101->0940640e-d454-688a-a4a6-941fc69c399c@192.168.1.117@o2ib:673/0 lens 1728/544 e 0 to 0 dl 1515629953 ref 1 fl Complete:/0/0 rc -19/-19 Jan 11 00:20:11 soak-9 kernel: Lustre: 22047:0:(service.c:2112:ptlrpc_server_handle_request()) Skipped 1 previous similar message Jan 11 00:20:13 soak-9 kernel: LustreError: 22287:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8803ca2ca400 ns: mdt-soaked-MDT0000_UUID lock: ffff8803ca63c800/0xf21e83d58af83b34 lrc: 3/0,0 mode: CR/CR res: [0x200025e04:0x1fd72:0x0].0x0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.117@o2ib remote: 0xd319047699a98ef6 expref: 3 pid: 22287 timeout: 0 lvb_type: 0 Jan 11 00:20:13 soak-9 kernel: LustreError: 22044:0:(lod_qos.c:208:lod_statfs_and_check()) soaked-MDT0000-mdtlov: statfs: rc = -108 Jan 11 00:20:13 soak-9 kernel: LustreError: 22044:0:(lod_qos.c:208:lod_statfs_and_check()) Skipped 139 previous similar messages Jan 11 00:20:13 soak-9 kernel: Lustre: 22364:0:(service.c:2112:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (42:59s); client may timeout. req@ffff8803cca4b600 x1589235586212928/t0(0) o101->55640ac0-502d-db6b-5093-b453b7dbb0cb@192.168.1.118@o2ib:673/0 lens 1728/544 e 0 to 0 dl 1515629953 ref 1 fl Complete:/0/0 rc -19/-19 Jan 11 00:20:13 soak-9 kernel: LustreError: 22287:0:(ldlm_lockd.c:1415:ldlm_handle_enqueue0()) Skipped 1 previous similar message Jan 11 00:20:13 soak-9 kernel: LustreError: 0-0: Forced cleanup waiting for mdt-soaked-MDT0000_UUID namespace with 1 resources in use, (rc=-110) Jan 11 00:20:17 soak-9 kernel: Lustre: soaked-MDT0000: Not available for connect from 192.168.1.119@o2ib (stopping) Jan 11 00:20:17 soak-9 kernel: Lustre: Skipped 14 previous similar messages Jan 11 00:20:19 soak-9 kernel: LustreError: 22401:0:(lod_dev.c:1672:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff8803bdb8e000 Jan 11 00:20:19 soak-9 kernel: LustreError: 22401:0:(lod_dev.c:1672:lod_device_free()) LBUG Jan 11 00:20:19 soak-9 kernel: Pid: 22401, comm: umount Jan 11 00:20:19 soak-9 kernel: #012Call Trace: Jan 11 00:20:19 soak-9 kernel: [<ffffffffc0de47ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Jan 11 00:20:19 soak-9 kernel: [<ffffffffc0de483c>] lbug_with_loc+0x4c/0xb0 [libcfs] Jan 11 00:20:19 soak-9 kernel: [<ffffffffc1657f76>] lod_device_free+0x296/0x2a0 [lod] Jan 11 00:20:19 soak-9 kernel: [<ffffffffc0ee373e>] class_free_dev+0x54e/0x7f0 [obdclass] Jan 11 00:20:19 soak-9 kernel: [<ffffffffc0ee3c00>] class_export_put+0x220/0x2f0 [obdclass] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc0ee56f5>] class_unlink_export+0x135/0x170 [obdclass] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc0efad70>] class_decref+0x80/0x160 [obdclass] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc0efb1d3>] class_detach+0x1b3/0x2e0 [obdclass] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc0f01eb8>] class_process_config+0x1a28/0x23f0 [obdclass] Jan 11 00:20:20 soak-9 kernel: [<ffffffff810c93f5>] ? sched_clock_cpu+0x85/0xc0 Jan 11 00:20:20 soak-9 kernel: [<ffffffffc0defbc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc0f02a60>] class_manual_cleanup+0x1e0/0x710 [obdclass] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc16579c3>] lod_obd_disconnect+0x93/0x1c0 [lod] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc16ccbc9>] mdd_process_config+0x289/0x600 [mdd] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc152a4cc>] mdt_stack_fini+0x2bc/0xd40 [mdt] Jan 11 00:20:20 soak-9 kernel: [<ffffffffc152b423>] mdt_device_fini+0x4d3/0x920 [mdt] Jan 11 00:20:21 soak-9 kernel: [<ffffffffc0effac1>] class_cleanup+0x971/0xcd0 [obdclass] Jan 11 00:20:21 soak-9 kernel: [<ffffffffc0f01e22>] class_process_config+0x1992/0x23f0 [obdclass] Jan 11 00:20:21 soak-9 kernel: [<ffffffffc0defbc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] Jan 11 00:20:21 soak-9 kernel: [<ffffffffc0f02a46>] class_manual_cleanup+0x1c6/0x710 [obdclass] Jan 11 00:20:21 soak-9 kernel: [<ffffffffc0f30f8e>] server_put_super+0x8de/0xcd0 [obdclass] Jan 11 00:20:21 soak-9 kernel: [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100 Jan 11 00:20:21 soak-9 kernel: [<ffffffff812058a2>] kill_anon_super+0x12/0x20 Jan 11 00:20:21 soak-9 kernel: [<ffffffffc0f05342>] lustre_kill_super+0x32/0x50 [obdclass] Jan 11 00:20:21 soak-9 kernel: [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60 Jan 11 00:20:21 soak-9 kernel: [<ffffffff812063c6>] deactivate_super+0x46/0x60 Jan 11 00:20:21 soak-9 kernel: [<ffffffff8122376f>] cleanup_mnt+0x3f/0x80 Jan 11 00:20:21 soak-9 kernel: [<ffffffff81223802>] __cleanup_mnt+0x12/0x20 Jan 11 00:20:21 soak-9 kernel: [<ffffffff810aee05>] task_work_run+0xc5/0xf0 Jan 11 00:20:21 soak-9 kernel: [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0 Jan 11 00:20:21 soak-9 kernel: [<ffffffff816b8d37>] int_signal+0x12/0x17 Jan 11 00:20:22 soak-9 kernel: Jan 11 00:20:22 soak-9 kernel: Kernel panic - not syncing: LBUG |
| Comment by Cliff White (Inactive) [ 25/Jan/18 ] |
|
Hit this again on 2.10.57 - Jan 25 00:15:03 soak-9 kernel: LustreError: 3830:0:(lod_qos.c:208:lod_statfs_and_check()) Skipped 46 previous similar messages
Jan 25 00:15:04 soak-9 kernel: Lustre: 3753:0:(service.c:2124:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (42:64s); client may timeout. req@ffff8803d2008c00 x1590514452745184/t0(0) o101->7f42689f-29c1-6234-8d58-8721a9f86699@192.168.1.125@o2ib:449/0 lens 1728/544 e 0 to 0 dl 1516839239 ref 1 fl Complete:/0/0 rc -19/-19
Jan 25 00:15:04 soak-9 kernel: LustreError: 3830:0:(ldlm_lockd.c:1362:ldlm_handle_enqueue0()) ### lock on destroyed export ffff8808177d1c00 ns: mdt-soaked-MDT0000_UUID lock: ffff880816f58b40/0xb5b8772267a2d8f2 lrc: 3/0,0 mode: CR/CR res: [0x20006e1f6:0x16d28:0x0].0x0 bits 0x8/0x0 rrc: 2 type: IBT flags: 0x50200000000000 nid: 192.168.1.117@o2ib remote: 0xb6333a2c30ac9e3b expref: 3 pid: 3830 timeout: 0 lvb_type: 0
Jan 25 00:15:04 soak-9 kernel: Lustre: 3753:0:(service.c:2124:ptlrpc_server_handle_request()) Skipped 1 previous similar message
Jan 25 00:15:07 soak-9 kernel: Lustre: soaked-MDT0000: Not available for connect from 192.168.1.134@o2ib (stopping)
Jan 25 00:15:07 soak-9 kernel: Lustre: Skipped 10 previous similar messages
Jan 25 00:15:09 soak-9 kernel: LustreError: 3915:0:(lod_dev.c:1687:lod_device_free()) ASSERTION( atomic_read(&lu->ld_ref) == 0 ) failed: lu is ffff88082668e000
Jan 25 00:15:09 soak-9 kernel: LustreError: 3915:0:(lod_dev.c:1687:lod_device_free()) LBUG
Jan 25 00:15:09 soak-9 kernel: Pid: 3915, comm: umount
Jan 25 00:15:09 soak-9 kernel: #012Call Trace:
Jan 25 00:15:09 soak-9 kernel: [<ffffffffc096f7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
Jan 25 00:15:09 soak-9 kernel: [<ffffffffc096f83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Jan 25 00:15:09 soak-9 kernel: [<ffffffffc1865f76>] lod_device_free+0x296/0x2a0 [lod]
Jan 25 00:15:09 soak-9 kernel: [<ffffffffc0a7ec4c>] class_free_dev+0x4dc/0x730 [obdclass]
Jan 25 00:15:09 soak-9 kernel: [<ffffffffc0a7f0c0>] class_export_put+0x220/0x2f0 [obdclass]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc0a80bd5>] class_unlink_export+0x135/0x170 [obdclass]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc0a96570>] class_decref+0x80/0x160 [obdclass]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc0a969d3>] class_detach+0x1b3/0x2e0 [obdclass]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc0a9d5e9>] class_process_config+0x19d9/0x28a0 [obdclass]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc097ad47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc0a9e690>] class_manual_cleanup+0x1e0/0x710 [obdclass]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc18659c3>] lod_obd_disconnect+0x93/0x1c0 [lod]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc18e3d1e>] mdd_process_config+0x3de/0x630 [mdd]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc177827c>] mdt_stack_fini+0x2bc/0xd40 [mdt]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc1779083>] mdt_device_fini+0x383/0x970 [mdt]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc0a9b21c>] class_cleanup+0x8cc/0xc40 [obdclass]
Jan 25 00:15:10 soak-9 kernel: [<ffffffffc0a9c23c>] class_process_config+0x62c/0x28a0 [obdclass]
Jan 25 00:15:11 soak-9 kernel: [<ffffffffc097ad47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
Jan 25 00:15:11 soak-9 kernel: [<ffffffffc0a9e676>] class_manual_cleanup+0x1c6/0x710 [obdclass]
Jan 25 00:15:11 soak-9 kernel: [<ffffffffc0ace25e>] server_put_super+0x8de/0xcd0 [obdclass]
Jan 25 00:15:11 soak-9 kernel: [<ffffffff812054d2>] generic_shutdown_super+0x72/0x100
Jan 25 00:15:11 soak-9 kernel: [<ffffffff812058a2>] kill_anon_super+0x12/0x20
Jan 25 00:15:11 soak-9 kernel: [<ffffffffc0aa10c2>] lustre_kill_super+0x32/0x50 [obdclass]
Jan 25 00:15:11 soak-9 kernel: [<ffffffff81205c59>] deactivate_locked_super+0x49/0x60
Jan 25 00:15:11 soak-9 kernel: [<ffffffff812063c6>] deactivate_super+0x46/0x60
Jan 25 00:15:11 soak-9 kernel: [<ffffffff8122376f>] cleanup_mnt+0x3f/0x80
Jan 25 00:15:11 soak-9 kernel: [<ffffffff81223802>] __cleanup_mnt+0x12/0x20
Jan 25 00:15:11 soak-9 kernel: [<ffffffff810aee05>] task_work_run+0xc5/0xf0
Jan 25 00:15:11 soak-9 kernel: [<ffffffff8102ab52>] do_notify_resume+0x92/0xb0
Jan 25 00:15:11 soak-9 kernel: [<ffffffff816b8d37>] int_signal+0x12/0x17
Jan 25 00:15:12 soak-9 kernel:
Jan 25 00:15:12 soak-9 kernel: Kernel panic - not syncing: LBUG
Jan 25 00:19:04 soak-9 systemd: Starting Stop Read-Ahead Data Collection...
Crash dumps are available on spirit |
| Comment by Cliff White (Inactive) [ 01/Feb/18 ] |
|
Hit this again on version=2.10.57_57_g98ddc99, basically blocks most soak/failover tests |
| Comment by Gerrit Updater [ 02/Feb/18 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: https://review.whamcloud.com/31143 |
| Comment by Lai Siyao [ 02/Feb/18 ] |
|
Hi Cliff, I uploaded a debug patch, will you test soak/failover with it? |
| Comment by Cliff White (Inactive) [ 07/Feb/18 ] |
|
Ask Minh to trigger an MNLX build and I can test that. Is you patch based on latest master? |
| Comment by Lai Siyao [ 08/Feb/18 ] |
|
Yes, it is. |
| Comment by Cliff White (Inactive) [ 08/Feb/18 ] |
|
Hit the problem with your patch, relevent syslog attached, core dumped, vmcore-dmesg.txt attached, full data is available on Spirit: crash dump is at: /scratch/dumps/soak-9.spirit.hpdd.intel018-02-08-08\:03\:14/ |
| Comment by Lai Siyao [ 09/Feb/18 ] |
|
Thanks Cliff, it's quite helpful. [45593.537511] LustreError: 8499:0:(lod_dev.c:1688:lod_device_free()) header@ffff8807dc04ee80[0x0, 1, [0x200000007:0x1:0x0] hash exist] { [45593.591495] LustreError: 8499:0:(lod_dev.c:1688:lod_device_free()) ....mdt@ffff8807dc04eed0mdt-object@ffff8807dc04ee80( , writecount=0) [45593.645796] LustreError: 8499:0:(lod_dev.c:1688:lod_device_free()) ....mdd@ffff8803cadf5820mdd-object@ffff8803cadf5820(open_count=0, valid=0, cltime=0ns, flags=0) {no format}The message shows 'root' is not released at lod_device_free(), I'll look into related code. |
| Comment by Lai Siyao [ 09/Feb/18 ] |
|
I still need to access the crash dump to verify some details, but I don't have account on Spirit yet, and I just created a ticket for it: https://jira.hpdd.intel.com/browse/DCO-7884 |
| Comment by Lai Siyao [ 11/Feb/18 ] |
|
The crash dump shows lod->lod_md_root is not NULL at lod_device_free(), which means lod_md_root is released too early (in precleanup), and some request re-initialized it after that, I'll move it to real cleanup time. |
| Comment by Lai Siyao [ 11/Feb/18 ] |
|
Hi Cliff, I just updated the patch, could you do soak test again? |
| Comment by Cliff White (Inactive) [ 13/Feb/18 ] |
|
Not seeing any hard faults yet, but many watchdogs/hanging threads. Feb 12 22:24:35 soak-8 kernel: LNet: Service thread pid 2499 was inactive for 200.49s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Feb 12 22:24:35 soak-8 kernel: Pid: 2499, comm: mdt01_003 Feb 12 22:24:35 soak-8 kernel: #012Call Trace: Feb 12 22:24:35 soak-8 kernel: [<ffffffff81033519>] ? sched_clock+0x9/0x10 Feb 12 22:24:35 soak-8 kernel: [<ffffffff816ab6b9>] schedule+0x29/0x70 Feb 12 22:24:35 soak-8 kernel: [<ffffffff816a9004>] schedule_timeout+0x174/0x2c0 Feb 12 22:24:35 soak-8 kernel: [<ffffffff8109a6c0>] ? process_timeout+0x0/0x10 Feb 12 22:24:35 soak-8 kernel: [<ffffffffc0e21eb1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc17ae760>] osp_precreate_reserve+0x2e0/0x810 [osp] Feb 12 22:24:35 soak-8 kernel: [<ffffffff810c6440>] ? default_wake_function+0x0/0x20 Feb 12 22:24:35 soak-8 kernel: [<ffffffffc17a3c53>] osp_declare_create+0x193/0x590 [osp] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc0f404a9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16f47dc>] lod_sub_declare_create+0xdc/0x210 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16eda4e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16ee9ca>] lod_alloc_rr.constprop.18+0x70a/0x1010 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16f317d>] lod_qos_prep_create+0xced/0x1820 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16f000e>] ? lod_alloc_qos.constprop.17+0xd3e/0x1590 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16f420d>] lod_prepare_create+0x25d/0x360 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16e5f7e>] lod_declare_striped_create+0x1ee/0x970 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16f47dc>] ? lod_sub_declare_create+0xdc/0x210 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc16ea2b4>] lod_declare_create+0x204/0x590 [lod] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc0f60619>] ? lu_context_refill+0x19/0x50 [obdclass] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc175c3ef>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd] Feb 12 22:24:35 soak-8 kernel: [<ffffffffc174cb63>] mdd_declare_create+0x53/0xe30 [mdd] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc1750e89>] mdd_create+0x879/0x1410 [mdd] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc1605106>] mdt_reint_open+0x2206/0x3260 [mdt] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc0f73d2e>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc15e8b43>] ? ucred_set_jobid+0x53/0x70 [mdt] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc15f9400>] mdt_reint_rec+0x80/0x210 [mdt] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc15d8f8b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc15e5437>] mdt_intent_reint+0x157/0x420 [mdt] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc15dc0b2>] mdt_intent_opc+0x442/0xad0 [mdt] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc113f470>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc15e3c63>] mdt_intent_policy+0x1a3/0x360 [mdt] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc10ef202>] ldlm_lock_enqueue+0x382/0x8f0 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc1117753>] ldlm_handle_enqueue0+0x8f3/0x13e0 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc113f4f0>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc119d202>] tgt_enqueue+0x62/0x210 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc11a5405>] tgt_request_handle+0x925/0x13b0 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc114958e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc1146448>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffff810c6452>] ? default_wake_function+0x12/0x20 Feb 12 22:24:36 soak-8 kernel: [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90 Feb 12 22:24:36 soak-8 kernel: [<ffffffffc114cd42>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffffc114c2b0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] Feb 12 22:24:36 soak-8 kernel: [<ffffffff810b252f>] kthread+0xcf/0xe0 Feb 12 22:24:36 soak-8 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0 Feb 12 22:24:36 soak-8 kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90 Feb 12 22:24:36 soak-8 kernel: [<ffffffff810b2460>] ? kthread+0x0/0xe0 Feb 12 22:24:36 soak-8 kernel: Feb 12 22:24:36 soak-8 kernel: LustreError: dumping log to /tmp/lustre-log.1518474276.2499 Logs are available on soak - /scratch/logs/syslog. |
| Comment by Lai Siyao [ 13/Feb/18 ] |
|
This should be a different issue, I'll look into it later. |
| Comment by Cliff White (Inactive) [ 14/Feb/18 ] |
|
Your new patch has been running several days, I am not seeing any output, or hard failures, I have grep'd for 'lod_device' in logs, but seeing nothing, any other strings I should search for to get you the output you need? Should I perhaps force a crash dump? If you have login to the system now, you can also force a dump if you need. |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31143/ |
| Comment by Peter Jones [ 27/Feb/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31431 |
| Comment by Gerrit Updater [ 05/Apr/18 ] |
|
John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31431/ |