[LU-4456] Test failure on sanity-scrub test_1c: hang on umount of mdt Created: 08/Jan/14  Updated: 11/Feb/14  Resolved: 31/Jan/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: MB, ldiskfs

Issue Links:
Duplicate
is duplicated by LU-4519 (osp_sync.c:351:osp_sync_interpret())... Resolved
Severity: 3
Rank (Obsolete): 12217

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/c2950242-75cc-11e3-b3c0-52540035b04c.

The sub-test test_1c failed with the following error:

test failed to respond and timed out

Info required for matching: sanity-scrub 1c

This tests seems to be failing regularly and always on umount mdt though there seems to be multiple culprits (always mdt console log):

review (ldiskfs):
http://maloo.whamcloud.com/test_sets/c2950242-75cc-11e3-b3c0-52540035b04c

19:21:12:LustreError: 4484:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880053639400 x1456340331892024/t0(0) o13->lustre-OST0003-osc-MDT0000@10.10.16.81@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
19:21:12:LustreError: 4484:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 6 previous similar messages
19:21:12:BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
19:21:12:IP: [<ffffffff81510dcf>] _spin_lock_irqsave+0x1f/0x40
19:21:12:PGD 0 
19:21:12:Oops: 0002 [#1] SMP 
19:21:12:last sysfs file: /sys/devices/system/cpu/possible
19:21:12:CPU 0 
19:21:12:Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) libcfs(U) ldiskfs(U) sha512_generic sha256_generic jbd2 nfsd exportfs autofs4 nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet_selftest]
19:21:12:
19:21:12:Pid: 4485, comm: ptlrpcd_1 Not tainted 2.6.32-358.23.2.el6_lustre.gc9be53c.x86_64 #1 Red Hat KVM
19:21:12:RIP: 0010:[<ffffffff81510dcf>]  [<ffffffff81510dcf>] _spin_lock_irqsave+0x1f/0x40
19:21:12:RSP: 0018:ffff880055fc7c60  EFLAGS: 00010082
19:21:12:RAX: 0000000000010000 RBX: 0000000000000030 RCX: 0000000000000000
19:21:12:RDX: 0000000000000282 RSI: 0000000000000003 RDI: 0000000000000030
19:21:12:RBP: ffff880055fc7c60 R08: 00000000ffffff0a R09: 00000000fffffffb
19:21:12:R10: 0000000000000000 R11: 000000000000000f R12: ffff880053639400
19:21:12:R13: 0000000000000003 R14: 0000000000000001 R15: 0000000000000000
19:21:12:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
19:21:12:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
19:21:12:CR2: 0000000000000030 CR3: 0000000037aba000 CR4: 00000000000006f0
19:21:12:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
19:21:12:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
19:21:12:Process ptlrpcd_1 (pid: 4485, threadinfo ffff880055fc6000, task ffff880037b60080)
19:21:12:Stack:
19:21:12: ffff880055fc7ca0 ffffffff81055ab2 0000000000000000 ffff880037a5c000
19:21:12:<d> ffff880053639400 00000000fffffffb ffff88007c897800 00000000ebc0de03
19:21:12:<d> ffff880055fc7cd0 ffffffffa14afa66 ffff880053639400 ffff880079b29e40
19:21:12:Call Trace:
19:21:12: [<ffffffff81055ab2>] __wake_up+0x32/0x70
19:21:12: [<ffffffffa14afa66>] osp_statfs_interpret+0x86/0x3b0 [osp]
19:21:12: [<ffffffffa0b0bef4>] ptlrpc_check_set+0x2c4/0x1b40 [ptlrpc]
19:21:12: [<ffffffffa0b385db>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
19:21:12: [<ffffffffa0b38b9d>] ptlrpcd+0x2ad/0x3f0 [ptlrpc]
19:21:12: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
19:21:12: [<ffffffffa0b388f0>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
19:21:12: [<ffffffff81096a36>] kthread+0x96/0xa0
19:21:12: [<ffffffff8100c0ca>] child_rip+0xa/0x20
19:21:12: [<ffffffff810969a0>] ? kthread+0x0/0xa0
19:21:12: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
19:21:12:Code: c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c2 fa 66 0f 1f 44 00 00 b8 00 00 01 00 <f0> 0f c1 07 0f b7 c8 c1 e8 10 39 c1 74 0e f3 90 0f 1f 44 00 00 
19:21:12:RIP  [<ffffffff81510dcf>] _spin_lock_irqsave+0x1f/0x40
19:21:12: RSP <ffff880055fc7c60>
19:21:12:CR2: 0000000000000030

review-dne (ldiskfs)
https://maloo.whamcloud.com/test_sets/50a115f0-6c16-11e3-8fd6-52540035b04c

10:41:26:LustreError: 2822:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880063df4c00 x1455238614618676/t0(0) o13->lustre-OST0004-osc-MDT0000@10.10.4.237@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
10:41:26:LustreError: 2822:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 8 previous similar messages
10:41:26:Lustre: lustre-MDT0000: Not available for connect from 10.10.4.237@tcp (stopping)
10:41:26:Lustre: Skipped 6 previous similar messages
10:41:26:LustreError: 10839:0:(lfsck_lib.c:202:lfsck_tgt_descs_fini()) ASSERTION( list_empty(&ltd->ltd_layout_list) ) failed: 
10:41:26:LustreError: 10839:0:(lfsck_lib.c:202:lfsck_tgt_descs_fini()) LBUG
10:41:26:Pid: 10839, comm: umount
10:41:26:
10:41:26:Call Trace:
10:41:26: [<ffffffffa0479895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
10:41:26: [<ffffffffa0479e97>] lbug_with_loc+0x47/0xb0 [libcfs]
10:41:26: [<ffffffffa0e41bd6>] lfsck_tgt_descs_fini+0x606/0x670 [lfsck]
10:41:26: [<ffffffffa0e41d3d>] lfsck_instance_cleanup+0xfd/0x360 [lfsck]
10:41:26: [<ffffffffa0e430f0>] lfsck_degister+0xa0/0xb0 [lfsck]
10:41:27: [<ffffffffa0fc1a8e>] mdd_process_config+0x22e/0x5e0 [mdd]
10:41:27: [<ffffffffa0e8d4b3>] ? lustre_cfg_new+0x2d3/0x6e0 [mdt]
10:41:27: [<ffffffffa0e8da36>] mdt_stack_fini+0x176/0xb90 [mdt]
10:41:27: [<ffffffffa0fc05a5>] ? mdd_init_capa_ctxt+0xf5/0x130 [mdd]
10:41:27: [<ffffffffa0e8eab6>] mdt_device_fini+0x666/0xda0 [mdt]
10:41:27: [<ffffffffa05ae706>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
10:41:27: [<ffffffffa05d2813>] class_cleanup+0x573/0xd30 [obdclass]
10:41:27: [<ffffffffa05ac056>] ? class_name2dev+0x56/0xe0 [obdclass]
10:41:27: [<ffffffffa05d453a>] class_process_config+0x156a/0x1ad0 [obdclass]
10:41:27: [<ffffffffa05cc813>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
10:41:27: [<ffffffffa05d4c19>] class_manual_cleanup+0x179/0x6f0 [obdclass]
10:41:27: [<ffffffffa05ac056>] ? class_name2dev+0x56/0xe0 [obdclass]
10:41:27: [<ffffffffa060ce19>] server_put_super+0x8d9/0xe30 [obdclass]
10:41:27: [<ffffffff8118366b>] generic_shutdown_super+0x5b/0xe0
10:41:27: [<ffffffff81183756>] kill_anon_super+0x16/0x60
10:41:27: [<ffffffffa05d6ad6>] lustre_kill_super+0x36/0x60 [obdclass]
10:41:27: [<ffffffff81183ef7>] deactivate_super+0x57/0x80
10:41:27: [<ffffffff811a21ef>] mntput_no_expire+0xbf/0x110
10:41:27: [<ffffffff811a2c5b>] sys_umount+0x7b/0x3a0
10:41:27: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
10:41:27:
10:41:27:Kernel panic - not syncing: LBUG

review-dne:
https://maloo.whamcloud.com/test_sets/8e6d11fa-6a6c-11e3-9248-52540035b04c

00:29:59:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1
00:29:59:LustreError: 6423:0:(osp_precreate.c:481:osp_precreate_send()) lustre-OST0001-osc-MDT0000: can't precreate: rc = -5
00:29:59:LustreError: 6423:0:(osp_precreate.c:481:osp_precreate_send()) Skipped 1 previous similar message
00:29:59:LustreError: 6423:0:(osp_precreate.c:984:osp_precreate_thread()) lustre-OST0001-osc-MDT0000: cannot precreate objects: rc = -5
00:29:59:LustreError: 6423:0:(osp_precreate.c:984:osp_precreate_thread()) Skipped 1 previous similar message
00:29:59:LustreError: 6427:0:(osp_precreate.c:734:osp_precreate_cleanup_orphans()) lustre-OST0003-osc-MDT0000: cannot cleanup orphans: rc = -5
00:29:59:Lustre: lustre-MDT0000: Not available for connect from 10.10.16.178@tcp (stopping)
00:29:59:Lustre: Skipped 66 previous similar messages
00:29:59:LustreError: 3144:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff88006635c000 x1454998270941364/t0(0) o13->lustre-OST0007-osc-MDT0000@10.10.16.184@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
00:29:59:LustreError: 3144:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 19 previous similar messages
00:29:59:LustreError: 166-1: MGC10.10.16.179@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
00:29:59:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
00:29:59:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
00:29:59:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
00:29:59:Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
00:29:59:Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?

Though this last one may be better put under LU-4449



 Comments   
Comment by Oleg Drokin [ 15/Jan/14 ]

I agree that the third trace is an unrelated issue we know about already.

The first two seem to be some sort of a race on shutdown potentially related to lfsck

Comment by nasf (Inactive) [ 17/Jan/14 ]

The failure for "(lfsck_lib.c:202:lfsck_tgt_descs_fini()) ASSERTION( list_empty(&ltd->ltd_layout_list) ) failed:" has been fixed weeks ago when updating the LFSCK patches series. As for the other two stack trace, have no idea yet.

Comment by Di Wang [ 17/Jan/14 ]

It seems osp_pre should be freed at last. I will push the fix soon.

Comment by Di Wang [ 17/Jan/14 ]

http://review.whamcloud.com/8890

Comment by Jodi Levi (Inactive) [ 31/Jan/14 ]

Patch has landed to Master. Please reopen this ticket if any further work is needed.

Generated at Sat Feb 10 01:42:54 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.