[LU-7707] general protection fault in qsd_reint_main Created: 25/Jan/16  Updated: 15/Mar/17  Resolved: 23/Mar/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

autotest review-dne-part-2


Issue Links:
Duplicate
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

insanity test suite hangs on unmounting ost2 after all tests have completed successfully. In the suite_stdout log, we see

08:51:30:CMD: shadow-22vm7 grep -c /mnt/ost2' ' /proc/mounts
08:51:30:Stopping /mnt/ost2 (opts:-f) on shadow-22vm7
08:51:30:CMD: shadow-22vm7 umount -d -f /mnt/ost2
09:50:45:********** Timeout by autotest system **********

If we look at the test_complete log for shadow-22vm7, we see:

08:51:41:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
08:51:41:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1
08:51:41:LustreError: 4149:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880065e860c0 x1524146469973308/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.1.5.16@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
08:51:41:LustreError: 4149:0:(qsd_reint.c:55:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
08:51:41:LustreError: 4149:0:(qsd_reint.c:55:qsd_reint_completion()) Skipped 1 previous similar message
08:51:41:Lustre: server umount lustre-OST0000 complete
08:51:41:Lustre: Skipped 6 previous similar messages
08:51:41:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
08:51:41:Lustre: DEBUG MARKER: grep -c /mnt/ost2' ' /proc/mounts
08:51:41:Lustre: DEBUG MARKER: umount -d -f /mnt/ost2
08:51:41:general protection fault: 0000 [#1] SMP 
08:51:41:last sysfs file: /sys/devices/system/cpu/online
08:51:41:CPU 1 
08:51:41:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
08:51:41:
08:51:41:Pid: 4320, comm: qsd_reint_0.lus Not tainted 2.6.32-573.8.1.el6_lustre.gea97898.x86_64 #1 Red Hat KVM
08:51:41:RIP: 0010:[<ffffffff81059911>]  [<ffffffff81059911>] __wake_up_common+0x31/0x90
08:51:41:RSP: 0018:ffff88006201bd80  EFLAGS: 00010096
08:51:41:RAX: 5a5a5a5a5a5a5a42 RBX: ffff88002cd908a0 RCX: 0000000000000000
08:51:41:RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000003 RDI: ffff88002cd908a0
08:51:41:RBP: ffff88006201bdc0 R08: 0000000000000000 R09: 0000000000000000
08:51:41:R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000282
08:51:41:R13: ffff88002cd908a8 R14: 0000000000000000 R15: 0000000000000000
08:51:41:FS:  0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
08:51:41:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
08:51:41:CR2: 00007fd119ac7000 CR3: 000000004aba3000 CR4: 00000000000006e0
08:51:41:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
08:51:41:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
08:51:41:Process qsd_reint_0.lus (pid: 4320, threadinfo ffff880062018000, task ffff88005cbc0ab0)
08:51:41:Stack:
08:51:41: ffff88006201bda0 0000000300000001 ffff88006201be00 ffff88002cd908a0
08:51:41:<d> 0000000000000282 0000000000000003 0000000000000001 0000000000000000
08:51:41:<d> ffff88006201be00 ffffffff8105e168 ffff88006201be00 ffff88002cd90800
08:51:41:Call Trace:
08:51:41: [<ffffffff8105e168>] __wake_up+0x48/0x70
08:51:41: [<ffffffffa0c0a1a3>] qsd_reint_main+0x73/0x1950 [lquota]
08:51:41: [<ffffffff81538dde>] ? thread_return+0x4e/0x7d0
08:51:41: [<ffffffff810672c2>] ? default_wake_function+0x12/0x20
08:51:41: [<ffffffffa0c0a130>] ? qsd_reint_main+0x0/0x1950 [lquota]
08:51:41: [<ffffffff810a0fce>] kthread+0x9e/0xc0
08:51:41: [<ffffffff8100c28a>] child_rip+0xa/0x20
08:51:41: [<ffffffff810a0f30>] ? kthread+0x0/0xc0
08:51:41: [<ffffffff8100c280>] ? child_rip+0x0/0x20
08:51:41:Code: 41 56 41 55 41 54 53 48 83 ec 18 0f 1f 44 00 00 89 75 cc 89 55 c8 4c 8d 6f 08 48 8b 57 08 41 89 cf 4d 89 c6 48 8d 42 e8 49 39 d5 <48> 8b 58 18 74 3f 48 83 eb 18 eb 0a 0f 1f 00 48 89 d8 48 8d 5a 
08:51:41:RIP  [<ffffffff81059911>] __wake_up_common+0x31/0x90
08:51:41: RSP <ffff88006201bd80>

Logs are at https://testing.hpdd.intel.com/test_sets/2ed0029e-c202-11e5-92cf-5254006e85c2

This is the first time we've seen insanity fail in this way.



 Comments   
Comment by Joseph Gmitter (Inactive) [ 25/Jan/16 ]

Hi Niu,
Can you please have a look at this issue?
Thanks.
Joe

Comment by Niu Yawei (Inactive) [ 26/Jan/16 ]

This could be a race on stopping quota reint thread, actually we should put the qqi reference after all operations, I'll cook a patch soon.

Comment by Gerrit Updater [ 26/Jan/16 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/18142
Subject: LU-7707 quota: put qqi reference after all things done
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cc83a02541442758c249af4022176325bac36063

Comment by Gerrit Updater [ 26/Jan/16 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/18150
Subject: LU-7707 kernel: kernel update RHEL7.2 [3.10.0-327.4.5.el7]
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8491f56ab317f66ef439f94c119c2666d864054b

Comment by Gerrit Updater [ 23/Mar/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18142/
Subject: LU-7707 quota: put qqi reference after all things done
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6fc9046a6a595eef80780ae2c1739cbd67ca827f

Comment by Joseph Gmitter (Inactive) [ 23/Mar/16 ]

Landed for 2.9.0

Generated at Sat Feb 10 02:11:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.