[LU-7707] general protection fault in qsd_reint_main Created: 25/Jan/16 Updated: 15/Mar/17 Resolved: 23/Mar/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest review-dne-part-2 |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
insanity test suite hangs on unmounting ost2 after all tests have completed successfully. In the suite_stdout log, we see 08:51:30:CMD: shadow-22vm7 grep -c /mnt/ost2' ' /proc/mounts 08:51:30:Stopping /mnt/ost2 (opts:-f) on shadow-22vm7 08:51:30:CMD: shadow-22vm7 umount -d -f /mnt/ost2 09:50:45:********** Timeout by autotest system ********** If we look at the test_complete log for shadow-22vm7, we see: 08:51:41:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts 08:51:41:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1 08:51:41:LustreError: 4149:0:(client.c:1130:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880065e860c0 x1524146469973308/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.1.5.16@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 08:51:41:LustreError: 4149:0:(qsd_reint.c:55:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5 08:51:41:LustreError: 4149:0:(qsd_reint.c:55:qsd_reint_completion()) Skipped 1 previous similar message 08:51:41:Lustre: server umount lustre-OST0000 complete 08:51:41:Lustre: Skipped 6 previous similar messages 08:51:41:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' 08:51:41:Lustre: DEBUG MARKER: grep -c /mnt/ost2' ' /proc/mounts 08:51:41:Lustre: DEBUG MARKER: umount -d -f /mnt/ost2 08:51:41:general protection fault: 0000 [#1] SMP 08:51:41:last sysfs file: /sys/devices/system/cpu/online 08:51:41:CPU 1 08:51:41:Modules linked in: osp(U) ofd(U) lfsck(U) ost(U) mgc(U) osd_ldiskfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic libcfs(U) ldiskfs(U) jbd2 nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 08:51:41: 08:51:41:Pid: 4320, comm: qsd_reint_0.lus Not tainted 2.6.32-573.8.1.el6_lustre.gea97898.x86_64 #1 Red Hat KVM 08:51:41:RIP: 0010:[<ffffffff81059911>] [<ffffffff81059911>] __wake_up_common+0x31/0x90 08:51:41:RSP: 0018:ffff88006201bd80 EFLAGS: 00010096 08:51:41:RAX: 5a5a5a5a5a5a5a42 RBX: ffff88002cd908a0 RCX: 0000000000000000 08:51:41:RDX: 5a5a5a5a5a5a5a5a RSI: 0000000000000003 RDI: ffff88002cd908a0 08:51:41:RBP: ffff88006201bdc0 R08: 0000000000000000 R09: 0000000000000000 08:51:41:R10: 0000000000000000 R11: 000000000000000f R12: 0000000000000282 08:51:41:R13: ffff88002cd908a8 R14: 0000000000000000 R15: 0000000000000000 08:51:41:FS: 0000000000000000(0000) GS:ffff880002300000(0000) knlGS:0000000000000000 08:51:41:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b 08:51:41:CR2: 00007fd119ac7000 CR3: 000000004aba3000 CR4: 00000000000006e0 08:51:41:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 08:51:41:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 08:51:41:Process qsd_reint_0.lus (pid: 4320, threadinfo ffff880062018000, task ffff88005cbc0ab0) 08:51:41:Stack: 08:51:41: ffff88006201bda0 0000000300000001 ffff88006201be00 ffff88002cd908a0 08:51:41:<d> 0000000000000282 0000000000000003 0000000000000001 0000000000000000 08:51:41:<d> ffff88006201be00 ffffffff8105e168 ffff88006201be00 ffff88002cd90800 08:51:41:Call Trace: 08:51:41: [<ffffffff8105e168>] __wake_up+0x48/0x70 08:51:41: [<ffffffffa0c0a1a3>] qsd_reint_main+0x73/0x1950 [lquota] 08:51:41: [<ffffffff81538dde>] ? thread_return+0x4e/0x7d0 08:51:41: [<ffffffff810672c2>] ? default_wake_function+0x12/0x20 08:51:41: [<ffffffffa0c0a130>] ? qsd_reint_main+0x0/0x1950 [lquota] 08:51:41: [<ffffffff810a0fce>] kthread+0x9e/0xc0 08:51:41: [<ffffffff8100c28a>] child_rip+0xa/0x20 08:51:41: [<ffffffff810a0f30>] ? kthread+0x0/0xc0 08:51:41: [<ffffffff8100c280>] ? child_rip+0x0/0x20 08:51:41:Code: 41 56 41 55 41 54 53 48 83 ec 18 0f 1f 44 00 00 89 75 cc 89 55 c8 4c 8d 6f 08 48 8b 57 08 41 89 cf 4d 89 c6 48 8d 42 e8 49 39 d5 <48> 8b 58 18 74 3f 48 83 eb 18 eb 0a 0f 1f 00 48 89 d8 48 8d 5a 08:51:41:RIP [<ffffffff81059911>] __wake_up_common+0x31/0x90 08:51:41: RSP <ffff88006201bd80> Logs are at https://testing.hpdd.intel.com/test_sets/2ed0029e-c202-11e5-92cf-5254006e85c2 This is the first time we've seen insanity fail in this way. |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 25/Jan/16 ] |
|
Hi Niu, |
| Comment by Niu Yawei (Inactive) [ 26/Jan/16 ] |
|
This could be a race on stopping quota reint thread, actually we should put the qqi reference after all operations, I'll cook a patch soon. |
| Comment by Gerrit Updater [ 26/Jan/16 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/18142 |
| Comment by Gerrit Updater [ 26/Jan/16 ] |
|
Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/18150 |
| Comment by Gerrit Updater [ 23/Mar/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18142/ |
| Comment by Joseph Gmitter (Inactive) [ 23/Mar/16 ] |
|
Landed for 2.9.0 |