[LU-3270] ptlrpcd strnlen crash trying to log a message - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.7.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
8107

Description

Just hit this (use after free, I think) after a few days of recovery-small:

[307783.003678] Lustre: DEBUG MARKER: == recovery-small test complete, duration 1814 sec == 08:30:02 (1367584202)
[307783.153758] BUG: unable to handle kernel paging request at ffff880097114fa8
[307783.154227] IP: [<ffffffff8127cc39>] strnlen+0x9/0x40
[307783.154604] PGD 1a26063 PUD 501067 PMD 5ba067 PTE 8000000097114163
[307783.155028] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[307783.155408] last sysfs file: /sys/devices/system/cpu/possible
[307783.155791] CPU 0 
[307783.155847] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs]
[307783.156905] 
[307783.156905] Pid: 17616, comm: ptlrpcd_3 Not tainted 2.6.32-rhe6.4-debug #2 Bochs Bochs
[307783.156905] RIP: 0010:[<ffffffff8127cc39>]  [<ffffffff8127cc39>] strnlen+0x9/0x40
[307783.156905] RSP: 0018:ffff8800af8e9980  EFLAGS: 00010206
[307783.156905] RAX: ffffffff817a6cc1 RBX: ffff8800b00e1000 RCX: 0000000000000001
[307783.156905] RDX: ffff880097114fa8 RSI: 0000000000000017 RDI: ffff880097114fa8
[307783.156905] RBP: ffff8800af8e9980 R08: 000000000000002e R09: 0000000000000053
[307783.156905] R10: 0000000000000001 R11: 0000000000000104 R12: ffff8800b00e0a4b
[307783.156905] R13: ffff880097114fa8 R14: 00000000ffffffff R15: 0000000000000000
[307783.156905] FS:  00007fbed06dc700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000
[307783.156905] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[307783.156905] CR2: ffff880097114fa8 CR3: 0000000001a25000 CR4: 00000000000006f0
[307783.156905] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[307783.156905] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[307783.156905] Process ptlrpcd_3 (pid: 17616, threadinfo ffff8800af8e8000, task ffff8800b08aa5c0)
[307783.156905] Stack:
[307783.156905]  ffff8800af8e99c0 ffffffff8127def0 ffff880099facdf0 ffff8800b00e0a4b
[307783.156905] <d> ffffffffa13328e0 ffffffffa13328df ffff8800af8e9b40 ffff8800b00e1000
[307783.156905] <d> ffff8800af8e9a60 ffffffff8127f338 0000000000000004 0000000affffffff
[307783.156905] Call Trace:
[307783.156905]  [<ffffffff8127def0>] string+0x40/0x100
[307783.156905]  [<ffffffff8127f338>] vsnprintf+0x218/0x5e0
[307783.156905]  [<ffffffffa0a9127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
[307783.156905]  [<ffffffffa0aa1b6b>] libcfs_debug_vmsg2+0x2fb/0xab0 [libcfs]
[307783.156905]  [<ffffffffa0aa2361>] libcfs_debug_msg+0x41/0x50 [libcfs]
[307783.156905]  [<ffffffffa0a9cde8>] ? libcfs_log_return+0x28/0x40 [libcfs]
[307783.156905]  [<ffffffffa132c2e5>] mdc_finish_intent_lock+0x535/0x7f0 [mdc]
[307783.156905]  [<ffffffffa0a9cde8>] ? libcfs_log_return+0x28/0x40 [libcfs]
[307783.156905]  [<ffffffffa132d388>] ? mdc_finish_enqueue+0x208/0x1180 [mdc]
[307783.156905]  [<ffffffffa132e626>] mdc_intent_getattr_async_interpret+0x326/0x4c0 [mdc]
[307783.156905]  [<ffffffffa0ea46b2>] ptlrpc_check_set+0x2b2/0x1da0 [ptlrpc]
[307783.156905]  [<ffffffffa0ed1e1b>] ptlrpcd_check+0x55b/0x590 [ptlrpc]
[307783.156905]  [<ffffffffa0ed2363>] ptlrpcd+0x233/0x390 [ptlrpc]
[307783.156905]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
[307783.156905]  [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc]
[307783.156905]  [<ffffffff8100c10a>] child_rip+0xa/0x20
[307783.156905]  [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc]
[307783.156905]  [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc]
[307783.156905]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
[307783.156905] Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00

Crashdump and modules are in /exports/crashdumps/192.168.10.217-2013-05-03-08\:30\:08/
source tag is 20130430

Attachments

Issue Links

is duplicated by

LU-6439 sanity test_120g: panic on client

Resolved

is related to

LU-8312 (statahead.c:351:do_sa_entry_fini()) ASSERTION( !ll_sa_entry_unhashed(entry) )

Open

LU-4624 ll_stop_statahead deadlock

Resolved

LU-4451 Kernel Oops with NFS reexport using mainline 3.12 client

Resolved

Activity

[LU-3270] ptlrpcd strnlen crash trying to log a message

Lai Siyao added a comment - 24/Apr/14 2:17 AM

Thank you, I'll fix it in related patch soon.

Lai Siyao added a comment - 24/Apr/14 2:17 AM Thank you, I'll fix it in related patch soon.

Ann Koehler (Inactive) added a comment - 23/Apr/14 5:50 PM

>> Any idea why statahead may not have gotten started? The clue is probably this error message:
>>
>> (statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080

Finally figured out the cause of this bug. The 64-bit output of kthread_run is being assigned to an int and truncated. In the cases where a valid task pointer, when interpreted as an int, falls between 0 and -4095 it is treated as an error. The code in ll_start_agl handles the return from kthread_run correctly.

Ann Koehler (Inactive) added a comment - 23/Apr/14 5:50 PM >> Any idea why statahead may not have gotten started? The clue is probably this error message: >> >> (statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080 Finally figured out the cause of this bug. The 64-bit output of kthread_run is being assigned to an int and truncated. In the cases where a valid task pointer, when interpreted as an int, falls between 0 and -4095 it is treated as an error. The code in ll_start_agl handles the return from kthread_run correctly.

Ann Koehler (Inactive) added a comment - 07/Feb/14 10:06 PM

With patchset 20 backported to 2.5, an ll_agl_XXXX thread occasionally hits a general protection fault in __wake_up. In all cases, the corresponding ll_sa_XXXX thread has terminated. Looks like the sai struct has been deallocated so the ll_agl_XXXX thread panics when it goes to access it.

Dump uploaded to: ftp.whamcloud.com:/upload/~~LU-3270~~/~~LU-3270~~_ll_agl_gpf.tgz
Client is Lustre 2.5 with ~~LU-3270~~ PS20 running on SLES11 SP3

crash> ps | grep ll_sa
  20979      2   2  ffff88080f12f7e0  IN   0.0       0      0  [ll_sa_20914]
crash> ps | grep ll_agl
> 20978      2   0  ffff8807d9783080  RU   0.0       0      0  [ll_agl_20915]
  20980      2   3  ffff88080f12f040  IN   0.0       0      0  [ll_agl_20914]
crash> bt
PID: 20978  TASK: ffff8807d9783080  CPU: 0   COMMAND: "ll_agl_20915"
 #0 [ffff8807d9c1fcc0] die at ffffffff81005f6b
 #1 [ffff8807d9c1fcf0] do_general_protection at ffffffff8100361a
 #2 [ffff8807d9c1fd20] general_protection at ffffffff813628df
    [exception RIP: __wake_up_common+49]
    RIP: ffffffff8102fc91  RSP: ffff8807d9c1fdd0  RFLAGS: 00010096
    RAX: 5a5a5a5a5a5a5a5a  RBX: ffff88080095b118  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: 0000000000000003  RDI: ffff88080095b118
    RBP: ffff8807d9c1fe00   R8: 5a5a5a5a5a5a5a42   R9: 0000000000000000
    R10: 0000000000000001  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000001  R14: ffff88080095b120  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #3 [ffff8807d9c1fe08] __wake_up at ffffffff81032b68
 #4 [ffff8807d9c1fe48] ll_agl_thread at ffffffffa0795aa2 [lustre]
 #5 [ffff8807d9c1fee8] kthread at ffffffff8106610e
 #6 [ffff8807d9c1ff48] kernel_thread_helper at ffffffff81364034

crash> ll_inode_info ffff8806325a9580 | grep magic
  lli_inode_magic = 287116773,
crash> ll_inode_info ffff8806325a9580 | grep sai
      d_sai = 0x0,

Ann Koehler (Inactive) added a comment - 07/Feb/14 10:06 PM With patchset 20 backported to 2.5, an ll_agl_XXXX thread occasionally hits a general protection fault in __wake_up. In all cases, the corresponding ll_sa_XXXX thread has terminated. Looks like the sai struct has been deallocated so the ll_agl_XXXX thread panics when it goes to access it. Dump uploaded to: ftp.whamcloud.com:/upload/ LU-3270 / LU-3270 _ll_agl_gpf.tgz Client is Lustre 2.5 with LU-3270 PS20 running on SLES11 SP3 crash> ps | grep ll_sa 20979 2 2 ffff88080f12f7e0 IN 0.0 0 0 [ll_sa_20914] crash> ps | grep ll_agl > 20978 2 0 ffff8807d9783080 RU 0.0 0 0 [ll_agl_20915] 20980 2 3 ffff88080f12f040 IN 0.0 0 0 [ll_agl_20914] crash> bt PID: 20978 TASK: ffff8807d9783080 CPU: 0 COMMAND: "ll_agl_20915" #0 [ffff8807d9c1fcc0] die at ffffffff81005f6b #1 [ffff8807d9c1fcf0] do_general_protection at ffffffff8100361a #2 [ffff8807d9c1fd20] general_protection at ffffffff813628df [exception RIP: __wake_up_common+49] RIP: ffffffff8102fc91 RSP: ffff8807d9c1fdd0 RFLAGS: 00010096 RAX: 5a5a5a5a5a5a5a5a RBX: ffff88080095b118 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88080095b118 RBP: ffff8807d9c1fe00 R8: 5a5a5a5a5a5a5a42 R9: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000001 R14: ffff88080095b120 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #3 [ffff8807d9c1fe08] __wake_up at ffffffff81032b68 #4 [ffff8807d9c1fe48] ll_agl_thread at ffffffffa0795aa2 [lustre] #5 [ffff8807d9c1fee8] kthread at ffffffff8106610e #6 [ffff8807d9c1ff48] kernel_thread_helper at ffffffff81364034 crash> ll_inode_info ffff8806325a9580 | grep magic lli_inode_magic = 287116773, crash> ll_inode_info ffff8806325a9580 | grep sai d_sai = 0x0,

Phil Schwan (Inactive) added a comment - 28/Dec/13 11:55 PM

For what it's worth, we've also hit the "can't start ll_sa thread" bug four times, though with a slightly different followup assertion:

LustreError: 14399:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread: rc: -2752
LustreError: 14399:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed

I don't have all four crashes, but I believe that rc was different each time.

Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-279.14.1.el6.x86_64
Linux hnod0032 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Phil Schwan (Inactive) added a comment - 28/Dec/13 11:55 PM For what it's worth, we've also hit the "can't start ll_sa thread" bug four times, though with a slightly different followup assertion: LustreError: 14399:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread: rc: -2752 LustreError: 14399:0:(statahead.c:1704:do_statahead_enter()) ASSERTION( lli->u.d.d_sai == ((void *)0) ) failed I don't have all four crashes, but I believe that rc was different each time. Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-279.14.1.el6.x86_64 Linux hnod0032 2.6.32-279.19.1.el6.x86_64 #1 SMP Wed Dec 19 07:05:20 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Ann Koehler (Inactive) added a comment - 06/Dec/13 3:27 PM

The client, where the LBUG occurred, was running 2.5 on SLES 11 SP3 (Linux 3.0.82).
Servers were Lustre 2.5 on CentOS (2.6.32).

Ann Koehler (Inactive) added a comment - 06/Dec/13 3:27 PM The client, where the LBUG occurred, was running 2.5 on SLES 11 SP3 (Linux 3.0.82). Servers were Lustre 2.5 on CentOS (2.6.32).

Lai Siyao added a comment - 06/Dec/13 2:45 AM

Hmm, 2080 doesn't look to be a valid error number. What kernel version are you using?

Lai Siyao added a comment - 06/Dec/13 2:45 AM Hmm, 2080 doesn't look to be a valid error number. What kernel version are you using?

Ann Koehler (Inactive) added a comment - 05/Dec/13 9:27 PM

"you hit that assert because statahead is not started successfully"

Any idea why statahead may not have gotten started? The clue is probably this error message:

(statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080

We do have a dump for the LBUG Patrick reported. I just haven't had time to do the analysis. Any guesses would be appreciated.

Ann Koehler (Inactive) added a comment - 05/Dec/13 9:27 PM "you hit that assert because statahead is not started successfully" Any idea why statahead may not have gotten started? The clue is probably this error message: (statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080 We do have a dump for the LBUG Patrick reported. I just haven't had time to do the analysis. Any guesses would be appreciated.

Lai Siyao added a comment - 05/Dec/13 2:38 AM

Patrick, you hit that assert because statahead is not started successfully. By design, sai has a refcount in allocation, and it will be put in ll_stop_statahead(), which will set opendir_pid to 0. And in normal situation, ll_stop_statahead() will be called in file release or statahead ratio is too low. So it means this function will be called before sai put its last refcount, which means this assert should be valid.

I'll update the patch to handle this corner case.

Lai Siyao added a comment - 05/Dec/13 2:38 AM Patrick, you hit that assert because statahead is not started successfully. By design, sai has a refcount in allocation, and it will be put in ll_stop_statahead(), which will set opendir_pid to 0. And in normal situation, ll_stop_statahead() will be called in file release or statahead ratio is too low. So it means this function will be called before sai put its last refcount, which means this assert should be valid. I'll update the patch to handle this corner case.

Patrick Farrell (Inactive) added a comment - 04/Dec/13 4:34 PM

This is the cover message from my Gerrit review, placed here so it's more readable:

We hit the following assertion while running with this patch:

> 02:54:14 LustreError: 17617:0:(statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080
> 02:54:14 LustreError: 17617:0:(statahead.c:473:ll_sai_put()) ASSERTION( lli->u.d.d_opendir_pid == 0 ) failed:
> 02:54:14 LustreError: 17617:0:(statahead.c:473:ll_sai_put()) LBUG
> 02:54:14 Pid: 17617, comm: du
> 02:54:14 Call Trace:
> 02:54:14 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
> 02:54:14 [<ffffffff81004919>] dump_trace+0x89/0x450
> 02:54:14 [<ffffffffa022a8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
> 02:54:14 [<ffffffffa022ae37>] lbug_with_loc+0x47/0xc0 [libcfs]
> 02:54:14 [<ffffffffa085fd14>] ll_sai_put+0x574/0x6e0 [lustre]
> 02:54:14 [<ffffffffa0862e48>] do_statahead_enter+0x1598/0x19c0 [lustre]
> 02:54:14 [<ffffffffa08003d0>] ll_revalidate_it+0xb10/0x1b10 [lustre]
> 02:54:14 [<ffffffffa08015ce>] ll_revalidate_nd+0x1fe/0x3e0 [lustre]
> 02:54:14 [<ffffffff81158f64>] do_lookup+0x134/0x3b0
> 02:54:14 [<ffffffff8115b913>] path_lookupat+0xc3/0x5f0
> 02:54:14 [<ffffffff8115be75>] do_path_lookup+0x35/0xd0
> 02:54:14 [<ffffffff8115cbc3>] user_path_at_empty+0x83/0xb0
> 02:54:14 [<ffffffff8115cc01>] user_path_at+0x11/0x20
> 02:54:14 [<ffffffff81151775>] vfs_fstatat+0x55/0x90
> 02:54:14 [<ffffffff811517d4>] sys_newfstatat+0x24/0x50
> 02:54:14 [<ffffffff813c836b>] system_call_fastpath+0x16/0x1b
> 02:54:14 [<00002aaaaad977ee>] 0x2aaaaad977ee

This assertion was added in the statahead patch we took in the patch from 803507.

(Note: #define lli_opendir_pid u.d.d_opendir_pid)

In particular, without the patch, ll_sai_put sets lli->u.d.d_opendir_pid to 0. The patch changes it to assert that lli->u.d.d_opendir_pid is 0.

Looking at some of the uses of ll_sai_put, we can see that lli->u.d.d_opendir_pid is sometimes set to 0 before ll_sai_put is called called, but also that this is not done in do_statahead_enter.

The inline comments give more detail. I've pointed out some of the ll_sai_put calls that appear problematic, but it's also called from other places and I'm not sure the PID is necessarily zero in all of them.

Patrick Farrell (Inactive) added a comment - 04/Dec/13 4:34 PM This is the cover message from my Gerrit review, placed here so it's more readable: We hit the following assertion while running with this patch: > 02:54:14 LustreError: 17617:0:(statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080 > 02:54:14 LustreError: 17617:0:(statahead.c:473:ll_sai_put()) ASSERTION( lli->u.d.d_opendir_pid == 0 ) failed: > 02:54:14 LustreError: 17617:0:(statahead.c:473:ll_sai_put()) LBUG > 02:54:14 Pid: 17617, comm: du > 02:54:14 Call Trace: > 02:54:14 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0 > 02:54:14 [<ffffffff81004919>] dump_trace+0x89/0x450 > 02:54:14 [<ffffffffa022a8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] > 02:54:14 [<ffffffffa022ae37>] lbug_with_loc+0x47/0xc0 [libcfs] > 02:54:14 [<ffffffffa085fd14>] ll_sai_put+0x574/0x6e0 [lustre] > 02:54:14 [<ffffffffa0862e48>] do_statahead_enter+0x1598/0x19c0 [lustre] > 02:54:14 [<ffffffffa08003d0>] ll_revalidate_it+0xb10/0x1b10 [lustre] > 02:54:14 [<ffffffffa08015ce>] ll_revalidate_nd+0x1fe/0x3e0 [lustre] > 02:54:14 [<ffffffff81158f64>] do_lookup+0x134/0x3b0 > 02:54:14 [<ffffffff8115b913>] path_lookupat+0xc3/0x5f0 > 02:54:14 [<ffffffff8115be75>] do_path_lookup+0x35/0xd0 > 02:54:14 [<ffffffff8115cbc3>] user_path_at_empty+0x83/0xb0 > 02:54:14 [<ffffffff8115cc01>] user_path_at+0x11/0x20 > 02:54:14 [<ffffffff81151775>] vfs_fstatat+0x55/0x90 > 02:54:14 [<ffffffff811517d4>] sys_newfstatat+0x24/0x50 > 02:54:14 [<ffffffff813c836b>] system_call_fastpath+0x16/0x1b > 02:54:14 [<00002aaaaad977ee>] 0x2aaaaad977ee This assertion was added in the statahead patch we took in the patch from 803507. (Note: #define lli_opendir_pid u.d.d_opendir_pid) In particular, without the patch, ll_sai_put sets lli->u.d.d_opendir_pid to 0. The patch changes it to assert that lli->u.d.d_opendir_pid is 0. Looking at some of the uses of ll_sai_put, we can see that lli->u.d.d_opendir_pid is sometimes set to 0 before ll_sai_put is called called, but also that this is not done in do_statahead_enter. The inline comments give more detail. I've pointed out some of the ll_sai_put calls that appear problematic, but it's also called from other places and I'm not sure the PID is necessarily zero in all of them.

Ann Koehler (Inactive) added a comment - 21/Nov/13 6:56 PM

The patch http://review.whamcloud.com/#change,6392 has not caused any regressions when run through Cray's vers system test suite.

We adopted the patch because we encountered the LBUG:

> 14:06:40 LustreError: 16861:0:(statahead.c:351:do_sa_entry_fini()) ASSERTION( !ll_sa_entry_unhashed(entry) ) failed: 
> 14:06:40 LustreError: 16861:0:(statahead.c:351:do_sa_entry_fini()) LBUG
> 14:06:40 Pid: 16861, comm: DeletionService
> 14:06:40 Call Trace:
> 14:06:40 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0
> 14:06:40 [<ffffffff81004849>] dump_trace+0x89/0x450
> 14:06:40 [<ffffffffa02278d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
> 14:06:40 [<ffffffffa0227e37>] lbug_with_loc+0x47/0xc0 [libcfs]
> 14:06:40 [<ffffffffa0888b02>] ll_sai_unplug+0x4b2/0x580 [lustre]
> 14:06:40 [<ffffffffa088b45b>] do_statahead_enter+0x32b/0x1900 [lustre]
> 14:06:40 [<ffffffffa08792bc>] ll_lookup_it+0x9cc/0xb00 [lustre]
> 14:06:40 [<ffffffffa087947c>] ll_lookup_nd+0x8c/0x3e0 [lustre]
> 14:06:41 [<ffffffff8114dacc>] d_alloc_and_lookup+0x4c/0x80
> 14:06:41 [<ffffffff8114f13e>] do_lookup+0x2ae/0x3b0
> 14:06:41 [<ffffffff811519b3>] path_lookupat+0xc3/0x5f0
> 14:06:41 [<ffffffff81151f15>] do_path_lookup+0x35/0xd0
> 14:06:41 [<ffffffff81152c63>] user_path_at_empty+0x83/0xb0
> 14:06:41 [<ffffffff81152ca1>] user_path_at+0x11/0x20
> 14:06:41 [<ffffffff81147905>] vfs_fstatat+0x55/0x90
> 14:06:41 [<ffffffff81147a6b>] vfs_stat+0x1b/0x20
> 14:06:41 [<ffffffff81147a94>] sys_newstat+0x24/0x50
> 14:06:41 [<ffffffff8138ce6b>] system_call_fastpath+0x16/0x1b
> 14:06:41 [<00002aaaab5dc325>] 0x2aaaab5dc325
> 14:06:41 Kernel panic - not syncing: LBUG

Analysis of the associated dump showed that lli_sai was not NULL on entry to do_statahead_enter() but was NULL at the time of the ASSERTION. In the version of Lustre being run, do_statahead_enter() did not acquire the ll_sa_lock allowing the lli_sai to be changed. Patch 6392 adds lock/unlock calls on ll_sa_lock. With the patch, the LBUG has not been reproduced. The rest of the patch looks like it might have a positive effect on a ll_sa hang that we see occasionally but the verdict is still out.

Ann Koehler (Inactive) added a comment - 21/Nov/13 6:56 PM The patch http://review.whamcloud.com/#change,6392 has not caused any regressions when run through Cray's vers system test suite. We adopted the patch because we encountered the LBUG: > 14:06:40 LustreError: 16861:0:(statahead.c:351:do_sa_entry_fini()) ASSERTION( !ll_sa_entry_unhashed(entry) ) failed: > 14:06:40 LustreError: 16861:0:(statahead.c:351:do_sa_entry_fini()) LBUG > 14:06:40 Pid: 16861, comm: DeletionService > 14:06:40 Call Trace: > 14:06:40 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0 > 14:06:40 [<ffffffff81004849>] dump_trace+0x89/0x450 > 14:06:40 [<ffffffffa02278d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] > 14:06:40 [<ffffffffa0227e37>] lbug_with_loc+0x47/0xc0 [libcfs] > 14:06:40 [<ffffffffa0888b02>] ll_sai_unplug+0x4b2/0x580 [lustre] > 14:06:40 [<ffffffffa088b45b>] do_statahead_enter+0x32b/0x1900 [lustre] > 14:06:40 [<ffffffffa08792bc>] ll_lookup_it+0x9cc/0xb00 [lustre] > 14:06:40 [<ffffffffa087947c>] ll_lookup_nd+0x8c/0x3e0 [lustre] > 14:06:41 [<ffffffff8114dacc>] d_alloc_and_lookup+0x4c/0x80 > 14:06:41 [<ffffffff8114f13e>] do_lookup+0x2ae/0x3b0 > 14:06:41 [<ffffffff811519b3>] path_lookupat+0xc3/0x5f0 > 14:06:41 [<ffffffff81151f15>] do_path_lookup+0x35/0xd0 > 14:06:41 [<ffffffff81152c63>] user_path_at_empty+0x83/0xb0 > 14:06:41 [<ffffffff81152ca1>] user_path_at+0x11/0x20 > 14:06:41 [<ffffffff81147905>] vfs_fstatat+0x55/0x90 > 14:06:41 [<ffffffff81147a6b>] vfs_stat+0x1b/0x20 > 14:06:41 [<ffffffff81147a94>] sys_newstat+0x24/0x50 > 14:06:41 [<ffffffff8138ce6b>] system_call_fastpath+0x16/0x1b > 14:06:41 [<00002aaaab5dc325>] 0x2aaaab5dc325 > 14:06:41 Kernel panic - not syncing: LBUG Analysis of the associated dump showed that lli_sai was not NULL on entry to do_statahead_enter() but was NULL at the time of the ASSERTION. In the version of Lustre being run, do_statahead_enter() did not acquire the ll_sa_lock allowing the lli_sai to be changed. Patch 6392 adds lock/unlock calls on ll_sa_lock. With the patch, the LBUG has not been reproduced. The rest of the patch looks like it might have a positive effect on a ll_sa hang that we see occasionally but the verdict is still out.

Lai Siyao added a comment - 20/May/13 1:40 AM

My patch is on http://review.whamcloud.com/#change,6392

Lai Siyao added a comment - 20/May/13 1:40 AM My patch is on http://review.whamcloud.com/#change,6392

People

Assignee:: Lai Siyao

Reporter:: Oleg Drokin

Votes:: 1 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 03/May/13 6:17 PM

Updated:: 10/Sep/18 5:26 PM

Resolved:: 21/Nov/14 4:10 PM