[LU-3270] ptlrpcd strnlen crash trying to log a message Created: 03/May/13 Updated: 10/Sep/18 Resolved: 21/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Oleg Drokin | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 8107 | ||||||||||||||||||||||||
| Description |
|
Just hit this (use after free, I think) after a few days of recovery-small: [307783.003678] Lustre: DEBUG MARKER: == recovery-small test complete, duration 1814 sec == 08:30:02 (1367584202) [307783.153758] BUG: unable to handle kernel paging request at ffff880097114fa8 [307783.154227] IP: [<ffffffff8127cc39>] strnlen+0x9/0x40 [307783.154604] PGD 1a26063 PUD 501067 PMD 5ba067 PTE 8000000097114163 [307783.155028] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [307783.155408] last sysfs file: /sys/devices/system/cpu/possible [307783.155791] CPU 0 [307783.155847] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] [307783.156905] [307783.156905] Pid: 17616, comm: ptlrpcd_3 Not tainted 2.6.32-rhe6.4-debug #2 Bochs Bochs [307783.156905] RIP: 0010:[<ffffffff8127cc39>] [<ffffffff8127cc39>] strnlen+0x9/0x40 [307783.156905] RSP: 0018:ffff8800af8e9980 EFLAGS: 00010206 [307783.156905] RAX: ffffffff817a6cc1 RBX: ffff8800b00e1000 RCX: 0000000000000001 [307783.156905] RDX: ffff880097114fa8 RSI: 0000000000000017 RDI: ffff880097114fa8 [307783.156905] RBP: ffff8800af8e9980 R08: 000000000000002e R09: 0000000000000053 [307783.156905] R10: 0000000000000001 R11: 0000000000000104 R12: ffff8800b00e0a4b [307783.156905] R13: ffff880097114fa8 R14: 00000000ffffffff R15: 0000000000000000 [307783.156905] FS: 00007fbed06dc700(0000) GS:ffff880006200000(0000) knlGS:0000000000000000 [307783.156905] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [307783.156905] CR2: ffff880097114fa8 CR3: 0000000001a25000 CR4: 00000000000006f0 [307783.156905] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [307783.156905] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [307783.156905] Process ptlrpcd_3 (pid: 17616, threadinfo ffff8800af8e8000, task ffff8800b08aa5c0) [307783.156905] Stack: [307783.156905] ffff8800af8e99c0 ffffffff8127def0 ffff880099facdf0 ffff8800b00e0a4b [307783.156905] <d> ffffffffa13328e0 ffffffffa13328df ffff8800af8e9b40 ffff8800b00e1000 [307783.156905] <d> ffff8800af8e9a60 ffffffff8127f338 0000000000000004 0000000affffffff [307783.156905] Call Trace: [307783.156905] [<ffffffff8127def0>] string+0x40/0x100 [307783.156905] [<ffffffff8127f338>] vsnprintf+0x218/0x5e0 [307783.156905] [<ffffffffa0a9127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] [307783.156905] [<ffffffffa0aa1b6b>] libcfs_debug_vmsg2+0x2fb/0xab0 [libcfs] [307783.156905] [<ffffffffa0aa2361>] libcfs_debug_msg+0x41/0x50 [libcfs] [307783.156905] [<ffffffffa0a9cde8>] ? libcfs_log_return+0x28/0x40 [libcfs] [307783.156905] [<ffffffffa132c2e5>] mdc_finish_intent_lock+0x535/0x7f0 [mdc] [307783.156905] [<ffffffffa0a9cde8>] ? libcfs_log_return+0x28/0x40 [libcfs] [307783.156905] [<ffffffffa132d388>] ? mdc_finish_enqueue+0x208/0x1180 [mdc] [307783.156905] [<ffffffffa132e626>] mdc_intent_getattr_async_interpret+0x326/0x4c0 [mdc] [307783.156905] [<ffffffffa0ea46b2>] ptlrpc_check_set+0x2b2/0x1da0 [ptlrpc] [307783.156905] [<ffffffffa0ed1e1b>] ptlrpcd_check+0x55b/0x590 [ptlrpc] [307783.156905] [<ffffffffa0ed2363>] ptlrpcd+0x233/0x390 [ptlrpc] [307783.156905] [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20 [307783.156905] [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc] [307783.156905] [<ffffffff8100c10a>] child_rip+0xa/0x20 [307783.156905] [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc] [307783.156905] [<ffffffffa0ed2130>] ? ptlrpcd+0x0/0x390 [ptlrpc] [307783.156905] [<ffffffff8100c100>] ? child_rip+0x0/0x20 [307783.156905] Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 Crashdump and modules are in /exports/crashdumps/192.168.10.217-2013-05-03-08\:30\:08/ |
| Comments |
| Comment by Oleg Drokin [ 05/May/13 ] |
|
Hit it once more [111421.423886] Lustre: DEBUG MARKER: == recovery-small test complete, duration 1966 sec == 21:24:24 (1367717064) [111426.185618] BUG: unable to handle kernel paging request at ffff8800968dafa8 [111426.185917] IP: [<ffffffff8127cc39>] strnlen+0x9/0x40 [111426.186162] PGD 1a26063 PUD 501067 PMD 5b6067 PTE 80000000968da163 [111426.186438] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [111426.186686] last sysfs file: /sys/devices/system/cpu/possible [111426.186934] CPU 7 [111426.186970] Modules linked in: lustre ofd osp lod ost mdt osd_ldiskfs fsfilt_ldiskfs ldiskfs mdd mgs lquota obdecho mgc lov osc mdc lmv fid fld ptlrpc obdclass lvfs ksocklnd lnet libcfs exportfs jbd sha512_generic sha256_generic ext4 mbcache jbd2 virtio_balloon virtio_console i2c_piix4 i2c_core virtio_blk virtio_net virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod nfs lockd fscache auth_rpcgss nfs_acl sunrpc be2iscsi bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: libcfs] [111426.189508] [111426.189508] Pid: 26777, comm: ptlrpcd_1 Not tainted 2.6.32-rhe6.4-debug #2 Bochs Bochs [111426.189508] RIP: 0010:[<ffffffff8127cc39>] [<ffffffff8127cc39>] strnlen+0x9/0x40 [111426.189508] RSP: 0018:ffff8800b787d980 EFLAGS: 00010202 [111426.189508] RAX: ffffffff817a6cc1 RBX: ffff8800650bc000 RCX: 0000000000000001 [111426.189508] RDX: ffff8800968dafa8 RSI: 000000000000001a RDI: ffff8800968dafa8 [111426.189508] RBP: ffff8800b787d980 R08: 000000000000002e R09: 0000000000000053 [111426.189508] R10: 0000000000000001 R11: 000000000000006e R12: ffff8800650bb205 [111426.189508] R13: ffff8800968dafa8 R14: 00000000ffffffff R15: 0000000000000000 [111426.189508] FS: 00007fbb99048700(0000) GS:ffff8800063c0000(0000) knlGS:0000000000000000 [111426.189508] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [111426.189508] CR2: ffff8800968dafa8 CR3: 0000000001a25000 CR4: 00000000000006e0 [111426.189508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [111426.189508] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [111426.189508] Process ptlrpcd_1 (pid: 26777, threadinfo ffff8800b787c000, task ffff8800b706e440) [111426.189508] Stack: [111426.189508] ffff8800b787d9c0 ffffffff8127def0 ffff8800b06c3df0 ffff8800650bb205 [111426.189508] <d> ffffffffa08488e0 ffffffffa08488df ffff8800b787db40 ffff8800650bc000 [111426.189508] <d> ffff8800b787da60 ffffffff8127f338 0000000000000004 0000000affffffff [111426.189508] Call Trace: [111426.189508] [<ffffffff8127def0>] string+0x40/0x100 [111426.189508] [<ffffffff8127f338>] vsnprintf+0x218/0x5e0 [111426.189508] [<ffffffffa08b627b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] [111426.189508] [<ffffffffa08c6b6b>] libcfs_debug_vmsg2+0x2fb/0xab0 [libcfs] [111426.189508] [<ffffffffa08c7361>] libcfs_debug_msg+0x41/0x50 [libcfs] [111426.189508] [<ffffffffa08c1de8>] ? libcfs_log_return+0x28/0x40 [libcfs] [111426.189508] [<ffffffffa08422e5>] mdc_finish_intent_lock+0x535/0x7f0 [mdc] [111426.189508] [<ffffffffa08c1de8>] ? libcfs_log_return+0x28/0x40 [libcfs] [111426.189508] [<ffffffffa0843388>] ? mdc_finish_enqueue+0x208/0x1180 [mdc] [111426.189508] [<ffffffffa0844626>] mdc_intent_getattr_async_interpret+0x326/0x4c0 [mdc] [111426.189508] [<ffffffffa06346b2>] ptlrpc_check_set+0x2b2/0x1da0 [ptlrpc] [111426.189508] [<ffffffffa0661e2b>] ptlrpcd_check+0x55b/0x590 [ptlrpc] [111426.189508] [<ffffffffa0662373>] ptlrpcd+0x233/0x390 [ptlrpc] [111426.189508] [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20 [111426.189508] [<ffffffffa0662140>] ? ptlrpcd+0x0/0x390 [ptlrpc] [111426.189508] [<ffffffff8100c10a>] child_rip+0xa/0x20 [111426.189508] [<ffffffffa0662140>] ? ptlrpcd+0x0/0x390 [ptlrpc] [111426.189508] [<ffffffffa0662140>] ? ptlrpcd+0x0/0x390 [ptlrpc] [111426.189508] [<ffffffff8100c100>] ? child_rip+0x0/0x20 [111426.189508] Code: 66 90 48 83 c2 01 80 3a 00 75 f7 48 89 d0 48 29 f8 c9 c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 85 f6 48 89 e5 74 2e <80> 3f 00 74 29 48 83 ee 01 48 89 f8 eb 12 66 0f 1f 84 00 00 00 Crashdump in /exports/crashdumps/192.168.10.223-2013-05-04-21\:24\:32 |
| Comment by Oleg Drokin [ 15/May/13 ] |
|
Well, this one seems to have turned into a very frequent occurence for me, so I dug into it some more. statahead thread creates a statahead request and pushes it to ptlrpcd, then after a while it decides to exit before the reply to that request is received. sa_entry refcount at 1 then dropped to 0 (decreased from 2 at ll_statehaed_one): static void ll_sa_entry_cleanup(struct ll_statahead_info *sai,
struct ll_sa_entry *entry)
{
struct md_enqueue_info *minfo = entry->se_minfo;
struct ptlrpc_request *req = entry->se_req;
if (minfo) {
entry->se_minfo = NULL;
ll_intent_release(&minfo->mi_it);
iput(minfo->mi_dir);
OBD_FREE_PTR(minfo);
}
if (req) {
entry->se_req = NULL;
ptlrpc_req_finished(req);
}
}
This code frees minfo that is still accessed in mdc_intent_getattr_async_interpret: static int mdc_intent_getattr_async_interpret(const struct lu_env *env,
struct ptlrpc_request *req,
void *args, int rc)
{
struct mdc_getattr_args *ga = args;
struct obd_export *exp = ga->ga_exp;
struct md_enqueue_info *minfo = ga->ga_minfo;
...
rc = mdc_finish_intent_lock(exp, req, &minfo->mi_data, it, lockh);
...
}
mdc_finish_intent_lock tries to print op_data from minfo and dies deep in vsprintf stepping on freed memory. And here's a bit from the log while I have it: 00000100:00000040:6.0:1368112087.663178:0:6676:0:(ptlrpcd.c:271:ptlrpcd_add_req()) @@@ add req [ffff88005d64f7f0] to pc [ptlrpcd_7:7] req@ffff88005d64f7f0 x1434567312028328/t0(0) o101->lustre-MDT0000-mdc-ffff8800682927f0@0@lo:12/10 lens 600/1080 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1 00000080:00400000:6.0:1368112087.664185:0:6676:0:(statahead.c:1279:ll_statahead_thread()) stop agl thread: [pid 6676] 00000080:00000001:6.0:1368112087.664247:0:6676:0:(statahead.c:554:ll_sai_put()) Process leaving 00000080:00400000:6.0:1368112087.664247:0:6676:0:(statahead.c:1306:ll_statahead_thread()) statahead thread stopped: [pid 6676] [parent d51] … 00000002:00100000:2.0:1368112087.664825:0:19458:0:(mdc_locks.c:579:mdc_finish_enqueue()) @@@ op: 8 disposition: 7, status: 0 req@ffff88005d64f7f0 x1434567312028328/t0(0) o101->lustre-MDT0000-mdc-ffff8800682927f0@0@lo:12/10 lens 600/536 e 0 to 0 dl 1368112104 ref 1 fl Interpret:R/0/0 rc 301/301 |
| Comment by Oleg Drokin [ 15/May/13 ] |
|
here's my patch created after discussion with Fan Yong: http://review.whamcloud.com/6348 |
| Comment by nasf (Inactive) [ 16/May/13 ] |
|
Hi Oleg, I found that the race may be NOT exist. Sorry I did NOT realise that when discussed with you. static void ll_sa_entry_cleanup(struct ll_statahead_info *sai, struct ll_sa_entry *entry) { ==> struct md_enqueue_info *minfo = entry->se_minfo; struct ptlrpc_request *req = entry->se_req; ==> if (minfo) { entry->se_minfo = NULL; ll_intent_release(&minfo->mi_it); iput(minfo->mi_dir); OBD_FREE_PTR(minfo); } if (req) { entry->se_req = NULL; ptlrpc_req_finished(req); } } That means if the statahead thread can free the "minfo", then the "entry->se_minfo" must be set. But in the whole logic, only the "ll_statahead_interpret()" can set the "entry->se_minfo" which is called by "mdc_intent_getattr_async_interpret()". But before "mdc_intent_getattr_async_interpret" calls "ll_statahead_interpret()" to set "entry->se_minfo", it has already hold reference count on the "entry". So at that time, the statahead thread should not trigger entry release processing. static void ll_sa_entry_put(struct ll_statahead_info *sai, struct ll_sa_entry *entry) { ==> if (cfs_atomic_dec_and_test(&entry->se_refcount)) { CDEBUG(D_READA, "free sa entry %.*s(%p) index "LPU64"\n", entry->se_qstr.len, entry->se_qstr.name, entry, entry->se_index); LASSERT(cfs_list_empty(&entry->se_link)); LASSERT(cfs_list_empty(&entry->se_list)); LASSERT(ll_sa_entry_unhashed(entry)); ll_sa_entry_cleanup(sai, entry); if (entry->se_inode) iput(entry->se_inode); OBD_FREE(entry, entry->se_size); cfs_atomic_dec(&sai->sai_cache_count); } } So there should be other issues we did not found yet... |
| Comment by Oleg Drokin [ 16/May/13 ] |
|
Yes, you are right. |
| Comment by nasf (Inactive) [ 17/May/13 ] |
|
This is the new patch: |
| Comment by Lai Siyao [ 17/May/13 ] |
|
Separating statahead thread (exit) from statahead RPCs makes code complicated. I'd prefer statahead thread wait for all inflight statahead RPCs finish, and clean them up and then exit. By this way statahead is not guaranteed to quit when ll_file_release() is called, but this should not be an issue. I'll implement this and if all goes well, I'll commit a patch separately. |
| Comment by Oleg Drokin [ 18/May/13 ] |
|
also one of the solutions could be to just embedd the name into opdata instead of just hte pointeer. |
| Comment by Lai Siyao [ 20/May/13 ] |
|
My patch is on http://review.whamcloud.com/#change,6392 |
| Comment by Ann Koehler (Inactive) [ 21/Nov/13 ] |
|
The patch http://review.whamcloud.com/#change,6392 has not caused any regressions when run through Cray's vers system test suite. We adopted the patch because we encountered the LBUG: > 14:06:40 LustreError: 16861:0:(statahead.c:351:do_sa_entry_fini()) ASSERTION( !ll_sa_entry_unhashed(entry) ) failed: > 14:06:40 LustreError: 16861:0:(statahead.c:351:do_sa_entry_fini()) LBUG > 14:06:40 Pid: 16861, comm: DeletionService > 14:06:40 Call Trace: > 14:06:40 [<ffffffff81005db9>] try_stack_unwind+0x169/0x1b0 > 14:06:40 [<ffffffff81004849>] dump_trace+0x89/0x450 > 14:06:40 [<ffffffffa02278d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs] > 14:06:40 [<ffffffffa0227e37>] lbug_with_loc+0x47/0xc0 [libcfs] > 14:06:40 [<ffffffffa0888b02>] ll_sai_unplug+0x4b2/0x580 [lustre] > 14:06:40 [<ffffffffa088b45b>] do_statahead_enter+0x32b/0x1900 [lustre] > 14:06:40 [<ffffffffa08792bc>] ll_lookup_it+0x9cc/0xb00 [lustre] > 14:06:40 [<ffffffffa087947c>] ll_lookup_nd+0x8c/0x3e0 [lustre] > 14:06:41 [<ffffffff8114dacc>] d_alloc_and_lookup+0x4c/0x80 > 14:06:41 [<ffffffff8114f13e>] do_lookup+0x2ae/0x3b0 > 14:06:41 [<ffffffff811519b3>] path_lookupat+0xc3/0x5f0 > 14:06:41 [<ffffffff81151f15>] do_path_lookup+0x35/0xd0 > 14:06:41 [<ffffffff81152c63>] user_path_at_empty+0x83/0xb0 > 14:06:41 [<ffffffff81152ca1>] user_path_at+0x11/0x20 > 14:06:41 [<ffffffff81147905>] vfs_fstatat+0x55/0x90 > 14:06:41 [<ffffffff81147a6b>] vfs_stat+0x1b/0x20 > 14:06:41 [<ffffffff81147a94>] sys_newstat+0x24/0x50 > 14:06:41 [<ffffffff8138ce6b>] system_call_fastpath+0x16/0x1b > 14:06:41 [<00002aaaab5dc325>] 0x2aaaab5dc325 > 14:06:41 Kernel panic - not syncing: LBUG Analysis of the associated dump showed that lli_sai was not NULL on entry to do_statahead_enter() but was NULL at the time of the ASSERTION. In the version of Lustre being run, do_statahead_enter() did not acquire the ll_sa_lock allowing the lli_sai to be changed. Patch 6392 adds lock/unlock calls on ll_sa_lock. With the patch, the LBUG has not been reproduced. The rest of the patch looks like it might have a positive effect on a ll_sa hang that we see occasionally but the verdict is still out. |
| Comment by Patrick Farrell (Inactive) [ 04/Dec/13 ] |
|
This is the cover message from my Gerrit review, placed here so it's more readable: We hit the following assertion while running with this patch: > 02:54:14 LustreError: 17617:0:(statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080 This assertion was added in the statahead patch we took in the patch from 803507. (Note: #define lli_opendir_pid u.d.d_opendir_pid) In particular, without the patch, ll_sai_put sets lli->u.d.d_opendir_pid to 0. The patch changes it to assert that lli->u.d.d_opendir_pid is 0. Looking at some of the uses of ll_sai_put, we can see that lli->u.d.d_opendir_pid is sometimes set to 0 before ll_sai_put is called called, but also that this is not done in do_statahead_enter. The inline comments give more detail. I've pointed out some of the ll_sai_put calls that appear problematic, but it's also called from other places and I'm not sure the PID is necessarily zero in all of them. |
| Comment by Lai Siyao [ 05/Dec/13 ] |
|
Patrick, you hit that assert because statahead is not started successfully. By design, sai has a refcount in allocation, and it will be put in ll_stop_statahead(), which will set opendir_pid to 0. And in normal situation, ll_stop_statahead() will be called in file release or statahead ratio is too low. So it means this function will be called before sai put its last refcount, which means this assert should be valid. I'll update the patch to handle this corner case. |
| Comment by Ann Koehler (Inactive) [ 05/Dec/13 ] |
|
"you hit that assert because statahead is not started successfully" Any idea why statahead may not have gotten started? The clue is probably this error message: (statahead.c:1660:do_statahead_enter()) can't start ll_sa thread, rc: -2080 We do have a dump for the LBUG Patrick reported. I just haven't had time to do the analysis. Any guesses would be appreciated. |
| Comment by Lai Siyao [ 06/Dec/13 ] |
|
Hmm, 2080 doesn't look to be a valid error number. What kernel version are you using? |
| Comment by Ann Koehler (Inactive) [ 06/Dec/13 ] |
|
The client, where the LBUG occurred, was running 2.5 on SLES 11 SP3 (Linux 3.0.82). |
| Comment by Phil Schwan (Inactive) [ 28/Dec/13 ] |
|
For what it's worth, we've also hit the "can't start ll_sa thread" bug four times, though with a slightly different followup assertion: LustreError: 14399:0:(statahead.c:1698:do_statahead_enter()) can't start ll_sa thread: rc: -2752 I don't have all four crashes, but I believe that rc was different each time. Lustre: Lustre: Build Version: 2.5.0-RC1--PRISTINE-2.6.32-279.14.1.el6.x86_64 |
| Comment by Ann Koehler (Inactive) [ 07/Feb/14 ] |
|
With patchset 20 backported to 2.5, an ll_agl_XXXX thread occasionally hits a general protection fault in __wake_up. In all cases, the corresponding ll_sa_XXXX thread has terminated. Looks like the sai struct has been deallocated so the ll_agl_XXXX thread panics when it goes to access it. Dump uploaded to: ftp.whamcloud.com:/upload/ crash> ps | grep ll_sa
20979 2 2 ffff88080f12f7e0 IN 0.0 0 0 [ll_sa_20914]
crash> ps | grep ll_agl
> 20978 2 0 ffff8807d9783080 RU 0.0 0 0 [ll_agl_20915]
20980 2 3 ffff88080f12f040 IN 0.0 0 0 [ll_agl_20914]
crash> bt
PID: 20978 TASK: ffff8807d9783080 CPU: 0 COMMAND: "ll_agl_20915"
#0 [ffff8807d9c1fcc0] die at ffffffff81005f6b
#1 [ffff8807d9c1fcf0] do_general_protection at ffffffff8100361a
#2 [ffff8807d9c1fd20] general_protection at ffffffff813628df
[exception RIP: __wake_up_common+49]
RIP: ffffffff8102fc91 RSP: ffff8807d9c1fdd0 RFLAGS: 00010096
RAX: 5a5a5a5a5a5a5a5a RBX: ffff88080095b118 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffff88080095b118
RBP: ffff8807d9c1fe00 R8: 5a5a5a5a5a5a5a42 R9: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000001 R14: ffff88080095b120 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#3 [ffff8807d9c1fe08] __wake_up at ffffffff81032b68
#4 [ffff8807d9c1fe48] ll_agl_thread at ffffffffa0795aa2 [lustre]
#5 [ffff8807d9c1fee8] kthread at ffffffff8106610e
#6 [ffff8807d9c1ff48] kernel_thread_helper at ffffffff81364034
crash> ll_inode_info ffff8806325a9580 | grep magic
lli_inode_magic = 287116773,
crash> ll_inode_info ffff8806325a9580 | grep sai
d_sai = 0x0,
|
| Comment by Ann Koehler (Inactive) [ 23/Apr/14 ] |
|
>> Any idea why statahead may not have gotten started? The clue is probably this error message: Finally figured out the cause of this bug. The 64-bit output of kthread_run is being assigned to an int and truncated. In the cases where a valid task pointer, when interpreted as an int, falls between 0 and -4095 it is treated as an error. The code in ll_start_agl handles the return from kthread_run correctly. |
| Comment by Lai Siyao [ 24/Apr/14 ] |
|
Thank you, I'll fix it in related patch soon. |
| Comment by Ryan Haasken [ 30/Apr/14 ] |
|
Lai, I have created |
| Comment by Lai Siyao [ 04/May/14 ] |
|
Thanks Ryan, and as is pointed out in |
| Comment by James A Simmons [ 27/Aug/14 ] |
|
Has this work been delayed until 2.8? |
| Comment by James A Simmons [ 08/Sep/14 ] |
|
Refreshed http://review.whamcloud.com/#/c/9663. Patch is ready for inspection. |
| Comment by James A Simmons [ 03/Nov/14 ] |
|
Many patches have been landed. Only one patch left at http://review.whamcloud.com/#/c/9667. Please inspect. |
| Comment by Gerrit Updater [ 19/Nov/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9667/ |
| Comment by James A Simmons [ 20/Nov/14 ] |
|
Last patch has been merged. This ticket can be closed. |
| Comment by Jodi Levi (Inactive) [ 21/Nov/14 ] |
|
Patches landed to Master. |
| Comment by Gerrit Updater [ 02/Dec/14 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/12901 |