[LU-1714] crash upon loading libcfs module Created: 07/Aug/12  Updated: 04/Feb/15  Resolved: 20/Feb/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.4.0, Lustre 2.1.5

Type: Bug Priority: Minor
Reporter: Alexander Boyko Assignee: Emoly Liu
Resolution: Fixed Votes: 0
Labels: patch
Environment:

debug kernel & lustre with LU-1201


Issue Links:
Duplicate
duplicates LU-2427 Hit "kernel BUG" when running on debu... Resolved
Related
is related to LU-6020 Bugfixes for GSS/Kerberos Resolved
Severity: 3
Rank (Obsolete): 8136

 Description   
RIP: 0010:[<ffffffffa06a2807>]  [<ffffffffa06a2807>] cfs_crypto_hash_digest+0xf6/0x166 [libcfs]
RSP: 0018:ffff88007c5fbda8  EFLAGS: 00010293
RAX: ffffea0001a55e00 RBX: ffff880078640000 RCX: 0000000087654321
RDX: 0000000000000000 RSI: ffffffffa06c6398 RDI: ffff880078640000
RBP: ffff88007c5fbe28 R08: 0000000000000000 R09: d84156c5635688c0
R10: d84156c5635688c0 R11: 0000000000000020 R12: 0000000000000000
R13: ffff88007c5fbe78 R14: 0000000000020000 R15: ffff88007c5fbe74
FS:  00007f2ef3eb6700(0000) GS:ffff880005000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f4777fa5000 CR3: 000000007c62c000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process insmod (pid: 8489, threadinfo ffff88007c5fa000, task ffff88007c6f00c0)
Stack:
 0000000000000000 0000000000000000 0000000000000000 0000000000000000
<0> 0000000000000000 0000000000000000 ffff88007c321988 0000000000000000
<0> 00007f2ef1eee010 ffffffffa06c6390 ffff88007c5fbe28 0000000000000000
Call Trace:
 [<ffffffffa06a29c8>] cfs_crypto_register+0x151/0x35a [libcfs]
 [<ffffffff8151cbcb>] ? _spin_unlock+0x2b/0x40
 [<ffffffffa06b3120>] ? cfs_wi_sched_create+0x3bf/0x4d4 [libcfs]
 [<ffffffffa06a5d34>] ? init_libcfs_module+0x0/0x377 [libcfs]
 [<ffffffffa06a5d34>] ? init_libcfs_module+0x0/0x377 [libcfs]
 [<ffffffffa06a5f5f>] init_libcfs_module+0x22b/0x377 [libcfs]
 [<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
 [<ffffffff810c1333>] sys_init_module+0xe3/0x260
 [<ffffffff8100b0b2>] system_call_fastpath+0x16/0x1b
Code: c1 e8 0c 48 8d 0c c5 00 00 00 00 48 c1 e0 06 48 29 c8 48 b9 00 00 00 00 00 ea ff ff 48 8d 04 08 b9 21 43 65 87 48 39 4d 80 74 04 <0f> 0b eb fe f6 c2 01 74 04 0f 0b eb fe 83 e2 03 48 09 c2 48 89 
RIP  [<ffffffffa06a2807>] cfs_crypto_hash_digest+0xf6/0x166 [libcfs]
 RSP <ffff88007c5fbda8>
---[ end trace 90f8b1994767d769 ]---
Kernel panic - not syncing: Fatal exception
Pid: 8489, comm: insmod Tainted: G      D    ----------------   2.6.32-220.7.1.el6.lustre.2015.x86_64.debug #1
Call Trace:
 [<ffffffff81518f20>] ? panic+0x78/0x148
 [<ffffffff8151e564>] ? oops_end+0xe4/0x100
 [<ffffffff8100f3bb>] ? die+0x5b/0x90
 [<ffffffff8151dc54>] ? do_trap+0xc4/0x160
 [<ffffffff8100d005>] ? do_invalid_op+0xa5/0xb0
 [<ffffffff8100cff5>] ? do_invalid_op+0x95/0xb0
 [<ffffffffa06a2807>] ? cfs_crypto_hash_digest+0xf6/0x166 [libcfs]
 [<ffffffff8151c661>] ? trace_hardirqs_off_thunk+0x3a/0x3c
 [<ffffffff8100bf9b>] ? invalid_op+0x1b/0x20
 [<ffffffffa06a2807>] ? cfs_crypto_hash_digest+0xf6/0x166 [libcfs]
 [<ffffffffa06a27d7>] ? cfs_crypto_hash_digest+0xc6/0x166 [libcfs]
 [<ffffffffa06a29c8>] ? cfs_crypto_register+0x151/0x35a [libcfs]
 [<ffffffff8151cbcb>] ? _spin_unlock+0x2b/0x40
 [<ffffffffa06b3120>] ? cfs_wi_sched_create+0x3bf/0x4d4 [libcfs]
 [<ffffffffa06a5d34>] ? init_libcfs_module+0x0/0x377 [libcfs]
 [<ffffffffa06a5d34>] ? init_libcfs_module+0x0/0x377 [libcfs]
 [<ffffffffa06a5f5f>] ? init_libcfs_module+0x22b/0x377 [libcfs]
 [<ffffffff8100204c>] ? do_one_initcall+0x3c/0x1d0
 [<ffffffff810c1333>] ? sys_init_module+0xe3/0x260
 [<ffffffff8100b0b2>] ? system_call_fastpath+0x16/0x1b


 Comments   
Comment by Alexander Boyko [ 07/Aug/12 ]

http://review.whamcloud.com/3552

Comment by Andreas Dilger [ 07/Aug/12 ]

This is the same as ORI-722, recently found on the Orion development branch. Please see http://review.whamcloud.com/3423 for the current patch.

Comment by Alexander Boyko [ 07/Aug/12 ]

Sure, check for sg_init_table looks the same. ORI-722 and http://review.whamcloud.com/3423 relate to the lnet fix. And this issue and http://review.whamcloud.com/3552 - relate to libcfs crypto hash fix. Both has the similar change at autoconf.

Comment by Andreas Dilger [ 07/Aug/12 ]

OK, looking more closely, this has the same cause as ORI-722, but affects different code.

Alexander and Prakash, could you please get together to make a single common patch that fixes these issues, or at least have a single autoconf check. I prefer the approach in http://review.whamcloud.com/3423 that has a wrapper that can be used in all the code, instead of #ifdefs inline in the code as in http://review.whamcloud.com/3552.

Comment by Prakash Surya (Inactive) [ 07/Aug/12 ]

Sure, I'm open to merging the two or landing one and reworking the other. It's probably a good idea to scan the rest of the code for areas which need similar fixes as well.

Comment by Alexander Boyko [ 17/Aug/12 ]

Prakash Surya, how about merging? Or you did not do it?

Comment by Prakash Surya (Inactive) [ 17/Aug/12 ]

Alexander, sure I'll merge your fix into mine. I'll update this ticket when I push the new version and add you to the review list. I should be able to do that later today.

Comment by Prakash Surya (Inactive) [ 17/Aug/12 ]

Some other places which may need similar fixes:

lnet/klnds/o2iblnd/o2iblnd_cb.c:730
lnet/klnds/o2iblnd/o2iblnd_cb.c:773
lustre/obdclass/capa.c:266
lustre/obdclass/capa.c:308
lustre/obdclass/capa.c:311
lustre/obdclass/capa.c:261
lustre/obdclass/capa.c:364

Those locations all make a direct call to sg_set_page which might need to change to sg_init_one.

Comment by Prakash Surya (Inactive) [ 17/Aug/12 ]

Please see: http://review.whamcloud.com/3709

Comment by Alexander Boyko [ 17/Aug/12 ]

I think it`s better to add something like this to libcfs headers.

#ifndef HAVE_SCATTERLIST_INITTABLE
#define sg_init_table(sg, nents) memset(sg, 0, sizeof(*(sg))*(nents))
#endif
#ifndef HAVE_SCATTERLIST_SETPAGE
sg_set_page(sg, p, len, off)   sg_set_buf(sg, page_address(p) + (off & ~CFS_PAGE_MASK), len)
#endif

Or maybe cfs_ prefix for this functions.
lnet/klnds/o2iblnd/o2iblnd_cb.c already use initialized sg, and set_page is faster then set_buf, because it skip addr to page conversion.

Comment by Prakash Surya (Inactive) [ 17/Aug/12 ]

I decided to drop the "cfs_" prefix since there are other "sg_" calls in the code which do not use any "cfs_" wrappers. So to be consistent with those calls, I think it's best to leave them with the "sg_" prefix.

Comment by Peter Jones [ 04/Sep/12 ]

Yujian

Could you please help out with this one?

Thanks

Peter

Comment by Lustre Bull [ 19/Oct/12 ]

Prakash,
there are 5 calls to sg_set_page() in lustre/obdclass/capa.c
I think you should also insert a call to sg_init_table() above each of them.

Index: b/lustre/obdclass/capa.c
===================================================================
--- a/lustre/obdclass/capa.c
+++ b/lustre/obdclass/capa.c
@@ -265,6 +265,7 @@ int capa_hmac(__u8 *hmac, struct lustre_
         }
         keylen = alg->ha_keylen;

+        sg_init_table(&sl, 1);
         sg_set_page(&sl, virt_to_page(capa),
                     offsetof(struct lustre_capa, lc_hmac),
                     (unsigned long)(capa) % CFS_PAGE_SIZE);
@@ -306,9 +307,11 @@ int capa_encrypt_id(__u32 *d, __u32 *s,
                 GOTO(out, rc);
         }

+        sg_init_table(&sd, 1);
         sg_set_page(&sd, virt_to_page(d), 16,
                     (unsigned long)(d) % CFS_PAGE_SIZE);

+        sg_init_table(&ss, 1);
         sg_set_page(&ss, virt_to_page(s), 16,
                     (unsigned long)(s) % CFS_PAGE_SIZE);
         desc.tfm   = tfm;
@@ -358,9 +361,11 @@ int capa_decrypt_id(__u32 *d, __u32 *s,
                 GOTO(out, rc);
         }

+        sg_init_table(&sd, 1);
         sg_set_page(&sd, virt_to_page(d), 16,
                     (unsigned long)(d) % CFS_PAGE_SIZE);
+        sg_init_table(&ss, 1);
         sg_set_page(&ss, virt_to_page(s), 16,
                     (unsigned long)(s) % CFS_PAGE_SIZE);

You have also added the following new definition for sg_set_page() in libcfs/include/libcfs/linux/libcfs.h

#ifndef HAVE_SCATTERLIST_SETPAGE
#define sg_set_page(sg, p, len, off) \
       sg_set_buf(sg, page_address(p) + ((off) & ~CFS_PAGE_MASK), len)
#endif

This file also includes libcfs/include/libcfs/linux/kp30.h in which there was already an older definition for sg_set_page()

#ifndef HAVE_SCATTERLIST_SETPAGE
static inline void sg_set_page(struct scatterlist *sg, struct page *page,
                               unsigned int len, unsigned int offset)
{
        sg->page = page;
        sg->offset = offset;
        sg->length = len;
}
#endif

I think this older definition should be removed.

Comment by Nathan Rutman [ 21/Nov/12 ]

Xyratex-bug-id: MRP-611

Comment by Emoly Liu [ 18/Feb/13 ]

b2_1 port is at http://review.whamcloud.com/5452

Comment by Peter Jones [ 20/Feb/13 ]

The existing patches have landed for 2.1.5 and 2.4. Please open a new ticket to track any further changes needed in this area of code.

Generated at Sat Feb 10 01:19:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.