[LU-1484] Test failure on test suite recovery-small, subtest test_57 Created: 05/Jun/12 Updated: 18/Apr/13 Resolved: 21/Feb/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0, Lustre 2.1.2, Lustre 2.1.3, Lustre 2.1.4, Lustre 1.8.8 |
| Fix Version/s: | Lustre 2.4.0, Lustre 2.1.5, Lustre 1.8.9 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre Tag: v2_1_2_RC2 |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4529 |
| Description |
|
This issue was created by maloo for yujian <yujian@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/743bea58-af48-11e1-a585-52540035b04c. The sub-test test_57 failed with the following error:
Info required for matching: recovery-small 57 |
| Comments |
| Comment by Jian Yu [ 05/Jun/12 ] |
|
Console log on Client 4 (client-28vm6) showed that: 05:43:48:Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 05:43:48 (1338900228) 05:43:53:LustreError: 28843:0:(fail.c:126:__cfs_fail_timeout_set()) cfs_fail_timeout id b00 sleeping for 10000ms 05:44:01:LustreError: 28843:0:(fail.c:130:__cfs_fail_timeout_set()) cfs_fail_timeout id b00 awake 05:46:17:INFO: task lctl:28843 blocked for more than 120 seconds. 05:46:17:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 05:46:18:lctl D 0000000000001000 0 28843 28631 (NOTLB) 05:46:18: ffff8100517dfe38 0000000000000086 ffffffff800cfa4c ffff810037d56d40 05:46:18: 0000000000000282 0000000000000007 ffff810067000080 ffff8100668c47e0 05:46:18: 000005e137f5c0e1 000000000001b643 ffff810067000268 0000000000000001 05:46:18:Call Trace: 05:46:18: [<ffffffff800cfa4c>] zone_statistics+0x3e/0x6d 05:46:18: [<ffffffff8000f40b>] __alloc_pages+0x78/0x308 05:46:19: [<ffffffff8006468c>] __down_read+0x7a/0x92 05:46:19: [<ffffffff888d90e2>] :obdclass:lprocfs_fops_read+0x82/0x1e0 05:46:19: [<ffffffff8010ab77>] proc_reg_read+0x7e/0x99 05:46:19: [<ffffffff8000b721>] vfs_read+0xcb/0x171 05:46:19: [<ffffffff80011d15>] sys_read+0x45/0x6e 05:46:19: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 05:46:19: 05:46:19:INFO: task umount:28863 blocked for more than 120 seconds. 05:46:22:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 05:46:24:umount D ffff810002536420 0 28863 28862 (NOTLB) 05:46:24: ffff810057f81828 0000000000000082 0000000000000000 0000000100000002 05:46:24: 0000000000000000 0000000000000007 ffff8100668c47e0 ffffffff80319b60 05:46:24: 000005e137f5e08c 0000000000001fab ffff8100668c49c8 0000000000000000 05:46:24:Call Trace: 05:46:24: [<ffffffff80064cb5>] __reacquire_kernel_lock+0x2e/0x47 05:46:24: [<ffffffff80063171>] wait_for_completion+0x79/0xa2 05:46:24: [<ffffffff8008ee74>] default_wake_function+0x0/0xe 05:46:24: [<ffffffff8010dcdb>] remove_proc_entry+0xfb/0x1c7 05:46:25: [<ffffffff888d7603>] :obdclass:lprocfs_remove+0x103/0x130 05:46:25: [<ffffffff888d6a46>] :obdclass:lprocfs_free_stats+0x1e6/0x230 05:46:25: [<ffffffff888d7a1f>] :obdclass:lprocfs_obd_cleanup+0x6f/0x80 05:46:28: [<ffffffff88b7ca32>] :osc:osc_precleanup+0x292/0x370 05:46:28: [<ffffffff888ff13c>] :obdclass:lu_context_fini+0x1c/0x50 05:46:28: [<ffffffff888e303f>] :obdclass:class_cleanup+0xc6f/0xe30 05:46:28: [<ffffffff888e6d8c>] :obdclass:class_process_config+0x1e5c/0x3200 05:46:28: [<ffffffff888e97f7>] :obdclass:class_manual_cleanup+0xad7/0xe80 05:46:28: [<ffffffff8002ff6f>] __up_write+0x27/0xf2 05:46:30: [<ffffffff88bba26c>] :lov:lov_putref+0xb0c/0xb90 05:46:30: [<ffffffff88bc2b98>] :lov:lov_disconnect+0x308/0x3e0 05:46:30: [<ffffffff88c66d94>] :lustre:client_common_put_super+0x894/0xed0 05:46:30: [<ffffffff88c676e5>] :lustre:ll_put_super+0x195/0x310 05:46:30: [<ffffffff800f079e>] invalidate_inodes+0xce/0xe0 05:46:30: [<ffffffff800e77ab>] generic_shutdown_super+0x79/0xfb 05:46:30: [<ffffffff800e787b>] kill_anon_super+0x9/0x35 05:46:30: [<ffffffff800e792c>] deactivate_super+0x6a/0x82 05:46:31: [<ffffffff800f1e8b>] sys_umount+0x245/0x27b 05:46:31: [<ffffffff800ba767>] audit_syscall_entry+0x1a8/0x1d3 05:46:31: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 For Lustre 2.1.1, we also hit recovery-small test 57 hanging, but it's another failure on client: |
| Comment by Jian Yu [ 06/Jun/12 ] |
|
Another instance: And here are the historical reports with status "TIMEOUT" on Maloo: |
| Comment by Peter Jones [ 23/Jul/12 ] |
|
Bobijam Could you please look into this one? Thanks Peter |
| Comment by Zhenyu Xu [ 24/Jul/12 ] |
|
there's a deadlock here: (!HAVE_PROCFS_USER && HAVE_PROCFS_DELETED) proc reader proc remover
proc_reg_read() LPROCFS_WRITE_ENTRY() // down_write _lprocfs_lock semaphore -------> (1)
pdeaux->pde_users++
lprocfs_fops_read()
LPROCFS_ENTRY_AND_CHECK() // down_read _lprocfs_lock semaphore, wait here ---------------------------> (2)
remove_proc_entry()
if (pdeaux->pde_users > 0)
wait_for_completion() ------------------------------------> (3)
...
pde_users_dec() // pdeaux->pde_users--, complete() ---------------------------------------------------------- (4)
the issue is when remover get to (3), the proc reader will wait at (2), while proc remover cannot move on until proc reader reachs (4), a deadlock ensues. |
| Comment by Zhenyu Xu [ 24/Jul/12 ] |
|
patch tracking at http://review.whamcloud.com/3455 lprocfs: fix a deadlock There is a deadlock between proc reader and proc remover. proc reader proc remover when remover gets to (3), the proc reader will wait at (2), while |
| Comment by Zhenyu Xu [ 24/Jul/12 ] |
|
update patch lprocfs: refine LC_PROCFS_USERS check In some RHEL patched 2.6.18 kernels, pde_users member is added in |
| Comment by Peter Jones [ 26/Jul/12 ] |
|
Landed for 2.3 |
| Comment by Jian Yu [ 13/Aug/12 ] |
|
Lustre Tag: v2_1_3_RC1 The same issue exists in Lustre 2.1.3: https://maloo.whamcloud.com/test_sets/89731b4c-e415-11e1-b6d3-52540035b04c Will the patch for this ticket be cherry-picked/ported to b2_1 branch? |
| Comment by Zhenyu Xu [ 13/Aug/12 ] |
|
b2_1 patch tracking at http://review.whamcloud.com/3471 |
| Comment by Peng Tao [ 14/Nov/12 ] |
|
With commit 76bf16d1e12cd3c2d2f48a31e3e6c1ad66523638 ( CC [M] /home/bergwolf/src/lustre-testing/libcfs/libcfs/linux/linux-tracefile.o |
| Comment by Peng Tao [ 14/Nov/12 ] |
|
config.h shows that both HAVE_PROCFS_DELETED and HAVE_PROCFS_USERS are defined. 460 /* kernel has deleted member in procfs entry struct */ |
| Comment by Stephen Champion [ 15/Nov/12 ] |
|
Building b2_1, I have verified that the RHEL 5.8 updates 2.6.18-308.11.1.el5 and 2.6.18-308.16.1.el5 include both proc_dir_entry.deleted and proc_dir_entry_aux.pde_users. As Peng reported, this leads to libcfs/include/libcfs/params_tree.h:107:2: error: #error proc_dir_entry->deleted is conflicted with proc_dir_entry->pde_user anytime the source for one of these RHEL 5.8 updates is used for --with-linux=. |
| Comment by Stephen Champion [ 15/Nov/12 ] |
|
Follow up in |
| Comment by Jian Yu [ 10/Dec/12 ] |
|
Lustre Branch: b2_1 The same issue occurred again on recovery-small test 57: == recovery-small test 57: read procfs entries causes kernel crash =================================== 19:13:57 (1354936437) fail_loc=0x80000B00 CMD: fat-intel-3vm6.lab.whamcloud.com grep -c /mnt/lustre' ' /proc/mounts Stopping client fat-intel-3vm6.lab.whamcloud.com /mnt/lustre (opts:) CMD: fat-intel-3vm6.lab.whamcloud.com lsof -t /mnt/lustre CMD: fat-intel-3vm6.lab.whamcloud.com umount /mnt/lustre 2>&1 Console log on client (fat-intel-3vm6): 19:14:02:Lustre: DEBUG MARKER: == recovery-small test 57: read procfs entries causes kernel crash =================================== 19:13:57 (1354936437) 19:14:02:Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts 19:14:02:LustreError: 9173:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id b00 sleeping for 10000ms 19:14:02:Lustre: DEBUG MARKER: lsof -t /mnt/lustre 19:14:02:Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 19:14:13:LustreError: 9173:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id b00 awake 19:16:57:INFO: task lctl:9173 blocked for more than 120 seconds. 19:16:57:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 19:16:57:lctl D 0000000000001000 0 9173 8913 (NOTLB) 19:16:57: ffff810044841e38 0000000000000086 ffffffff800cfa80 ffff810037d56d40 19:16:57: 0000000000000282 0000000000000007 ffff81004de4f860 ffff81005f2120c0 19:16:57: 00002a16f848196c 0000000000020344 ffff81004de4fa48 0000000000000001 19:16:57:Call Trace: 19:16:57: [<ffffffff800cfa80>] zone_statistics+0x3e/0x6d 19:16:58: [<ffffffff8000f47b>] __alloc_pages+0x78/0x308 19:16:58: [<ffffffff8006468c>] __down_read+0x7a/0x92 19:16:58: [<ffffffff889170c2>] :obdclass:lprocfs_fops_read+0x82/0x200 19:16:58: [<ffffffff8010b73e>] proc_reg_read+0x7e/0x99 19:16:58: [<ffffffff8000b72f>] vfs_read+0xcb/0x171 19:16:58: [<ffffffff80011d85>] sys_read+0x45/0x6e 19:16:58: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 19:16:58: 19:16:58:INFO: task umount:9195 blocked for more than 120 seconds. 19:16:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 19:16:58:umount D ffff810002536420 0 9195 9194 (NOTLB) 19:16:58: ffff810058613a08 0000000000000086 00000000ffffffff 0000000000000020 19:16:58: 00000000ffffffff 0000000000000007 ffff81005f2120c0 ffffffff8031ab60 19:16:58: 00002a16f8483cb6 000000000000234a ffff81005f2122a8 0000000000000000 19:16:58:Call Trace: 19:16:58: [<ffffffff80064cb5>] __reacquire_kernel_lock+0x2e/0x47 19:16:58: [<ffffffff80063171>] wait_for_completion+0x79/0xa2 19:16:58: [<ffffffff8008ee97>] default_wake_function+0x0/0xe 19:16:58: [<ffffffff8010e8a2>] remove_proc_entry+0xfb/0x1c7 19:16:58: [<ffffffff88915573>] :obdclass:lprocfs_remove+0x103/0x130 19:16:58: [<ffffffff889159d0>] :obdclass:lprocfs_obd_cleanup+0x90/0xa0 19:16:58: [<ffffffff88caf665>] :osc:osc_precleanup+0x2e5/0x3a0 19:16:58: [<ffffffff88920c35>] :obdclass:class_cleanup+0xc55/0xda0 19:16:58: [<ffffffff889241f6>] :obdclass:class_process_config+0x1b46/0x2cc0 19:16:58: [<ffffffff88926a0b>] :obdclass:class_manual_cleanup+0x9bb/0xd70 19:16:58: [<ffffffff88d03c5d>] :lov:lov_putref+0xa7d/0xaf0 19:16:58: [<ffffffff88cfe623>] :lov:lov_del_target+0x6d3/0x720 19:16:58: [<ffffffff88d0b78b>] :lov:lov_disconnect+0x39b/0x440 19:16:58: [<ffffffff88de73ea>] :lustre:client_common_put_super+0x83a/0xe10 19:16:58: [<ffffffff88de7d15>] :lustre:ll_put_super+0x1a5/0x330 19:16:58: [<ffffffff800f120a>] invalidate_inodes+0xce/0xe0 19:16:58: [<ffffffff800e78ae>] generic_shutdown_super+0x79/0xfb 19:16:58: [<ffffffff800e797e>] kill_anon_super+0x9/0x35 19:16:58: [<ffffffff800e7a2f>] deactivate_super+0x6a/0x82 19:16:58: [<ffffffff800f28f7>] sys_umount+0x245/0x27b 19:16:58: [<ffffffff800ba78a>] audit_syscall_entry+0x1a8/0x1d3 19:16:58: [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Maloo report: https://maloo.whamcloud.com/test_sets/a59d9126-41bc-11e2-a653-52540035b04c |
| Comment by Zhenyu Xu [ 10/Dec/12 ] |
|
In Lustre Branch: b2_1 In rhel5 kernel build log
this test result does not match the supposed result from 2.6.18-308.20.1.el5 kernel source, my local test shows
and build failed as Stephen pointed out:
This means there is something wrong about the RHEL5 kernel build. Also the hung stack also reveals that proc_dir_entry_aux::pre_users is used. remove_proc_entry() in 2.6.18-308.20.1.el5 source /* Wait until all existing callers into module are done. */ if (pdeaux->pde_users > 0) { DECLARE_COMPLETION_ONSTACK(c); pdeaux = to_pde_aux(de); if (!pdeaux->pde_unload_completion) pdeaux->pde_unload_completion = &c; spin_unlock(&pdeaux->pde_unload_lock); spin_unlock(&proc_subdir_lock); wait_for_completion(pdeaux->pde_unload_completion); spin_lock(&proc_subdir_lock); goto continue_removing; } I'll update a patch to make 2.6.18-308.20.1.el5 build pass. |
| Comment by Zhenyu Xu [ 10/Dec/12 ] |
|
b2_1 patch handling build error for 2.6.18-308 RHEL5 kernel is tracked at http://review.whamcloud.com/4794 commit message LU-1484 kernel: pass RHEL5 build for 2.6.18-308 For vanilla kernel, proc_dir_entry::deleted and ::pde_users co-exists from 2.6.23 to 2.6.23.17. For some RHEL5 kernels, it defines co-existings proc_dir_entry::deleted and proc_dir_entry_aux::pde_users. |
| Comment by Zhenyu Xu [ 11/Dec/12 ] |
|
strangely the RHEL5 build in http://build.whamcloud.com/job/lustre-reviews/11128/arch=x86_64,build_type=server,distro=el5,ib_stack=inkernel/ still shows that pde_users test failed. Chris Gearing, could you help me check out why RHEL5 (b2_1) build does not detect pde_users members? Thanks. Check items:
|
| Comment by Peter Jones [ 19/Dec/12 ] |
|
Landed for 2.1.4. RHEL5 is not supported in 2.4 so this latest change is not needed to master |
| Comment by Jian Yu [ 22/Dec/12 ] |
|
Lustre Tag: v2_1_4_RC2 The issue occurred again: https://maloo.whamcloud.com/test_sets/baaad7ac-4c1d-11e2-875d-52540035b04c |
| Comment by Jian Yu [ 31/Dec/12 ] |
|
Lustre Branch: b1_8 The same issue occurred: https://maloo.whamcloud.com/test_sets/e6be996c-51b5-11e2-a904-52540035b04c |
| Comment by Chris Gearing (Inactive) [ 03/Jan/13 ] |
|
Zhenyu Xu: I'm not sure what you are asking of me, but perhaps you could tell me what this line indicates? after configure-ed, the config.h contains "#define HAVE_PROCFS_DELETED 1"
which config.h file is this? |
| Comment by Zhenyu Xu [ 03/Jan/13 ] |
|
| Comment by Jian Yu [ 06/Jan/13 ] |
|
Lustre Branch: b1_8 The same issue occurred again: https://maloo.whamcloud.com/test_sets/6ed434e6-57ca-11e2-9cc9-52540035b04c |
| Comment by Chris Gearing (Inactive) [ 08/Jan/13 ] |
|
I presume this is a build issue, If I look at the latest b1_8 head build on server then BUILD/BUILD/lustre-1.8.8.60/config.h does have #define HAVE_PROCFS_DELETED 1 but /* kernel has pde_users member in procfs entry struct */ I've attached the config file and the config.log |
| Comment by Zhenyu Xu [ 08/Jan/13 ] |
|
I need port relevant patches to b1_8 branch then. it's tracked at http://review.whamcloud.com/4976 commit message LU-1484 lprocfs: refine LC_PROCFS_USERS check In some RHEL patched 2.6.18 kernels, pde_users member is added in another struct proc_dir_entry_aux instead of in struct proc_dir_entry in later kernel version of 2.6.23. |
| Comment by Peter Jones [ 14/Jan/13 ] |
|
Landed to b1_8 |
| Comment by Jian Yu [ 19/Jan/13 ] |
|
Lustre Branch: b1_8 The issue still occurred: |
| Comment by Jian Yu [ 20/Jan/13 ] |
|
Another instance on Lustre build http://build.whamcloud.com/job/lustre-b1_8/249 : |
| Comment by Zhenyu Xu [ 20/Jan/13 ] |
|
b1_8 still need land another patch http://review.whamcloud.com/5129 commit message LU-1484 kernel: pass RHEL5 build for 2.6.18-308 For vanilla kernel, proc_dir_entry::deleted and ::pde_users co-exists from 2.6.23 to 2.6.23.17. For some RHEL5 kernels, it defines co-existings proc_dir_entry::deleted and proc_dir_entry_aux::pde_users. |
| Comment by Jian Yu [ 29/Jan/13 ] |
|
Lustre Branch: b1_8 recovery-small test 57 failed again: https://maloo.whamcloud.com/test_sets/68c48694-6a28-11e2-85d4-52540035b04c |
| Comment by Zhenyu Xu [ 29/Jan/13 ] |
|
Chris, Would you mind checking the build system again for the following info of the latest b1_8 build (http://build.whamcloud.com/job/lustre-b1_8/252)?
|
| Comment by Chris Gearing (Inactive) [ 30/Jan/13 ] |
|
Xu, Can you provide me a lot more info please, I really do not know what you are asking me to check. fs/proc/internal.h comes as part of the lustre source? If so how does the build system affect the presence of pde_users, and if not where does fs/proc/internal.h come from. I guess I'm just not understanding how the build system affects these things. What do I need to provide to help you debug this? |
| Comment by Zhenyu Xu [ 30/Jan/13 ] |
|
sorry, fs/proc/internal.h is kernel's file, like linux-2.6.32-xxx/fs/proc/internal.h I'll also check my local 2.6.18-348.1.1.el5 kernel. |
| Comment by Zhenyu Xu [ 30/Jan/13 ] |
|
Chris, This is my local VM test environment test confirm procedure (CentOS 5.9, 2.6.18-348.1.1.el5 kernel), can you confirm them on the build node? $ tail -n 15 linux-2.6.18-308.11.1.el5-b18/fs/proc/internal.h
/*
* RHEL internal wrapper to extend struct proc_dir_entry
*/
struct proc_dir_entry_aux {
struct proc_dir_entry pde;
int pde_users; /* number of callers into module in progress */
spinlock_t pde_unload_lock; /* proc_fops checks and pde_users bumps */
struct completion *pde_unload_completion;
char name[]; /* PDE name */
};
static inline struct proc_dir_entry_aux *to_pde_aux(struct proc_dir_entry *d)
{
return container_of(d, struct proc_dir_entry_aux, pde);
}
$ ./configure --with-linux=/path-to/linux-2.6.18-308.11.1.el5-b18
...
...
checking if kernel has pde_users member in procfs entry struct... yes
...
checking if kernel has deleted member in procfs entry struct... yes
...
$ grep PROCFS ~/work/lustre-b18/config.h
440:#define HAVE_PROCFS_DELETED 1
443:#define HAVE_PROCFS_USERS 1
$ONLY=57 bash recovery-small.sh
...
...
== test 57: read procfs entries causes kernel crash == 20:41:25
fail_loc=0x80000B00
Stopping client test3 /mnt/lustre (opts:)
fail_loc=0x80000B00
Stopping /mnt/mds (opts:)
Failover mds to test3
20:41:49 (1359549709) waiting for test3 network 900 secs ...
20:41:49 (1359549709) network interface is UP
Starting mds: -o loop -o abort_recovery /tmp/lustre-mdt /mnt/mds
lnet.debug=0x33f1504
lnet.subsystem_debug=0xffb7e3ff
lnet.debug_mb=32
Started lustre-MDT0000
recovery-small.sh: line 992: kill: (1143) - No such process
fail_loc=0
Starting client: test3: -o user_xattr,acl,flock test3@tcp:/lustre /mnt/lustre
lnet.debug=0x33f1504
lnet.subsystem_debug=0xffb7e3ff
lnet.debug_mb=32
Filesystem 1K-blocks Used Available Use% Mounted on
test3@tcp:/lustre 562408 53656 478752 11% /mnt/lustre
Resetting fail_loc on all nodes...done.
PASS 57 (26s)
...===== recovery-small.sh test complete, duration 28 sec ======================
|
| Comment by Zhenyu Xu [ 31/Jan/13 ] |
|
in client build log "config.log" I found this
#include <linux/kernel.h>
|
| #include "/var/lib/jenkins/workspace/lustre-b1_8/arch/x86_64/build_type/client/distro/el5/ib_stack/inkernel/BUILD/reused/usr/src/kernels/2.6.18-348.1.1.el5-x86_64/fs/proc/internal.h"
|
| int
| main (void)
| {
|
| struct proc_dir_entry_aux pde_aux;
|
| pde_aux.pde_users = 0;
|
| ;
| return 0;
| }
configure:14056: result: no
And I checked the client build environment Chris copied for debugging bobijam@brent:/scratch/help-bob-jam/client/BUILD/reused/usr/src/kernels/2.6.18-3 there's no files under fs/proc/, while this RHEL kernel (vanilla kernel + RHEL patches) should have C/H files under fs/proc/ Brian, does the build process in this stage only use vanilla kernel (i.e. hasn't applied RHEL patches)? Given that the RHEL kernel src rpm only provides vanilla kernel source plus RHEL's patches, and the patches only get applied when rpmbuild executes its "%prep" stage. |
| Comment by Brian Murrell (Inactive) [ 01/Feb/13 ] |
|
Does the client build even use full kernel source at all? It shouldn't need it I don't think, just kernel-devel. Yes, looking in lbuild in build_with_srpm() where $PATCHLESS == true: if ! kernelrpm=$(find_linux_rpm "-$DEVEL_KERNEL_TYPE"); then fatal 1 "Could not find the kernel-$DEVEL_KERNEL_TYPE RPM in ${KERNELRPMSBASE}/${lnxmaj}/${DISTRO}" fi if ! lnxrel="$lnxrel" unpack_linux_devel_rpm "$kernelrpm" "-"; then fatal 1 "Could not find the Linux tree in $kernelrpm" fi and we can see in the build log in: + kernelrpm=/var/lib/jenkins/lbuild-data/kernelrpm/2.6.18/rhel5/x86_64/kernel-devel-2.6.18-348.1.1.el5.x86_64.rpm ... + unpack_linux_devel_rpm /var/lib/jenkins/lbuild-data/kernelrpm/2.6.18/rhel5/x86_64/kernel-devel-2.6.18-348.1.1.el5.x86_64.rpm - So indeed, it's kernel-devel that lustre's configure is pointed at in build_with_srpm()->build_lustre(): ++ ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu --program-prefix= --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/usr/com --mandir=/usr/share/man --infodir=/usr/share/info --with-linux=/var/lib/jenkins/workspace/lustre-b1_8/arch/x86_64/build_type/client/distro/el5/ib_stack/inkernel/BUILD/reused/usr/src/kernels/2.6.18-348.1.1.el5-x86_64 --with-linux-obj=/var/lib/jenkins/workspace/lustre-b1_8/arch/x86_64/build_type/client/distro/el5/ib_stack/inkernel/BUILD/reused/usr/src/kernels/2.6.18-348.1.1.el5-x86_64 --disable-server --enable-liblustre --enable-liblustre-tests --with-release=wc1_2.6.18_348.1.1.el5_g3480bb0 --enable-tests --enable-liblustre-tests If you look in kernel-devel you will find that fs/proc/ is empty because kernel-devel is "kernel headers" not full kernel source. Ultimate what this means is that you need to re-cook your test so as to not need kernel source but kernel headers only. This is the standard for external kernel modules: they should be buildable with kernel headers only since kernel headers represent the API and reaching behind the API is "cheating". |
| Comment by Zhenyu Xu [ 01/Feb/13 ] |
|
Andreas, RHEL 5.9 hasn't revealed pre_users in its devel package, I find no other way to detect proc_dir_entry_aux::pde_users, and since the pde_users are all used in later kernels, is it ok to change the lprocfs_status.[c|h] code to assume HAVE_PROCFS_USERS is always defined? |
| Comment by Andreas Dilger [ 01/Feb/13 ] |
|
I can't find any way to check for proc_dir_entry_aux, so we can't depend on checking it for patchless clients. I think what needs to change here is two things:
At worst this causes some small race where a /proc entry will not be shown when it is just loaded or unloaded, but should be safe against crashing. static inline int LPROCFS_ENTRY_AND_CHECK(struct proc_dir_entry *dp) { int deleted = 0; #ifdef HAVE_PROCFS_USERS spin_lock(&dp->pde_unload_lock); #endif if (unlikely(dp->proc_fops == NULL)) deleted = 1; #ifdef HAVE_PROCFS_USERS spin_unlock(&dp->pde_unload_lock); #endif LPROCFS_ENTRY(); #if defined(HAVE_PROCFS_DELETED) if (unlikely(dp->deleted)) { LPROCFS_EXIT(); deleted = 1; } #endif return deleted ? -ENODEV : 0; } I haven't tested this at all, nor even compiled it yet. |
| Comment by Andreas Dilger [ 01/Feb/13 ] |
|
Patch at http://review.whamcloud.com/5253, let's hope it builds and tests OK. |
| Comment by Jian Yu [ 05/Feb/13 ] |
|
Lustre Branch: b1_8 The issue still occurred: https://maloo.whamcloud.com/test_sets/583b7710-7009-11e2-a955-52540035b04c |
| Comment by Zhenyu Xu [ 06/Feb/13 ] |
|
Since recovery-small test_57 is intended to test proc removing while reading it, so the patch (review#5253) cannot avoid the hung of the test w/ patchless client build upon the hidden proc_dir_entry users kernels. Since later kernels all use proc_dir_entry users, I think we can presume it and define LPROCFS_ {ENTRY,END}empty ops. |
| Comment by Nathaniel Clark [ 14/Feb/13 ] |
|
Patch to assume proc_dir_entry for rhel kernels: http://review.whamcloud.com/5439 |
| Comment by Peter Jones [ 18/Feb/13 ] |
|
Nathaniel, Is this patch needed for b2_1 also? Peter |
| Comment by Jian Yu [ 18/Feb/13 ] |
|
Per http://wiki.whamcloud.com/display/ENG/Lustre+2.1.4+release+testing+tracker, the issue still exists in Lustre 2.1.4, so we need the patch on the current b2_1 branch for Lustre 2.1.5. |
| Comment by Nathaniel Clark [ 19/Feb/13 ] |
|
Peter, Yes. This patch can cleanly apply to b2_1 (all the way through master). It should be applied to anything we want to support rhel 5 on. Should I submit additional patches? |
| Comment by Nathaniel Clark [ 19/Feb/13 ] |
|
b2_1 patch: |
| Comment by Peter Jones [ 21/Feb/13 ] |
|
Landed for 1.8.9 and 2.1.5 |