[LU-10804] obdfilter-survey test_3a: lprocfs_alloc_obd_stats()) ASSERTION( obd->obd_proc_entry != ((void *)0) ) failed; LBUG Created: 12/Mar/18  Updated: 17/Mar/18  Resolved: 17/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for elena <c17455@cray.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5d126b54-2233-11e8-9ec4-52540065bddc

[ 3072.347530] LustreError: 30713:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) ASSERTION( obd->obd_proc_entry != ((void *)0) ) failed: 
[ 3072.351782] LustreError: 30713:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) LBUG
[ 3072.354022] Pid: 30713, comm: lctl
[ 3072.355821] 
[ 3072.355821] Call Trace:
[ 3072.359137]  [<ffffffffc06917ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 3072.361177]  [<ffffffffc069183c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 3072.363336]  [<ffffffffc0cbe06f>] lprocfs_alloc_obd_stats+0x18f/0x200 [obdclass]
[ 3072.365454]  [<ffffffffc0c81730>] ? lprocfs_obd_setup+0x130/0x320 [obdclass]
[ 3072.367602]  [<ffffffffc12b13ad>] echo_srv_device_alloc+0x48d/0x540 [obdecho]
[ 3072.369738]  [<ffffffffc0c91cb4>] obd_setup+0x114/0x2a0 [obdclass]
[ 3072.371892]  [<ffffffffc0c920e8>] class_setup+0x2a8/0x840 [obdclass]
[ 3072.373960]  [<ffffffffc0c9651f>] class_process_config+0x1b3f/0x27e0 [obdclass]
[ 3072.376145]  [<ffffffffc0c86ff9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 3072.378240]  [<ffffffffc0c7053a>] class_handle_ioctl+0x56a/0x1df0 [obdclass]
[ 3072.380342]  [<ffffffff8121f7b8>] ? destroy_inode+0x38/0x60
[ 3072.382250]  [<ffffffff812b3ea8>] ? security_capable+0x18/0x20
[ 3072.384178]  [<ffffffffc0c567f2>] obd_class_ioctl+0xd2/0x170 [obdclass]
[ 3072.386089]  [<ffffffff8121730d>] do_vfs_ioctl+0x33d/0x540
[ 3072.387929]  [<ffffffff81224a04>] ? mntput+0x24/0x40
[ 3072.389676]  [<ffffffff81205013>] ? __fput+0x183/0x260
[ 3072.391419]  [<ffffffff812175b1>] SyS_ioctl+0xa1/0xc0
[ 3072.393128]  [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[ 3072.394862]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[ 3072.396554]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
[ 3072.398245] 
[ 3072.399550] Kernel panic - not syncing: LBUG


 Comments   
Comment by Cory Spitz [ 12/Mar/18 ]

Obdfilter-survey ought to work very reliably.  We should consider this a major issue (or greater).

Comment by Oleg Drokin [ 12/Mar/18 ]

The report referenced is for a gerrit patch that is based on another patch that seems to change relevant functionality. We looked and this sort of failure was not observed in the latest master test results, so the primary suspicion is it's the patch in question that broke things, esp. since if you look, the baseline patch is "move procfs stuff to debugfs" and the assertion you hit is in procfs-related code.

Comment by James Nunez (Inactive) [ 12/Mar/18 ]

I've reviewed all obdfilter-survey test 3a hangs over all branches starting from January 1, 2018 to today and I've found the following test sessions that have the same assertion as in this ticket:

2018-03-09 22:59:29 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/67d21086-240d-11e8-b74b-52540065bddc

2018-03-09 23:53:56 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/6d7aa550-240f-11e8-b3c6-52540065bddc

2018-03-10 02:13:04 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/8d7b15dc-242e-11e8-b74b-52540065bddc

2018-03-09 00:54:41 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/7c8f22cc-2348-11e8-9852-52540065bddc

2018-03-08 18:53:42 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/965a62b4-231d-11e8-9852-52540065bddc

2018-03-08 21:41:26 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/15d7653a-233e-11e8-8d2f-52540065bddc

2018-03-08 22:23:08 UTC full-patchless – https://testing.hpdd.intel.com/test_sets/0f8acdfa-2340-11e8-88de-52540065bddc

All are from full test sessions for our patchless Lustre builds/testing.

Comment by Oleg Drokin [ 13/Mar/18 ]

hm, so that's all after Mar 6th round of landings, I guess.

Comment by Elena Gryaznova [ 13/Mar/18 ]

This was not observed in the latest master test results because of test_3a was not functional, see LU-7420 :

> this test passes at Maloo, seems it does just nothing.

LU-7420 patch https://review.whamcloud.com/#/c/18443/ fixes the test to be executed. The reported issue hit with this patch.

Comment by Peter Jones [ 13/Mar/18 ]

Mike

Can you please investigate?

Thanks

Peter

Comment by Andreas Dilger [ 13/Mar/18 ]

Looking at the assertion, this appears like it is related to changes in /proc, possibly related to /sysfs changes? Possibly James already has a patch to fix this part of the code?

Comment by Sarah Liu [ 13/Mar/18 ]

Hit this in interop testing

https://testing.hpdd.intel.com/test_sets/baa94546-2709-11e8-b3c6-52540065bddc

 

client: 2.10.3

server: lustre-master tag-2.10.59

OSS console

[75396.029755] Lustre: DEBUG MARKER: == obdfilter-survey test 3a: Network survey ========================================================== 18:48:29 (1520966909)
[75401.745879] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.4.73@tcp) was lost; in progress operations using this service will wait for recovery to complete
[75401.749951] Lustre: Skipped 9 previous similar messages
[75410.250117] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts
[75410.567342] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
[75410.784095] Lustre: server umount lustre-OST0000 complete
[75410.950964] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[75411.276952] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts
[75411.590521] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost2
[75411.965347] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[75412.289128] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts
[75412.601354] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost3
[75412.988541] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[75413.310862] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts
[75413.620037] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost4
[75413.745052] Lustre: 26996:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520966920/real 1520966920] req@ffff880037402400 x1594796080200464/t0(0) o400->MGC10.9.4.73@tcp@10.9.4.73@tcp:26/25 lens 224/224 e 0 to 1 dl 1520966927 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[75413.751282] Lustre: 26996:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 1 previous similar message
[75413.753466] LustreError: 166-1: MGC10.9.4.73@tcp: Connection to MGS (at 10.9.4.73@tcp) was lost; in progress operations using this service will fail
[75413.977797] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[75414.299315] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts
[75414.606635] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost5
[75414.967344] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[75415.289338] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts
[75415.598284] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost6
[75415.793961] Lustre: server umount lustre-OST0005 complete
[75415.795939] Lustre: Skipped 4 previous similar messages
[75415.954904] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[75416.274434] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts
[75416.583237] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost7
[75416.965393] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
[75419.722863] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @
[75421.172060] LustreError: 14767:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) ASSERTION( obd->obd_proc_entry != ((void *)0) ) failed: 
[75421.176024] LustreError: 14767:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) LBUG
[75421.178166] Pid: 14767, comm: lctl
[75421.179941] 
[75421.179941] Call Trace:
[75421.207941] [<ffffffffc06ae7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[75421.210086] [<ffffffffc06ae83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[75421.212244] [<ffffffffc08c867f>] lprocfs_alloc_obd_stats+0x18f/0x200 [obdclass]
[75421.214387] [<ffffffffc088b920>] ? lprocfs_obd_setup+0x130/0x320 [obdclass]
[75421.237295] [<ffffffffc10a8ced>] echo_setup+0x1bd/0x2c0 [obdecho]
[75421.273090] [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[75421.275072] [<ffffffffc089bfda>] obd_setup+0x18a/0x2b0 [obdclass]
[75421.276950] [<ffffffffc06c0a66>] ? cfs_hash_create+0x366/0xa20 [libcfs]
[75421.278899] [<ffffffffc089c3a8>] class_setup+0x2a8/0x840 [obdclass]
[75421.280725] [<ffffffffc08a081c>] class_process_config+0x1b5c/0x2810 [obdclass]
[75421.282654] [<ffffffffc0891219>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[75421.284527] [<ffffffffc087a71a>] class_handle_ioctl+0x56a/0x1de0 [obdclass]
[75421.286438] [<ffffffff8122239b>] ? destroy_inode+0x3b/0x60
[75421.288202] [<ffffffff812b72be>] ? security_capable+0x1e/0x20
[75421.290020] [<ffffffffc08607f2>] obd_class_ioctl+0xd2/0x170 [obdclass]
[75421.291824] [<ffffffff81219e90>] do_vfs_ioctl+0x350/0x560
[75421.293554] [<ffffffff812275e4>] ? mntput+0x24/0x40
[75421.295289] [<ffffffff812079b6>] ? __fput+0x186/0x260
[75421.296973] [<ffffffff8121a141>] SyS_ioctl+0xa1/0xc0
[75421.298680] [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146
[75421.300454] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
[75421.302229] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
[75421.304033] 
[75421.305435] Kernel panic - not syncing: LBUG
[75421.306428] CPU: 1 PID: 14767 Comm: lctl Tainted: G OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1
[75421.306428] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[75421.306428] Call Trace:
[75421.306428] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b
[75421.306428] [<ffffffff816a8634>] panic+0xe8/0x21f
[75421.306428] [<ffffffffc06ae854>] lbug_with_loc+0x64/0xb0 [libcfs]
[75421.306428] [<ffffffffc08c867f>] lprocfs_alloc_obd_stats+0x18f/0x200 [obdclass]
[75421.306428] [<ffffffffc088b920>] ? lprocfs_obd_setup+0x130/0x320 [obdclass]
[75421.306428] [<ffffffffc10a8ced>] echo_setup+0x1bd/0x2c0 [obdecho]
[75421.306428] [<ffffffffc0ab6810>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
[75421.306428] [<ffffffffc089bfda>] obd_setup+0x18a/0x2b0 [obdclass]
[75421.306428] [<ffffffffc06c0a66>] ? cfs_hash_create+0x366/0xa20 [libcfs]
[75421.306428] [<ffffffffc089c3a8>] class_setup+0x2a8/0x840 [obdclass]
[75421.306428] [<ffffffffc08a081c>] class_process_config+0x1b5c/0x2810 [obdclass]
[75421.306428] [<ffffffffc0891219>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[75421.306428] [<ffffffffc087a71a>] class_handle_ioctl+0x56a/0x1de0 [obdclass]
[75421.306428] [<ffffffff8122239b>] ? destroy_inode+0x3b/0x60
[75421.306428] [<ffffffff812b72be>] ? security_capable+0x1e/0x20
[75421.306428] [<ffffffffc08607f2>] obd_class_ioctl+0xd2/0x170 [obdclass]
[75421.306428] [<ffffffff81219e90>] do_vfs_ioctl+0x350/0x560
[75421.306428] [<ffffffff812275e4>] ? mntput+0x24/0x40
[75421.306428] [<ffffffff812079b6>] ? __fput+0x186/0x260
[75421.306428] [<ffffffff8121a141>] SyS_ioctl+0xa1/0xc0
[75421.306428] [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146
[75421.306428] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21
[75421.306428] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
[ 0.000000] Initializing cgroup subsys cpuset

Comment by Mikhail Pershin [ 15/Mar/18 ]

This is LU-8066, commit 0100ab268c and this particular change:

diff --git a/lustre/obdecho/echo_client.c b/lustre/obdecho/echo_client.c
index 081a339..6e25a02 100644
--- a/lustre/obdecho/echo_client.c
+++ b/lustre/obdecho/echo_client.c
@@ -3071,7 +3071,7 @@ static int __init obdecho_init(void)
         if (rc != 0)
                 goto failed_0;
 
-       rc = class_register_type(&echo_obd_ops, NULL, true, NULL,
+       rc = class_register_type(&echo_obd_ops, NULL, false, NULL,
                                 LUSTRE_ECHO_NAME, NULL);
        if (rc != 0)
                goto failed_1;

that change causes empty lprocfs root for echo server device and later it asserts on it in lprocfs_alloc_obd_stats() as I can see.

Comment by Gerrit Updater [ 15/Mar/18 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/31664
Subject: LU-10804 echo: allow echo server to setup procfs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8685e8db0b55651fab21170ed75bc9df454c0e58

Comment by Gerrit Updater [ 17/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31664/
Subject: LU-10804 echo: allow echo server to setup procfs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: af94e4b7afd3fa724d161a5127dc9a89d898e73c

Comment by Peter Jones [ 17/Mar/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:38:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.