[LU-10804] obdfilter-survey test_3a: lprocfs_alloc_obd_stats()) ASSERTION( obd->obd_proc_entry != ((void *)0) ) failed; LBUG Created: 12/Mar/18 Updated: 17/Mar/18 Resolved: 17/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for elena <c17455@cray.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5d126b54-2233-11e8-9ec4-52540065bddc [ 3072.347530] LustreError: 30713:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) ASSERTION( obd->obd_proc_entry != ((void *)0) ) failed: [ 3072.351782] LustreError: 30713:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) LBUG [ 3072.354022] Pid: 30713, comm: lctl [ 3072.355821] [ 3072.355821] Call Trace: [ 3072.359137] [<ffffffffc06917ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 3072.361177] [<ffffffffc069183c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 3072.363336] [<ffffffffc0cbe06f>] lprocfs_alloc_obd_stats+0x18f/0x200 [obdclass] [ 3072.365454] [<ffffffffc0c81730>] ? lprocfs_obd_setup+0x130/0x320 [obdclass] [ 3072.367602] [<ffffffffc12b13ad>] echo_srv_device_alloc+0x48d/0x540 [obdecho] [ 3072.369738] [<ffffffffc0c91cb4>] obd_setup+0x114/0x2a0 [obdclass] [ 3072.371892] [<ffffffffc0c920e8>] class_setup+0x2a8/0x840 [obdclass] [ 3072.373960] [<ffffffffc0c9651f>] class_process_config+0x1b3f/0x27e0 [obdclass] [ 3072.376145] [<ffffffffc0c86ff9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 3072.378240] [<ffffffffc0c7053a>] class_handle_ioctl+0x56a/0x1df0 [obdclass] [ 3072.380342] [<ffffffff8121f7b8>] ? destroy_inode+0x38/0x60 [ 3072.382250] [<ffffffff812b3ea8>] ? security_capable+0x18/0x20 [ 3072.384178] [<ffffffffc0c567f2>] obd_class_ioctl+0xd2/0x170 [obdclass] [ 3072.386089] [<ffffffff8121730d>] do_vfs_ioctl+0x33d/0x540 [ 3072.387929] [<ffffffff81224a04>] ? mntput+0x24/0x40 [ 3072.389676] [<ffffffff81205013>] ? __fput+0x183/0x260 [ 3072.391419] [<ffffffff812175b1>] SyS_ioctl+0xa1/0xc0 [ 3072.393128] [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214 [ 3072.394862] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b [ 3072.396554] [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214 [ 3072.398245] [ 3072.399550] Kernel panic - not syncing: LBUG |
| Comments |
| Comment by Cory Spitz [ 12/Mar/18 ] |
|
Obdfilter-survey ought to work very reliably. We should consider this a major issue (or greater). |
| Comment by Oleg Drokin [ 12/Mar/18 ] |
|
The report referenced is for a gerrit patch that is based on another patch that seems to change relevant functionality. We looked and this sort of failure was not observed in the latest master test results, so the primary suspicion is it's the patch in question that broke things, esp. since if you look, the baseline patch is "move procfs stuff to debugfs" and the assertion you hit is in procfs-related code. |
| Comment by James Nunez (Inactive) [ 12/Mar/18 ] |
|
I've reviewed all obdfilter-survey test 3a hangs over all branches starting from January 1, 2018 to today and I've found the following test sessions that have the same assertion as in this ticket: 2018-03-09 22:59:29 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/67d21086-240d-11e8-b74b-52540065bddc 2018-03-09 23:53:56 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/6d7aa550-240f-11e8-b3c6-52540065bddc 2018-03-10 02:13:04 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/8d7b15dc-242e-11e8-b74b-52540065bddc 2018-03-09 00:54:41 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/7c8f22cc-2348-11e8-9852-52540065bddc 2018-03-08 18:53:42 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/965a62b4-231d-11e8-9852-52540065bddc 2018-03-08 21:41:26 UTC full-patchless - https://testing.hpdd.intel.com/test_sets/15d7653a-233e-11e8-8d2f-52540065bddc 2018-03-08 22:23:08 UTC full-patchless – https://testing.hpdd.intel.com/test_sets/0f8acdfa-2340-11e8-88de-52540065bddc All are from full test sessions for our patchless Lustre builds/testing. |
| Comment by Oleg Drokin [ 13/Mar/18 ] |
|
hm, so that's all after Mar 6th round of landings, I guess. |
| Comment by Elena Gryaznova [ 13/Mar/18 ] |
|
This was not observed in the latest master test results because of test_3a was not functional, see > this test passes at Maloo, seems it does just nothing.
|
| Comment by Peter Jones [ 13/Mar/18 ] |
|
Mike Can you please investigate? Thanks Peter |
| Comment by Andreas Dilger [ 13/Mar/18 ] |
|
Looking at the assertion, this appears like it is related to changes in /proc, possibly related to /sysfs changes? Possibly James already has a patch to fix this part of the code? |
| Comment by Sarah Liu [ 13/Mar/18 ] |
|
Hit this in interop testing https://testing.hpdd.intel.com/test_sets/baa94546-2709-11e8-b3c6-52540065bddc
client: 2.10.3 server: lustre-master tag-2.10.59 OSS console [75396.029755] Lustre: DEBUG MARKER: == obdfilter-survey test 3a: Network survey ========================================================== 18:48:29 (1520966909) [75401.745879] Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.9.4.73@tcp) was lost; in progress operations using this service will wait for recovery to complete [75401.749951] Lustre: Skipped 9 previous similar messages [75410.250117] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts [75410.567342] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1 [75410.784095] Lustre: server umount lustre-OST0000 complete [75410.950964] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [75411.276952] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost2' ' /proc/mounts [75411.590521] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost2 [75411.965347] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [75412.289128] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost3' ' /proc/mounts [75412.601354] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost3 [75412.988541] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [75413.310862] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost4' ' /proc/mounts [75413.620037] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost4 [75413.745052] Lustre: 26996:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1520966920/real 1520966920] req@ffff880037402400 x1594796080200464/t0(0) o400->MGC10.9.4.73@tcp@10.9.4.73@tcp:26/25 lens 224/224 e 0 to 1 dl 1520966927 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [75413.751282] Lustre: 26996:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 1 previous similar message [75413.753466] LustreError: 166-1: MGC10.9.4.73@tcp: Connection to MGS (at 10.9.4.73@tcp) was lost; in progress operations using this service will fail [75413.977797] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [75414.299315] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost5' ' /proc/mounts [75414.606635] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost5 [75414.967344] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [75415.289338] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost6' ' /proc/mounts [75415.598284] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost6 [75415.793961] Lustre: server umount lustre-OST0005 complete [75415.795939] Lustre: Skipped 4 previous similar messages [75415.954904] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [75416.274434] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost7' ' /proc/mounts [75416.583237] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost7 [75416.965393] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' [75419.722863] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d @ [75421.172060] LustreError: 14767:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) ASSERTION( obd->obd_proc_entry != ((void *)0) ) failed: [75421.176024] LustreError: 14767:0:(lprocfs_status_server.c:647:lprocfs_alloc_obd_stats()) LBUG [75421.178166] Pid: 14767, comm: lctl [75421.179941] [75421.179941] Call Trace: [75421.207941] [<ffffffffc06ae7ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [75421.210086] [<ffffffffc06ae83c>] lbug_with_loc+0x4c/0xb0 [libcfs] [75421.212244] [<ffffffffc08c867f>] lprocfs_alloc_obd_stats+0x18f/0x200 [obdclass] [75421.214387] [<ffffffffc088b920>] ? lprocfs_obd_setup+0x130/0x320 [obdclass] [75421.237295] [<ffffffffc10a8ced>] echo_setup+0x1bd/0x2c0 [obdecho] [75421.273090] [<ffffffffc0ab6810>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [75421.275072] [<ffffffffc089bfda>] obd_setup+0x18a/0x2b0 [obdclass] [75421.276950] [<ffffffffc06c0a66>] ? cfs_hash_create+0x366/0xa20 [libcfs] [75421.278899] [<ffffffffc089c3a8>] class_setup+0x2a8/0x840 [obdclass] [75421.280725] [<ffffffffc08a081c>] class_process_config+0x1b5c/0x2810 [obdclass] [75421.282654] [<ffffffffc0891219>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [75421.284527] [<ffffffffc087a71a>] class_handle_ioctl+0x56a/0x1de0 [obdclass] [75421.286438] [<ffffffff8122239b>] ? destroy_inode+0x3b/0x60 [75421.288202] [<ffffffff812b72be>] ? security_capable+0x1e/0x20 [75421.290020] [<ffffffffc08607f2>] obd_class_ioctl+0xd2/0x170 [obdclass] [75421.291824] [<ffffffff81219e90>] do_vfs_ioctl+0x350/0x560 [75421.293554] [<ffffffff812275e4>] ? mntput+0x24/0x40 [75421.295289] [<ffffffff812079b6>] ? __fput+0x186/0x260 [75421.296973] [<ffffffff8121a141>] SyS_ioctl+0xa1/0xc0 [75421.298680] [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146 [75421.300454] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [75421.302229] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146 [75421.304033] [75421.305435] Kernel panic - not syncing: LBUG [75421.306428] CPU: 1 PID: 14767 Comm: lctl Tainted: G OE ------------ 3.10.0-693.21.1.el7_lustre.x86_64 #1 [75421.306428] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [75421.306428] Call Trace: [75421.306428] [<ffffffff816ae7c8>] dump_stack+0x19/0x1b [75421.306428] [<ffffffff816a8634>] panic+0xe8/0x21f [75421.306428] [<ffffffffc06ae854>] lbug_with_loc+0x64/0xb0 [libcfs] [75421.306428] [<ffffffffc08c867f>] lprocfs_alloc_obd_stats+0x18f/0x200 [obdclass] [75421.306428] [<ffffffffc088b920>] ? lprocfs_obd_setup+0x130/0x320 [obdclass] [75421.306428] [<ffffffffc10a8ced>] echo_setup+0x1bd/0x2c0 [obdecho] [75421.306428] [<ffffffffc0ab6810>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc] [75421.306428] [<ffffffffc089bfda>] obd_setup+0x18a/0x2b0 [obdclass] [75421.306428] [<ffffffffc06c0a66>] ? cfs_hash_create+0x366/0xa20 [libcfs] [75421.306428] [<ffffffffc089c3a8>] class_setup+0x2a8/0x840 [obdclass] [75421.306428] [<ffffffffc08a081c>] class_process_config+0x1b5c/0x2810 [obdclass] [75421.306428] [<ffffffffc0891219>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [75421.306428] [<ffffffffc087a71a>] class_handle_ioctl+0x56a/0x1de0 [obdclass] [75421.306428] [<ffffffff8122239b>] ? destroy_inode+0x3b/0x60 [75421.306428] [<ffffffff812b72be>] ? security_capable+0x1e/0x20 [75421.306428] [<ffffffffc08607f2>] obd_class_ioctl+0xd2/0x170 [obdclass] [75421.306428] [<ffffffff81219e90>] do_vfs_ioctl+0x350/0x560 [75421.306428] [<ffffffff812275e4>] ? mntput+0x24/0x40 [75421.306428] [<ffffffff812079b6>] ? __fput+0x186/0x260 [75421.306428] [<ffffffff8121a141>] SyS_ioctl+0xa1/0xc0 [75421.306428] [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146 [75421.306428] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [75421.306428] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146 [ 0.000000] Initializing cgroup subsys cpuset |
| Comment by Mikhail Pershin [ 15/Mar/18 ] |
|
This is LU-8066, commit 0100ab268c and this particular change: diff --git a/lustre/obdecho/echo_client.c b/lustre/obdecho/echo_client.c index 081a339..6e25a02 100644 --- a/lustre/obdecho/echo_client.c +++ b/lustre/obdecho/echo_client.c @@ -3071,7 +3071,7 @@ static int __init obdecho_init(void) if (rc != 0) goto failed_0; - rc = class_register_type(&echo_obd_ops, NULL, true, NULL, + rc = class_register_type(&echo_obd_ops, NULL, false, NULL, LUSTRE_ECHO_NAME, NULL); if (rc != 0) goto failed_1; that change causes empty lprocfs root for echo server device and later it asserts on it in lprocfs_alloc_obd_stats() as I can see. |
| Comment by Gerrit Updater [ 15/Mar/18 ] |
|
Mike Pershin (mike.pershin@intel.com) uploaded a new patch: https://review.whamcloud.com/31664 |
| Comment by Gerrit Updater [ 17/Mar/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31664/ |
| Comment by Peter Jones [ 17/Mar/18 ] |
|
Landed for 2.11 |