[LU-6665] Interop 2.7.0<->master conf-sanity test_80: (import.c:293:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed Created: 29/May/15  Updated: 16/Jan/22  Resolved: 16/Jan/22

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

server: 2.7.0
client: lustre-master #3029


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/6d359bc8-0035-11e5-a922-5254006e85c2.

The sub-test test_80 failed with the following error:

test failed to respond and timed out

OST console show:

03:25:51:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=10 fail_loc=0x906
03:25:51:LustreError: 11-0: MGC10.1.4.201@tcp: operation obd_ping to node 10.1.4.201@tcp failed: rc = -107
03:25:51:LustreError: Skipped 7 previous similar messages
03:26:22:LustreError: 166-1: MGC10.1.4.201@tcp: Connection to MGS (at 10.1.4.201@tcp) was lost; in progress operations using this service will fail
03:26:22:Lustre: 14127:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1432178732/real 1432178732]  req@ffff880077ed96c0 x1501746053519508/t0(0) o250->MGC10.1.4.201@tcp@10.1.4.201@tcp:26/25 lens 400/544 e 0 to 1 dl 1432178738 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
03:26:22:Lustre: 14127:0:(client.c:1939:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
03:26:22:Lustre: Evicted from MGS (at 10.1.4.201@tcp) after server handle changed from 0x7a35d8e1992e29fd to 0x7a35d8e1992e2aac
03:26:22:LustreError: 4602:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 sleeping for 15000ms
03:26:22:Lustre: DEBUG MARKER: mkdir -p /mnt/ost2
03:26:22:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_OSS/P2
03:26:22:Lustre: DEBUG MARKER: mkdir -p /mnt/ost2; mount -t lustre   		                   /dev/lvm-Role_OSS/P2 /mnt/ost2
03:26:22:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
03:26:22:LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
03:26:22:LustreError: 4746:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 sleeping for 15000ms
03:26:22:LustreError: 4602:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 awake
03:26:22:Lustre: MGC10.1.4.201@tcp: Connection restored to MGS (at 10.1.4.201@tcp)
03:26:22:LustreError: 4746:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 906 awake
03:26:22:LustreError: 4746:0:(import.c:293:ptlrpc_invalidate_import()) ASSERTION( imp->imp_invalid ) failed: 
03:26:22:LustreError: 4746:0:(import.c:293:ptlrpc_invalidate_import()) LBUG
03:26:22:Pid: 4746, comm: mount.lustre
03:26:22:
03:26:22:Call Trace:
03:26:22: [<ffffffffa0820895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
03:26:22: [<ffffffffa0820e97>] lbug_with_loc+0x47/0xb0 [libcfs]
03:26:22: [<ffffffffa0c3f06d>] ptlrpc_invalidate_import+0x85d/0x930 [ptlrpc]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa0c440f6>] ? ptlrpc_set_import_discon+0xf6/0x5b0 [ptlrpc]
03:26:22: [<ffffffffa0c445e3>] ptlrpc_reconnect_import+0x33/0x1b0 [ptlrpc]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa12ea2ea>] mgc_set_info_async+0x5ea/0x1940 [mgc]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa0a006d1>] obd_set_info_async.clone.2+0xf1/0x360 [obdclass]
03:26:22: [<ffffffffa0a06c18>] lustre_start_mgc+0x14c8/0x1e00 [obdclass]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa0a356f2>] server_fill_super+0x5c2/0x1690 [obdclass]
03:26:22: [<ffffffffa082b818>] ? libcfs_log_return+0x28/0x40 [libcfs]
03:26:22: [<ffffffffa0a07ab0>] lustre_fill_super+0x560/0xa80 [obdclass]
03:26:22: [<ffffffffa0a07550>] ? lustre_fill_super+0x0/0xa80 [obdclass]
03:26:22: [<ffffffff811917af>] get_sb_nodev+0x5f/0xa0
03:26:22: [<ffffffffa09feb05>] lustre_get_sb+0x25/0x30 [obdclass]
03:26:22: [<ffffffff81190deb>] vfs_kern_mount+0x7b/0x1b0
03:26:22: [<ffffffff81190f92>] do_kern_mount+0x52/0x130
03:26:22: [<ffffffff811b2b9b>] do_mount+0x2fb/0x930
03:26:22: [<ffffffff811b3260>] sys_mount+0x90/0xe0
03:26:22: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
03:26:22:
03:26:22:Kernel panic - not syncing: LBUG
03:26:22:Pid: 4746, comm: mount.lustre Not tainted 2.6.32-504.8.1.el6_lustre.x86_64 #1
03:26:22:Call Trace:
03:26:22: [<ffffffff81529b76>] ? panic+0xa7/0x16f
03:26:22: [<ffffffffa0820eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
03:26:22: [<ffffffffa0c3f06d>] ? ptlrpc_invalidate_import+0x85d/0x930 [ptlrpc]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa0c440f6>] ? ptlrpc_set_import_discon+0xf6/0x5b0 [ptlrpc]
03:26:22: [<ffffffffa0c445e3>] ? ptlrpc_reconnect_import+0x33/0x1b0 [ptlrpc]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa12ea2ea>] ? mgc_set_info_async+0x5ea/0x1940 [mgc]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa0a006d1>] ? obd_set_info_async.clone.2+0xf1/0x360 [obdclass]
03:26:22: [<ffffffffa0a06c18>] ? lustre_start_mgc+0x14c8/0x1e00 [obdclass]
03:26:22: [<ffffffffa08311c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
03:26:22: [<ffffffffa0a356f2>] ? server_fill_super+0x5c2/0x1690 [obdclass]
03:26:22: [<ffffffffa082b818>] ? libcfs_log_return+0x28/0x40 [libcfs]
03:26:22: [<ffffffffa0a07ab0>] ? lustre_fill_super+0x560/0xa80 [obdclass]
03:26:22: [<ffffffffa0a07550>] ? lustre_fill_super+0x0/0xa80 [obdclass]
03:26:22: [<ffffffff811917af>] ? get_sb_nodev+0x5f/0xa0
03:26:22: [<ffffffffa09feb05>] ? lustre_get_sb+0x25/0x30 [obdclass]
03:26:22: [<ffffffff81190deb>] ? vfs_kern_mount+0x7b/0x1b0
03:26:22: [<ffffffff81190f92>] ? do_kern_mount+0x52/0x130
03:26:22: [<ffffffff811b2b9b>] ? do_mount+0x2fb/0x930
03:26:22: [<ffffffff811b3260>] ? sys_mount+0x90/0xe0
03:26:22: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
03:26:22:Initializing cgroup subsys cpuset
03:26:22:Initializing cgroup subsys cpu


 Comments   
Comment by Andreas Dilger [ 01/Jun/15 ]

Sarah, is this a repeatable failure or only intermittent?

Comment by Sarah Liu [ 08/Jul/15 ]

Hi Andreas,

this is a repeatable issue:
https://testing.hpdd.intel.com/test_sets/b07b41e2-1211-11e5-a1d3-5254006e85c2
https://testing.hpdd.intel.com/test_sets/e36e29c2-250b-11e5-8009-5254006e85c2

Comment by Patrick Farrell (Inactive) [ 10/Aug/15 ]

This sure looks like https://jira.hpdd.intel.com/browse/LU-4913. It's being reproduced by the test added for that issue.

It seems the race there is not completely closed. (And I suspect this isn't related to interop.) Cray has seen this in our testing of 2.5 with the patch from LU-4913.

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for interop tag 2.7.66 - 2.7.1 Server/EL7 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/7f66f230-ccde-11e5-8b0e-5254006e85c2

Another instance found for interop tag 2.7.66 - 2.7.1 Server/EL6.7 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/ddac30e0-ccdd-11e5-b80c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ]

Another instance found for interop - 2.7.1 Server/EL6.7 Client, tag 2.7.90.
https://testing.hpdd.intel.com/test_sessions/f371534e-d573-11e5-bc47-5254006e85c2

Comment by Mikhail Pershin [ 16/Jan/22 ]

outdated

Generated at Sat Feb 10 02:02:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.