[LU-11363] sanity-sec test 31 fails with 'unable to remount client' Created: 11/Sep/18  Updated: 15/Dec/18  Resolved: 16/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: DNE, zfs
Environment:

DNE/ZFS


Issue Links:
Related
is related to LU-11057 Client mount option "-o network=net" ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-sec test_31 was added by the patch at https://review.whamcloud.com/#/c/32590/ and merged with master on September 10, 2018. So far, the test is either failing or crashing for review-dne-zfs-part-2 only.

Looking at the logs for the failure https://testing.whamcloud.com/test_sets/c7881c1e-b5b7-11e8-8c12-52540065bddc, from the test_log, for every target, we see a problem when tunefs is called

CMD: trevis-5vm8 tunefs.lustre --quiet --writeconf lustre-mdt1/mdt1
trevis-5vm8: 
trevis-5vm8: tunefs.lustre FATAL: Device lustre-mdt1/mdt1 has not been formatted with mkfs.lustre
trevis-5vm8: tunefs.lustre: exiting with 19 (No such device)
checking for existing Lustre data: not found

From there, we see a variety of other errors

Started lustre-MDT0003
CMD: trevis-5vm9 lctl get_param -n mdt.lustre-MDT0003.identity_upcall
/usr/lib64/lustre/tests/test-framework.sh: line 4452: mdt.lustre-MDT0000.identity_upcall: command not found
CMD: trevis-5vm9 lctl set_param -n mdt.lustre-MDT0003.identity_upcall "NONE"
CMD: trevis-5vm9 lctl set_param -n mdt/lustre-MDT0003/identity_flush=-1
…
CMD: trevis-5vm5.trevis.whamcloud.com lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
error: get_param: param_path 'mdc/*/connect_flags': No such file or directory
jobstats not supported by server
disable quota as required
CMD: trevis-5vm8 /usr/sbin/lctl list_nids | grep tcp999
Starting client: trevis-5vm5.trevis.whamcloud.com:  -o user_xattr,flock,network=tcp999 10.9.5.8@tcp999:/lustre /mnt/lustre
CMD: trevis-5vm5.trevis.whamcloud.com mkdir -p /mnt/lustre
CMD: trevis-5vm5.trevis.whamcloud.com mount -t lustre -o user_xattr,flock,network=tcp999 10.9.5.8@tcp999:/lustre /mnt/lustre
mount.lustre: mount 10.9.5.8@tcp999:/lustre at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Is 'lustre' the correct filesystem name?
Are the mount options correct?
Check the syslog for more info.
unconfigure:
    - lnet:
          errno: -16
          descr: "LNet unconfigure error: Device or resource busy"
Starting client: trevis-5vm5.trevis.whamcloud.com:  -o user_xattr,flock,network=tcp999 10.9.5.8@tcp999:/lustre /mnt/lustre
CMD: trevis-5vm5.trevis.whamcloud.com mkdir -p /mnt/lustre
CMD: trevis-5vm5.trevis.whamcloud.com mount -t lustre -o user_xattr,flock,network=tcp999 10.9.5.8@tcp999:/lustre /mnt/lustre
mount.lustre: mount 10.9.5.8@tcp999:/lustre at /mnt/lustre failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
 sanity-sec test_31: @@@@@@ FAIL: unable to remount client 

The following are links to logs for other test session failures for this test
https://testing.whamcloud.com/test_sets/6d51eee0-b54f-11e8-b86b-52540065bddc
https://testing.whamcloud.com/test_sets/a0a5d418-b555-11e8-a7de-52540065bddc
https://testing.whamcloud.com/test_sets/6070a87e-b59f-11e8-8c12-52540065bddc

When sanity-sec test_31 crashes, we see the following in the kernel-crash log

[ 9311.019503] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,network=tcp999 10.2.8.122@tcp999:/lustre /mnt/lustre
[ 9311.029516] LustreError: 21790:0:(obd_mount.c:1422:lmd_parse()) LNet Dynamic Peer Discovery is enabled on this node. 'network' mount option cannot be taken into account.
[ 9311.031037] LustreError: 21790:0:(obd_mount.c:1520:lmd_parse()) Bad mount options user_xattr,flock,network=tcp999,device=10.2.8.122@tcp999:/lustre
[ 9311.032361] LustreError: 21790:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-22)
[ 9312.035556] LNet: Removed LNI 10.2.8.119@tcp999
[ 9312.170496] Key type lgssc unregistered
[ 9312.171026] Lustre: 21892:0:(gss_mech_switch.c:80:lgss_mech_unregister()) Unregister krb5 mechanism
[ 9314.495561] LNet: Removed LNI 10.2.8.119@tcp
[ 9314.657567] LNet: HW NUMA nodes: 1, HW CPU cores: 2, npartitions: 1
[ 9314.661048] alg: No test for adler32 (adler32-zlib)
[ 9315.459156] Lustre: Lustre: Build Version: 2.11.54_104_gd365ea2
[ 9315.529642] LNet: Added LNI 10.2.8.119@tcp [8/256/0/180]
[ 9315.530284] LNet: Accept all, port 7988
[ 9315.537592] LNet: Added LNI 10.2.8.119@tcp999 [8/256/0/180]
[ 9315.541706] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
[ 9315.550513] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,network=tcp999 10.2.8.122@tcp999:/lustre /mnt/lustre
[ 9315.605193] LustreError: 22006:0:(ldlm_lib.c:492:client_obd_setup()) can't add initial connection
[ 9315.606173] LustreError: 22006:0:(obd_config.c:559:class_setup()) setup lustre-MDT0000-mdc-ffff8c373b3f5000 failed (-2)
[ 9315.607252] LustreError: 22006:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.2.8.122@tcp999: cfg command failed: rc = -2
[ 9315.608409] Lustre:    cmd=cf003 0:lustre-MDT0000-mdc  1:lustre-MDT0000_UUID  2:10.2.8.122@tcp  

[ 9315.609546] LustreError: 108:0:(connection.c:96:ptlrpc_connection_put()) ASSERTION( atomic_read(&conn->c_refcount) > 1 ) failed: 
[ 9315.609934] LustreError: 15c-8: MGC10.2.8.122@tcp999: The configuration from log 'lustre-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[ 9315.613151] LustreError: 108:0:(connection.c:96:ptlrpc_connection_put()) LBUG
[ 9315.613864] Pid: 108, comm: kworker/1:2 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
[ 9315.614783] Call Trace:
[ 9315.615088]  [<ffffffffc07847cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 9315.615779]  [<ffffffffc078487c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 9315.616419]  [<ffffffffc0a7aac3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc]
[ 9315.617180]  [<ffffffffc08b4c15>] obd_zombie_imp_cull+0x65/0x3e0 [obdclass]
[ 9315.617705] LustreError: 21994:0:(obd_config.c:610:class_cleanup()) Device 3 not setup
[ 9315.617739] Lustre: Unmounted lustre-client
[ 9315.619443]  [<ffffffffbd8b35ef>] process_one_work+0x17f/0x440
[ 9315.620210]  [<ffffffffbd8b4686>] worker_thread+0x126/0x3c0
[ 9315.620798]  [<ffffffffbd8bb621>] kthread+0xd1/0xe0
[ 9315.621336]  [<ffffffffbdf205f7>] ret_from_fork_nospec_end+0x0/0x39
[ 9315.622164]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 9315.622720] Kernel panic - not syncing: LBUG
[ 9315.623235] CPU: 1 PID: 108 Comm: kworker/1:2 Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7.x86_64 #1
[ 9315.624371] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 9315.624956] Workqueue: obd_zombid obd_zombie_imp_cull [obdclass]
[ 9315.625577] Call Trace:
[ 9315.625859]  [<ffffffffbdf0e84e>] dump_stack+0x19/0x1b
[ 9315.626383]  [<ffffffffbdf08b50>] panic+0xe8/0x21f
[ 9315.626868]  [<ffffffffc07848cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 9315.627502]  [<ffffffffc0a7aac3>] ptlrpc_connection_put+0x213/0x220 [ptlrpc]
[ 9315.628222]  [<ffffffffc08b4c15>] obd_zombie_imp_cull+0x65/0x3e0 [obdclass]
[ 9315.628918]  [<ffffffffbd8b35ef>] process_one_work+0x17f/0x440
[ 9315.629498]  [<ffffffffbd8b4686>] worker_thread+0x126/0x3c0
[ 9315.630059]  [<ffffffffbd8b4560>] ? manage_workers.isra.24+0x2a0/0x2a0
[ 9315.630732]  [<ffffffffbd8bb621>] kthread+0xd1/0xe0
[ 9315.631234]  [<ffffffffbd8bb550>] ? insert_kthread_work+0x40/0x40
[ 9315.631839]  [<ffffffffbdf205f7>] ret_from_fork_nospec_begin+0x21/0x21
[ 9315.632490]  [<ffffffffbd8bb550>] ? insert_kthread_work+0x40/0x40

Logs for when sanity-sec test 31 crashes are at
https://testing.whamcloud.com/test_sets/4ec4717a-b5b6-11e8-b86b-52540065bddc
https://testing.whamcloud.com/test_sets/fe8c7708-b569-11e8-a7de-52540065bddc



 Comments   
Comment by Mikhail Pershin [ 11/Sep/18 ]

+1 on master
https://testing.whamcloud.com/test_sets/4e6f80fa-b5c2-11e8-b86b-52540065bddc

Comment by Mikhail Pershin [ 13/Sep/18 ]

It has more than 60% failure rate now

Comment by James Nunez (Inactive) [ 13/Sep/18 ]

Patch https://review.whamcloud.com/#/c/33139/ reverted the patch that added sanity-sec test 31. Thus, all patches should be rebased to get this update.

Comment by Sebastien Buisson [ 26/Sep/18 ]

Hi,

In patch https://review.whamcloud.com/33189 , I modified writeconf_all() so that it uses vdev instead of dev name.
Unfortunately, test_31 still fails, with following messages:

CMD: trevis-33vm7 tunefs.lustre --quiet --writeconf /dev/lvm-Role_OSS/P1
trevis-33vm7: 
trevis-33vm7: tunefs.lustre FATAL: Device /dev/lvm-Role_OSS/P1 has not been formatted with mkfs.lustre
trevis-33vm7: tunefs.lustre: exiting with 19 (No such device)

However, target was formatted with:

CMD: trevis-33vm7 mkfs.lustre --mgsnode=trevis-33vm8@tcp --fsname=lustre --ost --index=0 --param=sys.timeout=20 --backfstype=zfs --device-size=9950986 --reformat lustre-ost1/ost1 /dev/lvm-Role_OSS/P1

(as seen in lustre-initialization-1 logs).

So /dev/lvm-Role_OSS/P1 should be a valid device.

Or does the problem stem from the fact that tunefs.lustre cannot be used on targets using ZFS as backend?

Comment by James Nunez (Inactive) [ 27/Sep/18 ]

I think I understand the issue with the original patch 32590. In sanity-sec test 31, we call stopall() to stop all servers which calls stop(). For ZFS, stop() will export the zpool and, thus, tunefs.lustre --writeconf will fail on exported zpools. We need to either set KEEP_ZPOOL to true or import the zpool after calling stopall().

Comment by Sebastien Buisson [ 16/Oct/18 ]

I think this ticket can be closed now that patch at https://review.whamcloud.com/33189 has landed.

Comment by Peter Jones [ 16/Oct/18 ]

ok sure.

Generated at Sat Feb 10 02:43:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.