[LU-10893] all conf-sanity tests failed: format mgs: mkfs.lustre FATAL: Unable to build fs Created: 10/Apr/18  Updated: 18/Jul/18  Resolved: 18/Jul/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Major
Reporter: Alexander Boyko Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-684 replace dev_rdonly kernel patch with ... Resolved
Epic/Theme: test
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After LU-684  https://review.whamcloud.com/#/c/7200/  where dm-flakey layer was added to test-framework, conf-sanity didn`t pass with a real devices.
Example of configuration at local.sh

MDSCOUNT=1
OSTCOUNT=2
mds1_HOST=fre0101
MDSDEV1=/dev/vdb
mds_HOST=fre0101
MDSDEV=/dev/vdb
ost1_HOST=fre0102
OSTDEV1=/dev/vdb
ost2_HOST=fre0102
OSTDEV2=/dev/vdc
.....

Errors:

CMD: fre0205,fre0206,fre0208 PATH=/usr/lib64/lustre/tests/../tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/../tests/mpi:/usr/lib64/lustre/tests/../tests/racer:/usr/lib64/lustre/tests/../../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests/../tests:/usr/lib64/lustre/tests/../utils/gss:/root//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/usr/lib64/mpich/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin::/sbin:/bin:/usr/sbin: NAME=ncli sh rpc.sh set_hostid 
fre0208: fre0208: executing set_hostid
fre0205: fre0205: executing set_hostid
fre0206: fre0206: executing set_hostid
CMD: fre0205 [ -e "/dev/vdb" ]
CMD: fre0205 grep -c /mnt/lustre-mgs' ' /proc/mounts || true
CMD: fre0205 lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
CMD: fre0205 e2label /dev/vdb
CMD: fre0205 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=ldiskfs --device-size=0 --mkfsoptions=\"-E lazy_itable_init\" --reformat /dev/vdb
fre0205: 
fre0205: mkfs.lustre FATAL: Unable to build fs /dev/vdb (256)
fre0205: 
fre0205: mkfs.lustre FATAL: mkfs failed 256

A quick look shows that reformat is fine at conf-sanity with the next change to t-f
formatall() {
CLEANUP_DM_DEV=true stopall -f

since there are a lot of stopall at conf-sanity, they requires a fix also, probably.

== conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) ====================================================================================================== 15:36:46 (1522942606)
start mds service on fre0113
Starting mds1: -o rw,user_xattr  /dev/mapper/mds1_flakey /mnt/lustre-mds1
fre0113: fre0113: executing set_default_debug -1 all 4
pdsh@fre0115: fre0113: ssh exited with exit code 1
pdsh@fre0115: fre0113: ssh exited with exit code 1
Started lustre-MDT0000
start mds service on fre0113
Starting mds2: -o rw,user_xattr  /dev/mapper/mds2_flakey /mnt/lustre-mds2
fre0113: fre0113: executing set_default_debug -1 all 4
pdsh@fre0115: fre0113: ssh exited with exit code 1
pdsh@fre0115: fre0113: ssh exited with exit code 1
Started lustre-MDT0001
start ost1 service on fre0114
Starting ost1: -o user_xattr  /dev/mapper/ost1_flakey /mnt/lustre-ost1
fre0114: fre0114: executing set_default_debug -1 all 4
pdsh@fre0115: fre0114: ssh exited with exit code 1
pdsh@fre0115: fre0114: ssh exited with exit code 1
Started lustre-OST0000
mount lustre on /mnt/lustre.....
Starting client: fre0115:  -o user_xattr,flock fre0113@tcp:/lustre /mnt/lustre
setup single mount lustre success
umount lustre on /mnt/lustre.....
Stopping client fre0115 /mnt/lustre (opts:)
stop ost1 service on fre0114
Stopping /mnt/lustre-ost1 (opts:-f) on fre0114
stop mds service on fre0113
Stopping /mnt/lustre-mds1 (opts:-f) on fre0113
stop mds service on fre0113
Stopping /mnt/lustre-mds2 (opts:-f) on fre0113
modules unloaded.
Remove mds config log
Stopping /mnt/lustre-mgs (opts:) on fre0113
fre0113: debugfs 1.42.13.x6 (01-Mar-2018)
start mgs service on fre0113
Loading modules from /usr/lib64/lustre/tests/..
detected 2 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
../lnet/lnet/lnet options: 'accept=all'
../lnet/klnds/socklnd/ksocklnd options: 'sock_timeout=10'
gss/krb5 is not supported
Starting mgs:   /dev/mapper/mgs_flakey /mnt/lustre-mgs
fre0113: fre0113: executing set_default_debug -1 all 4
pdsh@fre0115: fre0113: ssh exited with exit code 1
pdsh@fre0115: fre0113: ssh exited with exit code 1
Started MGS
start ost1 service on fre0114
Starting ost1: -o user_xattr  /dev/mapper/ost1_flakey /mnt/lustre-ost1
fre0114: fre0114: executing set_default_debug -1 all 4
pdsh@fre0115: fre0114: ssh exited with exit code 1
pdsh@fre0115: fre0114: ssh exited with exit code 1
Started lustre-OST0000
start mds service on fre0113
Starting mds1: -o rw,user_xattr  /dev/mapper/mds1_flakey /mnt/lustre-mds1
fre0113: mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre-mds1 failed: No such file or directory
fre0113: Is the MGS specification correct?
fre0113: Is the filesystem name correct?
fre0113: If upgrading, is the copied client log valid? (see upgrade docs)
pdsh@fre0115: fre0113: ssh exited with exit code 2
Start of /dev/mapper/mds1_flakey on mds1 failed 2
Stopping clients: fre0115,fre0116 /mnt/lustre (opts:-f)
Stopping clients: fre0115,fre0116 /mnt/lustre2 (opts:-f)
Stopping /mnt/lustre-ost1 (opts:-f) on fre0114
pdsh@fre0115: fre0114: ssh exited with exit code 1
Stopping /mnt/lustre-mgs (opts:) on fre0113
fre0114: fre0114: executing set_hostid
fre0116: fre0116: executing set_hostid
fre0113: fre0113: executing set_hostid
Loading modules from /usr/lib64/lustre/tests/..
detected 2 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
gss/krb5 is not supported
Formatting mgs, mds, osts
Format mgs: /dev/mapper/mgs_flakey
pdsh@fre0115: fre0113: ssh exited with exit code 1
 conf-sanity test_17: @@@@@@ FAIL: mgs: device '/dev/mapper/mgs_flakey' does not exist 
  Trace dump:
  = /usr/lib64/lustre/tests/../tests/test-framework.sh:5734:error()
  = /usr/lib64/lustre/tests/../tests/test-framework.sh:4314:__touch_device()
  = /usr/lib64/lustre/tests/../tests/test-framework.sh:4331:format_mgs()
  = /usr/lib64/lustre/tests/../tests/test-framework.sh:4384:formatall()
  = /usr/lib64/lustre/tests/conf-sanity.sh:109:reformat()
  = /usr/lib64/lustre/tests/conf-sanity.sh:91:reformat_and_config()
  = /usr/lib64/lustre/tests/conf-sanity.sh:605:test_17()
  = /usr/lib64/lustre/tests/../tests/test-framework.sh:6010:run_one()
  = /usr/lib64/lustre/tests/../tests/test-framework.sh:6049:run_one_logged()
  = /usr/lib64/lustre/tests/../tests/test-framework.sh:5848:run_test()
  = /usr/lib64/lustre/tests/conf-sanity.sh:607:main()
Dumping lctl log to /tmp/test_logs/1522942566/conf-sanity.test_17.*.1522942656.log
fre0114: Warning: Permanently added 'fre0115,192.168.101.15' (ECDSA) to the list of known hosts.

fre0116: Warning: Permanently added 'fre0115,192.168.101.15' (ECDSA) to the list of known hosts.

fre0113: Warning: Permanently added 'fre0115,192.168.101.15' (ECDSA) to the list of known hosts.

Resetting fail_loc on all nodes...done.
FAIL 17 (51s)


 Comments   
Comment by Jian Yu [ 12/Apr/18 ]

Hi Alexander,

The following error is not related to dm-flakey device:

CMD: fre0205 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=ldiskfs --device-size=0 --mkfsoptions=\"-E lazy_itable_init\" --reformat /dev/vdb
fre0205: 
fre0205: mkfs.lustre FATAL: Unable to build fs /dev/vdb (256)
fre0205: 
fre0205: mkfs.lustre FATAL: mkfs failed 256

The format command was run on /dev/vdb and failed. What error messages are in dmesg or syslog? Could you please manually run the mkfs.lustre command on /dev/vdb to see if it passes?

Comment by Alexander Boyko [ 03/May/18 ]

I`ve played a bit with t-f and found that dm-flakey patch broke the typical usage of test-framework. The simple way for some testing is

1) llmount.sh
2) ONLY=xxx sanity.sh
3) ONLY=xxx conf-sanity.sh
4) etc.
5) llmountcleanup.sh

[test@devvm-centos-1 lustre-release]$ sudo MDSDEV=/dev/sdb MDSDEV1=/dev/sdb sh lustre/tests/llmount.sh
Stopping clients: devvm-centos-1 /mnt/lustre (opts:-f)
Stopping clients: devvm-centos-1 /mnt/lustre2 (opts:-f)
Loading modules from /home/test/lustre-release/lustre/tests/..
detected 4 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
../libcfs/libcfs/libcfs options: 'cpu_npartitions=2'
../lnet/lnet/lnet options: 'networks=tcp0(eth0) accept=all'
gss/krb5 is not supported
quota/lquota options: 'hash_lqs_cur_bits=3'
Formatting mgs, mds, osts
Format mds1: /dev/sdb
Format ost1: /tmp/lustre-ost1
Format ost2: /tmp/lustre-ost2
Checking servers environments
Checking clients devvm-centos-1 environments
Loading modules from /home/test/lustre-release/lustre/tests/..
detected 4 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
gss/krb5 is not supported
Setup mgs, mdt, osts
Starting mds1:   /dev/mapper/mds1_flakey /mnt/lustre-mds1
Commit the device label on /dev/sdb
Started lustre-MDT0000
Starting ost1:   /dev/mapper/ost1_flakey /mnt/lustre-ost1
Commit the device label on /tmp/lustre-ost1
Started lustre-OST0000
Starting ost2:   /dev/mapper/ost2_flakey /mnt/lustre-ost2
Commit the device label on /tmp/lustre-ost2
Started lustre-OST0001
Starting client: devvm-centos-1:  -o user_xattr,flock devvm-centos-1@tcp:/lustre /mnt/lustre
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       125368        1904      112228   2% /mnt/lustre[MDT:0]
lustre-OST0000_UUID       325368       13508      284700   5% /mnt/lustre[OST:0]
lustre-OST0001_UUID       325368       13508      284700   5% /mnt/lustre[OST:1]

filesystem_summary:       650736       27016      569400   5% /mnt/lustre

Using TIMEOUT=20
seting jobstats to procname_uid
Setting lustre.sys.jobid_var from disable to procname_uid
Waiting 90 secs for update
Updated after 6s: wanted 'procname_uid' got 'procname_uid'
disable quota as required
[test@devvm-centos-1 lustre-release]$ sudo MDSDEV=/dev/sdb MDSDEV1=/dev/sdb ONLY=0 sh lustre/tests/conf-sanity.sh
devvm-centos-1: executing check_logdir /tmp/test_logs/1525349199
Logging to shared log directory: /tmp/test_logs/1525349199
devvm-centos-1: executing yml_node
Client: Lustre version: 2.11.51_20_g9ac477c
MDS: Lustre version: 2.11.51_20_g9ac477c
OSS: Lustre version: 2.11.51_20_g9ac477c
excepting tests: 32newtarball 101
skipping tests SLOW=no: 45 69
Stopping clients: devvm-centos-1 /mnt/lustre (opts:-f)
Stopping client devvm-centos-1 /mnt/lustre opts:-f
Stopping clients: devvm-centos-1 /mnt/lustre2 (opts:-f)
Stopping /mnt/lustre-mds1 (opts:-f) on devvm-centos-1
Stopping /mnt/lustre-ost1 (opts:-f) on devvm-centos-1
Stopping /mnt/lustre-ost2 (opts:-f) on devvm-centos-1
Loading modules from /home/test/lustre-release/lustre/tests/..
detected 4 online CPUs by sysfs
Force libcfs to create 2 CPU partitions
gss/krb5 is not supported
Formatting mgs, mds, osts
Format mds1: /dev/sdb

mkfs.lustre FATAL: Unable to build fs /dev/sdb (256)

mkfs.lustre FATAL: mkfs failed 256

 The real problem is that setup/mount export variables of dm-flakey device. But the next shell doesn`t know anything about them. Before this patch all configuration was located at separate file and worked fine.

Comment by Alexander Boyko [ 03/May/18 ]

The following error is not related to dm-flakey device:

It is directly related, because /dev/sdb used by dm-flakey still. And the mds reconfiguration use /dev/sdb instead of /dev/mapper/mds1_flakey

Comment by Jian Yu [ 03/May/18 ]

[test@devvm-centos-1 lustre-release]$ sudo MDSDEV=/dev/sdb MDSDEV1=/dev/sdb ONLY=0 sh lustre/tests/conf-sanity.sh

What about specifying the dm-flakey devices to MDSDEV{n} and OSTDEV{n} here?

Comment by Alexander Boyko [ 03/May/18 ]

The test works fine with specifying flakey devices.

Comment by Jian Yu [ 03/May/18 ]

Thank you Alexander for verifying this.

Comment by Alexander Boyko [ 04/May/18 ]

@Jian Yu, will you fix the t-f issue?

Comment by Alexander Zarochentsev [ 04/May/18 ]

What about specifying the dm-flakey devices to MDSDEV{n} and OSTDEV{n} here?

should an user assume that the same OSTDEV / MDSDEV parameters work with both llmount.sh and individual test scripts (i.e. conf_sanity.sh)? I think it is expected.

Comment by Gerrit Updater [ 07/Jun/18 ]

Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/32658
Subject: LU-10893 tests: allow to disable dm-flakey layer
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5b521f7a46a52cd19d0286ca33e50b63c4f435e6

Comment by Gerrit Updater [ 18/Jul/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32658/
Subject: LU-10893 tests: allow to disable dm-flakey layer
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f4618338643441970131f957f2a346ae3a455197

Comment by Peter Jones [ 18/Jul/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:39:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.