[LU-10893] all conf-sanity tests failed: format mgs: mkfs.lustre FATAL: Unable to build fs Created: 10/Apr/18 Updated: 18/Jul/18 Resolved: 18/Jul/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Boyko | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Epic/Theme: | test | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
After MDSCOUNT=1 OSTCOUNT=2 mds1_HOST=fre0101 MDSDEV1=/dev/vdb mds_HOST=fre0101 MDSDEV=/dev/vdb ost1_HOST=fre0102 OSTDEV1=/dev/vdb ost2_HOST=fre0102 OSTDEV2=/dev/vdc ..... Errors: CMD: fre0205,fre0206,fre0208 PATH=/usr/lib64/lustre/tests/../tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/../tests/mpi:/usr/lib64/lustre/tests/../tests/racer:/usr/lib64/lustre/tests/../../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests/../tests:/usr/lib64/lustre/tests/../utils/gss:/root//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/usr/lib64/mpich/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin::/sbin:/bin:/usr/sbin: NAME=ncli sh rpc.sh set_hostid fre0208: fre0208: executing set_hostid fre0205: fre0205: executing set_hostid fre0206: fre0206: executing set_hostid CMD: fre0205 [ -e "/dev/vdb" ] CMD: fre0205 grep -c /mnt/lustre-mgs' ' /proc/mounts || true CMD: fre0205 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' || true CMD: fre0205 e2label /dev/vdb CMD: fre0205 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=ldiskfs --device-size=0 --mkfsoptions=\"-E lazy_itable_init\" --reformat /dev/vdb fre0205: fre0205: mkfs.lustre FATAL: Unable to build fs /dev/vdb (256) fre0205: fre0205: mkfs.lustre FATAL: mkfs failed 256 A quick look shows that reformat is fine at conf-sanity with the next change to t-f since there are a lot of stopall at conf-sanity, they requires a fix also, probably. == conf-sanity test 17: Verify failed mds_postsetup won't fail assertion (2936) (should return errs) ====================================================================================================== 15:36:46 (1522942606) start mds service on fre0113 Starting mds1: -o rw,user_xattr /dev/mapper/mds1_flakey /mnt/lustre-mds1 fre0113: fre0113: executing set_default_debug -1 all 4 pdsh@fre0115: fre0113: ssh exited with exit code 1 pdsh@fre0115: fre0113: ssh exited with exit code 1 Started lustre-MDT0000 start mds service on fre0113 Starting mds2: -o rw,user_xattr /dev/mapper/mds2_flakey /mnt/lustre-mds2 fre0113: fre0113: executing set_default_debug -1 all 4 pdsh@fre0115: fre0113: ssh exited with exit code 1 pdsh@fre0115: fre0113: ssh exited with exit code 1 Started lustre-MDT0001 start ost1 service on fre0114 Starting ost1: -o user_xattr /dev/mapper/ost1_flakey /mnt/lustre-ost1 fre0114: fre0114: executing set_default_debug -1 all 4 pdsh@fre0115: fre0114: ssh exited with exit code 1 pdsh@fre0115: fre0114: ssh exited with exit code 1 Started lustre-OST0000 mount lustre on /mnt/lustre..... Starting client: fre0115: -o user_xattr,flock fre0113@tcp:/lustre /mnt/lustre setup single mount lustre success umount lustre on /mnt/lustre..... Stopping client fre0115 /mnt/lustre (opts:) stop ost1 service on fre0114 Stopping /mnt/lustre-ost1 (opts:-f) on fre0114 stop mds service on fre0113 Stopping /mnt/lustre-mds1 (opts:-f) on fre0113 stop mds service on fre0113 Stopping /mnt/lustre-mds2 (opts:-f) on fre0113 modules unloaded. Remove mds config log Stopping /mnt/lustre-mgs (opts:) on fre0113 fre0113: debugfs 1.42.13.x6 (01-Mar-2018) start mgs service on fre0113 Loading modules from /usr/lib64/lustre/tests/.. detected 2 online CPUs by sysfs Force libcfs to create 2 CPU partitions ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2' ../lnet/lnet/lnet options: 'accept=all' ../lnet/klnds/socklnd/ksocklnd options: 'sock_timeout=10' gss/krb5 is not supported Starting mgs: /dev/mapper/mgs_flakey /mnt/lustre-mgs fre0113: fre0113: executing set_default_debug -1 all 4 pdsh@fre0115: fre0113: ssh exited with exit code 1 pdsh@fre0115: fre0113: ssh exited with exit code 1 Started MGS start ost1 service on fre0114 Starting ost1: -o user_xattr /dev/mapper/ost1_flakey /mnt/lustre-ost1 fre0114: fre0114: executing set_default_debug -1 all 4 pdsh@fre0115: fre0114: ssh exited with exit code 1 pdsh@fre0115: fre0114: ssh exited with exit code 1 Started lustre-OST0000 start mds service on fre0113 Starting mds1: -o rw,user_xattr /dev/mapper/mds1_flakey /mnt/lustre-mds1 fre0113: mount.lustre: mount /dev/mapper/mds1_flakey at /mnt/lustre-mds1 failed: No such file or directory fre0113: Is the MGS specification correct? fre0113: Is the filesystem name correct? fre0113: If upgrading, is the copied client log valid? (see upgrade docs) pdsh@fre0115: fre0113: ssh exited with exit code 2 Start of /dev/mapper/mds1_flakey on mds1 failed 2 Stopping clients: fre0115,fre0116 /mnt/lustre (opts:-f) Stopping clients: fre0115,fre0116 /mnt/lustre2 (opts:-f) Stopping /mnt/lustre-ost1 (opts:-f) on fre0114 pdsh@fre0115: fre0114: ssh exited with exit code 1 Stopping /mnt/lustre-mgs (opts:) on fre0113 fre0114: fre0114: executing set_hostid fre0116: fre0116: executing set_hostid fre0113: fre0113: executing set_hostid Loading modules from /usr/lib64/lustre/tests/.. detected 2 online CPUs by sysfs Force libcfs to create 2 CPU partitions gss/krb5 is not supported Formatting mgs, mds, osts Format mgs: /dev/mapper/mgs_flakey pdsh@fre0115: fre0113: ssh exited with exit code 1 conf-sanity test_17: @@@@@@ FAIL: mgs: device '/dev/mapper/mgs_flakey' does not exist Trace dump: = /usr/lib64/lustre/tests/../tests/test-framework.sh:5734:error() = /usr/lib64/lustre/tests/../tests/test-framework.sh:4314:__touch_device() = /usr/lib64/lustre/tests/../tests/test-framework.sh:4331:format_mgs() = /usr/lib64/lustre/tests/../tests/test-framework.sh:4384:formatall() = /usr/lib64/lustre/tests/conf-sanity.sh:109:reformat() = /usr/lib64/lustre/tests/conf-sanity.sh:91:reformat_and_config() = /usr/lib64/lustre/tests/conf-sanity.sh:605:test_17() = /usr/lib64/lustre/tests/../tests/test-framework.sh:6010:run_one() = /usr/lib64/lustre/tests/../tests/test-framework.sh:6049:run_one_logged() = /usr/lib64/lustre/tests/../tests/test-framework.sh:5848:run_test() = /usr/lib64/lustre/tests/conf-sanity.sh:607:main() Dumping lctl log to /tmp/test_logs/1522942566/conf-sanity.test_17.*.1522942656.log fre0114: Warning: Permanently added 'fre0115,192.168.101.15' (ECDSA) to the list of known hosts. fre0116: Warning: Permanently added 'fre0115,192.168.101.15' (ECDSA) to the list of known hosts. fre0113: Warning: Permanently added 'fre0115,192.168.101.15' (ECDSA) to the list of known hosts. Resetting fail_loc on all nodes...done. FAIL 17 (51s) |
| Comments |
| Comment by Jian Yu [ 12/Apr/18 ] |
|
Hi Alexander, The following error is not related to dm-flakey device: CMD: fre0205 mkfs.lustre --mgs --param=sys.timeout=20 --backfstype=ldiskfs --device-size=0 --mkfsoptions=\"-E lazy_itable_init\" --reformat /dev/vdb fre0205: fre0205: mkfs.lustre FATAL: Unable to build fs /dev/vdb (256) fre0205: fre0205: mkfs.lustre FATAL: mkfs failed 256 The format command was run on /dev/vdb and failed. What error messages are in dmesg or syslog? Could you please manually run the mkfs.lustre command on /dev/vdb to see if it passes? |
| Comment by Alexander Boyko [ 03/May/18 ] |
|
I`ve played a bit with t-f and found that dm-flakey patch broke the typical usage of test-framework. The simple way for some testing is 1) llmount.sh [test@devvm-centos-1 lustre-release]$ sudo MDSDEV=/dev/sdb MDSDEV1=/dev/sdb sh lustre/tests/llmount.sh Stopping clients: devvm-centos-1 /mnt/lustre (opts:-f) Stopping clients: devvm-centos-1 /mnt/lustre2 (opts:-f) Loading modules from /home/test/lustre-release/lustre/tests/.. detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions ../libcfs/libcfs/libcfs options: 'cpu_npartitions=2' ../lnet/lnet/lnet options: 'networks=tcp0(eth0) accept=all' gss/krb5 is not supported quota/lquota options: 'hash_lqs_cur_bits=3' Formatting mgs, mds, osts Format mds1: /dev/sdb Format ost1: /tmp/lustre-ost1 Format ost2: /tmp/lustre-ost2 Checking servers environments Checking clients devvm-centos-1 environments Loading modules from /home/test/lustre-release/lustre/tests/.. detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions gss/krb5 is not supported Setup mgs, mdt, osts Starting mds1: /dev/mapper/mds1_flakey /mnt/lustre-mds1 Commit the device label on /dev/sdb Started lustre-MDT0000 Starting ost1: /dev/mapper/ost1_flakey /mnt/lustre-ost1 Commit the device label on /tmp/lustre-ost1 Started lustre-OST0000 Starting ost2: /dev/mapper/ost2_flakey /mnt/lustre-ost2 Commit the device label on /tmp/lustre-ost2 Started lustre-OST0001 Starting client: devvm-centos-1: -o user_xattr,flock devvm-centos-1@tcp:/lustre /mnt/lustre UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 125368 1904 112228 2% /mnt/lustre[MDT:0] lustre-OST0000_UUID 325368 13508 284700 5% /mnt/lustre[OST:0] lustre-OST0001_UUID 325368 13508 284700 5% /mnt/lustre[OST:1] filesystem_summary: 650736 27016 569400 5% /mnt/lustre Using TIMEOUT=20 seting jobstats to procname_uid Setting lustre.sys.jobid_var from disable to procname_uid Waiting 90 secs for update Updated after 6s: wanted 'procname_uid' got 'procname_uid' disable quota as required [test@devvm-centos-1 lustre-release]$ sudo MDSDEV=/dev/sdb MDSDEV1=/dev/sdb ONLY=0 sh lustre/tests/conf-sanity.sh devvm-centos-1: executing check_logdir /tmp/test_logs/1525349199 Logging to shared log directory: /tmp/test_logs/1525349199 devvm-centos-1: executing yml_node Client: Lustre version: 2.11.51_20_g9ac477c MDS: Lustre version: 2.11.51_20_g9ac477c OSS: Lustre version: 2.11.51_20_g9ac477c excepting tests: 32newtarball 101 skipping tests SLOW=no: 45 69 Stopping clients: devvm-centos-1 /mnt/lustre (opts:-f) Stopping client devvm-centos-1 /mnt/lustre opts:-f Stopping clients: devvm-centos-1 /mnt/lustre2 (opts:-f) Stopping /mnt/lustre-mds1 (opts:-f) on devvm-centos-1 Stopping /mnt/lustre-ost1 (opts:-f) on devvm-centos-1 Stopping /mnt/lustre-ost2 (opts:-f) on devvm-centos-1 Loading modules from /home/test/lustre-release/lustre/tests/.. detected 4 online CPUs by sysfs Force libcfs to create 2 CPU partitions gss/krb5 is not supported Formatting mgs, mds, osts Format mds1: /dev/sdb mkfs.lustre FATAL: Unable to build fs /dev/sdb (256) mkfs.lustre FATAL: mkfs failed 256 The real problem is that setup/mount export variables of dm-flakey device. But the next shell doesn`t know anything about them. Before this patch all configuration was located at separate file and worked fine. |
| Comment by Alexander Boyko [ 03/May/18 ] |
It is directly related, because /dev/sdb used by dm-flakey still. And the mds reconfiguration use /dev/sdb instead of /dev/mapper/mds1_flakey |
| Comment by Jian Yu [ 03/May/18 ] |
What about specifying the dm-flakey devices to MDSDEV{n} and OSTDEV{n} here? |
| Comment by Alexander Boyko [ 03/May/18 ] |
|
The test works fine with specifying flakey devices. |
| Comment by Jian Yu [ 03/May/18 ] |
|
Thank you Alexander for verifying this. |
| Comment by Alexander Boyko [ 04/May/18 ] |
|
@Jian Yu, will you fix the t-f issue? |
| Comment by Alexander Zarochentsev [ 04/May/18 ] |
should an user assume that the same OSTDEV / MDSDEV parameters work with both llmount.sh and individual test scripts (i.e. conf_sanity.sh)? I think it is expected. |
| Comment by Gerrit Updater [ 07/Jun/18 ] |
|
Alexandr Boyko (c17825@cray.com) uploaded a new patch: https://review.whamcloud.com/32658 |
| Comment by Gerrit Updater [ 18/Jul/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32658/ |
| Comment by Peter Jones [ 18/Jul/18 ] |
|
Landed for 2.12 |