[LU-10753] sanity test 300c fails with 'create 5k files failed' Created: 01/Mar/18 Updated: 04/Mar/21 Resolved: 19/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.1, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.14.0 |
| Fix Version/s: | Lustre 2.14.0, Lustre 2.12.7 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne, zfs | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
sanity test_300c fails to create 5,000 files for Lustre file systems with DNE configured and ZFS server targets. Actually, it can’t even create one file. The client test log has the following output == sanity test 300c: chown && check ls under striped directory ======================================= 07:16:38 (1508915798) CMD: trevis-49vm7 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 running as uid/gid/euid/egid 500/500/500/500, groups: [createmany] [-o] [/mnt/lustre/d300c.sanity/striped_dir/f] [5000] open(/mnt/lustre/d300c.sanity/striped_dir/f0) error: Permission denied total: 0 open/close in 0.01 seconds: 0.00 ops/second sanity test_300c: @@@@@@ FAIL: create 5k files failed There’s nothing interesting in the console and dmesg logs on any of the nodes. This test started failing with the ‘create 5k files failed’ error on 2017-09-29 for master tag 2.10.53. We’ve also seen this fail once for full test session DNE/ZFS testing on 2.10.1. Logs for test sessions with this failure are at: We see the other sanity 300* tests fail in similar ways and can open separate tickets if we think the following have distinct root cause for the failures. sanity test 300d fails with == sanity test 300d: check default stripe under striped directory ==================================== 06:31:06 (1516775466) CMD: trevis-13vm4 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 open(/mnt/lustre/d300d.sanity/striped_dir/f1) error: Permission denied total: 1 open/close in 0.01 seconds: 189.21 ops/second sanity test_300d: @@@@@@ FAIL: create 10 files failed with logs at sanity test 300e fails with == sanity test 300e: check rename under striped directory ============================================ 01:22:13 (1519521733) CMD: trevis-66vm8 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 touch: cannot touch '/mnt/lustre/d300e.sanity/striped_dir/a': Permission denied touch: cannot touch '/mnt/lustre/d300e.sanity/striped_dir/c': Permission denied mkdir: cannot create directory '/mnt/lustre/d300e.sanity/striped_dir/dir_a': Permission denied mkdir: cannot create directory '/mnt/lustre/d300e.sanity/striped_dir/dir_c': Permission denied rename returned -1: No such file or directory sanity test_300e: @@@@@@ FAIL: rename dir under striped dir fails with logs at: |
| Comments |
| Comment by Bob Glossman (Inactive) [ 05/Mar/18 ] |
|
another on master: |
| Comment by Andreas Dilger [ 18/Nov/18 ] |
|
Still being seen on master occasionally: |
| Comment by Andreas Dilger [ 23/Mar/19 ] |
|
Failed on test_300j as well: |
| Comment by Andreas Dilger [ 23/Mar/19 ] |
|
A number of separate failures: |
| Comment by Patrick Farrell (Inactive) [ 26/Mar/19 ] |
|
We're getting some sort of unrecognized error in mdt_intent_open, resulting in: 00010000:00010000:0.0:1553279507.468810:0:27025:0:(ldlm_lockd.c:1413:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=301, rc=0) ns: mdt-lustre-MDT0003_UUID lock: ffff90a202783680/0x5a2c7807f4f13313 lrc: 1/0,0 mode: --/CW res: [0x2c00013a3:0x8:0x0].0x0 bits 0x1/0x0 rrc: 2 type: IBT flags: 0x44000000000000 nid: 10.9.6.212@tcp remote: 0x1eeb154cc2268f57 expref: 6 pid: 27025 timeout: 0 lvb_type: 0 ELDLM_LOCK_ABORTED (301), and then I think the client is picking the error out of the intent...? I'm guessing, but it kind of looks like the ucred checks in mdt_intent_open are failing, since they could generate EACCES. No idea why. |
| Comment by Minh Diep [ 02/Apr/19 ] |
|
+1 on b2_12: https://testing.whamcloud.com/test_sets/62a29848-552a-11e9-a256-52540065bddc |
| Comment by James Nunez (Inactive) [ 23/May/19 ] |
|
We see this issue with sanity test_300f. Looking at https://testing.whamcloud.com/test_sets/a017e772-7d35-11e9-a028-52540065bddc, the test fails with == sanity test 300f: check rename cross striped directory ============================================ 03:15:34 (1558581334) rename returned 0: Success rename returned 0: Success rename returned -1: Permission denied sanity test_300f: @@@@@@ FAIL: rename file under diff striped dirs fails In the MDS2, 4 debug log, we see the ldlm_handle_enqueue0() error a couple of times that Patrick commented on 00010000:00010000:1.0:1558581331.340655:0:342:0:(ldlm_lockd.c:1481:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply (err=301, rc=0) ns: mdt-lustre-MDT0001_UUID lock: ffff9a76c0edc000/0x891b2a9bcbdbcf88 lrc: 1/0,0 mode: --/CR res: [0x240001b76:0x9dc:0x0].0x0 bits 0x1/0x0 rrc: 4 type: IBT flags: 0x44000000000000 nid: 10.9.5.163@tcp remote: 0xda582325b93faf54 expref: 22 pid: 342 timeout: 0 lvb_type: 0 |
| Comment by Andreas Dilger [ 15/Jul/19 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/a7f45c8a-a6e0-11e9-8fc1-52540065bddc |
| Comment by James Nunez (Inactive) [ 12/Sep/19 ] |
|
When one/some of the sanity 300* tests fail, all test suites that run after sanity fail because they can't clean up the file system. For example, looking at a recent failure where sanity test 300e, 300h and 300i fail, then sanityn fails to run with the suite_log at https://testing.whamcloud.com/test_sets/93b42e24-d4eb-11e9-a2b6-52540065bddc, we see rm: cannot remove '/mnt/lustre/d300e.sanity/striped_dir/stp_a': Permission denied rm: cannot remove '/mnt/lustre/d300e.sanity/striped_dir/stp_c': Permission denied rm: cannot remove '/mnt/lustre/d300h.sanity/striped_dir/test3': Permission denied Trace dump: = /usr/lib64/lustre/tests/sanityn.sh:56:main() sanityn: FAIL: test-framework exiting on error |
| Comment by Andreas Dilger [ 03/Dec/19 ] |
|
+6 on master in the past 4 weeks: |
| Comment by Mikhail Pershin [ 16/Feb/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/35856130-504a-11ea-bcf8-52540065bddc |
| Comment by Chris Horn [ 25/Feb/20 ] |
|
+1 on master |
| Comment by Bruno Faccini (Inactive) [ 04/Mar/20 ] |
|
+ 1 with recent master at https://testing.whamcloud.com/test_sets/21a96f11-1563-4baa-9f9e-9b3dbb59f19e |
| Comment by Chris Horn [ 04/May/20 ] |
|
+1 on master https://testing.whamcloud.com/test_sessions/4496d6f6-4562-485b-a62f-51bf7bd2b14a |
| Comment by Emoly Liu [ 05/Jun/20 ] |
|
more on master: |
| Comment by Chris Horn [ 09/Jun/20 ] |
|
+1 on b2_12: https://testing.whamcloud.com/test_sessions/48af3de1-6698-480f-bf8b-9c03a0777418 |
| Comment by Gerrit Updater [ 27/Sep/20 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40062 |
| Comment by Gerrit Updater [ 19/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40062/ |
| Comment by Peter Jones [ 19/Oct/20 ] |
|
Landed for 2.14 |
| Comment by Olaf Faaland [ 02/Nov/20 ] |
|
Lai, Should this be backported to b2_12? It also uses la_flags without first checking la_valid. |
| Comment by Gerrit Updater [ 10/Nov/20 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40585 |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40585/ |