[LU-10753] sanity test 300c fails with 'create 5k files failed' Created: 01/Mar/18  Updated: 04/Mar/21  Resolved: 19/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.1, Lustre 2.11.0, Lustre 2.12.0, Lustre 2.13.0, Lustre 2.14.0
Fix Version/s: Lustre 2.14.0, Lustre 2.12.7

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: dne, zfs

Issue Links:
Duplicate
is duplicated by LU-11487 sanity: 300b failed with 'touch error 0' Open
Related
is related to LU-10592 sanity test_300h: create files failed Open
is related to LU-13400 sanity test_300d: createmany 10 under... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_300c fails to create 5,000 files for Lustre file systems with DNE configured and ZFS server targets. Actually, it can’t even create one file. The client test log has the following output

== sanity test 300c: chown && check ls under striped directory ======================================= 07:16:38 (1508915798)
CMD: trevis-49vm7 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
running as uid/gid/euid/egid 500/500/500/500, groups:
 [createmany] [-o] [/mnt/lustre/d300c.sanity/striped_dir/f] [5000]
open(/mnt/lustre/d300c.sanity/striped_dir/f0) error: Permission denied
total: 0 open/close in 0.01 seconds: 0.00 ops/second
 sanity test_300c: @@@@@@ FAIL: create 5k files failed

There’s nothing interesting in the console and dmesg logs on any of the nodes.

This test started failing with the ‘create 5k files failed’ error on 2017-09-29 for master tag 2.10.53. We’ve also seen this fail once for full test session DNE/ZFS testing on 2.10.1.

Logs for test sessions with this failure are at:
https://testing.hpdd.intel.com/test_sets/7ce33f5a-ba7b-11e7-9abd-52540065bddc
https://testing.hpdd.intel.com/test_sets/336e7196-12b9-11e8-a10a-52540065bddc
https://testing.hpdd.intel.com/test_sets/d61ba008-167e-11e8-bd00-52540065bddc

We see the other sanity 300* tests fail in similar ways and can open separate tickets if we think the following have distinct root cause for the failures.

sanity test 300d fails with

== sanity test 300d: check default stripe under striped directory ==================================== 06:31:06 (1516775466)
CMD: trevis-13vm4 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
open(/mnt/lustre/d300d.sanity/striped_dir/f1) error: Permission denied
total: 1 open/close in 0.01 seconds: 189.21 ops/second
 sanity test_300d: @@@@@@ FAIL: create 10 files failed 

with logs at
https://testing.hpdd.intel.com/test_sets/46afbcfe-00e6-11e8-a10a-52540065bddc

sanity test 300e fails with

== sanity test 300e: check rename under striped directory ============================================ 01:22:13 (1519521733)
CMD: trevis-66vm8 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
touch: cannot touch '/mnt/lustre/d300e.sanity/striped_dir/a': Permission denied
touch: cannot touch '/mnt/lustre/d300e.sanity/striped_dir/c': Permission denied
mkdir: cannot create directory '/mnt/lustre/d300e.sanity/striped_dir/dir_a': Permission denied
mkdir: cannot create directory '/mnt/lustre/d300e.sanity/striped_dir/dir_c': Permission denied
rename returned -1: No such file or directory
 sanity test_300e: @@@@@@ FAIL: rename dir under striped dir fails 

with logs at:
https://testing.hpdd.intel.com/test_sets/cceb366a-19df-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/46afbcfe-00e6-11e8-a10a-52540065bddc
https://testing.hpdd.intel.com/test_sets/4b90ec14-ffb7-11e7-bd00-52540065bddc



 Comments   
Comment by Bob Glossman (Inactive) [ 05/Mar/18 ]

another on master:
https://testing.hpdd.intel.com/test_sets/0b50fc48-20b9-11e8-9ec4-52540065bddc

Comment by Andreas Dilger [ 18/Nov/18 ]

Still being seen on master occasionally:
https://testing.whamcloud.com/test_sets/beabbaf8-e3c3-11e8-b67f-52540065bddc

Comment by Andreas Dilger [ 23/Mar/19 ]

Failed on test_300j as well:
https://testing.whamcloud.com/test_sets/1e634084-4cf5-11e9-92fe-52540065bddc

Comment by Andreas Dilger [ 23/Mar/19 ]

A number of separate failures:
https://testing.whamcloud.com/test_sets/9662f4e4-4147-11e9-b98a-52540065bddc

Comment by Patrick Farrell (Inactive) [ 26/Mar/19 ]

We're getting some sort of unrecognized error in mdt_intent_open, resulting in:

00010000:00010000:0.0:1553279507.468810:0:27025:0:(ldlm_lockd.c:1413:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=301, rc=0) ns: mdt-lustre-MDT0003_UUID lock: ffff90a202783680/0x5a2c7807f4f13313 lrc: 1/0,0 mode: --/CW res: [0x2c00013a3:0x8:0x0].0x0 bits 0x1/0x0 rrc: 2 type: IBT flags: 0x44000000000000 nid: 10.9.6.212@tcp remote: 0x1eeb154cc2268f57 expref: 6 pid: 27025 timeout: 0 lvb_type: 0 

ELDLM_LOCK_ABORTED (301), and then I think the client is picking the error out of the intent...?

I'm guessing, but it kind of looks like the ucred checks in mdt_intent_open are failing, since they could generate EACCES.  No idea why.

Comment by Minh Diep [ 02/Apr/19 ]

+1 on b2_12: https://testing.whamcloud.com/test_sets/62a29848-552a-11e9-a256-52540065bddc

Comment by James Nunez (Inactive) [ 23/May/19 ]

We see this issue with sanity test_300f.

Looking at https://testing.whamcloud.com/test_sets/a017e772-7d35-11e9-a028-52540065bddc, the test fails with

== sanity test 300f: check rename cross striped directory ============================================ 03:15:34 (1558581334)
rename returned 0: Success
rename returned 0: Success
rename returned -1: Permission denied
 sanity test_300f: @@@@@@ FAIL: rename file under diff striped dirs fails 

In the MDS2, 4 debug log, we see the ldlm_handle_enqueue0() error a couple of times that Patrick commented on

00010000:00010000:1.0:1558581331.340655:0:342:0:(ldlm_lockd.c:1481:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply (err=301, rc=0) ns: mdt-lustre-MDT0001_UUID lock: ffff9a76c0edc000/0x891b2a9bcbdbcf88 lrc: 1/0,0 mode: --/CR res: [0x240001b76:0x9dc:0x0].0x0 bits 0x1/0x0 rrc: 4 type: IBT flags: 0x44000000000000 nid: 10.9.5.163@tcp remote: 0xda582325b93faf54 expref: 22 pid: 342 timeout: 0 lvb_type: 0
Comment by Andreas Dilger [ 15/Jul/19 ]

+1 on master https://testing.whamcloud.com/test_sets/a7f45c8a-a6e0-11e9-8fc1-52540065bddc

Comment by James Nunez (Inactive) [ 12/Sep/19 ]

When one/some of the sanity 300* tests fail, all test suites that run after sanity fail because they can't clean up the file system. For example, looking at a recent failure where sanity test 300e, 300h and 300i fail, then sanityn fails to run with the suite_log at https://testing.whamcloud.com/test_sets/93b42e24-d4eb-11e9-a2b6-52540065bddc, we see

rm: cannot remove '/mnt/lustre/d300e.sanity/striped_dir/stp_a': Permission denied
rm: cannot remove '/mnt/lustre/d300e.sanity/striped_dir/stp_c': Permission denied
rm: cannot remove '/mnt/lustre/d300h.sanity/striped_dir/test3': Permission denied
  Trace dump:
  = /usr/lib64/lustre/tests/sanityn.sh:56:main()
sanityn: FAIL: test-framework exiting on error
Comment by Andreas Dilger [ 03/Dec/19 ]

+6 on master in the past 4 weeks:
https://testing.whamcloud.com/test_sets/8dc83f6c-0148-11ea-bbc3-52540065bddc
https://testing.whamcloud.com/test_sets/f1be49c0-0180-11ea-a9d7-52540065bddc
https://testing.whamcloud.com/test_sets/3c91882a-0656-11ea-9487-52540065bddc
https://testing.whamcloud.com/test_sets/29c39e52-0f63-11ea-b934-52540065bddc
https://testing.whamcloud.com/test_sets/afbb3650-1030-11ea-a9d7-52540065bddc
https://testing.whamcloud.com/test_sets/d662d80a-1556-11ea-a9d7-52540065bddc

Comment by Mikhail Pershin [ 16/Feb/20 ]

+1 on master

https://testing.whamcloud.com/test_sets/35856130-504a-11ea-bcf8-52540065bddc

Comment by Chris Horn [ 25/Feb/20 ]

+1 on master
https://testing.whamcloud.com/test_sets/8ee226b4-e8e5-43e0-ad13-5b8fd72fd225

Comment by Bruno Faccini (Inactive) [ 04/Mar/20 ]

+ 1 with recent master at https://testing.whamcloud.com/test_sets/21a96f11-1563-4baa-9f9e-9b3dbb59f19e

Comment by Chris Horn [ 04/May/20 ]

+1 on master https://testing.whamcloud.com/test_sessions/4496d6f6-4562-485b-a62f-51bf7bd2b14a

Comment by Emoly Liu [ 05/Jun/20 ]

more on master:
https://testing.whamcloud.com/test_sets/53081ec8-1310-4cc9-9215-e17a0ce8a432
https://testing.whamcloud.com/test_sets/887d2132-33fe-4a5e-b850-5c5ae64107de

Comment by Chris Horn [ 09/Jun/20 ]

+1 on b2_12: https://testing.whamcloud.com/test_sessions/48af3de1-6698-480f-bf8b-9c03a0777418

Comment by Gerrit Updater [ 27/Sep/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40062
Subject: LU-10753 osd-zfs: initialize obj attr correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2a89e9090b57a333e9bff89ce34f315f99ff49f1

Comment by Gerrit Updater [ 19/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40062/
Subject: LU-10753 osd-zfs: initialize obj attr correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf395c2507e80717e7468456e9959d432b6accc8

Comment by Peter Jones [ 19/Oct/20 ]

Landed for 2.14

Comment by Olaf Faaland [ 02/Nov/20 ]

Lai,

Should this be backported to b2_12?  It also uses la_flags without first checking la_valid.

Comment by Gerrit Updater [ 10/Nov/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40585
Subject: LU-10753 osd-zfs: initialize obj attr correctly
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: aa62d909598d89d13bf564680ca2258daa1c0edf

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40585/
Subject: LU-10753 osd-zfs: initialize obj attr correctly
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 201efe7eea3f7d6fb6410db8b78d0c9187c274ce

Generated at Sat Feb 10 02:37:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.