[LU-11984] Intermittent file create or rm fail with EINVAL Created: 21/Feb/19  Updated: 13/Mar/19  Resolved: 13/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Patrick Farrell (Inactive)
Resolution: Duplicate Votes: 0
Labels: llnl
Environment:

linux 3.10.0-957.1.3.1chaos.ch6.x86_64
lustre-2.12.0_1.chaos-1.ch6.x86_64
Clients OmniPath <> routers <> Servers mlx5
See https://github.com/LLNL/lustre/releases for contents of 2.12.0_1.chaos.


Attachments: File dk.jet3.1550709103.gz     File dk.opal67.1550709082.gz    
Issue Links:
Duplicate
duplicates LU-11827 Race between llog_cat_declare_add_rec... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

mdtest intermittently fails and reports EINVAL error when trying to create or remove a file.

mdtest-1.8.3 was launched with 1024 total task(s) on 64 nodes
Command line used: /g/g0/faaland1/projects/mdtest/mdtest/mdtest -d /p/lquake/faaland1/lustre-212-reconnects -n 1024 -F -u -v
Path: /p/lquake/faaland1                                                                                                    
FS: 1867.3 TiB   Used FS: 34.2%   Inodes: 765.8 Mi   Used Inodes: 57.1%                                                     
1024 tasks, 1048576 files

 Operation               Duration              Rate
   ---------               --------              ----
 * iteration 1 02/20/2019 13:37:43 *                 
   Tree creation     :      0.076 sec,     13.191 ops/sec

02/20/2019 13:39:00: Process 158(opal119): FAILED in create_remove_items_helper, unable to unlink file file.mdtest.158.223 (cwd=/p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.158.0): Invalid argument                                                                         
--------------------------------------------------------------------------                                                                       
MPI_ABORT was invoked on rank 158 in communicator MPI_COMM_WORLD                                                                                 with errorcode 1.

Seen with:
no DoM
no PFL
16 MDTs in the file system, but directory mdtest is using is not striped.
64 nodes x 16 ppn



 Comments   
Comment by Olaf Faaland [ 21/Feb/19 ]

All on MDT0002

[root@opal67:toss-4452]# lfs getdirstripe /p/lquake/faaland1/lustre-212-reconnects
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
[root@opal67:toss-4452]# lfs getdirstripe /p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
[root@opal67:toss-4452]# lfs getdirstripe /p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.550.0
lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none
Comment by Olaf Faaland [ 21/Feb/19 ]

Reproduced with +rpctrace and gathered debug log from the client, opal67. Log attached as dk.opal67.1550709082.gz.

mdtest-1.8.3 was launched with 1024 total task(s) on 64 nodes
Command line used: /g/g0/faaland1/projects/mdtest/mdtest/mdtest -d /p/lquake/faaland1/lustre-212-reconnects -n 1024 -F -u -v                                                                                                  
Path: /p/lquake/faaland1                                                                                       
FS: 1867.3 TiB   Used FS: 34.2%   Inodes: 757.4 Mi   Used Inodes: 57.7%                                        
1024 tasks, 1048576 files
   Operation               Duration              Rate
   ---------               --------              ----
 * iteration 1 02/20/2019 16:26:38 *                 
   Tree creation     :      0.077 sec,     13.047 ops/sec

02/20/2019 16:27:23: Process 550(opal67): FAILED in create_remove_items_helper, unable to create file file.mdtest.550.791 (cwd=/p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.550.0): Invalid argument
Comment by Olaf Faaland [ 21/Feb/19 ]

Not sure if it's relevant, but the debug log from opal67 includes:

dk.opal67.1550709082:00000002:00100000:13.0:1550708843.710600:0:172721:0:(mdc_locks.c:627:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -22  req@ffff888518842400 x1625568471279792/t0(0) o101->lquake-MDT0002-mdc-ffff8895b3f6f800@172.19.1.113@o2ib100:12/10 lens 720/560 e 0 to 0 dl 1550708949 ref 1 fl Complete:R/0/0 rc 301/301
Comment by Patrick Farrell (Inactive) [ 21/Feb/19 ]

Olaf,

It's definitely relevant, but unfortunately, all the client logs show us is that the server didn't like your open.

Can you get dmesg and the output of lctl dk from the MDS you're using here?  Start with that, but if you can reproduce with any of trace, inode or info (or all) on the MDS, that would be great.  Inode and info are lighter, but likely less helpful.

Comment by Olaf Faaland [ 21/Feb/19 ]

Patrick,

I attached dk.jet3.1550709103.gz which is from the node running MDT0002 at the time.  This is from yesterday, so does not have any of the debug flags you asked for set.  When I get time on that machine again, hopefully today, I'll try again with your flags.  Thank you.

Comment by Patrick Farrell (Inactive) [ 21/Feb/19 ]

Unfortunately, the MDS dk log starts just a few seconds after that failure:

 00000100:00100000:6.0:1550708942.344015:0:56668:0:(ptlrpcd.c:409:ptlrpcd_check()) transfer 1 async RPCs [1->0]
1550708942 <--

00000002:00100000:13.0:1550708843.710600:0:172721:0:(mdc_locks.c:627:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -22 req@ffff888518842400 x1625568471279792/t0(0) o101->lquake-MDT0002-mdc-ffff8895b3f6f800@172.19.1.113@o2ib100:12/10 lens 720/560 e 0 to 0 dl 1550708949 ref 1 fl Complete:R/0/0 rc 301/301

1550708843 <--

So it contains no useful info.

Dmesg from the MDS might have something...?

Comment by Olaf Faaland [ 22/Feb/19 ]

Unfortunately, the MDS dk log starts just a few seconds after that failure:

Argh, I should have seen that.

There's nothing in dmesg from the MDS anywhere near the time of the failure.   I should be able to get the cluster back later today or tomorrow to try again.

Comment by Patrick Farrell (Inactive) [ 22/Feb/19 ]

No problem.

All right, I'll wait for more from you.  Thanks.

Comment by Alexander Boyko [ 27/Feb/19 ]

This probably duplicate of LU-11827. We saw unlink fail with invalid argument during mdtest regular.

@Olaf Faaland, could you check with LU-11827 patch? it was landed to master yesterday.

Comment by Patrick Farrell (Inactive) [ 28/Feb/19 ]

Alex,

Thank you very much!  This does indeed look like LU-11827ofaaland, when you get a chance, it would be good to try that out.

Comment by Olaf Faaland [ 08/Mar/19 ]

Hi Patrick,

I got the cluster back.  I applied the patch from LU-11827 to Lustre 2.12.0 and am using that build on both client and server. Creates now fail, but much more consistently and with different symptoms.  The user process gets back ENOENT instead of EINVAL. On the server's console is a lustre error, which did not occur before. It is:

LustreError: 49910:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0001-mdtlov: Can not locate [0x700000bd5:0x16:0x0]: rc = -2

This was produced running a single-node x 16tpp mdtest, without DoM or PFL.

This seems to me like a different problem entirely, so I am not uploading the debug logs. If you agree it's distinct, I can create a new ticket and put them there.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Whoops, missed this update...  But this most recent report is (your new ticket) LU-12063.  Let's close this one out as a duplicate of LU-11827 and move discussion there.

Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ]

Duplicate of LU-11827

Generated at Sat Feb 10 02:48:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.