[LU-11984] Intermittent file create or rm fail with EINVAL Created: 21/Feb/19 Updated: 13/Mar/19 Resolved: 13/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
linux 3.10.0-957.1.3.1chaos.ch6.x86_64 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
mdtest intermittently fails and reports EINVAL error when trying to create or remove a file. mdtest-1.8.3 was launched with 1024 total task(s) on 64 nodes Command line used: /g/g0/faaland1/projects/mdtest/mdtest/mdtest -d /p/lquake/faaland1/lustre-212-reconnects -n 1024 -F -u -v Path: /p/lquake/faaland1 FS: 1867.3 TiB Used FS: 34.2% Inodes: 765.8 Mi Used Inodes: 57.1% 1024 tasks, 1048576 files Operation Duration Rate --------- -------- ---- * iteration 1 02/20/2019 13:37:43 * Tree creation : 0.076 sec, 13.191 ops/sec 02/20/2019 13:39:00: Process 158(opal119): FAILED in create_remove_items_helper, unable to unlink file file.mdtest.158.223 (cwd=/p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.158.0): Invalid argument -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 158 in communicator MPI_COMM_WORLD with errorcode 1. Seen with: |
| Comments |
| Comment by Olaf Faaland [ 21/Feb/19 ] |
|
All on MDT0002 [root@opal67:toss-4452]# lfs getdirstripe /p/lquake/faaland1/lustre-212-reconnects lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none [root@opal67:toss-4452]# lfs getdirstripe /p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/ lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none [root@opal67:toss-4452]# lfs getdirstripe /p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.550.0 lmv_stripe_count: 0 lmv_stripe_offset: 2 lmv_hash_type: none |
| Comment by Olaf Faaland [ 21/Feb/19 ] |
|
Reproduced with +rpctrace and gathered debug log from the client, opal67. Log attached as dk.opal67.1550709082.gz. mdtest-1.8.3 was launched with 1024 total task(s) on 64 nodes Command line used: /g/g0/faaland1/projects/mdtest/mdtest/mdtest -d /p/lquake/faaland1/lustre-212-reconnects -n 1024 -F -u -v Path: /p/lquake/faaland1 FS: 1867.3 TiB Used FS: 34.2% Inodes: 757.4 Mi Used Inodes: 57.7% 1024 tasks, 1048576 files Operation Duration Rate --------- -------- ---- * iteration 1 02/20/2019 16:26:38 * Tree creation : 0.077 sec, 13.047 ops/sec 02/20/2019 16:27:23: Process 550(opal67): FAILED in create_remove_items_helper, unable to create file file.mdtest.550.791 (cwd=/p/lquake/faaland1/lustre-212-reconnects/#test-dir.0/mdtest_tree.550.0): Invalid argument |
| Comment by Olaf Faaland [ 21/Feb/19 ] |
|
Not sure if it's relevant, but the debug log from opal67 includes: dk.opal67.1550709082:00000002:00100000:13.0:1550708843.710600:0:172721:0:(mdc_locks.c:627:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -22 req@ffff888518842400 x1625568471279792/t0(0) o101->lquake-MDT0002-mdc-ffff8895b3f6f800@172.19.1.113@o2ib100:12/10 lens 720/560 e 0 to 0 dl 1550708949 ref 1 fl Complete:R/0/0 rc 301/301 |
| Comment by Patrick Farrell (Inactive) [ 21/Feb/19 ] |
|
Olaf, It's definitely relevant, but unfortunately, all the client logs show us is that the server didn't like your open. Can you get dmesg and the output of lctl dk from the MDS you're using here? Start with that, but if you can reproduce with any of trace, inode or info (or all) on the MDS, that would be great. Inode and info are lighter, but likely less helpful. |
| Comment by Olaf Faaland [ 21/Feb/19 ] |
|
Patrick, I attached dk.jet3.1550709103.gz which is from the node running MDT0002 at the time. This is from yesterday, so does not have any of the debug flags you asked for set. When I get time on that machine again, hopefully today, I'll try again with your flags. Thank you. |
| Comment by Patrick Farrell (Inactive) [ 21/Feb/19 ] |
|
Unfortunately, the MDS dk log starts just a few seconds after that failure: 00000100:00100000:6.0:1550708942.344015:0:56668:0:(ptlrpcd.c:409:ptlrpcd_check()) transfer 1 async RPCs [1->0] 1550708942 <-- 00000002:00100000:13.0:1550708843.710600:0:172721:0:(mdc_locks.c:627:mdc_finish_enqueue()) @@@ op: 3 disposition: 17, status: -22 req@ffff888518842400 x1625568471279792/t0(0) o101->lquake-MDT0002-mdc-ffff8895b3f6f800@172.19.1.113@o2ib100:12/10 lens 720/560 e 0 to 0 dl 1550708949 ref 1 fl Complete:R/0/0 rc 301/301 1550708843 <-- So it contains no useful info. Dmesg from the MDS might have something...? |
| Comment by Olaf Faaland [ 22/Feb/19 ] |
Argh, I should have seen that. There's nothing in dmesg from the MDS anywhere near the time of the failure. I should be able to get the cluster back later today or tomorrow to try again. |
| Comment by Patrick Farrell (Inactive) [ 22/Feb/19 ] |
|
No problem. All right, I'll wait for more from you. Thanks. |
| Comment by Alexander Boyko [ 27/Feb/19 ] |
|
This probably duplicate of @Olaf Faaland, could you check with |
| Comment by Patrick Farrell (Inactive) [ 28/Feb/19 ] |
|
Alex, Thank you very much! This does indeed look like |
| Comment by Olaf Faaland [ 08/Mar/19 ] |
|
Hi Patrick, I got the cluster back. I applied the patch from LustreError: 49910:0:(lod_lov.c:896:lod_gen_component_ea()) lquake-MDT0001-mdtlov: Can not locate [0x700000bd5:0x16:0x0]: rc = -2 This was produced running a single-node x 16tpp mdtest, without DoM or PFL. This seems to me like a different problem entirely, so I am not uploading the debug logs. If you agree it's distinct, I can create a new ticket and put them there. |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Whoops, missed this update... But this most recent report is (your new ticket) |
| Comment by Patrick Farrell (Inactive) [ 13/Mar/19 ] |
|
Duplicate of |