[LU-7821] job crash in complex error scenario Created: 26/Feb/16 Updated: 20/Jul/20 Resolved: 20/Jul/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Frank Heckes (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Error happens during soak testing of build '20160224' (b2_8 RC2) (see: Error happened several times during execution of application mdtest (1 file per process) on client nodes lola-[33,34] and reads as: JOBID ERROR - MESSAGE 445852 : 201602 25 21:12:01 : Process 1(lola-33.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error Lustre Error messages that can be correlated to the event are: lola-10.log:Feb 25 21:12:01 lola-10 kernel: Lustre: soaked-MDT0003-osp-MDT0005: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10) lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0006: operation out_update to node 192.168.1.109@o2ib10 failed: rc = -107 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0006: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:887:ldlm_resource_complain()) soaked-MDT0003-osp-MDT0006: namespace resource [0x2c000e6a3:0xa8a2:0x0].0x0 (ffff8806fb30dbc0) refcount nonzero (1) after lock cleanup; forcing cleanup. lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa8a2:0x0].0x0 (ffff8806fb30dbc0) refcount = 2 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1508:ldlm_resource_dump()) ### ### ns: soaked-MDT0003-osp-MDT0006 lock: ffff8806c99d0880/0xaacb8c6ebe9816d2 lrc: 2/0,1 mode: EX/EX res: [0x2c000e6a3:0xa8a2:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x1106401000000 nid: local remote: 0x4af49d2c5913727e expref: -99 pid: 4773 timeout: 0 lvb_type: 0 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa92a:0x0].0x0 (ffff88077f835e40) refcount = 2 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa433:0x0].0x0 (ffff8806b09aa780) refcount = 2 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x38000dec1:0x14895:0x0].0x0 (ffff8807ff657180) refcount = 2 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa8b9:0x0].0x0 (ffff8807d74972c0) refcount = 2 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa89b:0x0].0x0 (ffff8807e5c5f2c0) refcount = 2 lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order): lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST0000: deleting orphan objects from 0x400000403:1549433 to 0x400000403:1549585 lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST0004: deleting orphan objects from 0x500000405:1544672 to 0x500000405:1544849 lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x600000402:1548216 to 0x600000402:1548417 lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST000c: deleting orphan objects from 0x700000401:1545068 to 0x700000401:1545153 lola-30.log:Feb 25 21:12:01 lola-30 kernel: Lustre: soaked-MDT0003-mdc-ffff88106fa1f800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10) lola-31.log:Feb 25 21:12:01 lola-31 kernel: Lustre: soaked-MDT0003-mdc-ffff88086597e800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10) lola-32.log:Feb 25 21:12:01 lola-32 kernel: Lustre: soaked-MDT0003-mdc-ffff88082f4c4000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10) lola-32.log:Feb 25 21:12:01 lola-32 kernel: Lustre: Skipped 1 previous similar message lola-33.log:Feb 25 21:12:01 lola-33 kernel: LustreError: 11-0: soaked-MDT0003-mdc-ffff881032461c00: operation mds_reint to node 192.168.1.109@o2ib10 failed: rc = -107 lola-33.log:Feb 25 21:12:01 lola-33 kernel: LustreError: 167-0: soaked-MDT0003-mdc-ffff881032461c00: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. lola-33.log:Feb 25 21:12:01 lola-33 kernel: LustreError: 157072:0:(lmv_obd.c:1325:lmv_fid_alloc()) Can't alloc new fid, rc -19 lola-33.log:Feb 25 21:12:01 lola-33 kernel: Lustre: soaked-MDT0003-mdc-ffff881032461c00: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10) lola-34.log:Feb 25 21:12:01 lola-34 kernel: Lustre: soaked-MDT0003-mdc-ffff88102fa38000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10) lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST000d: deleting orphan objects from 0x740000405:1544703 to 0x740000405:1544833 lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST0005: deleting orphan objects from 0x540000403:1536015 to 0x540000403:1536097 lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST0001: deleting orphan objects from 0x440000401:1553755 to 0x440000401:1553873 lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST0009: deleting orphan objects from 0x640000402:1547689 to 0x640000402:1547777 lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x780000403:1542237 to 0x780000403:1542337 lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST000a: deleting orphan objects from 0x6c0000401:1544440 to 0x6c0000401:1544513 lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x480000401:1548270 to 0x480000401:1548385 lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST0006: deleting orphan objects from 0x580000405:1541804 to 0x580000405:1541889 lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST0003: deleting orphan objects from 0x4c0000401:1539783 to 0x4c0000401:1540289 lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST000f: deleting orphan objects from 0x7c0000403:1549006 to 0x7c0000403:1549265 lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST000b: deleting orphan objects from 0x680000401:1548710 to 0x680000401:1548801 lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST0007: deleting orphan objects from 0x5c0000405:1544139 to 0x5c0000405:1544513 lola-8.log:Feb 25 21:12:01 lola-8 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0001: operation out_update to node 192.168.1.109@o2ib10 failed: rc = -107 lola-8.log:Feb 25 21:12:01 lola-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0001: This client was evicted by soaked-MDT0003; in progress operations using this service will fail. lola-8.log:Feb 25 21:12:01 lola-8 kernel: Lustre: soaked-MDT0003-osp-MDT0000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10) lola-8.log:Feb 25 21:12:01 lola-8 kernel: Lustre: Skipped 2 previous similar messages lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961802 flags = 0 ops = 4 params = 7 lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961809 flags = 0 ops = 4 params = 7 lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961816 flags = 0 ops = 4 params = 7 lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961822 flags = 0 ops = 4 params = 7 lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961830 flags = 0 ops = 4 params = 7 lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366576 batchid = 94491702660 flags = 0 ops = 53 params = 38 lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366584 batchid = 94491702661 flags = 0 ops = 53 params = 38 lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366585 batchid = 94491702662 flags = 0 ops = 53 params = 38 lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: soaked-MDT0003: disconnecting 7 stale clients lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: 4484:0:(ldlm_lib.c:1586:abort_req_replay_queue()) @@@ aborted: req@ffff88040a497c80 x1527085464220748/t0(98785366530) o101->bf8a1d5c-0dc5-b3c9-6b26-84d56ad880b2@192.168.1.126@o2ib100:585/0 lens 976/0 e 2 to 0 dl 1456463535 ref 1 fl Complete:/4/ffffffff rc 0/-1 lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: 4484:0:(ldlm_lib.c:2011:target_recovery_overseer()) recovery is aborted, evict exports in recovery lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: soaked-MDT0002: Client 2cb76067-9b42-1736-a64a-e2cc0037f63b (at 192.168.1.132@o2ib100) reconnecting, waiting for 16 clients in recovery for 2:09 lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: Skipped 5 previous similar messages Immediately before a restart of MDT lola-9 finished mds_restart : 2016-02-25 20:59:00,754 - 2016-02-25 21:11:17,795 lola-9 |
| Comments |
| Comment by Cliff White (Inactive) [ 03/Mar/16 ] |
|
We are seeing this failure again on 2.8.0-RC4. 449195-mdtestfpp.out:03/03/2016 09:51:14: Process 0(lola-26.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error 449195-mdtestfpp.out:03/03/2016 09:51:14: Process 1(lola-26.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error 449195-mdtestfpp.out:03/03/2016 09:51:14: Process 6(lola-34.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error Server errors: lola-10.log:Mar 3 09:51:14 lola-10 kernel: LustreError: 4805:0:(llog_cat.c:712:llog_cat_cancel_records()) soaked-MDT0006-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116 lola-11.log:Mar 3 09:51:14 lola-11 kernel: Lustre: 4655:0:(ldlm_lib.c:2001:target_recovery_overseer()) soaked-MDT0006 recovery is aborted by hard timeout lola-11.log:Mar 3 09:51:14 lola-11 kernel: Lustre: 4655:0:(ldlm_lib.c:2011:target_recovery_overseer()) recovery is aborted, evict exports in recovery lola-11.log:Mar 3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted. lola-16.log:Mar 3 09:51:14 lola-16 kernel: Lustre: soaked-MDT0006-mdc-ffff8807eb5cd400: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10) lola-16.log:Mar 3 09:51:14 lola-16 kernel: LustreError: 36131:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2 lola-26.log:Mar 3 09:51:14 lola-26 kernel: Lustre: soaked-MDT0006-mdc-ffff88081ab31400: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10) lola-2.log:Mar 3 09:51:14 lola-2 kernel: Lustre: soaked-OST0000: deleting orphan objects from 0x400000406:4125908 to 0x400000406:4127809 lola-2.log:Mar 3 09:51:14 lola-2 kernel: Lustre: soaked-OST000c: deleting orphan objects from 0x700000406:4144394 to 0x700000406:4147105 |
| Comment by Oleg Drokin [ 04/Mar/16 ] |
|
so of those 8 clients that did not rejoin - did they crash? any interesting logs on them? What about server logs before the eviction, anything interesting there? Unfortunately now we only know tht recovery failed because half of the clients failed to rejoin - but no idea why did that happen at all |
| Comment by Frank Heckes (Inactive) [ 07/Mar/16 ] |
|
The clients didn't crash, but slurm jobs did (do).
The error might be correlated with |
| Comment by Andreas Dilger [ 20/Jul/20 ] |
|
Close old issue that has not been hit again. |