Details

    • 3
    • 9223372036854775807

    Description

      Error happens during soak testing of build '20160224' (b2_8 RC2) (see:
      https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola& spaceKey=Releases#SoakTestingonLola-20150224). DNE is enabled.
      MDSes had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA failover configuration.

      Error happened several times during execution of application mdtest (1 file per process) on client nodes lola-[33,34] and reads as:

       JOBID             ERROR - MESSAGE
      445852 :  201602 25 21:12:01 : Process 1(lola-33.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to remove directory: Input/output error
      

      Lustre Error messages that can be correlated to the event are:

      lola-10.log:Feb 25 21:12:01 lola-10 kernel: Lustre: soaked-MDT0003-osp-MDT0005: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0006: operation out_update to node 192.168.1.109@o2ib10 failed: rc = -107
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0006: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:887:ldlm_resource_complain()) soaked-MDT0003-osp-MDT0006: namespace resource [0x2c000e6a3:0xa8a2:0x0].0x0 (ffff8806fb30dbc0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa8a2:0x0].0x0 (ffff8806fb30dbc0) refcount = 2
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1508:ldlm_resource_dump()) ### ### ns: soaked-MDT0003-osp-MDT0006 lock: ffff8806c99d0880/0xaacb8c6ebe9816d2 lrc: 2/0,1 mode: EX/EX res: [0x2c000e6a3:0xa8a2:0x0].0x0 bits 0x2 rrc: 2 type: IBT flags: 0x1106401000000 nid: local remote: 0x4af49d2c5913727e expref: -99 pid: 4773 timeout: 0 lvb_type: 0
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa92a:0x0].0x0 (ffff88077f835e40) refcount = 2
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa433:0x0].0x0 (ffff8806b09aa780) refcount = 2
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x38000dec1:0x14895:0x0].0x0 (ffff8807ff657180) refcount = 2
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa8b9:0x0].0x0 (ffff8807d74972c0) refcount = 2
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1502:ldlm_resource_dump()) --- Resource: [0x2c000e6a3:0xa89b:0x0].0x0 (ffff8807e5c5f2c0) refcount = 2
      lola-11.log:Feb 25 21:12:01 lola-11 kernel: LustreError: 8253:0:(ldlm_resource.c:1505:ldlm_resource_dump()) Granted locks (in reverse order):
      lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST0000: deleting orphan objects from 0x400000403:1549433 to 0x400000403:1549585
      lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST0004: deleting orphan objects from 0x500000405:1544672 to 0x500000405:1544849
      lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST0008: deleting orphan objects from 0x600000402:1548216 to 0x600000402:1548417
      lola-2.log:Feb 25 21:12:01 lola-2 kernel: Lustre: soaked-OST000c: deleting orphan objects from 0x700000401:1545068 to 0x700000401:1545153
      lola-30.log:Feb 25 21:12:01 lola-30 kernel: Lustre: soaked-MDT0003-mdc-ffff88106fa1f800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
      lola-31.log:Feb 25 21:12:01 lola-31 kernel: Lustre: soaked-MDT0003-mdc-ffff88086597e800: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
      lola-32.log:Feb 25 21:12:01 lola-32 kernel: Lustre: soaked-MDT0003-mdc-ffff88082f4c4000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
      lola-32.log:Feb 25 21:12:01 lola-32 kernel: Lustre: Skipped 1 previous similar message
      lola-33.log:Feb 25 21:12:01 lola-33 kernel: LustreError: 11-0: soaked-MDT0003-mdc-ffff881032461c00: operation mds_reint to node 192.168.1.109@o2ib10 failed: rc = -107
      lola-33.log:Feb 25 21:12:01 lola-33 kernel: LustreError: 167-0: soaked-MDT0003-mdc-ffff881032461c00: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
      lola-33.log:Feb 25 21:12:01 lola-33 kernel: LustreError: 157072:0:(lmv_obd.c:1325:lmv_fid_alloc()) Can't alloc new fid, rc -19
      lola-33.log:Feb 25 21:12:01 lola-33 kernel: Lustre: soaked-MDT0003-mdc-ffff881032461c00: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
      lola-34.log:Feb 25 21:12:01 lola-34 kernel: Lustre: soaked-MDT0003-mdc-ffff88102fa38000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
      lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST000d: deleting orphan objects from 0x740000405:1544703 to 0x740000405:1544833
      lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST0005: deleting orphan objects from 0x540000403:1536015 to 0x540000403:1536097
      lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST0001: deleting orphan objects from 0x440000401:1553755 to 0x440000401:1553873
      lola-3.log:Feb 25 21:12:01 lola-3 kernel: Lustre: soaked-OST0009: deleting orphan objects from 0x640000402:1547689 to 0x640000402:1547777
      lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST000e: deleting orphan objects from 0x780000403:1542237 to 0x780000403:1542337
      lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST000a: deleting orphan objects from 0x6c0000401:1544440 to 0x6c0000401:1544513
      lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST0002: deleting orphan objects from 0x480000401:1548270 to 0x480000401:1548385
      lola-4.log:Feb 25 21:12:01 lola-4 kernel: Lustre: soaked-OST0006: deleting orphan objects from 0x580000405:1541804 to 0x580000405:1541889
      lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST0003: deleting orphan objects from 0x4c0000401:1539783 to 0x4c0000401:1540289
      lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST000f: deleting orphan objects from 0x7c0000403:1549006 to 0x7c0000403:1549265
      lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST000b: deleting orphan objects from 0x680000401:1548710 to 0x680000401:1548801
      lola-5.log:Feb 25 21:12:01 lola-5 kernel: Lustre: soaked-OST0007: deleting orphan objects from 0x5c0000405:1544139 to 0x5c0000405:1544513
      lola-8.log:Feb 25 21:12:01 lola-8 kernel: LustreError: 11-0: soaked-MDT0003-osp-MDT0001: operation out_update to node 192.168.1.109@o2ib10 failed: rc = -107
      lola-8.log:Feb 25 21:12:01 lola-8 kernel: LustreError: 167-0: soaked-MDT0003-osp-MDT0001: This client was evicted by soaked-MDT0003; in progress operations using this service will fail.
      lola-8.log:Feb 25 21:12:01 lola-8 kernel: Lustre: soaked-MDT0003-osp-MDT0000: Connection restored to 192.168.1.109@o2ib10 (at 192.168.1.109@o2ib10)
      lola-8.log:Feb 25 21:12:01 lola-8 kernel: Lustre: Skipped 2 previous similar messages
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961802 flags = 0 ops = 4 params = 7
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961809 flags = 0 ops = 4 params = 7
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961816 flags = 0 ops = 4 params = 7
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961822 flags = 0 ops = 4 params = 7
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366528 batchid = 73014961830 flags = 0 ops = 4 params = 7
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366576 batchid = 94491702660 flags = 0 ops = 53 params = 38
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366584 batchid = 94491702661 flags = 0 ops = 53 params = 38
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: LustreError: 4484:0:(update_records.c:72:update_records_dump()) master transno = 98785366585 batchid = 94491702662 flags = 0 ops = 53 params = 38
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: soaked-MDT0003: disconnecting 7 stale clients
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: 4484:0:(ldlm_lib.c:1586:abort_req_replay_queue()) @@@ aborted:  req@ffff88040a497c80 x1527085464220748/t0(98785366530) o101->bf8a1d5c-0dc5-b3c9-6b26-84d56ad880b2@192.168.1.126@o2ib100:585/0 lens 976/0 e 2 to 0 dl 1456463535 ref 1 fl Complete:/4/ffffffff rc 0/-1
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: 4484:0:(ldlm_lib.c:2011:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: soaked-MDT0002: Client 2cb76067-9b42-1736-a64a-e2cc0037f63b (at 192.168.1.132@o2ib100) reconnecting, waiting for 16 clients in recovery for 2:09
      lola-9.log:Feb 25 21:12:01 lola-9 kernel: Lustre: Skipped 5 previous similar messages
      

      Immediately before a restart of MDT lola-9 finished

      mds_restart      : 2016-02-25 20:59:00,754 - 2016-02-25 21:11:17,795    lola-9
      

      Attachments

        Activity

          [LU-7821] job crash in complex error scenario

          Close old issue that has not been hit again.

          adilger Andreas Dilger added a comment - Close old issue that has not been hit again.
          heckes Frank Heckes (Inactive) added a comment - - edited

          The clients didn't crash, but slurm jobs did (do).

          • I attached an extracted content ( - 1hour) from the messages files of the MDS nodes (lola-[8-11]) and clients (lola-[30-34] for first event. (--> logfiles *-20160225)
            I'm not sure about the relevance of the errors happened before for the problem
          • Same for second event added by Cliff (--> logfiles *-20160303)
          • Sorry, no debug logs had been written after the events happened containing items for the time intervals of interest.

          The error might be correlated with LU-7848 as the evictions always happens during or shortly after MDS restarts and failover.

          heckes Frank Heckes (Inactive) added a comment - - edited The clients didn't crash, but slurm jobs did (do). I attached an extracted content ( - 1hour) from the messages files of the MDS nodes ( lola- [8-11] ) and clients ( lola- [30-34] for first event. (--> logfiles *-20160225) I'm not sure about the relevance of the errors happened before for the problem Same for second event added by Cliff (--> logfiles *-20160303) Sorry, no debug logs had been written after the events happened containing items for the time intervals of interest. The error might be correlated with LU-7848 as the evictions always happens during or shortly after MDS restarts and failover.
          green Oleg Drokin added a comment -

          so of those 8 clients that did not rejoin - did they crash? any interesting logs on them? What about server logs before the eviction, anything interesting there?

          Unfortunately now we only know tht recovery failed because half of the clients failed to rejoin - but no idea why did that happen at all

          green Oleg Drokin added a comment - so of those 8 clients that did not rejoin - did they crash? any interesting logs on them? What about server logs before the eviction, anything interesting there? Unfortunately now we only know tht recovery failed because half of the clients failed to rejoin - but no idea why did that happen at all

          We are seeing this failure again on 2.8.0-RC4.
          Job errors:

          449195-mdtestfpp.out:03/03/2016 09:51:14: Process 0(lola-26.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
          449195-mdtestfpp.out:03/03/2016 09:51:14: Process 1(lola-26.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
          449195-mdtestfpp.out:03/03/2016 09:51:14: Process 6(lola-34.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error
          

          Server errors:

          lola-10.log:Mar  3 09:51:14 lola-10 kernel: LustreError: 4805:0:(llog_cat.c:712:llog_cat_cancel_records()) soaked-MDT0006-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116
          lola-11.log:Mar  3 09:51:14 lola-11 kernel: Lustre: 4655:0:(ldlm_lib.c:2001:target_recovery_overseer()) soaked-MDT0006 recovery is aborted by hard timeout
          lola-11.log:Mar  3 09:51:14 lola-11 kernel: Lustre: 4655:0:(ldlm_lib.c:2011:target_recovery_overseer()) recovery is aborted, evict exports in recovery
          lola-11.log:Mar  3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted.
          lola-16.log:Mar  3 09:51:14 lola-16 kernel: Lustre: soaked-MDT0006-mdc-ffff8807eb5cd400: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
          lola-16.log:Mar  3 09:51:14 lola-16 kernel: LustreError: 36131:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2
          lola-26.log:Mar  3 09:51:14 lola-26 kernel: Lustre: soaked-MDT0006-mdc-ffff88081ab31400: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10)
          lola-2.log:Mar  3 09:51:14 lola-2 kernel: Lustre: soaked-OST0000: deleting orphan objects from 0x400000406:4125908 to 0x400000406:4127809
          lola-2.log:Mar  3 09:51:14 lola-2 kernel: Lustre: soaked-OST000c: deleting orphan objects from 0x700000406:4144394 to 0x700000406:4147105
          
          cliffw Cliff White (Inactive) added a comment - We are seeing this failure again on 2.8.0-RC4. Job errors: 449195-mdtestfpp.out:03/03/2016 09:51:14: Process 0(lola-26.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error 449195-mdtestfpp.out:03/03/2016 09:51:14: Process 1(lola-26.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error 449195-mdtestfpp.out:03/03/2016 09:51:14: Process 6(lola-34.lola.whamcloud.com): FAILED in create_remove_items_helper, unable to create directory: Input/output error Server errors: lola-10.log:Mar 3 09:51:14 lola-10 kernel: LustreError: 4805:0:(llog_cat.c:712:llog_cat_cancel_records()) soaked-MDT0006-osp-MDT0005: fail to cancel 1 of 1 llog-records: rc = -116 lola-11.log:Mar 3 09:51:14 lola-11 kernel: Lustre: 4655:0:(ldlm_lib.c:2001:target_recovery_overseer()) soaked-MDT0006 recovery is aborted by hard timeout lola-11.log:Mar 3 09:51:14 lola-11 kernel: Lustre: 4655:0:(ldlm_lib.c:2011:target_recovery_overseer()) recovery is aborted, evict exports in recovery lola-11.log:Mar 3 09:51:14 lola-11 kernel: Lustre: soaked-MDT0006: Recovery over after 7:31, of 16 clients 8 recovered and 8 were evicted. lola-16.log:Mar 3 09:51:14 lola-16 kernel: Lustre: soaked-MDT0006-mdc-ffff8807eb5cd400: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10) lola-16.log:Mar 3 09:51:14 lola-16 kernel: LustreError: 36131:0:(llite_lib.c:2309:ll_prep_inode()) new_inode -fatal: rc -2 lola-26.log:Mar 3 09:51:14 lola-26 kernel: Lustre: soaked-MDT0006-mdc-ffff88081ab31400: Connection restored to 192.168.1.111@o2ib10 (at 192.168.1.111@o2ib10) lola-2.log:Mar 3 09:51:14 lola-2 kernel: Lustre: soaked-OST0000: deleting orphan objects from 0x400000406:4125908 to 0x400000406:4127809 lola-2.log:Mar 3 09:51:14 lola-2 kernel: Lustre: soaked-OST000c: deleting orphan objects from 0x700000406:4144394 to 0x700000406:4147105

          People

            wc-triage WC Triage
            heckes Frank Heckes (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: