Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11418

hung threads on MDT and MDT won't umount

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0, Lustre 2.12.1
    • Lustre 2.10.4
    • None
    • x86_64, zfs, 3 MDTs, all on 1 MDS, , 2.10.4 + many patches ~= 2.10.5 to 2.12
    • 2
    • 9223372036854775807

    Description

      Hi,

      unfortunately once again similar/same symptoms as LU-11082 and LU-11301.

      chgrp/chmod sweep across files and directories results in eventual total hang of the filesystem. hung MDT threads. one MDT won't umount. MDS has to be powered off to fix the fs.

      processes that are stuck on the client doing the sweep are

      root     142716  0.0  0.0 108252   116 pts/1    S    01:33   0:34 xargs -0 -n5 chgrp -h oz044
      root     236217  0.0  0.0 108252   116 pts/1    S    01:15   0:25 xargs -0 -n5 chgrp -h oz065
      root     385816  0.0  0.0 108252   116 pts/1    S    05:34   0:15 xargs -0 -n5 chgrp -h oz100
      root     418923  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz100 oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_ranked.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/catalogs/candidates_full.cat oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd5/images oz100/pipes/DWF_PIPE/MARY_WORK/Antlia_170206_msystembis5_8/ccd46
      root     418944  0.0  0.0 120512   136 pts/1    S    09:34   0:01 chgrp -h oz044 oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/msdriv.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grexec.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys_msdos/grdos.f oz044/mbernet/c_cpp/dust_prc/src/pgplot/makemake oz044/mbernet/c_cpp/dust_prc/src/pgplot/sys
      root     418947  0.0  0.0 120512   136 pts/1    S    09:34   0:00 chgrp -h oz065 oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/uniform/functionObjects/functionObjectProperties oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/alpha.water oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/Ur oz065/OpenFOAM/szhu-v1806/run/Deen/LES/run09_multiperforation_periodic/3x3/fine_GraceDrag_constantLift_defaultLES_probe2_u_0p005_ozstar/processor14/85/p...
      

      I can't see any rc=-116 in the logs this time.

      first hung thread is

      Sep 22 09:37:39 warble2 kernel: LNet: Service thread pid 458124 was inactive for 200.31s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      Sep 22 09:37:39 warble2 kernel: Pid: 458124, comm: mdt01_095 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018
      Sep 22 09:37:39 warble2 kernel: Call Trace:
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc159c047>] top_trans_wait_result+0xa6/0x155 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc157d91b>] top_trans_stop+0x42b/0x930 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc16d65f9>] lod_trans_stop+0x259/0x340 [lod]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc177423a>] mdd_trans_stop+0x2a/0x46 [mdd]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1769bcb>] mdd_attr_set+0x5eb/0xce0 [mdd]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff65f5>] mdt_reint_setattr+0xba5/0x1060 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0ff6b33>] mdt_reint_rec+0x83/0x210 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fd836b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc0fe3f07>] mdt_reint+0x67/0x140 [mdt]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc156a38a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1512e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffc1516592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Sep 22 09:37:39 warble2 kernel: [<ffffffffb64bb621>] kthread+0xd1/0xe0
      Sep 22 09:37:39 warble2 kernel: [<ffffffffb6b205dd>] ret_from_fork_nospec_begin+0x7/0x21
      Sep 22 09:37:39 warble2 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
      Sep 22 09:37:39 warble2 kernel: LustreError: dumping log to /tmp/lustre-log.1537573059.458124
      

      there was a subnet manager crash and restart about 15 minutes before the MDS threads hung this time, but I don't think that's related.

      first lustre-log for warble2 and syslog for the cluster are attached.

      I also did a sryrq 't' and 'w' before resetting warble2, so that may be of help to you.
      those start at
      Sep 22 16:26:15
      in messages.

      please let us know if you'd like anything else.
      would a kernel crashdump help?
      we are getting closer to being able to capture one of these.

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11418] hung threads on MDT and MDT won't umount

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34326/
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 6a412a8671d3d76b5da55c08ada011e7aeea1e8c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34326/ Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 6a412a8671d3d76b5da55c08ada011e7aeea1e8c

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34327
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 694a92ec774d5bd958a61f457fc64380feb95db2

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34327 Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 694a92ec774d5bd958a61f457fc64380feb95db2

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34326
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: ec0944fc22a351b44332984050606f0efb1d3b63

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34326 Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: ec0944fc22a351b44332984050606f0efb1d3b63
            laisiyao Lai Siyao added a comment -

            Peter, it's tracked under LU-11681, when it passed reviews, I'll backport them to 2.10.

            laisiyao Lai Siyao added a comment - Peter, it's tracked under LU-11681 , when it passed reviews, I'll backport them to 2.10.
            pjones Peter Jones added a comment -

            The existing patch has landed for 2.13 and could now potentially be included in 2.10.x or 2.12.x maintenance releases. Lai, if you're still working on a further patch, what ticket is it being tracked under?

            pjones Peter Jones added a comment - The existing patch has landed for 2.13 and could now potentially be included in 2.10.x or 2.12.x maintenance releases. Lai, if you're still working on a further patch, what ticket is it being tracked under?

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33661/
            Subject: LU-11418 mdd: delete name if orphan doesn't exist
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: fffef5c29e3bdf0f96168abc3d0488bad06f33bb

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33661/ Subject: LU-11418 mdd: delete name if orphan doesn't exist Project: fs/lustre-release Branch: master Current Patch Set: Commit: fffef5c29e3bdf0f96168abc3d0488bad06f33bb
            laisiyao Lai Siyao added a comment -

            Robin, I'm still working on a patch to fix bugs in striped directory lfsck, I'll update you when it's finished, and I think it may help because many inconsistencies in your system are in striped directory. And next time when you think system stalls, can you use sysrq to dump all processes backtraces on all servers, and it's also helps to dump debug logs.

            laisiyao Lai Siyao added a comment - Robin, I'm still working on a patch to fix bugs in striped directory lfsck, I'll update you when it's finished, and I think it may help because many inconsistencies in your system are in striped directory. And next time when you think system stalls, can you use sysrq to dump all processes backtraces on all servers, and it's also helps to dump debug logs.
            scadmin SC Admin added a comment -

            Hi Peter and Lai,

            not sure if it's good or bad
            the lfsck just takes a while 'cos I set it to a slow rate...

            this is namespace which has completed now

            [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags: inconsistent
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1544252914
            time_since_last_completed: 75442 seconds
            latest_start_time: 1544075850
            time_since_latest_start: 252506 seconds
            last_checkpoint_time: 1544252914
            time_since_last_checkpoint: 75442 seconds
            latest_start_position: 266, N/A, N/A
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: 5603849, [0x200011573:0x1:0x0], 0x3d0b02e56e00000
            checked_phase1: 150469049
            checked_phase2: 21386549
            inconsistent_phase1: 6
            inconsistent_phase2: 22
            failed_phase1: 0
            failed_phase2: 1
            directories: 10159430
            dirent_inconsistent: 0
            linkea_inconsistent: 0
            nlinks_inconsistent: 0
            multiple_linked_checked: 886537
            multiple_linked_inconsistent: 0
            unknown_inconsistency: 0
            unmatched_pairs_inconsistent: 0
            dangling_inconsistent: 35
            multiple_referenced_inconsistent: 0
            bad_file_type_inconsistent: 0
            lost_dirent_inconsistent: 0
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 1
            striped_dirs_inconsistent: 1
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 4492951
            striped_shards_inconsistent: 0
            striped_shards_failed: 0
            striped_shards_skipped: 261
            name_hash_inconsistent: 0
            linkea_overflow_inconsistent: 0
            success_count: 14
            run_time_phase1: 164481 seconds
            run_time_phase2: 12404 seconds
            average_speed_phase1: 914 items/sec
            average_speed_phase2: 1724 objs/sec
            average_speed_total: 971 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags: inconsistent
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1544251904
            time_since_last_completed: 76452 seconds
            latest_start_time: 1544075851
            time_since_latest_start: 252505 seconds
            last_checkpoint_time: 1544251904
            time_since_last_checkpoint: 76452 seconds
            latest_start_position: 266, N/A, N/A
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: 557060, [0x280020f73:0x1:0x0], 0x198ca8b508be8000
            checked_phase1: 48829719
            checked_phase2: 20829132
            inconsistent_phase1: 5
            inconsistent_phase2: 0
            failed_phase1: 0
            failed_phase2: 2
            directories: 5965672
            dirent_inconsistent: 0
            linkea_inconsistent: 0
            nlinks_inconsistent: 0
            multiple_linked_checked: 478400
            multiple_linked_inconsistent: 0
            unknown_inconsistency: 0
            unmatched_pairs_inconsistent: 0
            dangling_inconsistent: 163
            multiple_referenced_inconsistent: 0
            bad_file_type_inconsistent: 0
            lost_dirent_inconsistent: 0
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 2
            striped_dirs_inconsistent: 2
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 4474054
            striped_shards_inconsistent: 0
            striped_shards_failed: 0
            striped_shards_skipped: 261
            name_hash_inconsistent: 0
            linkea_overflow_inconsistent: 0
            success_count: 18
            run_time_phase1: 56418 seconds
            run_time_phase2: 11394 seconds
            average_speed_phase1: 865 items/sec
            average_speed_phase2: 1828 objs/sec
            average_speed_total: 1027 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_namespace
            magic: 0xa0621a0b
            version: 2
            status: completed
            flags: inconsistent
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1544251913
            time_since_last_completed: 76443 seconds
            latest_start_time: 1544075851
            time_since_latest_start: 252505 seconds
            last_checkpoint_time: 1544251913
            time_since_last_checkpoint: 76443 seconds
            latest_start_position: 266, N/A, N/A
            last_checkpoint_position: 35184372088832, N/A, N/A
            first_failure_position: 24644277, [0x68003ff89:0x20:0x0], 0x1a11d7ef27b38000
            checked_phase1: 49406594
            checked_phase2: 20846709
            inconsistent_phase1: 26
            inconsistent_phase2: 1
            failed_phase1: 0
            failed_phase2: 1
            directories: 5984370
            dirent_inconsistent: 0
            linkea_inconsistent: 0
            nlinks_inconsistent: 0
            multiple_linked_checked: 479388
            multiple_linked_inconsistent: 0
            unknown_inconsistency: 0
            unmatched_pairs_inconsistent: 0
            dangling_inconsistent: 160
            multiple_referenced_inconsistent: 0
            bad_file_type_inconsistent: 0
            lost_dirent_inconsistent: 1
            local_lost_found_scanned: 0
            local_lost_found_moved: 0
            local_lost_found_skipped: 0
            local_lost_found_failed: 0
            striped_dirs_scanned: 1
            striped_dirs_inconsistent: 1
            striped_dirs_failed: 0
            striped_dirs_disabled: 0
            striped_dirs_skipped: 0
            striped_shards_scanned: 4474199
            striped_shards_inconsistent: 0
            striped_shards_failed: 0
            striped_shards_skipped: 261
            name_hash_inconsistent: 0
            linkea_overflow_inconsistent: 0
            success_count: 18
            run_time_phase1: 57069 seconds
            run_time_phase2: 11403 seconds
            average_speed_phase1: 865 items/sec
            average_speed_phase2: 1828 objs/sec
            average_speed_total: 1026 items/sec
            real_time_speed_phase1: N/A
            real_time_speed_phase2: N/A
            current_position: N/A
            

            layout is still running. I think it's stuck

            [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_layout
            name: lfsck_layout
            magic: 0xb17371b9
            version: 2
            status: scanning-phase2
            flags: scanned-once
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1541789825
            time_since_last_completed: 2538610 seconds
            latest_start_time: 1544075847
            time_since_latest_start: 252588 seconds
            last_checkpoint_time: 1544240510
            time_since_last_checkpoint: 87925 seconds
            latest_start_position: 266
            last_checkpoint_position: 35184372088832
            first_failure_position: 436180
            success_count: 1
            inconsistent_dangling: 0
            inconsistent_unmatched_pair: 0
            inconsistent_multiple_referenced: 0
            inconsistent_orphan: 0
            inconsistent_inconsistent_owner: 532160
            inconsistent_others: 0
            skipped: 0
            failed_phase1: 0
            failed_phase2: 0
            checked_phase1: 140103851
            checked_phase2: 0
            run_time_phase1: 164481 seconds
            run_time_phase2: 87926 seconds
            average_speed_phase1: 851 items/sec
            average_speed_phase2: 0 items/sec
            real-time_speed_phase1: N/A
            real-time_speed_phase2: 0 items/sec
            current_position: [0x0:0x0:0x0]
            name: lfsck_layout
            magic: 0xb17371b9
            version: 2
            status: scanning-phase2
            flags: scanned-once
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1541789834
            time_since_last_completed: 2538601 seconds
            latest_start_time: 1544075851
            time_since_latest_start: 252584 seconds
            last_checkpoint_time: 1544132317
            time_since_last_checkpoint: 196118 seconds
            latest_start_position: 266
            last_checkpoint_position: 35184372088832
            first_failure_position: 100570
            success_count: 1
            inconsistent_dangling: 0
            inconsistent_unmatched_pair: 0
            inconsistent_multiple_referenced: 0
            inconsistent_orphan: 0
            inconsistent_inconsistent_owner: 519932
            inconsistent_others: 0
            skipped: 0
            failed_phase1: 0
            failed_phase2: 0
            checked_phase1: 42500398
            checked_phase2: 0
            run_time_phase1: 56418 seconds
            run_time_phase2: 196119 seconds
            average_speed_phase1: 753 items/sec
            average_speed_phase2: 0 items/sec
            real-time_speed_phase1: N/A
            real-time_speed_phase2: 0 items/sec
            current_position: [0x0:0x0:0x0]
            name: lfsck_layout
            magic: 0xb17371b9
            version: 2
            status: scanning-phase2
            flags: scanned-once
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1541789846
            time_since_last_completed: 2538589 seconds
            latest_start_time: 1544075851
            time_since_latest_start: 252584 seconds
            last_checkpoint_time: 1544132971
            time_since_last_checkpoint: 195464 seconds
            latest_start_position: 266
            last_checkpoint_position: 35184372088832
            first_failure_position: 103878
            success_count: 1
            inconsistent_dangling: 2
            inconsistent_unmatched_pair: 0
            inconsistent_multiple_referenced: 0
            inconsistent_orphan: 0
            inconsistent_inconsistent_owner: 521695
            inconsistent_others: 0
            skipped: 0
            failed_phase1: 0
            failed_phase2: 0
            checked_phase1: 43095736
            checked_phase2: 0
            run_time_phase1: 57069 seconds
            run_time_phase2: 195464 seconds
            average_speed_phase1: 755 items/sec
            average_speed_phase2: 0 items/sec
            real-time_speed_phase1: N/A
            real-time_speed_phase2: 0 items/sec
            current_position: [0x0:0x0:0x0]
            

            the only thing changing in the output above is the time counters. the lfsck/lfsk_layout processes are still there but not showing up as being active in 'top'.

            I've stopped it now ->

            [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_layout
            name: lfsck_layout
            magic: 0xb17371b9
            version: 2
            status: stopped
            flags: scanned-once
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1541789825
            time_since_last_completed: 2539497 seconds
            latest_start_time: 1544075847
            time_since_latest_start: 253475 seconds
            last_checkpoint_time: 1544329255
            time_since_last_checkpoint: 67 seconds
            latest_start_position: 266
            last_checkpoint_position: 35184372088832
            first_failure_position: 436180
            success_count: 1
            inconsistent_dangling: 0
            inconsistent_unmatched_pair: 0
            inconsistent_multiple_referenced: 0
            inconsistent_orphan: 0
            inconsistent_inconsistent_owner: 532160
            inconsistent_others: 0
            inconsistent_inconsistent_owner: 532160
            inconsistent_others: 0
            skipped: 0
            failed_phase1: 0
            failed_phase2: 0
            checked_phase1: 140103851
            checked_phase2: 0
            run_time_phase1: 164481 seconds
            run_time_phase2: 88745 seconds
            average_speed_phase1: 851 items/sec
            average_speed_phase2: 0 objs/sec
            real-time_speed_phase1: N/A
            real-time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_layout
            magic: 0xb17371b9
            version: 2
            status: stopped
            flags: scanned-once
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1541789834
            time_since_last_completed: 2539488 seconds
            latest_start_time: 1544075851
            time_since_latest_start: 253471 seconds
            last_checkpoint_time: 1544329255
            time_since_last_checkpoint: 67 seconds
            latest_start_position: 266
            last_checkpoint_position: 35184372088832
            first_failure_position: 100570
            success_count: 1
            inconsistent_dangling: 0
            inconsistent_unmatched_pair: 0
            inconsistent_multiple_referenced: 0
            inconsistent_orphan: 0
            inconsistent_inconsistent_owner: 519932
            inconsistent_others: 0
            skipped: 0
            failed_phase1: 0
            failed_phase2: 0
            checked_phase1: 42500398
            checked_phase2: 0
            run_time_phase1: 56418 seconds
            run_time_phase2: 196938 seconds
            run_time_phase1: 56418 seconds
            run_time_phase2: 196938 seconds
            average_speed_phase1: 753 items/sec
            average_speed_phase2: 0 objs/sec
            real-time_speed_phase1: N/A
            real-time_speed_phase2: N/A
            current_position: N/A
            name: lfsck_layout
            magic: 0xb17371b9
            version: 2
            status: stopped
            flags: scanned-once
            param: dryrun,all_targets,create_mdtobj
            last_completed_time: 1541789846
            time_since_last_completed: 2539476 seconds
            latest_start_time: 1544075851
            time_since_latest_start: 253471 seconds
            last_checkpoint_time: 1544329255
            time_since_last_checkpoint: 67 seconds
            latest_start_position: 266
            last_checkpoint_position: 35184372088832
            first_failure_position: 103878
            success_count: 1
            inconsistent_dangling: 2
            inconsistent_unmatched_pair: 0
            inconsistent_multiple_referenced: 0
            inconsistent_orphan: 0
            inconsistent_inconsistent_owner: 521695
            inconsistent_others: 0
            skipped: 0
            failed_phase1: 0
            failed_phase2: 0
            checked_phase1: 43095736
            checked_phase2: 0
            run_time_phase1: 57069 seconds
            run_time_phase2: 196283 seconds
            average_speed_phase1: 755 items/sec
            average_speed_phase2: 0 objs/sec
            real-time_speed_phase1: N/A
            real-time_speed_phase2: N/A
            current_position: N/A
            

            I should point out that we're still mounting these 3 MDTs with -o skip_lfsck.
            should we mount them in the normal way first, before doing these lfsck's? I'm not sure of the difference.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Peter and Lai, not sure if it's good or bad the lfsck just takes a while 'cos I set it to a slow rate... this is namespace which has completed now [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_namespace name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: inconsistent param: dryrun,all_targets,create_mdtobj last_completed_time: 1544252914 time_since_last_completed: 75442 seconds latest_start_time: 1544075850 time_since_latest_start: 252506 seconds last_checkpoint_time: 1544252914 time_since_last_checkpoint: 75442 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: 5603849, [0x200011573:0x1:0x0], 0x3d0b02e56e00000 checked_phase1: 150469049 checked_phase2: 21386549 inconsistent_phase1: 6 inconsistent_phase2: 22 failed_phase1: 0 failed_phase2: 1 directories: 10159430 dirent_inconsistent: 0 linkea_inconsistent: 0 nlinks_inconsistent: 0 multiple_linked_checked: 886537 multiple_linked_inconsistent: 0 unknown_inconsistency: 0 unmatched_pairs_inconsistent: 0 dangling_inconsistent: 35 multiple_referenced_inconsistent: 0 bad_file_type_inconsistent: 0 lost_dirent_inconsistent: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 1 striped_dirs_inconsistent: 1 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 4492951 striped_shards_inconsistent: 0 striped_shards_failed: 0 striped_shards_skipped: 261 name_hash_inconsistent: 0 linkea_overflow_inconsistent: 0 success_count: 14 run_time_phase1: 164481 seconds run_time_phase2: 12404 seconds average_speed_phase1: 914 items/sec average_speed_phase2: 1724 objs/sec average_speed_total: 971 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: inconsistent param: dryrun,all_targets,create_mdtobj last_completed_time: 1544251904 time_since_last_completed: 76452 seconds latest_start_time: 1544075851 time_since_latest_start: 252505 seconds last_checkpoint_time: 1544251904 time_since_last_checkpoint: 76452 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: 557060, [0x280020f73:0x1:0x0], 0x198ca8b508be8000 checked_phase1: 48829719 checked_phase2: 20829132 inconsistent_phase1: 5 inconsistent_phase2: 0 failed_phase1: 0 failed_phase2: 2 directories: 5965672 dirent_inconsistent: 0 linkea_inconsistent: 0 nlinks_inconsistent: 0 multiple_linked_checked: 478400 multiple_linked_inconsistent: 0 unknown_inconsistency: 0 unmatched_pairs_inconsistent: 0 dangling_inconsistent: 163 multiple_referenced_inconsistent: 0 bad_file_type_inconsistent: 0 lost_dirent_inconsistent: 0 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 2 striped_dirs_inconsistent: 2 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 4474054 striped_shards_inconsistent: 0 striped_shards_failed: 0 striped_shards_skipped: 261 name_hash_inconsistent: 0 linkea_overflow_inconsistent: 0 success_count: 18 run_time_phase1: 56418 seconds run_time_phase2: 11394 seconds average_speed_phase1: 865 items/sec average_speed_phase2: 1828 objs/sec average_speed_total: 1027 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A name: lfsck_namespace magic: 0xa0621a0b version: 2 status: completed flags: inconsistent param: dryrun,all_targets,create_mdtobj last_completed_time: 1544251913 time_since_last_completed: 76443 seconds latest_start_time: 1544075851 time_since_latest_start: 252505 seconds last_checkpoint_time: 1544251913 time_since_last_checkpoint: 76443 seconds latest_start_position: 266, N/A, N/A last_checkpoint_position: 35184372088832, N/A, N/A first_failure_position: 24644277, [0x68003ff89:0x20:0x0], 0x1a11d7ef27b38000 checked_phase1: 49406594 checked_phase2: 20846709 inconsistent_phase1: 26 inconsistent_phase2: 1 failed_phase1: 0 failed_phase2: 1 directories: 5984370 dirent_inconsistent: 0 linkea_inconsistent: 0 nlinks_inconsistent: 0 multiple_linked_checked: 479388 multiple_linked_inconsistent: 0 unknown_inconsistency: 0 unmatched_pairs_inconsistent: 0 dangling_inconsistent: 160 multiple_referenced_inconsistent: 0 bad_file_type_inconsistent: 0 lost_dirent_inconsistent: 1 local_lost_found_scanned: 0 local_lost_found_moved: 0 local_lost_found_skipped: 0 local_lost_found_failed: 0 striped_dirs_scanned: 1 striped_dirs_inconsistent: 1 striped_dirs_failed: 0 striped_dirs_disabled: 0 striped_dirs_skipped: 0 striped_shards_scanned: 4474199 striped_shards_inconsistent: 0 striped_shards_failed: 0 striped_shards_skipped: 261 name_hash_inconsistent: 0 linkea_overflow_inconsistent: 0 success_count: 18 run_time_phase1: 57069 seconds run_time_phase2: 11403 seconds average_speed_phase1: 865 items/sec average_speed_phase2: 1828 objs/sec average_speed_total: 1026 items/sec real_time_speed_phase1: N/A real_time_speed_phase2: N/A current_position: N/A layout is still running. I think it's stuck [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_layout name: lfsck_layout magic: 0xb17371b9 version: 2 status: scanning-phase2 flags: scanned-once param: dryrun,all_targets,create_mdtobj last_completed_time: 1541789825 time_since_last_completed: 2538610 seconds latest_start_time: 1544075847 time_since_latest_start: 252588 seconds last_checkpoint_time: 1544240510 time_since_last_checkpoint: 87925 seconds latest_start_position: 266 last_checkpoint_position: 35184372088832 first_failure_position: 436180 success_count: 1 inconsistent_dangling: 0 inconsistent_unmatched_pair: 0 inconsistent_multiple_referenced: 0 inconsistent_orphan: 0 inconsistent_inconsistent_owner: 532160 inconsistent_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 140103851 checked_phase2: 0 run_time_phase1: 164481 seconds run_time_phase2: 87926 seconds average_speed_phase1: 851 items/sec average_speed_phase2: 0 items/sec real-time_speed_phase1: N/A real-time_speed_phase2: 0 items/sec current_position: [0x0:0x0:0x0] name: lfsck_layout magic: 0xb17371b9 version: 2 status: scanning-phase2 flags: scanned-once param: dryrun,all_targets,create_mdtobj last_completed_time: 1541789834 time_since_last_completed: 2538601 seconds latest_start_time: 1544075851 time_since_latest_start: 252584 seconds last_checkpoint_time: 1544132317 time_since_last_checkpoint: 196118 seconds latest_start_position: 266 last_checkpoint_position: 35184372088832 first_failure_position: 100570 success_count: 1 inconsistent_dangling: 0 inconsistent_unmatched_pair: 0 inconsistent_multiple_referenced: 0 inconsistent_orphan: 0 inconsistent_inconsistent_owner: 519932 inconsistent_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 42500398 checked_phase2: 0 run_time_phase1: 56418 seconds run_time_phase2: 196119 seconds average_speed_phase1: 753 items/sec average_speed_phase2: 0 items/sec real-time_speed_phase1: N/A real-time_speed_phase2: 0 items/sec current_position: [0x0:0x0:0x0] name: lfsck_layout magic: 0xb17371b9 version: 2 status: scanning-phase2 flags: scanned-once param: dryrun,all_targets,create_mdtobj last_completed_time: 1541789846 time_since_last_completed: 2538589 seconds latest_start_time: 1544075851 time_since_latest_start: 252584 seconds last_checkpoint_time: 1544132971 time_since_last_checkpoint: 195464 seconds latest_start_position: 266 last_checkpoint_position: 35184372088832 first_failure_position: 103878 success_count: 1 inconsistent_dangling: 2 inconsistent_unmatched_pair: 0 inconsistent_multiple_referenced: 0 inconsistent_orphan: 0 inconsistent_inconsistent_owner: 521695 inconsistent_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 43095736 checked_phase2: 0 run_time_phase1: 57069 seconds run_time_phase2: 195464 seconds average_speed_phase1: 755 items/sec average_speed_phase2: 0 items/sec real-time_speed_phase1: N/A real-time_speed_phase2: 0 items/sec current_position: [0x0:0x0:0x0] the only thing changing in the output above is the time counters. the lfsck/lfsk_layout processes are still there but not showing up as being active in 'top'. I've stopped it now -> [warble2]root: lctl get_param -n mdd.dagg-MDT000*.lfsck_layout name: lfsck_layout magic: 0xb17371b9 version: 2 status: stopped flags: scanned-once param: dryrun,all_targets,create_mdtobj last_completed_time: 1541789825 time_since_last_completed: 2539497 seconds latest_start_time: 1544075847 time_since_latest_start: 253475 seconds last_checkpoint_time: 1544329255 time_since_last_checkpoint: 67 seconds latest_start_position: 266 last_checkpoint_position: 35184372088832 first_failure_position: 436180 success_count: 1 inconsistent_dangling: 0 inconsistent_unmatched_pair: 0 inconsistent_multiple_referenced: 0 inconsistent_orphan: 0 inconsistent_inconsistent_owner: 532160 inconsistent_others: 0 inconsistent_inconsistent_owner: 532160 inconsistent_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 140103851 checked_phase2: 0 run_time_phase1: 164481 seconds run_time_phase2: 88745 seconds average_speed_phase1: 851 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A name: lfsck_layout magic: 0xb17371b9 version: 2 status: stopped flags: scanned-once param: dryrun,all_targets,create_mdtobj last_completed_time: 1541789834 time_since_last_completed: 2539488 seconds latest_start_time: 1544075851 time_since_latest_start: 253471 seconds last_checkpoint_time: 1544329255 time_since_last_checkpoint: 67 seconds latest_start_position: 266 last_checkpoint_position: 35184372088832 first_failure_position: 100570 success_count: 1 inconsistent_dangling: 0 inconsistent_unmatched_pair: 0 inconsistent_multiple_referenced: 0 inconsistent_orphan: 0 inconsistent_inconsistent_owner: 519932 inconsistent_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 42500398 checked_phase2: 0 run_time_phase1: 56418 seconds run_time_phase2: 196938 seconds run_time_phase1: 56418 seconds run_time_phase2: 196938 seconds average_speed_phase1: 753 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A name: lfsck_layout magic: 0xb17371b9 version: 2 status: stopped flags: scanned-once param: dryrun,all_targets,create_mdtobj last_completed_time: 1541789846 time_since_last_completed: 2539476 seconds latest_start_time: 1544075851 time_since_latest_start: 253471 seconds last_checkpoint_time: 1544329255 time_since_last_checkpoint: 67 seconds latest_start_position: 266 last_checkpoint_position: 35184372088832 first_failure_position: 103878 success_count: 1 inconsistent_dangling: 2 inconsistent_unmatched_pair: 0 inconsistent_multiple_referenced: 0 inconsistent_orphan: 0 inconsistent_inconsistent_owner: 521695 inconsistent_others: 0 skipped: 0 failed_phase1: 0 failed_phase2: 0 checked_phase1: 43095736 checked_phase2: 0 run_time_phase1: 57069 seconds run_time_phase2: 196283 seconds average_speed_phase1: 755 items/sec average_speed_phase2: 0 objs/sec real-time_speed_phase1: N/A real-time_speed_phase2: N/A current_position: N/A I should point out that we're still mounting these 3 MDTs with -o skip_lfsck. should we mount them in the normal way first, before doing these lfsck's? I'm not sure of the difference. cheers, robin
            pjones Peter Jones added a comment -

            Robin

            Is no news good news?

            Peter

            pjones Peter Jones added a comment - Robin Is no news good news? Peter
            scadmin SC Admin added a comment -

            Hi Lai,

            lustre's been stable for the last 2 week since that patch.
            all OSS's were rebooted to match the MDS version about 10 days ago.
            so looking really good.

            I haven't run the lfsck yet, or explicitly tried to run stress tests to try to break anything.
            just enjoying the stability for a while

            I'll kick off a lfsck dry run today.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Lai, lustre's been stable for the last 2 week since that patch. all OSS's were rebooted to match the MDS version about 10 days ago. so looking really good. I haven't run the lfsck yet, or explicitly tried to run stress tests to try to break anything. just enjoying the stability for a while I'll kick off a lfsck dry run today. cheers, robin

            People

              laisiyao Lai Siyao
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: