Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10851

parallel-scale-nfsv4 hangs on unmount

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.11.0, (13)
      Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.14.0, Lustre 2.12.5, Lustre 2.12.6, Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      parallel-scale-nfsv4 hangs on unmount after all tests have run. In the suite_log, the last thing we see is

      == parallel-scale-nfsv4 test complete, duration 2088 sec ============================================= 22:07:24 (1521868044)
       
      Unmounting NFS clients...
      CMD: trevis-8vm1,trevis-8vm2 umount -f /mnt/lustre
       
      Unexporting Lustre filesystem...
      CMD: trevis-8vm1,trevis-8vm2 chkconfig --list rpcidmapd 2>/dev/null |
                                          grep -q rpcidmapd && service rpcidmapd stop ||
                                          true
      CMD: trevis-8vm4 { [[ -e /etc/SuSE-release ]] &&
                                             service nfsserver stop; } ||
                                             service nfs stop
      CMD: trevis-8vm4 sed -i '/^lustre/d' /etc/exports
      CMD: trevis-8vm4 exportfs -v
      CMD: trevis-8vm4 grep -c /mnt/lustre' ' /proc/mounts
      Stopping client trevis-8vm4 /mnt/lustre (opts:-f)
      CMD: trevis-8vm4 lsof -t /mnt/lustre
      CMD: trevis-8vm4 umount -f /mnt/lustre 2>&1 
      

      Looking at the console logs for vm4, MDS1 and 3, we see

      [ 2216.385890] Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test complete, duration 2088 sec ============================================= 22:07:24 (1521868044)
      [ 2216.698201] Lustre: DEBUG MARKER: { [[ -e /etc/SuSE-release ]] &&
      [ 2216.698201]                                service nfsserver stop; } ||
      [ 2216.698201]                                service nfs stop
      [ 2216.805093] nfsd: last server has exited, flushing export cache
      [ 2216.819487] Lustre: DEBUG MARKER: sed -i '/^lustre/d' /etc/exports
      [ 2216.885266] Lustre: DEBUG MARKER: exportfs -v
      [ 2216.945098] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
      [ 2216.982526] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
      [ 2217.170422] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1
      [ 2217.192827] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
      [ 2217.193476] LustreError: 410:0:(file.c:205:ll_close_inode_openhandle()) lustre-clilmv-ffff880060b4e800: inode [0x200000406:0x3c1b:0x0] mdc close failed: rc = -108
      [ 2217.218709] Lustre: 4066:0:(llite_lib.c:2676:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.84@tcp:/lustre/fid: [0x200000406:0x3e42:0x0]/ may get corrupted (rc -108)
      [ 2217.218732] Lustre: 4066:0:(llite_lib.c:2676:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.4.84@tcp:/lustre/fid: [0x200000406:0x3e7b:0x0]/ may get corrupted (rc -108)
      …
      [ 5541.474664]
      [ 5541.474667] umount          D 0000000000000000     0   410    409 0x00000000
      [ 5541.474669]  ffff88004365fda8 ffff88004365fde0 ffff880048e5ce00 ffff880043660000
      [ 5541.474670]  ffff88004365fde0 000000010013feb9 ffff88007fc10840 0000000000000000
      [ 5541.474671]  ffff88004365fdc0 ffffffff81612a95 ffff88007fc10840 ffff88004365fe68
      [ 5541.474672] Call Trace:
      [ 5541.474674]  [<ffffffff81612a95>] schedule+0x35/0x80
      [ 5541.474677]  [<ffffffff81615851>] schedule_timeout+0x161/0x2d0
      [ 5541.474689]  [<ffffffffa1457cc7>] ll_kill_super+0x77/0x150 [lustre]
      [ 5541.474723]  [<ffffffffa09f3a94>] lustre_kill_super+0x34/0x40 [obdclass]
      [ 5541.474734]  [<ffffffff8120cf5f>] deactivate_locked_super+0x3f/0x70
      [ 5541.474742]  [<ffffffff812283fb>] cleanup_mnt+0x3b/0x80
      [ 5541.474745]  [<ffffffff8109d198>] task_work_run+0x78/0x90
      [ 5541.474748]  [<ffffffff8107b5cf>] exit_to_usermode_loop+0x91/0xc2
      [ 5541.474760]  [<ffffffff81003ae5>] syscall_return_slowpath+0x85/0xa0
      [ 5541.474768]  [<ffffffff81616ca7>] int_ret_from_sys_call+0x25/0x9f
      [ 5541.476903] DWARF2 unwinder stuck at int_ret_from_sys_call+0x25/0x9f
      [ 5541.476904] 
      

      We see this problem with unmount on the maser and b2_10 branches for SLES12 SP2 and SP3 testing only.

      Logs for test suites failres are at

      https://testing.whamcloud.com/test_sets/4bce5a66-2f2f-11e8-9e0e-52540065bddc 

      https://testing.whamcloud.com/test_sets/103f280e-2fac-11e8-b3c6-52540065bddc

      https://testing.whamcloud.com/test_sets/044a75f0-2eba-11e8-b6a0-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-10851] parallel-scale-nfsv4 hangs on unmount
            Deiter Alex Deiter added a comment - - edited

            Hello adilger,

            Alex, can you please test if removing the "-f" allows this testing to pass more reliably?

            Done, but result is the same: https://testing.whamcloud.com/test_sessions/related?job=lustre-reviews&build=91221#redirect

            [ 1737.450139] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test complete, duration 1480 sec ========================================================== 23:08:24 (1671145704)
            [ 1738.826052] Lustre: DEBUG MARKER: systemctl stop nfs-server
            [ 1739.068162] nfsd: last server has exited, flushing export cache
            [ 1739.311751] Lustre: DEBUG MARKER: sed -i '\|^/mnt/lustre|d' /etc/exports
            [ 1739.702372] Lustre: DEBUG MARKER: exportfs -v
            [ 1740.089608] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
            [ 1740.448974] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
            [ 1740.993566] Lustre: DEBUG MARKER: umount  /mnt/lustre 2>&1
            [ 5093.425686] sysrq: SysRq : Changing Loglevel
            ...
            [ 5096.722623] task:umount          state:D stack:    0 pid:56280 ppid: 56279 flags:0x00004080
            [ 5096.724108] Call Trace:
            [ 5096.724610]  __schedule+0x2bd/0x760
            [ 5096.725284]  schedule+0x37/0xa0
            [ 5096.725903]  schedule_timeout+0x197/0x300
            [ 5096.726669]  ? __next_timer_interrupt+0xf0/0xf0
            [ 5096.727535]  ? __radix_tree_delete+0x92/0xa0
            [ 5096.728360]  ll_kill_super+0x63/0x130 [lustre]
            [ 5096.729239]  lustre_kill_super+0x28/0x40 [lustre]
            [ 5096.730133]  deactivate_locked_super+0x34/0x70
            [ 5096.730970]  cleanup_mnt+0x3b/0x70
            [ 5096.731645]  task_work_run+0x8a/0xb0
            [ 5096.732344]  exit_to_usermode_loop+0xeb/0xf0
            [ 5096.733163]  do_syscall_64+0x198/0x1a0
            [ 5096.733888]  entry_SYSCALL_64_after_hwframe+0x65/0xca
            

            Details: umount hangs only for NVSv3 and may be caused by tests using locks:

            [ 1231.270056] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test connectathon: connectathon == 22:58:47 \(1671145127\)
            [ 1231.633442] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test connectathon: connectathon == 22:58:47 (1671145127)
            [ 1232.072645] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -b -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon
            [ 1232.437529] Lustre: DEBUG MARKER: sh ./runtests -N 2 -b -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon
            [ 1234.227996] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -g -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon
            [ 1234.581110] Lustre: DEBUG MARKER: sh ./runtests -N 2 -g -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon
            [ 1241.533359] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -s -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon
            [ 1241.914629] Lustre: DEBUG MARKER: sh ./runtests -N 2 -s -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon
            [ 1250.386938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -l -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon
            [ 1250.784748] Lustre: DEBUG MARKER: sh ./runtests -N 2 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon
            [ 1285.778542] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029
            [ 1325.200970] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029
            [ 1355.406386] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029
            [ 1395.339987] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029
            [ 1434.762512] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029
            [ 1464.967844] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029
            [ 1473.104321] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test iorssf: iorssf ============== 23:02:49 \(1671145369\)
            [ 1473.474325] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorssf: iorssf ============== 23:02:49 (1671145369)
            [ 1473.951047] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: ior failed! 1 
            [ 1474.334915] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: ior failed! 1
            [ 1474.729606] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2022-12-15/lustre-reviews_custom_91221_102_231e7dd2-7190-4e2c-8d24-93aa05c646b7//parallel-scale-nfsv3.test_iorssf.debug_log.$(hostname -s).1671145371.log;
            		dmesg > /autotest/autotest-2/2022-12-15/lustre-reviews_cu
            [ 1476.359538] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test iorfpp: iorfpp ============== 23:02:52 \(1671145372\)
            [ 1476.726679] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorfpp: iorfpp ============== 23:02:52 (1671145372)
            [ 1477.201082] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv3 test_iorfpp: @@@@@@ FAIL: ior failed! 1 
            [ 1477.578628] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_iorfpp: @@@@@@ FAIL: ior failed! 1
            [ 1477.984261] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2022-12-15/lustre-reviews_custom_91221_102_231e7dd2-7190-4e2c-8d24-93aa05c646b7//parallel-scale-nfsv3.test_iorfpp.debug_log.$(hostname -s).1671145374.log;
            		dmesg > /autotest/autotest-2/2022-12-15/lustre-reviews_cu
            [ 1479.340152] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ========================================================== 23:02:55 \(1671145375\)
            [ 1479.702251] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ========================================================== 23:02:55 (1671145375)
            [ 1787.824195] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test complete, duration 1466 sec ========================================================== 23:08:04 \(1671145684\)
            [ 1788.199820] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test complete, duration 1466 sec ========================================================== 23:08:04 (1671145684)
            [ 1789.526631] Lustre: DEBUG MARKER: systemctl stop nfs-server
            [ 1789.944393] nfsd: last server has exited, flushing export cache
            [ 1790.227067] Lustre: DEBUG MARKER: sed -i '\|^/mnt/lustre|d' /etc/exports
            [ 1790.872269] Lustre: DEBUG MARKER: exportfs -v
            [ 1791.514326] Lustre: DEBUG MARKER: grep -c /mnt/lustre' ' /proc/mounts
            [ 1792.126821] Lustre: DEBUG MARKER: lsof -t /mnt/lustre
            [ 1792.941836] Lustre: DEBUG MARKER: umount  /mnt/lustre 2>&1
            [ 5153.799694] sysrq: SysRq : Changing Loglevel
            

            We can disable NLM locks on the server side and repeat the same tests to confirm the guess . What do you think ?
            Reference: man nfs

                  lock / nolock  Selects whether to use the NLM sideband protocol to lock  files
                                  on  the  server.  If neither option is specified (or if lock is
                                  specified), NLM locking is used for  this  mount  point.   When
                                  using  the nolock option, applications can lock files, but such
                                  locks provide exclusion only against other applications running
                                  on  the  same  client.  Remote applications are not affected by
                                  these locks.
            

            Thank you!

            Deiter Alex Deiter added a comment - - edited Hello adilger , Alex, can you please test if removing the "-f" allows this testing to pass more reliably ? Done, but result is the same: https://testing.whamcloud.com/test_sessions/related?job=lustre-reviews&build=91221#redirect [ 1737.450139] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test complete, duration 1480 sec ========================================================== 23:08:24 (1671145704) [ 1738.826052] Lustre: DEBUG MARKER: systemctl stop nfs-server [ 1739.068162] nfsd: last server has exited, flushing export cache [ 1739.311751] Lustre: DEBUG MARKER: sed -i '\|^/mnt/lustre|d' /etc/exports [ 1739.702372] Lustre: DEBUG MARKER: exportfs -v [ 1740.089608] Lustre: DEBUG MARKER: grep -c /mnt/lustre ' ' /proc/mounts [ 1740.448974] Lustre: DEBUG MARKER: lsof -t /mnt/lustre [ 1740.993566] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 [ 5093.425686] sysrq: SysRq : Changing Loglevel ... [ 5096.722623] task:umount state:D stack: 0 pid:56280 ppid: 56279 flags:0x00004080 [ 5096.724108] Call Trace: [ 5096.724610] __schedule+0x2bd/0x760 [ 5096.725284] schedule+0x37/0xa0 [ 5096.725903] schedule_timeout+0x197/0x300 [ 5096.726669] ? __next_timer_interrupt+0xf0/0xf0 [ 5096.727535] ? __radix_tree_delete+0x92/0xa0 [ 5096.728360] ll_kill_super+0x63/0x130 [lustre] [ 5096.729239] lustre_kill_super+0x28/0x40 [lustre] [ 5096.730133] deactivate_locked_super+0x34/0x70 [ 5096.730970] cleanup_mnt+0x3b/0x70 [ 5096.731645] task_work_run+0x8a/0xb0 [ 5096.732344] exit_to_usermode_loop+0xeb/0xf0 [ 5096.733163] do_syscall_64+0x198/0x1a0 [ 5096.733888] entry_SYSCALL_64_after_hwframe+0x65/0xca Details: umount hangs only for NVSv3 and may be caused by tests using locks: [ 1231.270056] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test connectathon: connectathon == 22:58:47 \(1671145127\) [ 1231.633442] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test connectathon: connectathon == 22:58:47 (1671145127) [ 1232.072645] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -b -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1232.437529] Lustre: DEBUG MARKER: sh ./runtests -N 2 -b -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1234.227996] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -g -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1234.581110] Lustre: DEBUG MARKER: sh ./runtests -N 2 -g -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1241.533359] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -s -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1241.914629] Lustre: DEBUG MARKER: sh ./runtests -N 2 -s -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1250.386938] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sh .\/runtests -N 2 -l -f \/mnt\/lustre\/d0.parallel-scale-nfs\/d0.connectathon [ 1250.784748] Lustre: DEBUG MARKER: sh ./runtests -N 2 -l -f /mnt/lustre/d0.parallel-scale-nfs/d0.connectathon [ 1285.778542] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1325.200970] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1355.406386] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1395.339987] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1434.762512] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1464.967844] LustreError: 18862:0:(file.c:4836:ll_file_flock()) unknown fcntl lock command: 1029 [ 1473.104321] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test iorssf: iorssf ============== 23:02:49 \(1671145369\) [ 1473.474325] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorssf: iorssf ============== 23:02:49 (1671145369) [ 1473.951047] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: ior failed! 1 [ 1474.334915] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_iorssf: @@@@@@ FAIL: ior failed! 1 [ 1474.729606] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2022-12-15/lustre-reviews_custom_91221_102_231e7dd2-7190-4e2c-8d24-93aa05c646b7 //parallel-scale-nfsv3.test_iorssf.debug_log.$(hostname -s).1671145371.log; dmesg > /autotest/autotest-2/2022-12-15/lustre-reviews_cu [ 1476.359538] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test iorfpp: iorfpp ============== 23:02:52 \(1671145372\) [ 1476.726679] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test iorfpp: iorfpp ============== 23:02:52 (1671145372) [ 1477.201082] Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 test_iorfpp: @@@@@@ FAIL: ior failed! 1 [ 1477.578628] Lustre: DEBUG MARKER: parallel-scale-nfsv3 test_iorfpp: @@@@@@ FAIL: ior failed! 1 [ 1477.984261] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-2/2022-12-15/lustre-reviews_custom_91221_102_231e7dd2-7190-4e2c-8d24-93aa05c646b7 //parallel-scale-nfsv3.test_iorfpp.debug_log.$(hostname -s).1671145374.log; dmesg > /autotest/autotest-2/2022-12-15/lustre-reviews_cu [ 1479.340152] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ========================================================== 23:02:55 \(1671145375\) [ 1479.702251] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test racer_on_nfs: racer on NFS client ========================================================== 23:02:55 (1671145375) [ 1787.824195] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == parallel-scale-nfsv3 test complete, duration 1466 sec ========================================================== 23:08:04 \(1671145684\) [ 1788.199820] Lustre: DEBUG MARKER: == parallel-scale-nfsv3 test complete, duration 1466 sec ========================================================== 23:08:04 (1671145684) [ 1789.526631] Lustre: DEBUG MARKER: systemctl stop nfs-server [ 1789.944393] nfsd: last server has exited, flushing export cache [ 1790.227067] Lustre: DEBUG MARKER: sed -i '\|^/mnt/lustre|d' /etc/exports [ 1790.872269] Lustre: DEBUG MARKER: exportfs -v [ 1791.514326] Lustre: DEBUG MARKER: grep -c /mnt/lustre ' ' /proc/mounts [ 1792.126821] Lustre: DEBUG MARKER: lsof -t /mnt/lustre [ 1792.941836] Lustre: DEBUG MARKER: umount /mnt/lustre 2>&1 [ 5153.799694] sysrq: SysRq : Changing Loglevel We can disable NLM locks on the server side and repeat the same tests to confirm the guess . What do you think ? Reference: man nfs lock / nolock Selects whether to use the NLM sideband protocol to lock files on the server. If neither option is specified (or if lock is specified), NLM locking is used for this mount point. When using the nolock option, applications can lock files, but such locks provide exclusion only against other applications running on the same client. Remote applications are not affected by these locks. Thank you!
            Deiter Alex Deiter added a comment - - edited

            Hello adilger,

            Thank you very much for the detailed explanation!
            Please let me test umount without force.

            Thank you!

            Deiter Alex Deiter added a comment - - edited Hello adilger , Thank you very much for the detailed explanation! Please let me test umount without force . Thank you!

            Alex, can you please test if removing the "-f" allows this testing to pass more reliably?

            Yingjin, I think you are most familiar with statahead these days. I'm thinking that the patch to fix this may be quite different before/after your statahead patches.

            For master I would prefer that we prioritize landing the statahead patch series, so a patch as early in your series as possible that avoids other conflicts in later patches, if it isn't already fixed by your series. Then a separate patch for backporting to earlier branches to keep the statahead threads on a list/waitqueue on the superblock so they can be woken up and then the threads check if the filesystem is being unmounted and exit immediately.

            adilger Andreas Dilger added a comment - Alex, can you please test if removing the " -f " allows this testing to pass more reliably? Yingjin, I think you are most familiar with statahead these days. I'm thinking that the patch to fix this may be quite different before/after your statahead patches. For master I would prefer that we prioritize landing the statahead patch series, so a patch as early in your series as possible that avoids other conflicts in later patches, if it isn't already fixed by your series. Then a separate patch for backporting to earlier branches to keep the statahead threads on a list/waitqueue on the superblock so they can be woken up and then the threads check if the filesystem is being unmounted and exit immediately.
            adilger Andreas Dilger added a comment - - edited

            Looking at the messages here, it seems the client still has cached files at the time that it is being unmounted, but the connection to the MDS is stopped shortly thereafter and that gives the client problems.

            [ 2217.170422] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1
            [ 2217.192827] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
            

            The "-f" flag means "force unmount and don't wait gracefully" and should be able to clean up regardless of whether the MDS connection is available or not, but I wonder if it is part of the issue here? If this was just "umount /mnt/lustre" the client would flush the dirty cache (avoiding the "dirty page discard" errors) and probably work more reliably.

            That said, "umount -f" should also work reliably, but that likely needs some changes to the code on the client. According to the stack traces, the unmount is stuck on:

            void ll_kill_super(struct super_block *sb)
            {
                            /* wait running statahead threads to quit */
                            while (atomic_read(&sbi->ll_sa_running) > 0)
                                    schedule_timeout_uninterruptible(
                                            cfs_time_seconds(1) >> 3);
            }
            

            so something is preventing the statahead threads from exiting.

            Likely the statahead threads need to be woken up here during the unmount process so that they can exit. It isn't even clear if there is a list of statahead threads on the superblock that can be signaled to wake up/exit? It looks like only a count/limit of statahead threads. I recall some changes in Yingjin's recent statahead patches, that these threads will exit themselves after a short time, but I'm not sure if that is enough to fix this problem or not. Definitely having a direct notification and the threads checking that the filesystem is being unmounted is better than a timeout.

            adilger Andreas Dilger added a comment - - edited Looking at the messages here, it seems the client still has cached files at the time that it is being unmounted, but the connection to the MDS is stopped shortly thereafter and that gives the client problems. [ 2217.170422] Lustre: DEBUG MARKER: umount -f /mnt/lustre 2>&1 [ 2217.192827] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request The " -f " flag means "force unmount and don't wait gracefully" and should be able to clean up regardless of whether the MDS connection is available or not, but I wonder if it is part of the issue here? If this was just " umount /mnt/lustre " the client would flush the dirty cache (avoiding the "dirty page discard" errors) and probably work more reliably. That said, " umount -f " should also work reliably, but that likely needs some changes to the code on the client. According to the stack traces, the unmount is stuck on: void ll_kill_super(struct super_block *sb) { /* wait running statahead threads to quit */ while (atomic_read(&sbi->ll_sa_running) > 0) schedule_timeout_uninterruptible( cfs_time_seconds(1) >> 3); } so something is preventing the statahead threads from exiting. Likely the statahead threads need to be woken up here during the unmount process so that they can exit. It isn't even clear if there is a list of statahead threads on the superblock that can be signaled to wake up/exit? It looks like only a count/limit of statahead threads. I recall some changes in Yingjin's recent statahead patches, that these threads will exit themselves after a short time, but I'm not sure if that is enough to fix this problem or not. Definitely having a direct notification and the threads checking that the filesystem is being unmounted is better than a timeout.

            Colin, the llite* messages would definitely be from the client, regardless of where it is mounted.

            Alex, the reason for the NFS server to be mounted on the MDS is twofold:

            • fewer test nodes needed
            • faster performance because client is local to MDS, where many of the RPCs are sent, for lower latency metadata operations

            I don't think that would be a contributor to the unmount problem being seen, but hard to say for sure.

            adilger Andreas Dilger added a comment - Colin, the llite* messages would definitely be from the client, regardless of where it is mounted. Alex, the reason for the NFS server to be mounted on the MDS is twofold: fewer test nodes needed faster performance because client is local to MDS, where many of the RPCs are sent, for lower latency metadata operations I don't think that would be a contributor to the unmount problem being seen, but hard to say for sure.
            Deiter Alex Deiter added a comment - - edited

            Hello cfaber,

            Is this dmesg from the MDS or a client?

            For some reason Lustre tests suite uses MDS host as a Lustre client. Test flow is:

            • Setup Lustre targets: host4 - MGS/MDS, host3 - OSS:
              host4# mount -t lustre
              /dev/mapper/mds1_flakey on /mnt/lustre-mds1 type lustre (rw,svname=lustre-MDT0000,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro)
              
              host3# mount -t lustre
              /dev/mapper/ost1_flakey on /mnt/lustre-ost1 type lustre (rw,svname=lustre-OST0000,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro)
              /dev/mapper/ost2_flakey on /mnt/lustre-ost2 type lustre (rw,svname=lustre-OST0001,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro)
              /dev/mapper/ost3_flakey on /mnt/lustre-ost3 type lustre (rw,svname=lustre-OST0002,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro)
              /dev/mapper/ost4_flakey on /mnt/lustre-ost4 type lustre (rw,svname=lustre-OST0003,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro)
              /dev/mapper/ost5_flakey on /mnt/lustre-ost5 type lustre (rw,svname=lustre-OST0004,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro)
              /dev/mapper/ost6_flakey on /mnt/lustre-ost6 type lustre (rw,svname=lustre-OST0005,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro)
              /dev/mapper/ost7_flakey on /mnt/lustre-ost7 type lustre (rw,svname=lustre-OST0006,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro)
              
            • Mount Lustre filesystem on host4 (e.g. as a Lustre client):
              host4# mount -t lustre
              10.240.43.3@tcp:/lustre on /mnt/lustre type lustre (rw,checksum,flock,user_xattr,lruresize,lazystatfs,32bitapi,nouser_fid2path,verbose,encrypt)
              

            Note - 10.240.43.3 is a local address on the same host.

            • Export /mnt/lustre via NFS
              host4# showmount -e
              Export list for trevis-84vm4.trevis.whamcloud.com:
              /mnt/lustre *
              
            • Mount /mnt/lustre on host1 and host2 NFS clients
            • Start workload(s) on host1 and host2 NFS clients on top of mounted NFS share
            • Umount NFS share /mnt/lustre on host1 and host2
            • Stop NFS server on host4
            • Try to umount Lustre client mountpoint /mnt/lustre on host4:
              host4# umount -f /mnt/lustre
              ==> hang here
              
            Deiter Alex Deiter added a comment - - edited Hello cfaber , Is this dmesg from the MDS or a client ? For some reason Lustre tests suite uses MDS host as a Lustre client. Test flow is: Setup Lustre targets: host4 - MGS/MDS, host3 - OSS: host4# mount -t lustre /dev/mapper/mds1_flakey on /mnt/lustre-mds1 type lustre (rw,svname=lustre-MDT0000,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro) host3# mount -t lustre /dev/mapper/ost1_flakey on /mnt/lustre-ost1 type lustre (rw,svname=lustre-OST0000,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro) /dev/mapper/ost2_flakey on /mnt/lustre-ost2 type lustre (rw,svname=lustre-OST0001,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro) /dev/mapper/ost3_flakey on /mnt/lustre-ost3 type lustre (rw,svname=lustre-OST0002,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro) /dev/mapper/ost4_flakey on /mnt/lustre-ost4 type lustre (rw,svname=lustre-OST0003,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro) /dev/mapper/ost5_flakey on /mnt/lustre-ost5 type lustre (rw,svname=lustre-OST0004,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro) /dev/mapper/ost6_flakey on /mnt/lustre-ost6 type lustre (rw,svname=lustre-OST0005,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro) /dev/mapper/ost7_flakey on /mnt/lustre-ost7 type lustre (rw,svname=lustre-OST0006,mgsnode=10.240.43.3@tcp,osd=osd-ldiskfs,errors=remount-ro) Mount Lustre filesystem on host4 (e.g. as a Lustre client): host4# mount -t lustre 10.240.43.3@tcp:/lustre on /mnt/lustre type lustre (rw,checksum,flock,user_xattr,lruresize,lazystatfs,32bitapi,nouser_fid2path,verbose,encrypt) Note - 10.240.43.3 is a local address on the same host. Export /mnt/lustre via NFS host4# showmount -e Export list for trevis-84vm4.trevis.whamcloud.com: /mnt/lustre * Mount /mnt/lustre on host1 and host2 NFS clients Start workload(s) on host1 and host2 NFS clients on top of mounted NFS share Umount NFS share /mnt/lustre on host1 and host2 Stop NFS server on host4 Try to umount Lustre client mountpoint /mnt/lustre on host4: host4# umount -f /mnt/lustre ==> hang here
            cfaber Colin Faber added a comment -

            Is this dmesg from the MDS or a client?

            cfaber Colin Faber added a comment - Is this dmesg from the MDS or a client?
            Deiter Alex Deiter added a comment -

            We hit the same issue on the master branch:

            https://testing.whamcloud.com/test_sessions/1442a73f-2793-4e28-9124-a81b4ae65262
            https://testing.whamcloud.com/test_sessions/f0561029-d1d1-4d61-9eea-7f2130ad9784

            all test sessions hang on:

            CMD: trevis-70vm4 umount -f /mnt/lustre 2>&1
            
            [ 5154.892582] task:umount          state:D stack:    0 pid:37125 ppid: 37124 flags:0x00004080
            [ 5154.894509] Call Trace:
            [ 5154.895153]  __schedule+0x2bd/0x760
            [ 5154.896010]  schedule+0x37/0xa0
            [ 5154.896681]  schedule_timeout+0x197/0x300
            [ 5154.897504]  ? __next_timer_interrupt+0xf0/0xf0
            [ 5154.898422]  ? __radix_tree_delete+0x92/0xa0
            [ 5154.899286]  ll_kill_super+0x63/0x130 [lustre]
            [ 5154.900268]  lustre_kill_super+0x28/0x40 [lustre]
            [ 5154.901221]  deactivate_locked_super+0x34/0x70
            [ 5154.902133]  cleanup_mnt+0x3b/0x70
            [ 5154.902869]  task_work_run+0x8a/0xb0
            [ 5154.903640]  exit_to_usermode_loop+0xeb/0xf0
            [ 5154.904525]  do_syscall_64+0x198/0x1a0
            [ 5154.905288]  entry_SYSCALL_64_after_hwframe+0x65/0xca
            

            dmesg:

            [ 1796.367975] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
            [ 1796.370695] LustreError: 37125:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff95a92ffe3800: inode [0x200000403:0x153d:0x0] mdc close failed: rc = -108
            [ 1796.411489] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28cf:0x0]/ may get corrupted (rc -108)
            [ 1796.411662] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29ec:0x0]/ may get corrupted (rc -108)
            [ 1796.414759] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x293d:0x0]/ may get corrupted (rc -108)
            [ 1796.421120] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a1b:0x0]/ may get corrupted (rc -108)
            [ 1796.436808] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29ed:0x0]/ may get corrupted (rc -108)
            [ 1796.442761] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a05:0x0]/ may get corrupted (rc -108)
            [ 1796.443103] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f9:0x0]/ may get corrupted (rc -108)
            [ 1796.445919] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2940:0x0]/ may get corrupted (rc -108)
            [ 1796.449051] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f0:0x0]/ may get corrupted (rc -108)
            [ 1796.452119] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29b1:0x0]/ may get corrupted (rc -108)
            [ 1796.458583] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x298d:0x0]/ may get corrupted (rc -108)
            [ 1796.463828] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29bc:0x0]/ may get corrupted (rc -108)
            [ 1796.466124] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2955:0x0]/ may get corrupted (rc -108)
            [ 1796.466961] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x298e:0x0]/ may get corrupted (rc -108)
            [ 1796.472725] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a0f:0x0]/ may get corrupted (rc -108)
            [ 1796.473142] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x294a:0x0]/ may get corrupted (rc -108)
            [ 1796.480757] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29e1:0x0]/ may get corrupted (rc -108)
            [ 1796.481037] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f3:0x0]/ may get corrupted (rc -108)
            [ 1796.483887] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a10:0x0]/ may get corrupted (rc -108)
            [ 1796.494909] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2901:0x0]/ may get corrupted (rc -108)
            [ 1796.495139] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2945:0x0]/ may get corrupted (rc -108)
            [ 1796.498049] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28d7:0x0]/ may get corrupted (rc -108)
            [ 1796.501156] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a19:0x0]/ may get corrupted (rc -108)
            [ 1796.504452] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29d2:0x0]/ may get corrupted (rc -108)
            [ 1796.519318] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x296b:0x0]/ may get corrupted (rc -108)
            [ 1796.525781] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a02:0x0]/ may get corrupted (rc -108)
            [ 1826.914042] Lustre: lustre-MDT0000: haven't heard from client 309ad38c-8e48-430d-ae7a-a65d0196c74a (at 0@lo) in 31 seconds. I think it's dead, and I am evicting it. exp 00000000f489d0d3, cur 1671076587 expire 1671076557 last 1671076556
            
            Deiter Alex Deiter added a comment - We hit the same issue on the master branch: https://testing.whamcloud.com/test_sessions/1442a73f-2793-4e28-9124-a81b4ae65262 https://testing.whamcloud.com/test_sessions/f0561029-d1d1-4d61-9eea-7f2130ad9784 all test sessions hang on: CMD: trevis-70vm4 umount -f /mnt/lustre 2>&1 [ 5154.892582] task:umount state:D stack: 0 pid:37125 ppid: 37124 flags:0x00004080 [ 5154.894509] Call Trace: [ 5154.895153] __schedule+0x2bd/0x760 [ 5154.896010] schedule+0x37/0xa0 [ 5154.896681] schedule_timeout+0x197/0x300 [ 5154.897504] ? __next_timer_interrupt+0xf0/0xf0 [ 5154.898422] ? __radix_tree_delete+0x92/0xa0 [ 5154.899286] ll_kill_super+0x63/0x130 [lustre] [ 5154.900268] lustre_kill_super+0x28/0x40 [lustre] [ 5154.901221] deactivate_locked_super+0x34/0x70 [ 5154.902133] cleanup_mnt+0x3b/0x70 [ 5154.902869] task_work_run+0x8a/0xb0 [ 5154.903640] exit_to_usermode_loop+0xeb/0xf0 [ 5154.904525] do_syscall_64+0x198/0x1a0 [ 5154.905288] entry_SYSCALL_64_after_hwframe+0x65/0xca dmesg: [ 1796.367975] Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request [ 1796.370695] LustreError: 37125:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff95a92ffe3800: inode [0x200000403:0x153d:0x0] mdc close failed: rc = -108 [ 1796.411489] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28cf:0x0]/ may get corrupted (rc -108) [ 1796.411662] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29ec:0x0]/ may get corrupted (rc -108) [ 1796.414759] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x293d:0x0]/ may get corrupted (rc -108) [ 1796.421120] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a1b:0x0]/ may get corrupted (rc -108) [ 1796.436808] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29ed:0x0]/ may get corrupted (rc -108) [ 1796.442761] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a05:0x0]/ may get corrupted (rc -108) [ 1796.443103] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f9:0x0]/ may get corrupted (rc -108) [ 1796.445919] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2940:0x0]/ may get corrupted (rc -108) [ 1796.449051] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f0:0x0]/ may get corrupted (rc -108) [ 1796.452119] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29b1:0x0]/ may get corrupted (rc -108) [ 1796.458583] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x298d:0x0]/ may get corrupted (rc -108) [ 1796.463828] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29bc:0x0]/ may get corrupted (rc -108) [ 1796.466124] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2955:0x0]/ may get corrupted (rc -108) [ 1796.466961] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x298e:0x0]/ may get corrupted (rc -108) [ 1796.472725] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a0f:0x0]/ may get corrupted (rc -108) [ 1796.473142] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x294a:0x0]/ may get corrupted (rc -108) [ 1796.480757] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29e1:0x0]/ may get corrupted (rc -108) [ 1796.481037] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28f3:0x0]/ may get corrupted (rc -108) [ 1796.483887] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a10:0x0]/ may get corrupted (rc -108) [ 1796.494909] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2901:0x0]/ may get corrupted (rc -108) [ 1796.495139] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2945:0x0]/ may get corrupted (rc -108) [ 1796.498049] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x28d7:0x0]/ may get corrupted (rc -108) [ 1796.501156] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a19:0x0]/ may get corrupted (rc -108) [ 1796.504452] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x29d2:0x0]/ may get corrupted (rc -108) [ 1796.519318] Lustre: 5617:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x296b:0x0]/ may get corrupted (rc -108) [ 1796.525781] Lustre: 5618:0:(llite_lib.c:3674:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.240.41.235@tcp:/lustre/fid: [0x200000403:0x2a02:0x0]/ may get corrupted (rc -108) [ 1826.914042] Lustre: lustre-MDT0000: haven 't heard from client 309ad38c-8e48-430d-ae7a-a65d0196c74a (at 0@lo) in 31 seconds. I think it' s dead, and I am evicting it. exp 00000000f489d0d3, cur 1671076587 expire 1671076557 last 1671076556
            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - +1 on master https://testing.whamcloud.com/test_sets/bc47e7ed-ce27-4fab-ad48-8ab438dc70b8
            sarah Sarah Liu added a comment -

            similar error on master, it failed to stop nfs service after all tests finished.
            https://testing.whamcloud.com/test_sets/e370a684-05f9-41cc-9e78-0b061119344f

            onyx-71vm13: Redirecting to /bin/systemctl stop nfs.service
            onyx-71vm13: Failed to stop nfs.service: Unit nfs.service not loaded.
            onyx-71vm13: Redirecting to /bin/systemctl stop nfs-server.service
            
            sarah Sarah Liu added a comment - similar error on master, it failed to stop nfs service after all tests finished. https://testing.whamcloud.com/test_sets/e370a684-05f9-41cc-9e78-0b061119344f onyx-71vm13: Redirecting to /bin/systemctl stop nfs.service onyx-71vm13: Failed to stop nfs.service: Unit nfs.service not loaded. onyx-71vm13: Redirecting to /bin/systemctl stop nfs-server.service

            People

              qian_wc Qian Yingjin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: