Details

    • 3
    • 8471

    Description

      It looks like LU-2576 is back again. The problem went away for a while, seemingly thanks to the patch from LU-2576. I note that a later fix for stack overflow in LU-2859 changed the same lines where the fix was applied, so perhaps that reintroduced the problem?

      We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here:

      2013-05-29 16:15:36.363547  sysiod        S 00000fffa85363bc     0  4111   3070 0x00000000
      2013-05-29 16:15:36.363582  Call Trace:
      2013-05-29 16:15:36.363617  [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable)
      2013-05-29 16:15:36.363653  [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100
      2013-05-29 16:15:36.363688  [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0
      2013-05-29 16:15:36.363723  [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs]
      2013-05-29 16:15:36.363758  [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc]
      2013-05-29 16:15:36.363793  [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc]
      2013-05-29 16:15:36.363828  [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc]
      2013-05-29 16:15:36.363863  [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
      2013-05-29 16:15:36.363898  [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov]
      2013-05-29 16:15:36.363934  [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass]
      2013-05-29 16:15:36.363968  [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre]
      2013-05-29 16:15:36.364003  [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass]
      2013-05-29 16:15:36.364038  [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre]
      2013-05-29 16:15:36.364074  [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre]
      2013-05-29 16:15:36.364109  [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388
      2013-05-29 16:15:36.364143  [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8
      2013-05-29 16:15:36.364178  [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8
      2013-05-29 16:15:36.364213  [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre]
      2013-05-29 16:15:36.364248  [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass]
      2013-05-29 16:15:36.364283  [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass]
      2013-05-29 16:15:36.364317  [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre]
      2013-05-29 16:15:36.364352  [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre]
      2013-05-29 16:15:36.364387  [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre]
      2013-05-29 16:15:36.364422  [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4
      2013-05-29 16:15:36.364458  [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98
      2013-05-29 16:15:36.364492  [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c
      

      This was with Lustre 2.4.0-RC1_3chaos.

      Attachments

        1. rzuseqio13_drop_caches.txt.bz2
          0.2 kB
        2. rzuseqio14_drop_caches.txt.bz2
          4.62 MB
        3. rzuseqio14_lustre_log.txt
          0.2 kB
        4. rzuseqio15_console.txt.bz2
          79 kB
        5. rzuseqio15_drop_caches.txt.bz2
          0.2 kB
        6. rzuseqio16_drop_caches.txt.bz2
          3.46 MB
        7. rzuseqio16_dump_page_cache.txt.bz2
          783 kB

        Issue Links

          Activity

            [LU-3416] Hangs on write in osc_enter_cache()
            pjones Peter Jones made changes -
            Comment [ Hi Team, 

            Clients are in hang state and  are seeing these corresponding errors on the clients: 
            [Thu Oct 20 04:03:17 2022] LustreError: 11-0: data-MDT0000-mdc-ffff9f51f6751000: operation mds_close to node 172.27.0.45@o2ib failed: rc = -107 [Thu Oct 20 04:03:17 2022] Lustre: data-MDT0000-mdc-ffff9f51f6751000: Connection to data-MDT0000 (at 172.27.0.45@o2ib) was lost; in progress operations using this service will wait for recovery to complete [Thu Oct 20 04:03:17 2022] LustreError: Skipped 2 previous similar messages [Thu Oct 20 04:03:19 2022] LustreError: 11-0: data-MDT0000-mdc-ffff9f51f6751000: operation ldlm_enqueue to node 172.27.0.45@o2ib failed: rc = -107 [Thu Oct 20 04:03:19 2022] LustreError: Skipped 1 previous similar message [Thu Oct 20 04:03:20 2022] LustreError: 11-0: data-MDT0000-mdc-ffff9f51f6751000: operation ldlm_enqueue to node 172.27.0.45@o2ib failed: rc = -107 [Thu Oct 20 04:03:20 2022] LustreError: Skipped 2 previous similar messages [Thu Oct 20 04:07:05 2022] LustreError: 9131:0:(client.c:3067:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff9f3c2ff59680 x1747146287852800/t347910890929(347910890929) o101->data-MDT0000-mdc-ffff9f51f6751000@172.27.0.45@o2ib:12/10 lens 864/568 e 0 to 0 dl 1666257015 ref 2 fl Interpret:RP/4/0 rc 301/301 [Thu Oct 20 05:56:33 2022] LustreError: 65849:0:(file.c:4601:ll_inode_revalidate_fini()) data: revalidate FID [0x200000007:0x1:0x0] error: rc = -4

             

            Regards,

            Hithesh Kumar ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.4.2 [ 10605 ]
            pjones Peter Jones made changes -
            Fix Version/s Original: Lustre 2.4.0 [ 10154 ]
            niu Niu Yawei (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]
            adilger Andreas Dilger made changes -
            Fix Version/s New: Lustre 2.5.0 [ 10295 ]
            morrone Christopher Morrone (Inactive) made changes -
            Attachment New: rzuseqio13_drop_caches.txt.bz2 [ 13061 ]
            Attachment New: rzuseqio14_drop_caches.txt.bz2 [ 13062 ]
            Attachment New: rzuseqio15_drop_caches.txt.bz2 [ 13063 ]
            Attachment New: rzuseqio16_drop_caches.txt.bz2 [ 13064 ]
            morrone Christopher Morrone (Inactive) made changes -
            Attachment New: rzuseqio15_console.txt.bz2 [ 13038 ]
            pjones Peter Jones made changes -
            Labels New: mq313
            pjones Peter Jones made changes -
            Affects Version/s New: Lustre 2.4.1 [ 10294 ]
            pjones Peter Jones made changes -
            Priority Original: Blocker [ 1 ] New: Critical [ 2 ]

            People

              niu Niu Yawei (Inactive)
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: