Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4359

1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.1
    • None
    • RHEL6.4/distro OFED/1.8.9 clients/2.4.1 servers
    • 3
    • 11939

    Description

      We have 2 separate 1.8.9 clients that have processes hung in the D state with the clients endlessly looping from FULL to DISCONN and then reestablishing connectivity. One appears to be looping on a Bulk IO write to a stale file handle (10.36.202.142@o2ib) and the other appears to be a BULK IO read from a kiblnd failure (10.36.202.138@o2ib). The timeouts affect filesystem availability, but other activities proceed in between these disconnections.

      Just yesterday we identified 2 bad IB cables with high symbol error rates in our fabric that have since been disconnected. They were likely the cause for at least one of the issues.

      server logs relevant to 10.36.202.138@o2ib issue:
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.863173] Lustre: atlas2-OST009b: Client 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib) refused reconnection, still busy with 1 active RPCs
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032387.919690] LustreError: 11590:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk PUT req@ffff880761cd2800 x1453425689933409/t0(0) o3->3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2@10.36.202.138@o2ib:0/0 lens 448/432 e 0 to 0 dl 1386523292 ref 1 fl Interpret:/2/0 rc 0/0
      Dec 8 12:11:52 atlas-oss3b4 kernel: [1032388.030434] Lustre: atlas2-OST009b: Bulk IO read error with 3b57a9ed-bec6-9b0d-7da8-04d696e1a7f2 (at 10.36.202.138@o2ib), client will retry: rc -110

      client log on 10.36.202.138@o2ib:
      Dec 8 12:11:51 dtn04.ccs.ornl.gov kernel: LustreError: 24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689933409 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 19s ago has failed due to network error (567s prior to deadline).
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: 24622:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: atlas2-OST009b-osc-ffff880c392f9400: Connection to service atlas2-OST009b via nid 10.36.225.185@o2ib was lost; in progress operations using this service will wait for recovery to complete.
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: Lustre: Skipped 7 previous similar messages
      Dec 8 12:11:52 dtn04.ccs.ornl.gov kernel: LustreError: 11-0: an error occurred while communicating with 10.36.225.185@o2ib. The ost_connect operation failed with -16

      lctl dk output:
      00000100:00080000:9:1386522391.428708:0:24622:0:(client.c:1392:ptlrpc_check_set()) resend bulk old x1453425689698153 new x1453425689814678
      00000100:02000400:4:1386522391.428708:0:24623:0:(import.c:1016:ptlrpc_connect_interpret()) Server atlas2-OST009b_UUID version (2.4.1.0) is much newer than client version (1.8.9)
      00000800:00000100:8:1386522411.140346:0:1644:0:(o2iblnd_cb.c:1813:kiblnd_close_conn_locked()) Closing conn to 10.36.225.185@o2ib: error 0(waiting)
      00000100:00020000:6:1386522411.140744:0:24615:0:(events.c:199:client_bulk_callback()) event type 1, status -103, desc ffff880499288000
      00000100:00000400:3:1386522411.151303:0:24622:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1453425689814678 sent from atlas2-OST009b-osc-ffff880c392f9400 to NID 10.36.225.185@o2ib 20s ago has failed due to network error (567s prior to deadline).

      server logs relevant to 10.36.202.142@o2ib issue:
      Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123761] LustreError: 113676:0:(ldlm_lib.c:2722:target_bulk_io()) @@@ network error on bulk GET 0(1048576) req@ffff880ff253d000 x145343054
      1439222/t0(0) o4->10c57078-ce10-72c6-b97d-e7c9a32a240c@10.36.202.142@o2ib:0/0 lens 448/448 e 0 to 0 dl 1386513410 ref 1 fl Interpret:/2/0 rc 0/0
      Dec 8 09:27:15 atlas-oss4h1 kernel: [1022525.123900] Lustre: atlas2-OST03e0: Bulk IO write error with 10c57078-ce10-72c6-b97d-e7c9a32a240c (at 10.36.202.142@o2ib), client will retry:
      rc -110

      client logs from 10.36.202.142@o2ib:
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection to service atlas2-OST03e0 via nid 10.36.226.48@o2ib was lost; in progress operati
      ons using this service will wait for recovery to complete.
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 39 previous similar messages
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: atlas2-OST03e0-osc-ffff881039813c00: Connection restored to service atlas2-OST03e0 using nid 10.36.226.48@o2ib.
      Dec 8 09:26:50 dtn-sch01.ccs.ornl.gov kernel: Lustre: Skipped 41 previous similar messages

      The lctl dk output is less revealing for this case, but I will attach it. What flags are desirable for getting more relevant information?

      From the OS, if I try to kill the process or stat the inode, I see:
      00000080:00020000:3:1386519720.291103:0:16284:0:(file.c:3348:ll_inode_revalidate_fini()) failure -116 inode 144117425485470173

      [root@dtn-sch01 ~]# fuser -k -m /lustre/atlas2
      Cannot stat file /proc/975/fd/25: Stale file handle
      [root@dtn-sch01 ~]# ls -l /proc/975/fd/25
      l-wx------ 1 cfuson ccsstaff 64 Dec 7 22:46 /proc/975/fd/25 -> /lustre/atlas1/stf007/scratch/cfuson/TestDir/SubDir2/13G-3.tar
      [root@dtn-sch01 ~]#

      Attachments

        1. kiblnd_timeout_client_dtn04.log
          45 kB
        2. stale_fh_loop_client_dtn-sch01.log
          25 kB
        3. stale_fh_loop_server_logs
          153 kB
        4. lustre_dtn-sch01_0.log.gz
          2.99 MB
        5. lustre_dtn04_0.log.gz
          1.46 MB
        6. lustre_dtn04_20131209_fail.log.gz
          544 kB
        7. lustre_logs_atlas-oss3b4_flush.log.gz
          0.3 kB
        8. lst-wrapper.sh
          1 kB

        Issue Links

          Activity

            [LU-4359] 1.8.9 clients has endless bulk IO timeouts with 2.4.1 servers

            Blake,

            Thank you for the update. We're glad the root cause of the problem was found.

            jamesanunez James Nunez (Inactive) added a comment - Blake, Thank you for the update. We're glad the root cause of the problem was found.

            The problem has been solved. We found the issue within the IB fabric. We fixed it by adjusting the weights of certain links. Routes between certain OSS and clients were taking an unexpected path through an older DDR switch with MTU 2048. Since the hosts with the RHEL6 OFED supported 4098 along with the newer switches, this worked fine most of the time. Setting ko2iblnd parameter ib_mtu=2048 provided us instant relief before the routes could be adjusted.

            Thanks for the help narrowing the issue down on this case.

            blakecaldwell Blake Caldwell added a comment - The problem has been solved. We found the issue within the IB fabric. We fixed it by adjusting the weights of certain links. Routes between certain OSS and clients were taking an unexpected path through an older DDR switch with MTU 2048. Since the hosts with the RHEL6 OFED supported 4098 along with the newer switches, this worked fine most of the time. Setting ko2iblnd parameter ib_mtu=2048 provided us instant relief before the routes could be adjusted. Thanks for the help narrowing the issue down on this case.
            Jan 8 08:35:14 client kernel: LustreError: 9877:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff88039adf6cd0) ns: linkfarm-MDT0000-mdc-ffff8803f4fd1000 lock: ffff88082165bc00/0x
            

            This is LU-1488.

            niu Niu Yawei (Inactive) added a comment - Jan 8 08:35:14 client kernel: LustreError: 9877:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff88039adf6cd0) ns: linkfarm-MDT0000-mdc-ffff8803f4fd1000 lock: ffff88082165bc00/0x This is LU-1488 .

            The failure was present with 1.8.8. Below are the logs from starting the the lustre 1.8.8 client, mounting filesystems, and then running file * in the directory that has been causing issue. The symptoms were the same, where the application would hang. Recognizing the problem, I went to unmount to try the next case (this has never worked in the past and before LU-3067, we had to reboot the client). As usual that failed but the logs have some different messages with 1.8.8.

                1. data mismatch with ino 144115188193296385/0

            Does that tell us anything new?

            Jan 8 08:33:17 client kernel: Lustre: Listener bound to ib0:10.36.202.142:987:mlx4_0
            Jan 8 08:33:17 client kernel: Lustre: Added LNI 10.36.202.142@o2ib [63/2560/0/180]
            Jan 8 08:33:17 client kernel: Lustre: Build Version: lustre/scripts--PRISTINE-2.6.32-358.23.2.el6.x86_64
            Jan 8 08:33:18 client kernel: Lustre: Lustre Client File System; http://www.lustre.org/
            Jan 8 08:33:18 client kernel: Lustre: MGC10.36.226.77@o2ib: Reactivating import
            Jan 8 08:33:18 client kernel: Lustre: Server MGS version (2.4.1.0) is much newer than client version (1.8.8)
            Jan 8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf001,clilmv:lmv
            Jan 8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf014,clilmv:linkfarm-MDT0000_UUID
            Jan 8 08:33:18 client kernel: Lustre: Server linkfarm-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)
            Jan 8 08:33:18 client kernel: Lustre: client supports 64-bits dir hash/offset!
            Jan 8 08:33:18 client kernel: Lustre: Client linkfarm-client(ffff8803f4fd1000) mount complete
            Jan 8 08:33:19 client kernel: Lustre: Server atlas1-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)
            Jan 8 08:33:19 client kernel: Lustre: client supports 64-bits dir hash/offset!
            Jan 8 08:33:19 client kernel: Lustre: Client atlas1-client(ffff8803f4fd7400) mount complete
            Jan 8 08:33:19 client kernel: Lustre: 8445:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf001,clilmv:lmv
            Jan 8 08:33:21 client kernel: Lustre: Server atlas2-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8)
            Jan 8 08:33:21 client kernel: Lustre: client supports 64-bits dir hash/offset!
            Jan 8 08:33:21 client kernel: Lustre: Client atlas2-client(ffff880c397d6400) mount complete
            Jan 8 08:33:22 client kernel: Lustre: MGC10.36.227.200@o2ib: Reactivating import
            Jan 8 08:33:22 client kernel: Lustre: Client widow1-client(ffff880fd01dec00) mount complete
            Jan 8 08:35:14 client kernel: LustreError: 8410:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88080e1c4000
            Jan 8 08:35:14 client kernel: Lustre: setting import linkfarm-MDT0000_UUID INACTIVE by administrator request
            Jan 8 08:35:14 client kernel: LustreError: 9877:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff88039adf6cd0) ns: linkfarm-MDT0000-mdc-ffff8803f4fd1000 lock: ffff88082165bc00/0x
            f06137b3b58e545f lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdca99b6a04cb5ba3 expref: -99 pid: 8466 timeout: 0
            Jan 8 08:35:14 client kernel: Lustre: setting import linkfarm-OST0000_UUID INACTIVE by administrator request
            Jan 8 08:35:14 client kernel: Lustre: client linkfarm-client(ffff8803f4fd1000) umount complete
            Jan 8 08:35:14 client kernel: LustreError: 9879:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff8804249bad10) ns: atlas1-MDT0000-mdc-ffff8803f4fd7400 lock: ffff88082165ba00/0xf06137b3b58e5466 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdff80ba35f064331 expref: -99 pid: 8466 timeout: 0
            Jan 8 08:35:16 client kernel: Lustre: client atlas1-client(ffff8803f4fd7400) umount complete
            Jan 8 08:35:16 client kernel: Lustre: setting import atlas2-MDT0000_UUID INACTIVE by administrator request
            Jan 8 08:35:16 client kernel: LustreError: 9881:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff880c39246cd0) ns: atlas2-MDT0000-mdc-ffff880c397d6400 lock: ffff88082165b800/0xf06137b3b58e546d lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0x7f4ba8a4cefab31f expref: -99 pid: 8466 timeout: 0
            Jan 8 08:35:16 client kernel: LustreError: 2866:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108
            Jan 8 08:35:16 client kernel: LustreError: 2866:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385
            Jan 8 08:35:17 client kernel: Lustre: client widow-client(ffff88081da07c00) umount complete
            Jan 8 08:35:17 client kernel: Lustre: setting import widow2-MDT0000_UUID INACTIVE by administrator request
            Jan 8 08:35:18 client kernel: Lustre: client widow2-client(ffff88038e15b400) umount complete
            Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
            Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
            Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
            Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
            Jan 8 08:35:18 client kernel: LustreError: 10318:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108
            Jan 8 08:35:18 client kernel: LustreError: 10318:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385
            Jan 8 08:35:32 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:35:48 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:36:04 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:36:19 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:36:35 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:36:50 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:37:06 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:37:37 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:38:24 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:39:42 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108
            Jan 8 08:42:02 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108

            blakecaldwell Blake Caldwell added a comment - The failure was present with 1.8.8. Below are the logs from starting the the lustre 1.8.8 client, mounting filesystems, and then running file * in the directory that has been causing issue. The symptoms were the same, where the application would hang. Recognizing the problem, I went to unmount to try the next case (this has never worked in the past and before LU-3067 , we had to reboot the client). As usual that failed but the logs have some different messages with 1.8.8. data mismatch with ino 144115188193296385/0 Does that tell us anything new? Jan 8 08:33:17 client kernel: Lustre: Listener bound to ib0:10.36.202.142:987:mlx4_0 Jan 8 08:33:17 client kernel: Lustre: Added LNI 10.36.202.142@o2ib [63/2560/0/180] Jan 8 08:33:17 client kernel: Lustre: Build Version: lustre/scripts--PRISTINE-2.6.32-358.23.2.el6.x86_64 Jan 8 08:33:18 client kernel: Lustre: Lustre Client File System; http://www.lustre.org/ Jan 8 08:33:18 client kernel: Lustre: MGC10.36.226.77@o2ib: Reactivating import Jan 8 08:33:18 client kernel: Lustre: Server MGS version (2.4.1.0) is much newer than client version (1.8.8) Jan 8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf001,clilmv:lmv Jan 8 08:33:18 client kernel: Lustre: 8437:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf014,clilmv:linkfarm-MDT0000_UUID Jan 8 08:33:18 client kernel: Lustre: Server linkfarm-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8) Jan 8 08:33:18 client kernel: Lustre: client supports 64-bits dir hash/offset! Jan 8 08:33:18 client kernel: Lustre: Client linkfarm-client(ffff8803f4fd1000) mount complete Jan 8 08:33:19 client kernel: Lustre: Server atlas1-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8) Jan 8 08:33:19 client kernel: Lustre: client supports 64-bits dir hash/offset! Jan 8 08:33:19 client kernel: Lustre: Client atlas1-client(ffff8803f4fd7400) mount complete Jan 8 08:33:19 client kernel: Lustre: 8445:0:(obd_config.c:1127:class_config_llog_handler()) skipping 'lmv' config: cmd=cf001,clilmv:lmv Jan 8 08:33:21 client kernel: Lustre: Server atlas2-MDT0000_UUID version (2.4.1.0) is much newer than client version (1.8.8) Jan 8 08:33:21 client kernel: Lustre: client supports 64-bits dir hash/offset! Jan 8 08:33:21 client kernel: Lustre: Client atlas2-client(ffff880c397d6400) mount complete Jan 8 08:33:22 client kernel: Lustre: MGC10.36.227.200@o2ib: Reactivating import Jan 8 08:33:22 client kernel: Lustre: Client widow1-client(ffff880fd01dec00) mount complete Jan 8 08:35:14 client kernel: LustreError: 8410:0:(events.c:198:client_bulk_callback()) event type 1, status -103, desc ffff88080e1c4000 Jan 8 08:35:14 client kernel: Lustre: setting import linkfarm-MDT0000_UUID INACTIVE by administrator request Jan 8 08:35:14 client kernel: LustreError: 9877:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff88039adf6cd0) ns: linkfarm-MDT0000-mdc-ffff8803f4fd1000 lock: ffff88082165bc00/0x f06137b3b58e545f lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdca99b6a04cb5ba3 expref: -99 pid: 8466 timeout: 0 Jan 8 08:35:14 client kernel: Lustre: setting import linkfarm-OST0000_UUID INACTIVE by administrator request Jan 8 08:35:14 client kernel: Lustre: client linkfarm-client(ffff8803f4fd1000) umount complete Jan 8 08:35:14 client kernel: LustreError: 9879:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff8804249bad10) ns: atlas1-MDT0000-mdc-ffff8803f4fd7400 lock: ffff88082165ba00/0xf06137b3b58e5466 lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0xdff80ba35f064331 expref: -99 pid: 8466 timeout: 0 Jan 8 08:35:16 client kernel: Lustre: client atlas1-client(ffff8803f4fd7400) umount complete Jan 8 08:35:16 client kernel: Lustre: setting import atlas2-MDT0000_UUID INACTIVE by administrator request Jan 8 08:35:16 client kernel: LustreError: 9881:0:(namei.c:256:ll_mdc_blocking_ast()) ### data mismatch with ino 144115188193296385/0 (ffff880c39246cd0) ns: atlas2-MDT0000-mdc-ffff880c397d6400 lock: ffff88082165b800/0xf06137b3b58e546d lrc: 3/0,0 mode: PR/PR res: 8589934599/1 bits 0x13 rrc: 2 type: IBT flags: 0x2002c90 remote: 0x7f4ba8a4cefab31f expref: -99 pid: 8466 timeout: 0 Jan 8 08:35:16 client kernel: LustreError: 2866:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108 Jan 8 08:35:16 client kernel: LustreError: 2866:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385 Jan 8 08:35:17 client kernel: Lustre: client widow-client(ffff88081da07c00) umount complete Jan 8 08:35:17 client kernel: Lustre: setting import widow2-MDT0000_UUID INACTIVE by administrator request Jan 8 08:35:18 client kernel: Lustre: client widow2-client(ffff88038e15b400) umount complete Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1039:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Jan 8 08:35:18 client kernel: LustreError: 10316:0:(ldlm_request.c:1597:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Jan 8 08:35:18 client kernel: LustreError: 10318:0:(mdc_locks.c:653:mdc_enqueue()) ldlm_cli_enqueue error: -108 Jan 8 08:35:18 client kernel: LustreError: 10318:0:(file.c:3331:ll_inode_revalidate_fini()) failure -108 inode 144115188193296385 Jan 8 08:35:32 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:35:48 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:36:04 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:36:19 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:36:35 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:36:50 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:37:06 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:37:37 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:38:24 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:39:42 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108 Jan 8 08:42:02 client kernel: LustreError: 2866:0:(llite_lib.c:1774:ll_statfs_internal()) mdc_statfs fails: rc = -108

            I wanted to update this case that we are still experience the timeouts, however the patch for LU 2703 took care of the LBUG. That has significantly sped up the process of reproducing the error under different software stacks and on different systems. We noticed that 1.8.6 clients on SLES do not have this issue, so we are trying rolling back the RHEL clients to earlier lustre versions.

            blakecaldwell Blake Caldwell added a comment - I wanted to update this case that we are still experience the timeouts, however the patch for LU 2703 took care of the LBUG. That has significantly sped up the process of reproducing the error under different software stacks and on different systems. We noticed that 1.8.6 clients on SLES do not have this issue, so we are trying rolling back the RHEL clients to earlier lustre versions.

            Hi Niu, this issue has been tracked at LU-3067 and there is a patch in it, Thanks!

            hongchao.zhang Hongchao Zhang added a comment - Hi Niu, this issue has been tracked at LU-3067 and there is a patch in it, Thanks!

            The ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) in brw_interpret was introduced by LU-2703.

            Hongchao, could you take a look? What's the intention of this assert?

            niu Niu Yawei (Inactive) added a comment - The ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) in brw_interpret was introduced by LU-2703 . Hongchao, could you take a look? What's the intention of this assert?

            Evicting the client from from the OST the it is connect cycling with will cause an LBUG

            <3>LustreError: 167-0: This client was evicted by atlas2-OST009b; in progress operations using this service will fail.
            <0>LustreError: 2405:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
            <3>LustreError: 17528:0:(rw.c:1339:ll_issue_page_read()) page ffffea001ba4b0d0 map ffff880839a9aa30 index 0 flags c0000000000821 count 5 priv ffff880822484d80: read queue failed: rc -5
            <0>LustreError: 2405:0:(osc_request.c:2357:brw_interpret()) LBUG
            <4>Pid: 2405, comm: ptlrpcd
            <4>
            <4>Call Trace:
            <1>BUG: unable to handle kernel NULL pointer dereference at (null)
            <1>IP: [<(null)>] (null)
            <4>PGD 81984f067 PUD 8392fc067 PMD 0
            <4>Oops: 0010 1 SMP
            <4>last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average_interval
            <4>CPU 10
            <4>Modules linked in: iptable_nat nf_nat mptctl mptbase nfs lockd fscache auth_rpcgss nfs_acl mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptl
            rpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) autofs4 sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_c
            onntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_add
            r ipv6 tcp_bic power_meter sg mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core hpilo hpwdt bnx2 myri10ge(U) dca microcode serio_raw k10temp amd64_e
            dac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa pata_acpi ata_generic pata_atiixp ahci radeon ttm drm_km
            s_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
            <4>
            <4>Pid: 2405, comm: ptlrpcd Not tainted 2.6.32-358.23.2.el6.x86_64 #1 HP ProLiant DL385 G7
            <4>RIP: 0010:[<0000000000000000>] [<(null)>] (null)
            <4>RSP: 0018:ffff88041b9ddb48 EFLAGS: 00010246
            <4>RAX: ffff88041b9ddbac RBX: ffff88041b9ddba0 RCX: ffffffffa0558260
            <4>RDX: ffff88041b9ddbe0 RSI: ffff88041b9ddba0 RDI: ffff88041b9dc000
            <4>RBP: ffff88041b9ddbe0 R08: 0000000000000000 R09: 0000000000000000
            <4>R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000cbe0
            <4>R13: ffffffffa0558260 R14: 0000000000000000 R15: ffff88044e483fc0
            <4>FS: 00007f0e92ef9700(0000) GS:ffff88044e480000(0000) knlGS:00000000f77026c0
            <4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
            <4>CR2: 0000000000000000 CR3: 0000000815d93000 CR4: 00000000000007e0
            <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
            <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
            <4>Process ptlrpcd (pid: 2405, threadinfo ffff88041b9dc000, task ffff880438ed3500)
            <4>Stack:
            <4> ffffffff8100e4a0 ffff88041b9ddbac ffff880438ed3500 ffffffffa07cdf78
            <4><d> 00000000a07ce9a8 ffff88041b9dc000 ffff88041b9ddfd8 ffff88041b9dc000
            <4><d> 000000000000000a ffff88044e480000 ffff88041b9ddbe0 ffff88041b9ddbb0
            <4>Call Trace:
            <4> [<ffffffff8100e4a0>] ? dump_trace+0x190/0x3b0
            <4> [<ffffffffa054c835>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            <4> [<ffffffffa054ce65>] lbug_with_loc+0x75/0xe0 [libcfs]
            <4> [<ffffffffa05555d6>] libcfs_assertion_failed+0x66/0x70 [libcfs]
            <4> [<ffffffffa07c43ff>] brw_interpret+0xcff/0xe90 [osc]
            <4> [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
            <4> [<ffffffff8105ea69>] ? find_busiest_queue+0x69/0x150
            <4> [<ffffffffa0702a9a>] ptlrpc_check_set+0x24a/0x16b0 [ptlrpc]
            <4> [<ffffffff81081b5b>] ? try_to_del_timer_sync+0x7b/0xe0
            <4> [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30
            <4> [<ffffffffa07397ad>] ptlrpcd_check+0x18d/0x270 [ptlrpc]
            <4> [<ffffffffa0739a50>] ptlrpcd+0x160/0x270 [ptlrpc]
            <4> [<ffffffff81063990>] ? default_wake_function+0x0/0x20
            <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
            <4> [<ffffffffa07398f0>] ? ptlrpcd+0x0/0x270 [ptlrpc]
            <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            <4>Code: Bad RIP value.
            <1>RIP [<(null)>] (null)
            <4> RSP <ffff88041b9ddb48>
            <4>CR2: 0000000000000000

            blakecaldwell Blake Caldwell added a comment - Evicting the client from from the OST the it is connect cycling with will cause an LBUG <3>LustreError: 167-0: This client was evicted by atlas2-OST009b; in progress operations using this service will fail. <0>LustreError: 2405:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed <3>LustreError: 17528:0:(rw.c:1339:ll_issue_page_read()) page ffffea001ba4b0d0 map ffff880839a9aa30 index 0 flags c0000000000821 count 5 priv ffff880822484d80: read queue failed: rc -5 <0>LustreError: 2405:0:(osc_request.c:2357:brw_interpret()) LBUG <4>Pid: 2405, comm: ptlrpcd <4> <4>Call Trace: <1>BUG: unable to handle kernel NULL pointer dereference at (null) <1>IP: [<(null)>] (null) <4>PGD 81984f067 PUD 8392fc067 PMD 0 <4>Oops: 0010 1 SMP <4>last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI000D:00/power1_average_interval <4>CPU 10 <4>Modules linked in: iptable_nat nf_nat mptctl mptbase nfs lockd fscache auth_rpcgss nfs_acl mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ptl rpc(U) obdclass(U) lvfs(U) ko2iblnd(U) lnet(U) libcfs(U) autofs4 sunrpc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_REJECT xt_comment nf_c onntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_add r ipv6 tcp_bic power_meter sg mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core hpilo hpwdt bnx2 myri10ge(U) dca microcode serio_raw k10temp amd64_e dac_mod edac_core edac_mce_amd i2c_piix4 shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa pata_acpi ata_generic pata_atiixp ahci radeon ttm drm_km s_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] <4> <4>Pid: 2405, comm: ptlrpcd Not tainted 2.6.32-358.23.2.el6.x86_64 #1 HP ProLiant DL385 G7 <4>RIP: 0010: [<0000000000000000>] [<(null)>] (null) <4>RSP: 0018:ffff88041b9ddb48 EFLAGS: 00010246 <4>RAX: ffff88041b9ddbac RBX: ffff88041b9ddba0 RCX: ffffffffa0558260 <4>RDX: ffff88041b9ddbe0 RSI: ffff88041b9ddba0 RDI: ffff88041b9dc000 <4>RBP: ffff88041b9ddbe0 R08: 0000000000000000 R09: 0000000000000000 <4>R10: 0000000000000003 R11: 0000000000000000 R12: 000000000000cbe0 <4>R13: ffffffffa0558260 R14: 0000000000000000 R15: ffff88044e483fc0 <4>FS: 00007f0e92ef9700(0000) GS:ffff88044e480000(0000) knlGS:00000000f77026c0 <4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b <4>CR2: 0000000000000000 CR3: 0000000815d93000 CR4: 00000000000007e0 <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 <4>Process ptlrpcd (pid: 2405, threadinfo ffff88041b9dc000, task ffff880438ed3500) <4>Stack: <4> ffffffff8100e4a0 ffff88041b9ddbac ffff880438ed3500 ffffffffa07cdf78 <4><d> 00000000a07ce9a8 ffff88041b9dc000 ffff88041b9ddfd8 ffff88041b9dc000 <4><d> 000000000000000a ffff88044e480000 ffff88041b9ddbe0 ffff88041b9ddbb0 <4>Call Trace: <4> [<ffffffff8100e4a0>] ? dump_trace+0x190/0x3b0 <4> [<ffffffffa054c835>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa054ce65>] lbug_with_loc+0x75/0xe0 [libcfs] <4> [<ffffffffa05555d6>] libcfs_assertion_failed+0x66/0x70 [libcfs] <4> [<ffffffffa07c43ff>] brw_interpret+0xcff/0xe90 [osc] <4> [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320 <4> [<ffffffff8105ea69>] ? find_busiest_queue+0x69/0x150 <4> [<ffffffffa0702a9a>] ptlrpc_check_set+0x24a/0x16b0 [ptlrpc] <4> [<ffffffff81081b5b>] ? try_to_del_timer_sync+0x7b/0xe0 <4> [<ffffffff81081be2>] ? del_timer_sync+0x22/0x30 <4> [<ffffffffa07397ad>] ptlrpcd_check+0x18d/0x270 [ptlrpc] <4> [<ffffffffa0739a50>] ptlrpcd+0x160/0x270 [ptlrpc] <4> [<ffffffff81063990>] ? default_wake_function+0x0/0x20 <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20 <4> [<ffffffffa07398f0>] ? ptlrpcd+0x0/0x270 [ptlrpc] <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 <4>Code: Bad RIP value. <1>RIP [<(null)>] (null) <4> RSP <ffff88041b9ddb48> <4>CR2: 0000000000000000
            blakecaldwell Blake Caldwell added a comment - - edited

            Unfortunately, the issue is still present independent of the IB hardware.

            I'm unsure what contributed to the string of tests succeeding, but we are now back in the same pattern of reconnects on many different 1.8.9 clients. When this happens, processes that access the filesystem remain in the D state and the log messages given above repeat.

            I believe the best way to pick this up is to try the 2.4 client on identical hardware to compare, and then also a LNET selftest on a dedicated client. We would have to deal with other activity on the server, but if I read Liang's comment correctly, the worry is client activity.

            blakecaldwell Blake Caldwell added a comment - - edited Unfortunately, the issue is still present independent of the IB hardware. I'm unsure what contributed to the string of tests succeeding, but we are now back in the same pattern of reconnects on many different 1.8.9 clients. When this happens, processes that access the filesystem remain in the D state and the log messages given above repeat. I believe the best way to pick this up is to try the 2.4 client on identical hardware to compare, and then also a LNET selftest on a dedicated client. We would have to deal with other activity on the server, but if I read Liang's comment correctly, the worry is client activity.

            As an update to this case, we tried running the rsync test with a lustre 1.8.9 client, but on an FDR ConnectX-3 card and could not reproduce the issue. Right now it appears to be an issue with the IB card or associated driver. It does not appear to be a lustre issue.

            blakecaldwell Blake Caldwell added a comment - As an update to this case, we tried running the rsync test with a lustre 1.8.9 client, but on an FDR ConnectX-3 card and could not reproduce the issue. Right now it appears to be an issue with the IB card or associated driver. It does not appear to be a lustre issue.

            People

              niu Niu Yawei (Inactive)
              blakecaldwell Blake Caldwell
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: