Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1109

NFS server not responding when running parallel-scale test_iorsff

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.2.0, Lustre 2.1.2
    • Lustre 2.2.0
    • None
    • server/client: 2.1.55 RHEL6-x86_64
    • 3
    • 4710

    Description

      Hit the following error when running iorssf over NFS v3

      Lustre: DEBUG MARKER: == parallel-scale test iorssf: iorssf == 13:45:23 (1329342323)
      nfs: server 10.10.4.15 not responding, still trying
      nfs: server 10.10.4.15 not responding, still trying
      nfs: server 10.10.4.15 not responding, still trying
      nfs: server 10.10.4.15 not responding, still trying
      nfs: server 10.10.4.15 not responding, still trying

      Attachments

        Issue Links

          Activity

            [LU-1109] NFS server not responding when running parallel-scale test_iorsff

            Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #493
            LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1)

            Result = SUCCESS
            Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1
            Files :

            • lustre/llite/vvp_io.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #493 LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1) Result = SUCCESS Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1 Files : lustre/llite/vvp_io.c

            Integrated in lustre-master » x86_64,client,el5,ofa #493
            LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1)

            Result = SUCCESS
            Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1
            Files :

            • lustre/llite/vvp_io.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el5,ofa #493 LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1) Result = SUCCESS Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1 Files : lustre/llite/vvp_io.c

            Integrated in lustre-master » i686,server,el5,ofa #493
            LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1)

            Result = SUCCESS
            Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1
            Files :

            • lustre/llite/vvp_io.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » i686,server,el5,ofa #493 LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1) Result = SUCCESS Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1 Files : lustre/llite/vvp_io.c

            Integrated in lustre-master » x86_64,client,el5,inkernel #493
            LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1)

            Result = SUCCESS
            Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1
            Files :

            • lustre/llite/vvp_io.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,client,el5,inkernel #493 LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1) Result = SUCCESS Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1 Files : lustre/llite/vvp_io.c
            pjones Peter Jones added a comment -

            Landed for 2.2

            pjones Peter Jones added a comment - Landed for 2.2

            Integrated in lustre-master » x86_64,server,el5,inkernel #493
            LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1)

            Result = SUCCESS
            Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1
            Files :

            • lustre/llite/vvp_io.c
            hudson Build Master (Inactive) added a comment - Integrated in lustre-master » x86_64,server,el5,inkernel #493 LU-1109 llite: do splice read stripe by stripe (Revision 211b00d651bbc57d9ab9d24d6d7e94b013957cf1) Result = SUCCESS Oleg Drokin : 211b00d651bbc57d9ab9d24d6d7e94b013957cf1 Files : lustre/llite/vvp_io.c
            sarah Sarah Liu added a comment - Hi, I reran the test on Toro instead of Juelich, both NFSv3 and v4 were pass: https://maloo.whamcloud.com/test_sets/395bfe5a-61d7-11e1-b462-5254004bbbd3 https://maloo.whamcloud.com/test_sets/00c195ae-61d8-11e1-b462-5254004bbbd3
            sarah Sarah Liu added a comment -

            I didn't run recovery test at that time and will keep you updated if I have any more information.

            sarah Sarah Liu added a comment - I didn't run recovery test at that time and will keep you updated if I have any more information.

            Hi Sara,

            It smells not to be the same problem. Did you have recovery test During the time IOR is running? The suspicious log is this one:

            LustreError: 31719:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761 LustreError: 31768:0:(mdc_locks.c:719:mdc_enqueue()) ldlm_cli_enqueue: -95
            LustreError: 31768:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761

            But I'm not sure before I get the full log.

            Can you please rerun the test with the following debug setttings on the nfs server(lustre client):
            1. lctl set_param debug=-1
            2. lctl set_param debug=-trace
            3. lctl set_param debug_mb=200
            4. lctl mark "XXXX IOR test starting..."

            After you notice nfsd is hung, please do the following besides collecting lustre logs:
            5. echo t > /proc/sysrq-trigger
            6. dmesg > dmesg.txt and upload dmesg.txt file.

            Thanks.

            jay Jinshan Xiong (Inactive) added a comment - Hi Sara, It smells not to be the same problem. Did you have recovery test During the time IOR is running? The suspicious log is this one: LustreError: 31719:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761 LustreError: 31768:0:(mdc_locks.c:719:mdc_enqueue()) ldlm_cli_enqueue: -95 LustreError: 31768:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761 But I'm not sure before I get the full log. Can you please rerun the test with the following debug setttings on the nfs server(lustre client): 1. lctl set_param debug=-1 2. lctl set_param debug=-trace 3. lctl set_param debug_mb=200 4. lctl mark "XXXX IOR test starting..." After you notice nfsd is hung, please do the following besides collecting lustre logs: 5. echo t > /proc/sysrq-trigger 6. dmesg > dmesg.txt and upload dmesg.txt file. Thanks.
            sarah Sarah Liu added a comment - - edited

            Hi Xiong,

            I ran your patch on Juelich cluster and it seems there still some problem with the IOR test. NFS client hang there and below is the dmesg from the NFS server(lustre client)
            ------------------------------------------------------------------------
            Lustre: MGC192.168.119.12@tcp: Reactivating import
            Lustre: MGC192.168.119.12@tcp: Connection restored to service MGS using nid 192.168.119.12@tcp.
            LustreError: 31587:0:(genops.c:311:class_newdev()) Device lustre-OST0001-osc-ffff8806256ba000 already exists at 5, won't add
            LustreError: 31587:0:(obd_config.c:327:class_attach()) Cannot create device lustre-OST0001-osc-ffff8806256ba000 of type osc : -17
            LustreError: 31587:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command:
            Lustre: cmd=cf001 0:lustre-OST0001-osc 1:osc 2:lustre-clilov_UUID LustreError: 31719:0:(mdc_locks.c:719:mdc_enqueue()) ldlm_cli_enqueue: -95
            LustreError: 31719:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761 LustreError: 31768:0:(mdc_locks.c:719:mdc_enqueue()) ldlm_cli_enqueue: -95
            LustreError: 31768:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761

            sarah Sarah Liu added a comment - - edited Hi Xiong, I ran your patch on Juelich cluster and it seems there still some problem with the IOR test. NFS client hang there and below is the dmesg from the NFS server(lustre client) ------------------------------------------------------------------------ Lustre: MGC192.168.119.12@tcp: Reactivating import Lustre: MGC192.168.119.12@tcp: Connection restored to service MGS using nid 192.168.119.12@tcp. LustreError: 31587:0:(genops.c:311:class_newdev()) Device lustre-OST0001-osc-ffff8806256ba000 already exists at 5, won't add LustreError: 31587:0:(obd_config.c:327:class_attach()) Cannot create device lustre-OST0001-osc-ffff8806256ba000 of type osc : -17 LustreError: 31587:0:(obd_config.c:1363:class_config_llog_handler()) Err -17 on cfg command: Lustre: cmd=cf001 0:lustre-OST0001-osc 1:osc 2:lustre-clilov_UUID LustreError: 31719:0:(mdc_locks.c:719:mdc_enqueue()) ldlm_cli_enqueue: -95 LustreError: 31719:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761 LustreError: 31768:0:(mdc_locks.c:719:mdc_enqueue()) ldlm_cli_enqueue: -95 LustreError: 31768:0:(file.c:2221:ll_inode_revalidate_fini()) failure -95 inode 1045761
            jay Jinshan Xiong (Inactive) added a comment - Please try patch: http://review.whamcloud.com/2182

            People

              jay Jinshan Xiong (Inactive)
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: