Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • None
    • Lustre 1.8.x (1.8.0 - 1.8.5)
    • None
    • lustre 1.8.8 RHEL5
    • 3
    • 5754

    Description

      we have a network problem at the customer site, the clients are still running, but network is unstable. In that situation, sometimes Lustre servers refuses new connections due to still waiting some active RPC finish.

      e.g.)
      Nov 6 10:51:00 oss212 kernel: Lustre: 21280:0:(ldlm_lib.c:874:target_handle_connect()) LARGE01-OST004c: refuse reconnection from 6279e611-9d6b-3d6a-bab4-e76cf925282f@560@gni to 0xffff81043d807a00; still busy with 1 active RPCs
      Nov 6 10:51:16 oss212 kernel: LustreError: 21337:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (107) req@ffff8106a3c46400 x1415646605273905/t0 o400><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1352166761 ref 1 fl Interpret:H/0/0 rc -107/0

      Some cases, we can find bad client and reboot them or evict servers and reconnect, then situation can be back.

      Howerver, most of cases, it's hard to find bad client, and keeping the error messages. If we can find bad client, new clients can't reconnect until all clients reboot. this is not good idea..

      Any good idea to easy find bad client when the above logs happen?

      Attachments

        Issue Links

          Activity

            [LU-2429] easy to find bad client

            Hi Bruno,

            Sorry, delayed updates on this. We haven't seen same problem and even been able to reproduce problem since the final crash we did...

            ihara Shuichi Ihara (Inactive) added a comment - Hi Bruno, Sorry, delayed updates on this. We haven't seen same problem and even been able to reproduce problem since the final crash we did...

            Ihara, Any news ?? Please can you provide us with a status for this ticket ??

            bfaccini Bruno Faccini (Inactive) added a comment - Ihara, Any news ?? Please can you provide us with a status for this ticket ??

            Hello Ihara,
            Any news on this issue ??
            Have you been able to apply the work-around and/or get a new crash-dump ??
            Bruno.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Ihara, Any news on this issue ?? Have you been able to apply the work-around and/or get a new crash-dump ?? Bruno.

            No, I am afraid that only "data=writeback" can be thought as a work-around according to the problem you encounter. But again, it can only be used as a work-around and we need to understand your problem's root-cause because even running with it you may finally later end-up in an other hung situation ...

            bfaccini Bruno Faccini (Inactive) added a comment - No, I am afraid that only "data=writeback" can be thought as a work-around according to the problem you encounter. But again, it can only be used as a work-around and we need to understand your problem's root-cause because even running with it you may finally later end-up in an other hung situation ...

            Bruno,
            Unfortunately, we couldn't get crashdump.. you need same jbd2 stack situation, right? if so, hope we can get it when the same probem happens sooner.
            any another ideas we can test before decide to change data=writeback?

            ihara Shuichi Ihara (Inactive) added a comment - Bruno, Unfortunately, we couldn't get crashdump.. you need same jbd2 stack situation, right? if so, hope we can get it when the same probem happens sooner. any another ideas we can test before decide to change data=writeback?

            Ihara, do you think you can take an OSS crash-dump ?? Because event if "data=writeback" seems to be a good work-around candidate and works finally, we need to understand how we fall in such situation where the jbd2 thread finds dirty-pages to flush when it should not !!

            bfaccini Bruno Faccini (Inactive) added a comment - Ihara, do you think you can take an OSS crash-dump ?? Because event if "data=writeback" seems to be a good work-around candidate and works finally, we need to understand how we fall in such situation where the jbd2 thread finds dirty-pages to flush when it should not !!

            Ihara, it is safe to use data=writeback since lustre already pushes data to disk before committing, so you already have the ordering guarantee.

            Bruno, the stack trace shows that the jdb2 thread in charge of commit is waiting for some dirty pages to be flushed, which should never happen on the OSS. The issue is that we wait for commit with the pages locked, so there is a deadlock between the service threads and the jbd2 thread. Therefore, we should try to understand how we can end up with dirty pages in the page cache.

            johann Johann Lombardi (Inactive) added a comment - Ihara, it is safe to use data=writeback since lustre already pushes data to disk before committing, so you already have the ordering guarantee. Bruno, the stack trace shows that the jdb2 thread in charge of commit is waiting for some dirty pages to be flushed, which should never happen on the OSS. The issue is that we wait for commit with the pages locked, so there is a deadlock between the service threads and the jbd2 thread. Therefore, we should try to understand how we can end up with dirty pages in the page cache.

            BTW, LU-1219 is still waitig for the Alt+SysRq+T logs you provided there!!

            Strange is that the SysRq output only shows 11 running tasks stacks fr your 12xCores OSS !! But this may come from the fact (option?) that the swapper/idle tasks stacks are not dumped ...

            I agree with you Johann, task/pid 16413 is the one blocking all others, but don't you think there could be some issue on the disks/storage/back-end side ???

            bfaccini Bruno Faccini (Inactive) added a comment - BTW, LU-1219 is still waitig for the Alt+SysRq+T logs you provided there!! Strange is that the SysRq output only shows 11 running tasks stacks fr your 12xCores OSS !! But this may come from the fact (option?) that the swapper/idle tasks stacks are not dumped ... I agree with you Johann, task/pid 16413 is the one blocking all others, but don't you think there could be some issue on the disks/storage/back-end side ???

            Johann,
            data=writeback on the standard ext3/4 filesystem, no guarantee of ordering. (sometimes, journal may commit before data flush) So, is data=writeback safe with the lustre? and no re-ordering even writeback mode is enalbed on OST/MDT?
            https://bugzilla.lustre.org/show_bug.cgi?id=21406.. why this data=writeback mode wan't default option on the lustre even today?

            ihara Shuichi Ihara (Inactive) added a comment - Johann, data=writeback on the standard ext3/4 filesystem, no guarantee of ordering. (sometimes, journal may commit before data flush) So, is data=writeback safe with the lustre? and no re-ordering even writeback mode is enalbed on OST/MDT? https://bugzilla.lustre.org/show_bug.cgi?id=21406 .. why this data=writeback mode wan't default option on the lustre even today?

            This might be same problem? http://jira.whamcloud.com/browse/LU-1219

            Yes, it looks similar.

            Also, data=writeback might help to prevent this kind of probem?

            Yes, although i really would like to understand how we can end up with dirty pages in the inode mapping ...

            johann Johann Lombardi (Inactive) added a comment - This might be same problem? http://jira.whamcloud.com/browse/LU-1219 Yes, it looks similar. Also, data=writeback might help to prevent this kind of probem? Yes, although i really would like to understand how we can end up with dirty pages in the inode mapping ...

            This might be same problem? http://jira.whamcloud.com/browse/LU-1219
            Also, data=writeback might help to prevent this kind of probem?

            ihara Shuichi Ihara (Inactive) added a comment - This might be same problem? http://jira.whamcloud.com/browse/LU-1219 Also, data=writeback might help to prevent this kind of probem?

            People

              bfaccini Bruno Faccini (Inactive)
              ihara Shuichi Ihara (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: