SCOM alert – Max concurrent API reached

NOTE: while I’m still keeping the current posts live as they still seem to help, currently my focus has changed and new activity moved to the new site iternia.be

EDIT (11/03/2014): 2nd possible cause found for the SCOM alert and added to the article (at the bottom).

If you got a recently patched Operations Manager environment then the current version of the basic OS management pack includes new intelligence to check for problems due to the maximum amount of NTLM or Kerberos PAC password validations a particular server can handle at a time.

Symptoms

Performance issues; these can be veeery hard to troubleshoot due to the large amount of variables in your environment (from storage to networking to server hardware or virtualization performance etc etc). If you had your storage engineers, your network specialists and your HyperV or Vmware gurus run all the tests they can think of, try to look at the following as well (or better: SCOM could have done it preventively already 😉

Besides performance issues which are not only difficult but also often subjective, you can see some strange application behaviour. I guess just about anyone has already seen his Outlook 2007 client freeze for some time without an apparent reason (Outlook 2010 seems to have crushed that bug somehow, I think). Apps giving you “failed to authenticate” errors after about 45 seconds although you are using integrated authentication or you are sure the password you supplied was without typos.

Cause

When a client sends an NTLM authentication request to a server, the server will build a secure channel with a domain controller to forward the challenge and try to authenticate the user. This is called NTLM pass-through authentication. Each server can have only one secure channel with a domain controller. Inside this channel there are 1 or more threads allowed for simultaneous requests. Depending on the type of machine and the version of Windows, the number of concurrent authentication request threads is limited:

  • Windows Server, pre-Windows 2012: 2 concurrent threads
  • Windows Server 2012: 10
  • Windows client: 1
  • Domain controllers, pre-Windows-2012: 1
  • Domain controllers, Windows-2012: 10

If the amount of requests is larger thant the amount of allowed concurrent authentication requests inside a particular channel, then the request is queued. This delays the application so the user is waiting. If the request has to wait too long (45 seconds) then the request fails and the client will have to try a new attempt; f.e. a user has to enter his credentials again to log on to Exchange, Sharepoint or similar.

Impact

Any application using NTLM can potentially be slowed down due to this problem. Particularly Exchange is a good example because the “Outlook Anywhere” protocol (RPC over HTTPS) does use a lot of authentications and typically all users keep their mailbox open the entire day resulting in a lot of authentication requests.

Detecting the problem

Of course your SCOM already came telling you that there is in fact something going on, but you need concrete figures to better understand the situation. Performance monitor on your Domain Controllers can do the trick.

  • Open perfmon (start – run – “perfmon.msc”)
  • Add the netlogon counters. You’ll get 2 options to view the counters separately or as a total. This is only applicable if you have multiple trusted domains or a forest trust with external domains. You’ll get counters for each domain or you can choose to see only the sum of the values.
  • You’ll get the following view:
NTLM netlogon performance monitor

performance monitor – Netlogon

Semaphore?

To explain the counters let’s first look at a semaphore. In this case the semaphore is the process that will evaluate each request and gererate the needed response. It is a queue that will fill with requests and serve them all with an answer.

Available counters:

  • Semaphore waiters: the number of threads that are waiting to obtain the semaphore. This means the semaphore is busy with another request and there is a queue buildup. This is a warning signal that performance is possibly deteriorating.
  • Semaphore holders: the current amount of threads that are active and doing authentication requests. So the sum of threads within all secure channels from any client currently active. Kind of “the active users”.
  • Semaphore acquires: All secure channels currently doing requests, will increase the counter with 1. If a secure channel is gone (no longer active) then its amount of requests is reset to 0.
  • Semaphore timeouts: Where the “waiters” was a warning signal, this is a critical check! If connections are in the queue longer than the timeout value, then the request is dropped and the counter is increased. Although the request has failed, the secure channel is still established and new requests can still come in from that channel. Once the secure channel is closed as well because there are no new requests from a client, then the number of timeouts for this particular channel is reset and the semaphore timeouts counter only lists the sum of all the other active channels still present.
  • Average semaphore hold time: how long does request take from coming into the queue to getting served. This is also an indicator of performance issues if it gets higher.

If the Semaphore Holders and Semaphore Hold Time are without value (zero) as in this screenshot, then there is no need to change the MaxConcurrentApi setting at this time. But it can be difficult to spot the problem at the time it actually occurs so set up a data collector set to do an extensive period of logging to review later.

Resolution

The first and preferable option is to take a look at your application server or webproxy and try to switch to Kerberos. When using Kerberos authentication instead of NTLM, the problem does not occur according to Microsoft. However the alert I got myself from SCOM is actually mentioning “Kerberos PAC password validations” as well so I’m not completely sure about this (I’ll try to get you that info in a update later on). Kerberos is supported by a lot of applications so if possible switching to it should improve the issue.

When Kerberos is not an option, you could try to increase the default value of MaxConcurrentApi to a higher number. However it is advisable to run perfmon in advance to get a baseline and again afterwards to see if you’re actually improving the situation or not.

The setting is located in
HKLM\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters\MaxConcurrentApi

You’re allowing more threads to use a particular secure channel so the backlogging issue should be decreased. Remember to restart the machine because this settings is only loaded at startup.

You can also get the NTLM authentication delays and failures in your eventlog on an application server (not the domain controller itself). You might have to install the hotfix in KB2654097.
After that you should can track the following eventids in the eventviewer:

Log name: system
Source: netlogon
Event id: 5816
Event id: 5817
Event id: 5818
Event id: 5819
(see http://support.microsoft.com/kb/2654097 for more information)

—————————————————————————————————

EDIT 11/03/2014:

it seems that although my original article is correct, there is another possible cause for the monitoring alert. It seems there was a SCOM monitoring agent bug that caused the alert to go off without having an actual problem. This bug has been crushed in one of the SCOM updates / fixes that Microsoft released in a large bunch a few months ago.

—————————————————————————————————

btw: after starting a new initiative called Iternia, we can now assist you with any IT architectural challenges: Iternia.be

Advertisements

About Geert Baeten
IT service architect - cloud infrastructure solutions - datacenter infrastructure solutions - service design / governing processes

7 Responses to SCOM alert – Max concurrent API reached

  1. siddhu says:

    CAUSE:
    When a client sends an NTLM authentication request to a server, the server will build a secure channel with a domain controller to forward the challenge and try to authenticate the user. This is called NTLM pass-through authentication. Each server can have only one secure channel with a domain controller. Inside this channel there are 1 or more threads allowed for simultaneous requests. Depending on the type of machine and the version of Windows, the number of concurrent authentication request threads is limited:

    Your cuase explaination is not correct, NTLM authentication is for local authentication and it is encrypted and stored in SAM(Security accounts manager) database.

    The Kerberos authentication protocol provides a mechanism for authentication — and mutual authentication — between a client and a server, or between one server and another server. Server can have only one secure channel with a domain controller to autheticate. It uses LDAP protocol as NTLM. Thanks.

    • geertbaeten says:

      Thanks a lot for your info! I’ll check it out and adjust accordingly

    • geertbaeten says:

      Hello Siddhu,

      I did some checks and I am not sure your explanation for challenging the sentence “NTLM authentication is for local authentication…” is correct. But please challenge me further 😉

      As far as I understand, NTLM is in fact a system providing cross-machine authentication mechanism and not only used locally.

      On official Microsoft MSDN articles I found:
      —————
      “Windows Challenge/Response (NTLM) is the authentication protocol used on networks that include systems running the Windows operating system and on stand-alone systems.
      The Microsoft Kerberos security package adds greater security than NTLM to systems on a network. Although Microsoft Kerberos is the protocol of choice, NTLM is still supported. NTLM must also be used for logon authentication on stand-alone systems. ”
      —————

      I found the explanation on the following blog very interesting, although it is not first hand information so it might be dangerous to quote. I’m going for it nevertheless:
      http://davenport.sourceforge.net/ntlm.html#whatIsNtlm
      —————
      NTLM is a suite of authentication and session security protocols used in various Microsoft network protocol implementations and supported by the NTLM Security Support Provider (“NTLMSSP”). Originally used for authentication and negotiation of secure DCE/RPC, NTLM is also used throughout Microsoft’s systems as an integrated single sign-on mechanism. It is probably best recognized as part of the “Integrated Windows Authentication” stack for HTTP authentication; however, it is also used in Microsoft implementations of SMTP, POP3, IMAP (all part of Exchange), CIFS/SMB, Telnet, SIP, and possibly others.

      The NTLM Security Support Provider provides authentication, integrity, and confidentiality services within the Window Security Support Provider Interface (SSPI) framework. SSPI specifies a core set of security functionality that is implemented by supporting providers; the NTLMSSP is such a provider. The SSPI specifies, and the NTLMSSP implements, the following core operations:

      Authentication — NTLM provides a challenge-response authentication mechanism, in which clients are able to prove their identities without sending a password to the server.

      Signing — The NTLMSSP provides a means of applying a digital “signature” to a message. This ensures that the signed message has not been modified (either accidentally or intentionally) and that that signing party has knowledge of a shared secret. NTLM implements a symmetric signature scheme (Message Authentication Code, or MAC); that is, a valid signature can only be generated and verified by parties that possess the common shared key.

      Sealing — The NTLMSSP implements a symmetric-key encryption mechanism, which provides message confidentiality. In the case of NTLM, sealing also implies signing (a signed message is not necessarily sealed, but all sealed messages are signed).

      NTLM has been largely supplanted by Kerberos as the authentication protocol of choice for domain-based scenarios. However, Kerberos is a trusted-third-party scheme, and cannot be used in situations where no trusted third party exists; for example, member servers (servers that are not part of a domain), local accounts, and authentication to resources in an untrusted domain. In such scenarios, NTLM continues to be the primary authentication mechanism (and likely will be for a long time).
      —————

  2. Sam says:

    Hi Geert,

    Thanks for the great article, way better than anything that I read on any one or Microsoft’s sites.

    All the DCs in our domain are 2008R2 which according to what you say are, by default, limited to 1 concurrent authentication to any member server. Does that mean that I should raise the Max Concurrent API (in the registry) on all DCs before I raise it on any member server? Or, to ask the same question in different words. What is the point in raising the Max Concurrent API on a member server if all the DCs are limited to 1?
    So far, as needed, I have been raising the Max Concurrent API on member servers and, as far as I can tell in SCOM, seems to work without me doing anything on the DCs so I’m a bit confused.

    Again, thanks a lot.
    Sam

    • geertbaeten says:

      Hello Sam,

      there is only one authentication channel between a client and a server, but inside this channel there can be multiple requests at the same time. The number of allowed simultanious requests is what the setting can adjust.

      I don’t think you will need to increase the value on the domain controller as it actually already supports higher values for different Windows versions. I believe it is only a requestor problem caused by a preset in each different Windows version, not a requestee problem.

      That is however not something I have tested so can’t be 100% sure.

      Best regards,
      Geert

  3. sal says:

    If you have to raise the “Max Concurrent API” , on member servers, what will be acceptable value you have to set it to? Can it be set to, say, 100? 🙂

    • geertbaeten says:

      Hello sal,

      In theory as long as your domain controller can follow the increased number of simultaneous requests then there is no issue. The queue however has been built into the secure channel for a reason. An increased number of simultaneous requests could have a latency penalty for each request still being processed. So at some point the benefit will reach a tipping point and the decreased queue wait time will be nullified by an increased overhead due to too many simultaneous requests.

      I know it’s a crappy answer but I have to say: increase but with caution. Keep it as low as possible, where you eventually only hit the limit very sporadically instead of systematically + keep an eye on your performance (f.e. setup applicaton monitoring and have a baseline in advance so you can also check for negative effects once you hit a tipping point).

      Make sure your domain controller always has a higher value than the member servers. Otherwise the members will send more simultaneous requests than the DC will accept. Could lead to completely other problems. Not sure how this would be handled by the servers ( f.e. errorous? with a timeout and re-try?)

      Going from 1 or 2 to 100 does not seem like a good idea to me, but I’m not entirely sure. If we’re lucky then the parameter is some legacy implementation that is long overcome by increased hardware performance. But it’s there for a reason so use small increments at a time.

      Regards,
      Geert

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: