Tuesday, 20 September 2011

Possible Oracle Service Bus/Weblogic Server SSL bug

We have encountered 'Connection reset' exceptions in production environments when the Oracle Service Bus sends web requests to IIS/WCF-based web services over an SSL connection. By capturing packets and decrypting the SSL stream we were able to establish the following chain of events:

  • The OSB opens an SSL connection with the IIS/WCF web service
  • The OSB sends many HTTP/SOAP requests over this connection
  • Eventually the OSB sends a HTTP/SOAP request for which the IIS/WCF web service fails to reply - it appears to be a bug in IIS/WCF (or the application) that occurs after n number of requests are sent over the same SSL connection
  • IIS times out the TCP connection by sending a TCP RST packet to the OSB
  • The OSB re-establishes an SSL connection to the IIS/WCF web service, it then proceeds to send the HTTP header to the web service but fails to send the HTTP payload. Since the IIS/WCF webservice is expecting more data everything halts waiting for more data. After some time IIS closes the connection by sending a TCP RST packet
  • The OSB continues to re-establish the SSL connection and only send the HTTP header, this occurs 5 times in total
  • On the last TCP SSL connection tear down the OSB throws a 'Connection reset' exception
  • Now the OSB re-establishes a SSL connection and continues to send the other requests in the queue - this works as expected until we reach n number of requests for the same SSL connection where the same thing happens again
The following exception is encountered after 5 SSL connection tear downs.
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at weblogic.utils.io.ChunkedInputStream.read(ChunkedInputStream.java:159)
at java.io.InputStream.read(InputStream.java:85)
at com.certicom.tls.record.ReadHandler.readFragment(Unknown Source)
at com.certicom.tls.record.ReadHandler.readRecord(Unknown Source)
at com.certicom.tls.record.ReadHandler.read(Unknown Source)
at com.certicom.io.InputSSLIOStreamWrapper.read(Unknown Source)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at weblogic.net.http.MessageHeader.isHTTP(MessageHeader.java:220)
at weblogic.net.http.MessageHeader.parseHeader(MessageHeader.java:143)
at weblogic.net.http.HttpClient.parseHTTP(HttpClient.java:431)
at weblogic.net.http.HttpURLConnection.getInputStream(HttpURLConnection.java:359)
at weblogic.net.http.SOAPHttpsURLConnection.getInputStream(SOAPHttpsURLConnection.java:37)
at weblogic.net.http.HttpURLConnection.getResponseCode(HttpURLConnection.java:947)
at com.bea.wli.sb.transports.http.HttpOutboundMessageContext.getResponse(HttpOutboundMessageContext.java:574)
at com.bea.wli.sb.transports.http.HttpOutboundMessageContext.access$000(HttpOutboundMessageContext.java:76)
at com.bea.wli.sb.transports.http.HttpOutboundMessageContext$RetrieveHttpResponseWork.handleResponse(HttpOutboundMessageContext.ja\
va:898)
at weblogic.net.http.AsyncResponseHandler$MuxableSocketHTTPAsyncResponse$RunnableCallback.run(AsyncResponseHandler.java:531)
at weblogic.work.ContextWrap.run(ContextWrap.java:41)
at weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:516)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:201)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:173
Our workaround was to set a timeout of 30 seconds in the OSB business service, and a retry count of 1. Since the IIS time out was approximately two minutes this meant that when it failed to respond, the OSB would tear down the connection and try again once more. This meant we never arrived in the state where the OSB was sending the header alone.

8 comments:

  1. I'm having the exact same issue, unfortunately as our WCF service is not idempotent the retry is not a viable solution. Just curious since you posted this if you found any additional information or workarounds about this issue?

    ReplyDelete
  2. I never did any testing on the WCF side so I'm not sure if the message is actually received by the application more than once because I believe the IIS connection died before passing anything to the application. In saying that I'm not terribly familiar with the .net stack.

    I've had some communication with Oracle support but not much luck... they suggested disabling Keep-Alive on HTTP so the connection was closed after every request but this was going to be prohibitively expensive for the number of requests we were doing. Since Oracle support didn't really understand the problem it never got escalated to someone that could actually help...

    FYI I've been looking at the incoming Google search requests and this seems to be a fairly common problem so maybe if you lean on Oracle support they'll do something but I didn't have the patience! Best of luck - please let me know how you get on.

    ReplyDelete
  3. Thanks for the quick reply. I wonder if OSB is sending just the headers because it is using the binary exponential backoff algorithm described in section 8.2.4 of the HTTP specification "Client Behavior if Server Prematurely Closes Connection" as shown here http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html.

    It basically states that if the client sends a request that does not include the Expect header with 100-continue (which OSB does not in this situation) and the client sees the connection close before receiving any status it can use this algorithm for a more reliable response of which the first part of the algorithm is to send the headers only.

    Any thoughts on this?

    ReplyDelete
  4. I've had a dive around through the OSB code (using the handy Java Decompiler) and there seems to be some kind of backoff in terms of retries (albeit linear).

    In saying that, this doesn't seem to be the problem here, rather It seems that there's some corner case where it just sends the header, expects some kind of InputStream and waits before sending off the body. Although I couldn't quite figure out how it all tied together due to the asynchronous nature of it all. I *think* it's a bug in the OSB as this doesn't appear to be expected behaviour based on looking at the code but I'm not sure.

    Have you given Oracle Support a try? You might have more luck than me... it's a bit of a lottery.

    ReplyDelete
  5. Hi David,

    Could it be that the IIS/WCF Web service is taking too long to respond because the response its returning is too big and is taking time to compose the whole document on the WCF side?

    Ramnarayan

    ReplyDelete
  6. I wouldn't think so... it's been a while since I looked at this but if I remember correctly the time out took approximately 2 minutes. This wouldn't be typical of the application trying to form the response, further more, the periodicity of the error seems to imply that it is a protocol or memory error.

    ReplyDelete
  7. Hi David,

    Saw a similar problem with one of our application which is calling a external webservice. By external i mean its a webservice call via ESB. Below are the stack traces of the weblogic thread where they got stuck.

    [STUCK] ExecuteThread: '38' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=1 tid=0x03e85760 nid=0x642a runnable [0x4a4eb000..0x4a4ef8f0]
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at weblogic.utils.io.ChunkedInputStream.read(ChunkedInputStream.java:159)
    at java.io.InputStream.read(InputStream.java:89)
    at com.certicom.tls.record.ReadHandler.readFragment(Unknown Source)
    at com.certicom.tls.record.ReadHandler.readRecord(Unknown Source)
    at com.certicom.tls.record.ReadHandler.readUntilHandshakeComplete(Unknown Source)
    at com.certicom.tls.interfaceimpl.TLSConnectionImpl.completeHandshake(Unknown Source)
    - locked <0x94e6cdb0> (a com.certicom.tls.interfaceimpl.TLSConnectionImpl)
    at com.certicom.tls.record.WriteHandler.write(Unknown Source)
    at com.certicom.io.OutputSSLIOStreamWrapper.write(Unknown Source)
    at com.certicom.net.ssl.HttpsClient.doHandshake(Unknown Source)
    at com.certicom.net.ssl.internal.HttpURLConnection.getInputStream(Unknown Source)
    - locked <0x94e79fe0> (a com.certicom.net.ssl.HttpsURLConnection)
    at weblogic.webservice.client.https.HttpsURLConnection.getInputStream(HttpsURLConnection.java:231)
    at weblogic.webservice.wsdl.DefinitionFactory.createDefinition(DefinitionFactory.java:113)
    at weblogic.webservice.wsdl.WSDLParser.(WSDLParser.java:50)
    at weblogic.webservice.WebServiceFactory.createFromWSDL(WebServiceFactory.java:109)
    at weblogic.webservice.core.rpc.ServiceImpl.(ServiceImpl.java:94)

    ReplyDelete
  8. Hi,

    This error/stuck thread makes sense given what appears to be happening behind the scenes. It appears Weblogic Server 11g+ has deprecated Certicom and shifted to JSSE and the SSL implementation so hopefully we shouldn't be seeing this problem in the future...

    ReplyDelete