June 2, 2010

Why WS-Client should RETRY Web Service Exceptions ?

Web Services are something that are either used as provider or client by almost every developer. Web services are usually exposed via SOAP, these days RESTful web services are common too. One thing common in all types web services is HTTP protocol, its the base protocol wrapped by SOAP / REST implementations. So all web services are prone to HTTP/Internet/LAN issues and other web service providers issues. This post tries to explain which of these errors/issues should be retried by client code before failing fast.

Development Mode Approach – A fairy tale !

When we are developing Client side code for a web service. We go though flows that rarely fail for Internet/LAN errors or other WS provider mistakes. So we never think about retrying on such failures, we also had a feeling when solution will run on production on A grade super fast internet wires, these issues will not be there.  But this fairy tale turns nightmare, when in production we get strange errors that are completely unexpected and hard to reproduce and debug.

Realistic Approach – Lesser Production Nightmares !

So the life is not like a fairy tale, so we should RETRY for certain errors on web service calls. Though this topic is pretty wide and its not easy to cover all known different issues so I tried to pick common error codes and failure reasons for two popular clouds web service providers i.e. Salesforce and Amazon.

Below is a table that lists these error codes and explains which of them should be retired by a web service client side code.

Error Code Retry “Why” or “Why Not” Retry ?
Unknown Host   YES Unknown host might come because of temporary network issues, we should wait and try to reconnect for those.
Service Unavailable 503 YES Pushing updates or  maintenance window is not too long and is usually known for providers so we should wait and hold on for the known period.
Temporary Redirect 307 YES As said its a temporary redirect, so we can retry on the same endpoint again.
Request Timeout 400 YES Request can timeout because of network issues or because you are querying too much with the web service. If the failure is because of network issues, we should retry, otherwise one should try tuning the web service request to reduce the queried data.

Internal Errors at Server Side

500 YES

To RETRY or NOT, depends on the web service providers. Many providers like Amazon document which internal errors to retry. For others without such documentation we should try to wait for a while and then retry.

Conflicting Request

409 NO We should try optimizing the client code to ensure proper locking, so that multiple threads don’t race against the same resource.

Slowdown

503 NO We should fix client to slow or queue requests. Another cool option provided by many providers is ability to batch multiple requests. So those options should be tried client side.

Token Refresh/Expire

400 YES

Most of the web service providers give a login token in form of Keys, Session Ids etc. Sometimes these tokens have limited life, so client code should try renewing these tokens on such errors.

Bad Request/Digest, Incomplete Body, Invalid Argument, Malformed XML,  
Malformed POST Request, Missing Content Length,
400 NO Client code should be fixed to form correct requests
Access Denied 403 NO Try different credentials
Wrong End Point, PermanentRedirect 301 / 400 NO Client code is trying incorrect end point. The URL of end point should be fixed.
MethodNotAllowed 405 NO Client code should be fixed.

In general one can follow some simple rules based on HTTP status codes too. Though these rules are not applicable to all web service providers, but most of the time you will end up in taking right RETRY or NOT decision. The table below explains this

HTTP STATUS CODE MEANING ? RETRY ? “Why” or “Why Not” Retry ?
307 Moved Temporarily YES Retry after a while.
400 Bad Request NO Needs to fix client side code to form Request correctly.
403 Forbidden NO Needs to fix the credential in client side code.
405 Method Not Allowed NO Need to fix the HTTP call to use correct method.
409 Conflict YES Client is racing for same resource, have some locking in client code to ensure not requesting conflicting actions on same resource. If locking is not possible wait for a while and retry.
411 Length Required NO Client must provide the Content-Length HTTP header.
500 Internal Server Error YES The client side code is correct, some thing on web service provider’s side failed. So retry after a while.
501 Not Implemented NO Client is trying to use a functionality that is not yet implemented
503 Service Unavailable YES You may retry here, if service is down for a while. Like for Maintenance.

Open Source Project Coming Soon !

I am about to release an open source project for Salesforce and Amazon that helps you write Retryable client side code easily. Stay connected, I will post updates.