Search Results


Friday, September 07, 2012

Diagnosing SOA Suite issues



What tools are available for diagnosing SOA Suite issues?
There are a variety of tools available to help you and Support diagnose SOA Suite issues in 11g but it can be confusing as to which tool is appropriate for a particular situation and what their relationships are. This blog post will introduce the various tools and attempt to clarify what each is for and how they are related. Let's first list the tools we'll be addressing: 

This overview is not mean to be a comprehensive guide on using all of these tools, however, extensive reference materials are included that will provide many more details on their execution. Another point to note is that all of these tools are applicable for Fusion Middleware as a whole but specific products may or may not have implemented features to leverage them. 

A couple of the tools have a WebLogic Scripting Tool or 'WLST' interface. WLST is a command interface for executing pre-built functions and custom scripts against a domain. A detailed WLST tutorial is beyond the scope of this post but you can find general information here. There are more specific resources in the below sections. 

In this post when we refer to 'Enterprise Manager' or 'EM' we are referring to Enterprise Manager Fusion Middleware Control. 


RDA (Remote Diagnostic Agent)

RDA is a standalone tool that is used to collect both static configuration and dynamic runtime information from the SOA environment. RDA is generally run manually from the command line against a domain or single server. When opening a new Service Request, including an RDA collection can dramatically decrease the back and forth required to collect logs and configuration information for Support. 

After installing RDA you configure it to use the SOA Suite module as decribed in the referenced resources. The SOA module includes the Oracle WebLogic Server (WLS) module by default in order to include all of the relevant information for the environment. In addition to this basic configuration there is also an advanced mode where you can set the number of thread dumps for the collections, log files, Incidents, etc. 

When would you use it? 
When creating a Service Request or otherwise working with Oracle resources on an issue, capturing environment snapshots to baseline your configuration or to diagnose an issue on your own. 

How is it related to the other tools? 
RDA is related to DFW in that it collects the last 10 Incidents from the server by default. In a similar manner, RDA is related to ODL through its collection of the diagnostic logs and these may contain information from Selective Tracingsessions. 

Examples of what it currently collects: (for details please see the links in the Resources section)
  • Diagnostic Logs (ODL)
  • Diagnostic Framework Incidents (DFW)
  • SOA MDS Deployment Descriptors
  • SOA Repository Summary Statistics
  • Thread Dumps
  • Complete Domain Configuration

RDA Resources: top 


DFW (Diagnostic Framework)

DFW provides the ability to collect specific information for a particular problem when that problem occurs. DFW is included with your SOA Suite installation and deployed to the domain. Let's define the components of DFW.
  • Diagnostic Dumps: Specific diagnostic collections that are defined at either the 'system' or product level. Examples would be diagnostic logs or thread dumps.
  • Incident: A collection of Diagnostic Dumps associated with a particular problem
  • Log Conditions: An Oracle Diagnostic Logging event that DFW is configured to listen for. If the event is identified then an Incident will be created.
  • WLDF Watch: The WebLogic Diagnostic Framework or 'WLDF' is not a component of DFW, however, it can be a source of DFW Incident creation through the use of a 'Watch'.
  • WLDF Notification: A Notification is a component of WLDF and is the link between the Watch and DFW. You can configure multiple Notification types in WLDF and associate them with your Watches. 'FMWDFW-notification' is available to you out of the box to allow for DFW notification of Watch execution.
  • Rule: Defines a WLDF Watch or Log Condition for which we want to associate a set of Diagnostic Dumps. When triggered the specified dumps will be collected and added to the Incident
  • Rule Action: Defines the specific Diagnostic Dumps to collect for a particular rule
  • ADRAutomatic Diagnostics Repository; Defined for every server in a domain. This is where Incidents are stored

Now let's walk through a simple flow: 
  1. Oracle Web Services error message OWS-04086 (SOAP Fault) is generated on managed server 1
  2. DFW Log Condition for OWS-04086 evaluates to TRUE
  3. DFW creates a new Incident in the ADR for managed server 1
  4. DFW executes the specified Diagnostic Dumps and adds the output to the Incident
  5. In this case we'll grab the diagnostic log and thread dump. We might also want to collect the WSDL binding information and SOA audit trail

When would you use it? 
When you want to automatically collect Diagnostic Dumps at a particular time using a trigger or when you want to manually collect the information. In either case it can be readily uploaded to Oracle Support through the Service Request. 

How is it related to the other tools? 
DFW generates Incidents which are collections of Diagnostic Dumps. One of the system level Diagonstic Dumps collects the current server diagnostic log which is generated by ODL and can contain information from Selective Tracing sessions. Incidents are included in RDA collections by default and ADRCI is a tool that is used to package an Incident for upload to Oracle Support. In addition, both ODL and DMS can be used to trigger Incident creation through DFW. 

The conditions and rules for generating Incidents can become quite complicated and the below resources go into more detail. A simpler approach to leveraging at least the Diagnostic Dumps is through WLST (WebLogic Scripting Tool) where there are commands to do the following:
  • Create an Incident
  • Execute a single Diagnostic Dump
  • Describe a Diagnostic Dump
  • List the available Diagnostic Dumps
The WLST option offers greater control in what is generated and when. It can be a great help when collecting information for Support. There are overlaps with RDA, however, DFW is geared towards collecting specific runtime information when an issue occurs while existing Incidents are collected by RDA

There are 3 WLDF Watches configured by default in a SOA Suite 11g domain: Stuck Threads, Unchecked Exception and Deadlock. These Watches are enabled by default and will generate Incidents in ADR. They are configured to reset automatically after 30 seconds so they have the potential to create multiple Incidents if these conditions are consistent. The Incidents generated by these Watches will only contain System level Diagnostic Dumps. These same System level Diagnostic Dumps will be included in any application scoped Incident as well. 

Starting in 11.1.1.6, SOA Suite is including its own set of application scoped Diagnostic Dumps that can be executed from WLST or through a WLDF Watch or Log Condition. These Diagnostic Dumps can be added to an Incident such as in the earlier example using the error code OWS-04086.
  • soa.config: MDS configuration files and deployed-composites.xml
  • soa.composite: All artifacts related to the deployed composite
  • soa.wsdl: Summary of endpoints configured for the composite
  • soa.edn: EDN configuration summary if applicable
  • soa.db: Summary DB information for the SOA repository
  • soa.env: Coherence cluster configuration summary
  • soa.composite.trail: Partial audit trail information for the running composite
The current release of RDA has the option to collect the soa.wsdl and soa.composite Diagnostic Dumps. More Diagnostic Dumps for SOA Suite products are planned for future releases along with enhancements to DFW itself. 

DFW Resources: top 


Selective Tracing

Selective Tracing is a facility available starting in version 11.1.1.4 that allows you to increase the logging level for specific loggers and for a specific context. What this means is that you have greater capability to collect needed diagnostic log information in a production environment with reduced overhead. For example, a Selective Tracing session can be executed that only increases the log level for one composite, only one logger, limited to one server in the cluster and for a preset period of time. In an environment where dozens of composites are deployed this can dramatically reduce the volume and overhead of the logging without sacrificing relevance. 

Selective Tracing can be administered either from Enterprise Manager or through WLST. WLST provides a bit more flexibility in terms of exactly where the tracing is run. 

When would you use it? 
When there is an issue in production or another environment that lends itself to filtering by an available context criteria and increasing the log level globally results in too much overhead or irrelevant information. The information is written to the server diagnostic log and is exportable from Enterprise Manager 

How is it related to the other tools? 
Selective Tracing output is written to the server diagnostic log. This log can be collected by a system level Diagnostic Dump using DFW or through a default RDA collection. Selective Tracing also heavily leverages ODL fields to determine what to trace and to tag information that is part of a particular tracing session. 

Available Context Criteria:
  • Application Name
  • Client Address
  • Client Host
  • Composite Name
  • User Name
  • Web Service Name
  • Web Service Port

Selective Tracing Resources: top 


DMS (Dynamic Monitoring Service)

DMS exposes runtime information for monitoring. This information can be monitored in two ways:
  1. Through the DMS servlet
  2. As exposed MBeans
The servlet is deployed by default and can be accessed through http://:/dms/Spy (use administrative credentials to access). The landing page of the servlet shows identical columns of what are known as Noun Types. If you select a Noun Type you will see a table in the right frame that shows the attributes (Sensors) for the Noun Type and the available instances. SOA Suite has several exposed Noun Types that are available for viewing through the Spy servlet. Screenshots of the Spy servlet are available in the Knowledge Base article How to Monitor Runtime SOA Performance With the Dynamic Monitoring Service (DMS)

Every Noun instance in the runtime is exposed as an MBean instance. As such they are generally available through an MBean browser and available for monitoring through WLDF. You can configure a WLDF Watch to monitor a particular attribute and fire a notification when the threshold is exceeded. A WLDF Watch can use the out of the box DFW notification type to notify DFW to create an Incident. 

When would you use it? 
When you want to monitor a metric or set of metrics either manually or through an automated system. When you want to trigger a WLDF Watch based on a metric exposed through DMS. 

How is it related to the other tools? 
DMS metrics can be monitored with WLDF Watches which can in turn notify DFW to create an Incident. 

DMS Resources: top 


ODL (Oracle Diagnostic Logging)

ODL is the primary facility for most Fusion Middleware applications to log what they are doing. Whenever you change a logging level through Enterprise Manager it is ultimately exposed through ODL and written to the server diagnostic log. A notable exception to this is WebLogic Server which uses its own log format / file. 

ODL logs entries in a consistent, structured way using predefined fields and name/value pairs. Here's an example of a SOA Suite entry: 

[2012-04-25T12:49:28.083-06:00] [AdminServer] [ERROR] [] [oracle.soa.bpel.engine] [tid: [ACTIVE].ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: ] [ecid: 0963fdde7e77631c:-31a6431d:136eaa46cda:-8000-00000000000000b4,0] [errid: 41] [WEBSERVICE_PORT.name: BPELProcess2_pt] [APP: soa-infra] [composite_name: TestProject2] [J2EE_MODULE.name: fabric] [WEBSERVICE.name: bpelprocess1_client_ep] [J2EE_APP.name: soa-infra] Error occured while handling a post operation[[

When would you use it?
You'll use ODL almost every time you want to identify and diagnose a problem in the environment. The entries are written to the server diagnostic log.

How is it related to the other tools?
The server diagnostic logs are collected by DFW and RDASelective Tracing writes its information to the diagnostic log as well. Additionally, DFW log conditions are triggered by ODL log events.

ODL Resources: top


ADR (Automatic Diagnostics Repository)

ADR is not a tool in and of itself but is where DFW stores the Incidents it creates. Every server in the domain has an ADR location which can be found under /adr. This is referred to the as the ADR 'Base' location. ADR also has what are known as 'Home' locations. Example:
  • You have a domain called 'myDomain' and an associated managed server called 'myServer'. Your admin server is called 'AdminServer'.
  • Your domain home directory is called 'myDomain' and it contains a 'servers' directory.
  • The 'servers' directory contains a directory for the managed server called 'myServer' and here is where you'll find the 'adr' directory which is the ADR 'Base' location for myServer.
  • To get to the ADR 'Home' locations we drill through a few levels: diag/ofm/myDomain/
  • In an 11.1.1.6 SOA Suite domain you will see 2 directories here, 'myServer' and 'soa-infra'. These are the ADR 'Home' locations.
  • 'myServer' is the 'system' ADR home and contains system level Incidents.
  • 'soa-infra' is the name that SOA Suite used to register with DFW and this ADR home contains SOA Suite related Incidents
  • Each ADR home location contains a series of directories, one of which is called 'incident'. This is where your Incidents are stored.

When would you use it?
It's a good idea to check on these locations from time to time to see whether a lot of Incidents are being generated. They can be cleaned out by deleting the Incident directories or through the ADRCI tool. If you know that an Incident is of particular interest for an issue you're working with Oracle you can simply zip it up and provide it.

How does it relate to the other tools?
ADR is obviously very important for DFW since it's where the Incidents are stored. Incidents contain Diagnostic Dumps that may relate to diagnostic logs (ODL) and DMS metrics. The most recent 10 Incident directories are collected by RDA by default and ADRCI relies on the ADR locations to help manage the contents.
top


ADRCI (Automatic Diagnostics Repository Command Interpreter)

ADRCI is a command line tool for packaging and managing Incidents.

When would you use it?
When purging Incidents from an ADR Home location or when you want to package an Incident along with an offlineRDA collection for upload to Oracle Support.

How does it relate to the other tools?
ADRCI contains a tool called the Incident Packaging System or IPS. This is used to package an Incident for upload to Oracle Support through a Service Request. Starting in 11.1.1.6 IPS will attempt to collect an offline RDA collection and include it with the Incident package. This will only work if Perl is available on the path, otherwise it will give a warning and package only the Incident files.

ADRCI Resources: top


WLDF (WebLogic Diagnostic Framework)

WLDF is functionality available in WebLogic Server since version 9. Starting with FMw 11g a link has been added between WLDF and the pre-existing DFW, the WLDF Watch Notification. Let's take a closer look at the flow:
  1. There is a need to monitor the performance of your SOA Suite message processing
  2. A WLDF Watch is created in the WLS console that will trigger if the average message processing time exceeds 2 seconds. This metric is monitored through a DMS MBean instance.
  3. The out of the box DFW Notification (the Notification is called FMWDFW-notification) is added to the Watch. Under the covers this notification is of type JMX.
  4. The Watch is triggered when the threshold is exceeded and fires the Notification.
  5. DFW has a listener that picks up the Notification and evaluates it according to its rules, etc
When it comes to automatic Incident creation, WLDF is a key component with capabilities that will grow over time.

When would you use it?
When you want to monitor the WLS server log or an MBean metric for some condition and fire a notification when the Watch is triggered.

How does it relate to the other tools?
WLDF is used to automatically trigger Incident creation through DFW using the DFW Notification.

WLDF Resources: