Know How's of Oracle Apps DBA: OPMN and process monitoring in 10G Application server

Wednesday, 11 April 2012

OPMN and process monitoring in 10G Application server

OPMN and process monitoring

Oracle Process Manager (PM) is the centralized process management mechanism in Oracle Application Server and is used to manage Oracle AS processes. It starts, stops, restarts and detects death of these processes. The Oracle AS processes that PM is configured to manage are specified in the opmn.xml file. The start and stop functions are executed by the OPMN as per the user commands.

OPMN (Oracle Process monitoring and Notification services ) are a set of processes that manage mid-tier Application Server components like Oracle HTTP Server (Apache) and OC4J containers.
OPMN consists of the Process Manager and the Notification Server.

Oracle Notification Server (ONS) is the transport mechanism for failure, recovery, startup and other related notifications between components in Oracle Application Server. It operates according to a publish-subscribe model: an Oracle AS component receives a notification of a certain type as per its subscription to ONS. When such a notification is published, ONS sends it to the appropriate subscribers.

Four parameters determine the behavior of the Oracle Process Manager and Notification services process in managing the iAS middle tier comprising of OC4J instances and the Apache HTTP server. They are

a) restart-on-death
b) ping timeout
c) ping interval
d) reverse-ping timeout

The settings for these parameters need to be governed by the heap sizing of the OC4J container JVMs, the latencies involved with garbage collection algorithms and the response times of the HTTP server.

The way in which the parameters affect the functioning of OPMN is as follows:

- OPMN pings the process managed by it,ie, OC4J instances and http server periodically and expects a response within a certain time out period. OPMN would try this 3 times before it declares the process as dead (due to no response) and then it would kill and restart the particular process. OPMN manages the failover and availability of the processes it manages.

Reasons for Oracle HTTP Server not responding to OPMN ping could be:

- due to a high load of concurrent requests which requires many http server attention.
- because of timeouts between the various modules of Apache and the servers servicing client requests
- bulk data received from any of the modules.
- all new connections are not getting processed due to the server being fully occupied.
- it can spawn, with many connections remaining in the CLOSE_WAIT state
- due to thrashing when NFS hiccups cause files that need to be served to be unavailable to it
- because of synchronization issues with the various mutexes that it needs to support for the proper
functioning of its modules.

Reasons for OC4J containers not responding to OPMN ping could be:

While the container is processing servlet or EJB logic within the JVM within which it runs, new objects get created in its heap memory area all the while.

When the garbage collection thread starts to run, it looks for objects that it can release to the heap memory pool based upon several algorithms that depend upon the kind of references the objects have to themselves. Since the collection is "generational", i.e., objects having references are promoted to an older generation and presumed to have a longer lifetime, objects with weak references are candidates for "cleaning" up, and their occupied memory gets released to the global heap of the JVM.

In this way, memory is reclaimed back into the the heap memory pool and made available for use in the creation of newer objects. The forays made by the garbage collector to reclaim memory in the heap are governed by several algorithms and every such collection takes a finite amount of time during which no other work of application processing is possible. And when the collection is over the entire heap, the full GC consists of a mark/sweep/compact cycles that "mark" the memory to be reclaimed, "sweep" the memory into the corresponding generations and "compact" the holes created when the memory is reclaimed, in order as to create contiguous memory for future object creation.

These strategies consume more time, as to be expected, and can result in a delay in the container responding to an OPMN ping cycle. During such full GC scans, OPMN can and will kill and restart the container, causing it to lose the state of the application or request it was processing at that time.

Since full GC scans (referred to to as stop-the-world scans) can happen at any time during the lifetime of a request or an application, there is always the danger of OPMN killing a perfectly functioning container on the assumption that it was "hung" since it was "unresponsive".

Heap memory settings for OC4J containers

Changes are made in the opmn.xml file in the <java-option> sections for each OC4J instance:

-Xms ( for start heap memory ) and -Xmx ( for maximum heap memory ).

The recommended settings for the -Xmx value are 512MB as typical applications need that much of memory to avoid java.lang.OutOfMemoryError exceptions as seen from experience.

Start with an -Xms value of 128MB to prevent side effects of "Too many files open" errors when this setting is higher, as garbage collection kicks in only later for higher -Xms values, resulting in open file handles not getting released by the GC.

<java-option>-server
-Xms128M -Xmx512M

Thread pool sizing

In the server.xml file, set the thread-pool sizes as follows for optimum operation of the thread pool:

<global-thread-pool
min="40" max="40" queue="80"
keepAlive="-1"/>

This sets the min and max thread-pool sizes to the same value and the keepAlive parameter to "-1" - recommended for production environments, this will ensure that idle threads are never destroyed to allow for thread reuse without the overhead for new thread creation. The min, max and queue values can be left at the default as specified here.

Redundancy and load balancing

More than one OC4J instance can be started to accommodate the higher volume of concurrent requests that the container may need to handle. This is set through the "numProcs" parameter in the opmn.xml file and this parameter takes the value of 1 by default, to start a single OC4J instance. For multiple instances, the "numProcs" parameter can be adjusted to different values ( 2 for two instances, and so on ) and PM needs to be restarted with this value for the modules under its control. Very often, the applications that are being run may be process or memory intensive and may require one to adjust the value of the "numProcs" parameter to effect load-balancing via multiple instances.

Reference Metalink id: 298551.1