Sunday 15 October 2017

Issues for weblogic


1:   OOM, Native OOM, Server Crash, High CPU Utilization, Server down/Unknown
2:   404, 403, Users Unable to access some application and URL, application errors, application responding slowly, application not working , application not opening,        not getting authenticated, blank page.
3:   Log file not rotating, high disk space usage on servers, Stack overflow, Thread count, Site scope alert, Error while uploading war file.
4:   User creation errors.


1.OOM

Login to the Corresponding Server through Putty
Then Check the Status of the Server instances
Check the Server logs and Out logs for OutOfMemory Error
Take the Access logs at the time of OOM and it will be good if we take thread dump
 If Server(s) is/are in Running State.
Analysis the Thread dump for the Cause of OutOfMemory Error (Due to App/Server)
Then Depending on the Server Status (if not in Running State) Restart the Server.

 OutOfMemory during deployment:

If the application is huge(contains more than 100 JSPs), we might encounter this problem with default JVM settings.
The reason for this is, the MaxPermSpace getting filled up.
This space is used by JVM to store its internal datastructures as well as class definitions. JSP generated class definitions are also stored in here.
MaxPermSpace is outside java heap and cannot expand dynamically.
So fix is to increase it by passing the argument in startup script of the server: –XX:MaxPermSize=128m (default is 64m)

2.Site Scope alerts:

Login to the Server
Check the server status and Particularly at the time of Site Scope alert
Check the logs (Server/Out) for any Errors and Exceptions at the time of Site Scope alert.

3.High CPU utilization:

 Login to the Corresponding Server through putty
 Check the server instances CPU utilization
 ps –ef  [0r] top [or] prstat
 aix: topas or psstat
 Make Sure that the instances are running in weblogic User.
 ps –ef | grep java
Check the logs for any findings regarding high utilization
Check the Queue threads
If 100% cpu utilization :: kill -9 pid
Restart the instances to bring down the more CPU Utilization.


4.High disk space usage on servers:

Login to the Server.
Check the disk space of the respective Mount which is consuming more disk Space.
df –kh
Zip log files or remove oldest logs backup war files and also access logs.
gzip or compress   [0r] rm –rf  
Backup : mv /apps/bea/domains/gwmp_desktop/ads_web.war /apps/back_up/ads_web.war_bak
 Backup: mv .

5.Threads count :

 Check the logs for any  Errors and Exceptions
 Check the status of instances & connection pools
 Check the CPU usage.
 Take the thread dump if possible and Analyze the thread dump
 Check with Other Subsystems
 Check with the DB team if any Issues related to Database.
6.Stack overflow:

 Checkout the Server logs as well as Out logs and also the access logs at the time of Stack Overflow Occurrence. Restart the instance if required
 Xss=.

7. Log files not rotating:

 Check the Status of the Server
 ./startWeblogic.sh
 ./startManagedWeblogic.sh (or)
 Check through console.
 Check the disk Space(if full, Delete the logs and then need to restart the Server)
 du –kh (folder)
 df –kh (filesystem)
available  capacity 45% 90%
 If full , mv
 Delete, rm –rf

8.Server Errors:

 Check the Status of Servers.
 ./startWeblogic.sh
 ./startManagedWeblogic.sh [0R]
 Check through console.
 Check the Server logs
 /apps/bea/domain/gwmp_destop/logs
 Adminserver.log
 Managedserver.log
 If any Database Errors, Check the Connection pool and Datasource.
 Services->jdbc->connectionpool,datasource
 Check out the Deployment Descriptors.
 Weblogic.xml,web.xml
 Based on the logs if any Configuration Changes Required, Make the Changes and then restart instances one by one if in Cluster.

9.Server Down/Unknown:

 Login to the Server through Putty as well as Open the Admin Console
 Check out the respective Instance Process from putty as well as the instance Status from Admin Console
 If Process does not exist and Instance Status is Unknown, then check the logs of the Server Instance as well as Admin Logs.
 Admin and managed server logs.
 Node manage status.
 Find the root Cause from the logs And Restart the required instances

10. URL not working:

 Access the URL
 Check the Status of the Server instances on which this Application is deployed.
 Then Check the Default Queue threads or (Application Specified Queue if any)
 whether idle threads are zero or not. Then Server logs and Application logs (Out logs) for Errors and Exceptions.
 If idle threads are Zero, Check which Application is consuming all threads and if it is the same application which you are accessing, then check with the Application Owner.
 (To resolve the above Issue, Need to restart the Corresponding Instances, before that check
 with the App owner why they are getting consumed)
 If there is any Application Related Exceptions- Check with the Application owner or check the server logs for exceptions.
 If there are any DB Exceptions related to the application which you are accessing, Please Check the Corresponding Connection pool and Datasource whether they are running fine or not.

11.Application errors:

 Access the Application URL
 Check the instances and their status if any Errors
 Check the logs of the Server as well as Application (Out) logs
 Check out the Connection pool Parameters and Datasource

12.Users unable to access some application/URL:

 Check out by  accessing the url
 Check out whether they are using Correct URL or not
 Check the logs of both Weblogic and Webserver
 Check the Server Instances status.
 Test the pools.
 Check the DB connectivity.
 Check if the deployment is done properly or not, else redeploy the application and check for errors in the logs simultaneously.
 Check out the Connection pool user name.
 Restart the instance if required.

13.Application error, responding slowly, Application not working/not Opening, not getting authenticated,Blank page

 Check the Web server and App server instance status.
 Check the logs for any errors/exceptions both in Webserver as well as in Weblogic Server.
 Check the Queue threads, Connection pool Status, Connections and Datasource.
 Check disk space
 Check the log4j property enable.
 Check if the deployment done properly.

14.Error while uploading war file:

check out the Availability of Space

15.Log locations:

1) Server log
WebLogic server creates server log file by default under:
///.log
The location is configurable.

2) JDBC log
All SQL statements and DB related exceptions/errors.
This file is created under //jdbc.log

3)STDout log (If the process is redirected to STDout)
Domain log
All domain level information is logged into this file.
This is subset of server log file.
/.log
4) Access log
All http requests are recorded in this log file
//access.log
5) Transaction log
All servers record transaction in the tlog file
//.tlog

16.Server Crash:
 Server Crash
 This implies the weblogic java process no longer exists.
 Server crash can occur only because of native code. (Java cannot cause a process to crash)
 Determine all potential sources of native code used by the WebLogic Server.
 nativeIO.
 Type4 jdbc driver.
 Native libraries accessed with JNI calls.
 SSL native libraries.
 JVM itself. Most of the times its from JVM.

Sometimes the JVM will produce a small log file that may contain useful information as to which library the crash has originated from. (hs_err_pid*.log)

Server Crash Analysis
When a JVM is crashed, a core file(binary image of the process) is created. Run pmap and pstack against the core file to get the library that caused the crash.

Demo to figure out offending library using existing pmap & pstack out files.
Check list:

1) hs _err_pid*.log (Look for library that caused the crash)

2) pmap core (core file created in JVM root dir)
pstack core

3) Using debugger (gdb,dbx,adb) (if above two steps does not provide any information)

17.Server Hang:

A server is said to be hung when:
 Process is still alive
 Server does not accept any requests because all the execute threads busy or stuck for some reason.
 No reponse sent to clients.
 java weblogic.Admin PING command doesn’t return a normal reponse

Server Hang Analysis:
The first step is to take multiple thread dumps.
 A thread dump is a snapshot of the JVM at the particular instant.
 Multiple thread dumps are necessary to conclude that the threads are  stuck and not progressing.

Procedure to take thread dumps:
Unix:
 Open shell window and issue the command  kill -3
 where PID is java processID of weblogic. Thread dumps are
 logged on to STDout file.
Windows:
 Do ctrl-break on command window where weblogic is running.
 Thread dumps are created on the same command window.

Windows Service:
 Open a command prompt and issue the command(Make sure beasvc.exe is in the PATH)
 c:\> beasvc -dump -svcname:service-name
 Thread dumps are created in the defined log file.
 While creating service, we can provide log option in installservice script    as:
 -log:"d:\bea\domains\mydomain\myserver-stdout.txt

•             Before we analyze thread dumps, it is important to know the common thread states:
1)Runnable [marked as R in some VMs]:
This state indicates that the thread is either running currently or is ready to run the next time the OS thread scheduler schedules it.
2)Object.wait() [marked as CW in some VMs]:
Indicates that the thread is waiting for some condition to be
fulfilled.
3)Waiting for monitor entry [marked as MW in some VMs]:
Indicates that the thread is waiting to enter a synchronized block.

These threads are something to watch out because there is lock contention here. Thread is waiting for a lock on object and some other thread is holding the lock.

In case of weblogic, the main worker threads are from group weblogic.kernel.defalt:
"ExecuteThread: '1' for queue: 'weblogic.kernel.Defalt'“….
This is the set of threads we need to look for hang/slow performance issues.
This is a snapshot of idle thread waiting for some work to be assigned.
On an idle system you would see lot of threads in the below state:

"ExecuteThread: '1' for queue: 'weblogic.kernel.Defalt'" daemon prio=5 tid=0x031a6308 nid=0x980 in Object.wait() [2dff000..2dffd8c]
at java.lang.Object.wait(Native Method)
- waiting on <0x112cf2c0> (a weblogic.kernel.ExecuteThread)
at java.lang.Object.wait(Object.java:429)
at weblogic.kernel.ExecuteThread.waitForRequest(ExecuteThread.java:153)
- locked <0x112cf2c0> (a weblogic.kernel.ExecuteThread)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:172)

•             As for thread dump analysis & conclusion, lets see a sample thread dump and drill into it further
Demo of RSD thread dump (Thread stuck issue on UAT)
Server performing Slow
There are lot of reasons for server performing slow.
First step is to take thread dumps and see what the threads are doing. If there is nothing wrong with the threads  there are other reasons why server performs slow:

Process runs OutOfMemory:
If java heap is full, server process appears to be hung and not accepting any requests because each request needs heap for allocating objects.
So if heap is full, none of the requests get served, all the requests fail with java.lang.OutOfMemory


OutOfMemory Analysis:
OutOfMemory can occur because of real memory crunch or a memory leak causing the heap to fill with orphaned objects.
First step is to enable GC and run the server again.
(-XX:printGCDetails).
The STDout file would show the garbage collection details.
If the error is because of memory leak, then we would need to use profilers like Introscope or optmizeIT to figure out the source of leak.
OutOfMemory Analysis
Process size  = java heap + native memory + memory occupied by the executables and libraries.
On 32 bit operating systems, the virtual address space of a process can go up to 4 GB. This is data bit limitation (2 pow 32)

Out of this 4 GB, the OS kernel reserves some part for itself (typically 1 – 2 GB).
This is not a limitation on 64 bit machines like solaris(sparc) or windows running on Itanium (64 bit)

OutOfMemory Analysis
OOM can occur due to fragmentation. In this situation, we can see free memory available but still get OutOfMemory errors.
Before we know about fragmentation, we need to know the following fact:
Heap allocation can only be contiguous (As per JVM spec). If a request needs 2MB of memory then JVM has to provide 2MB of contiguous memory chunk.
Over a period of time, memory allocation is becomes scattered and there might not be enough contiguous memory available.
FullGC might no be able to reclaim the contiguous space.
This is called fragmentation
For eg: The verbose:gc output might look like the following if there was a fragmentation of heap. There is free memory available, but  still JVM throws OOM error.
(Most of the fragmentation bugs are resolved in Sun JDK1.4.2_xx)

[GC 4673K->3017K(32576K), 0.0050632 secs]
[GC 5047K->3186K(32576K), 0.0028928 secs]
[GC 5232K->3296K(32576K), 0.0019779 secs]
[GC 5309K->3210K(32576K), 0.0004447 secs]
java.lang.OutOfMemoryError

•             OutOfMemory Analysis
Fragmentation relates issues are because of bug in JVM.
Best approach is to try the latest minor version of JVM and if does not work out, we need to work with vendor to get it fixed.
•             The following commands on solaris will provide good information:
vmstat :
The vmstat command reports statistics about kernel   threads, virtual memory, disks, traps and CPU activity
sar:
An OS utility that is termed as system activity reporter
•             If the application uses SSL, then the server performs slow compared to non SSL.
SSL reduces the capacity of the server by about 33 to 50 percent depending upon the strength of encryption used in the SSL connections.

Process running out of File descriptors. Server cannot accept further requests because sockets cannot be created. (Each socket created consumes a FileDescriptor)
The following exception is thrown in such cases:
java.net.SocketException: Too many open files
OR
java.io.IOException: Too many open files
In the above case, the lsof utility would help. lsof utility shows the list of all open filedescriptors. From the list of open files, we ( application owner) can easily figure out if it is a bug or expected behavior. If it is expected behavior, then the number of FDs needs to be increased. (default number is 1024)

•             GC taking long times (more than 20secs).
This appears like a hang for end users.
In the above case, we need to tune the GC parameters.
In these scenarios, we should be trying other GC options  available. In some cases (GC taking very long times), incremental GC has been useful (-Xincgc).


WebLogic Troubleshooting Communication from Apache - Weblogic

If there is any issue between Apache and Weblogic and the cause is not obvious, enable debug at Apache layer. In http.conf file add:
Debug ALL
This would create file called wlproxy.log under /tmp of Apache machine. The log would contain all the request/response headers between Apache and WebLogic.
Most of the plug-in issues in WLS8.1 were centered around the attribute “KeepAliveEnabled”.
For most of the socket related errors, it worth trying turning off
“KeepAliveEnabled” and redo the test.

Apache Restart and Check the Connection counts:

APACHE_HOME\bin\Apache –t   Syntax check
APACHE_HOME\bin\Apache  start Start the server
APACHE_HOME\bin\Apache  stop Stop the server
APACHE_HOME\bin\Apache  Restart
APACHE_HOME\bin\Apache  -l
_______________________________________________________________________
Getting error while restarting one of the Weblogic server instance

#### <[ACTIVE] ExecuteThread: '0' for queue: 'weblog
ic.kernel.Default (self-tuning)'> <> <> <> <1189689344635> <000000>
itms/data/ldap/ldapfiles/EmbeddedLDAP.tran (Permission denied)>
#### <[ACTIVE] ExecuteThread: '0' for queue: 'weblog
ic.kernel.Default (self-tuning)'> <> <> <> <1189689344637> <000000>
l>
#### <[ACTIVE] ExecuteThread: '0' for queue: 'web
logic.kernel.Default (self-tuning)'> <> <> <> <1189689344653>
he Embedded LDAP Server. The exception thown is java.lang.ClassCastException: com.octetstring.vde.backend.BackendRoot. This ma
y indicate a problem with the data files for the Embedded LDAP Server. This managed server has a replica of the data contained
 on the Master Embedded LDAP Server in the Admin server. This replica has been marked invalid and will be refreshed on the nex
t boot of the managed server. Retry the reboot of this server.>
####
<> <> <> <1189689344667 span="">


While restarting WL instance on 9.2 I got the above-mentioned error and I found that server is getting started but again it’s getting forced shutdown.

Solution:
Just go to that server instance directory and browse inside that for the
/local/BEA/weblogic92/domain-name/servers/server-name/data/ldap/ldapfiles directory path

You will get below listed files in that particular directory

-rw-r--r--   1 weblogic weblogic   79649      Sep 13  18:48 EmbeddedLDAP.data
-rw-r--r--   1 weblogic weblogic       0          Sep 13  18:48 EmbeddedLDAP.delete
-rw-r--r--   1 weblogic weblogic     648        Sep 13  18:48 EmbeddedLDAP.index
-rw-r--r--   1 weblogic weblogic       0          Sep 13  18:48 EmbeddedLDAP.lok
-rw-r--r--   1 weblogic weblogic   80126      Sep 13  18:48 EmbeddedLDAP.tran
-rw-r--r--   1 weblogic weblogic       8          Sep 13  18:48 EmbeddedLDAP.trpos

Just delete the below listed files inside the directory
-rw-r--r--   1 weblogic weblogic       0      Sep 13 18:48 EmbeddedLDAP.delete
-rw-r--r--   1 weblogic weblogic       0      Sep 13 18:48 EmbeddedLDAP.lok

Now restart the instance from the bin directory, this will get your Server up and running without issue.


Issue 1: JMS Issue 1

EOP messaging bridges failing frequently with error : "(java.lang.Exception: javax.resource.ResourceException: method (Ljava/lang/String;Ljava/lang/Throwable;)V not found). Because of this issue messages are being piled up on MQ and not being picked up by the bridge.

Soln:

    Domain:eopdom1 (1admin +2ms spread across 2 servers). Checked the bridge configuration (70 bridges in total). Then checked the pools-param in jma-xa-dap.rar (120 on m1 and 20 on m2). Changed this to 150 on both servers as each bridge needs atleast 2 connections from the adapter pool, then redeployed and restarted weblogic instances. Also applied patch WB1E (CR326720_920.jar) to resolve the known issue with the error mentioned.

Notes:

Live is running on 9.2.0 and test is running on 9.2.3, this should be brought in sycn. Also, planning a quick round of WLS health check on EOP.


Issue2: JMS 

Messaging bridge failed to connect with the source and target destinations and was giving below error: "failed to get one of the adapters from JNDI (javax.naming.NameNotFoundException: Unable to resolve 'eis.jms.WLSConnectionFactoryJNDIXA'. Resolved 'eis.jms'; remaining name 'WLSConnectionFactoryJNDIXA')". This would suggest that the adapter file jms-xa-adp.rar was either not targeted to the required managed server instance or perhaps the deployment of adapter failed with certain error.

Soln:

Found that the adapter was only targeted to managed2 server whereas the bridge was configured to run on managed1 server. Targeted the adapter to managed1 server as well and restarted the instances.

Issue3: JMS 

A newly configured messaging bridge failed to become Active and following 2 error messages were seen: " Unable to connect to source destination" and "Configured QoS is not reachable".

Soln:

"Unable to connect to source destination" found that the source URL had a space between the "//" and IP, removed this and now the bridge was able to connect to source destination. "configured QoS is not reachable" found that the "QoS degradation allowed" was checked for earlier bridges but was unchecked for this new bridge and QoS was configured for "Exactly One" delivery, enabled this and the messaging bridge became Active upon bounce of weblogic instances.

Notes:  

 QoS "Exactly Once" required the messaging to be XA enabled i.e. the connection factory should be XA enabled and the destinations should be configured to use jms XA adapter.

Issue4: Deployer

Unable to deploy application from the console and getting following error on the console page ""Deployer:149150]An IOException occurred while reading input.; nested exception is: java.net.SocketException: Connection reset; nested exception is: java.net.SocketException: Connection reset".

Soln:

The only error message in the logs was indicating that the application is attempting to connect to java.sun.com on port 80 over internet but this was disabled due to firewall restrictions, reported this to application team. As a work-around added a manual entry in config.xml for application and restarted the admin and managed server instances and the application got deployed sucessfully.

Notes:

One of the argument was that the application was getting deployed properly on another test instance even with the same error. Though we were never able to replicate this again, one theory is that while deploying through console the deployer was attempting to connect to java.sun.com again and again, eventually timing out but by adding entry in config.xml and restart it just attempted once and moved over with other tasks which would have higher priority during restart.


Issue4: Startup

“/wls_domains/wlmrtnept/servers/managed3_wlmrtnept/tmp/managed3_wlmrtnept.lok : java.io.IOException: No locks available”

Soln:

This could be due to incorrect NFS setup (if NFS filesystem is used), check if the hosts have correct permissions on the NFS server. Check for the below nfs libraries they should be installed.yum list | grep nfs

*Note*: Red Hat Network repositories are not listed below. You must run this command as root to access RHN repositories.
nfs-utils.x86_64                           1:1.0.9-40.el5         installed
nfs-utils-lib.x86_64                       1.0.8-7.2.z2           installed
Also rpc.statd and rpc.idmapd processes should be running.


Issue5:  Cluster

For quite sometime we were observing multicast packet loss issues triggering various other problems on Weblogic like managed servers dropping out of cluster, jms messages not delivered properly to distributed queues.

A recurring message similar to below appears in the logs, although it is an informational message only but it in turn acts as a trigger to various other issues, so messages like this should not be neglected.

<[ACTIVE] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <> <1269346444069>

Soln:

We used multicast test utility to see if there is in fact any issue with multicasting

java utils. MulticastTest –n -a -p

The result showed that the multicast packets are intermittently being dropped within the vLAN causing the above issue. We then liaised with the OS experts to narrow down the issue and to see whether the multicast packets are being transmitted correctly amongst the servers. This did not help much as from the server perspective all the packets were being transmitted correctly. 

Next we involved network experts to seek their help. After thorough investigations of the network logs and various switch configurations it was concluded that this was down to the multicast address range being used and the way the local switches acknowledged that multicast range. They also suggested that in future we make use of Link Local Multicast IP Addresses for Weblogic multicasting purposes.

A note on Link Local IP can be found at: http://www.iana.org/assignments/multicast-addresses/

In short, Multicast Link-local addresses (actually, the link-local mac-addresses) are treated as broadcasts by the local switches so all web logic servers on the same vlan will see them. Other multicast addresses are dropped by the switches as default unless further action is taken:
Disable IGMP snooping on the vlan or the whole switch – otherwise the switch just drops the multicast packet because Web logic doesn’t use IGMP so the switch never sees an IGMP join request to the multicast group (and thus never maps the mac address to the switch port).  OR
Configure static multicast mac addresses for the relevant switch ports.

Both the above 2 options add network complexity and are costly to implement, test and maintain. Link-local multicast addresses completely avoid these issues. Some previous implementations using non link local multicast addresses may have worked OK if the switch had IGMP snooping disabled globally or per vlan.

5.Threads count :

 Check the logs for any  Errors and Exceptions
 Check the status of instances & connection pools
 Check the CPU usage.
 Take the thread dump if possible and Analyze the thread dump
 Check with Other Subsystems
 Check with the DB team if any Issues related to Database.
6.Stack overflow:

 Checkout the Server logs as well as Out logs and also the access logs at the time of Stack Overflow Occurrence. Restart the instance if required
 Xss=.

7. Log files not rotating:
  Check the Status of the Server
 ./startWeblogic.sh
 ./startManagedWeblogic.sh
 [0R]
 Check through console.
 Check the disk Space(if full, Delete the logs and then need to restart the Server)
 du –kh (folder)
 df –kh (filesystem)
 avail capacity
 45% 90%
 If full , mv
 Delete, rm –rf

8.Server Errors:

 Check the Status of Servers.
 ./startWeblogic.sh
 ./startManagedWeblogic.sh
 [0R]
 Check through console.
 Check the Server logs
 /apps/bea/domain/gwmp_destop/logs
 Adminserver.log
 Managedserver.log
 If any Database Errors, Check the Connection pool and Datasource.
 Services->jdbc->connectionpool,datasource
 Check out the Deployment Descriptors.
 Weblogic.xml,web.xml
 Based on the logs if any Configuration Changes Required, Make the Changes and then restart instances one by one if in Cluster.

9.Server Down/Unknown:

 Login to the Server through Putty as well as Open the Admin Console
 Check out the respective Instance Process from putty as well as the instance Status from Admin Console
 If Process does not exist and Instance Status is Unknown, then check the logs of the Server Instance as well as Admin Logs.
 Admin and managed server logs.
 Node manage status.
 Find the root Cause from the logs And Restart the required instances

10. URL not working:
Access the URL
 Check the Status of the Server instances on which this Application is deployed.
 Then Check the Default Queue threads or (Application Specified Queue if any)
 whether idle threads are zero or not. Then Server logs and Application logs (Out logs) for Errors and Exceptions.
 If idle threads are Zero, Check which Application is consuming all threads and if it is the same application which you are accessing, then check with the Application Owner.
 (To resolve the above Issue, Need to restart the Corresponding Instances, before that check
 with the App owner why they are getting consumed)
 If there is any Application Related Exceptions- Check with the Application owner or check the server logs for exceptions.
 If there are any DB Exceptions related to the application which you are accessing, Please Check the Corresponding Connection pool and Datasource whether they are running fine or not.

11.Application errors:

 Access the Application URL
 Check the instances and their status if any Errors
 Check the logs of the Server as well as Application (Out) logs
 Check out the Connection pool Parameters and Datasource

12.Users unable to access some application/URL:

 Check out by  accessing the url
 Check out whether they are using Correct URL or not
 Check the logs of both Weblogic and Webserver
 Check the Server Instances status.
 Test the pools.
 Check the DB connectivity.
 Check if the deployment is done properly or not, else redeploy the application and check for errors in the logs simultaneously.
 Check out the Connection pool user name.
 Restart the instance if required.

13.Application error, responding slowly, Application not working/not Opening, not getting authenticated,Blank page

 Check the Web server and App server instance status.
 Check the logs for any errors/exceptions both in Webserver as well as in Weblogic Server.
 Check the Queue threads, Connection pool Status, Connections and Datasource.
 Check disk space
 Check the log4j property enable.
 Check if the deployment done properly.


14.Error while uploading war file:

check out the Availability of Space

15.Log locations:

1) Server log
WebLogic server creates server log file by default under:
///.log
The location is configurable.

2) JDBC log
All SQL statements and DB related exceptions/errors.
This file is created under //jdbc.log

3)STDout log (If the process is redirected to STDout)
Domain log
All domain level information is logged into this file.
This is subset of server log file.
/.log
4) Access log
All http requests are recorded in this log file
//access.log
5) Transaction log
All servers record transaction in the tlog file
//.tlog

16.Server Crash:
 Server Crash
 This implies the weblogic java process no longer exists.
 Server crash can occur only because of native code. (Java cannot cause a process to crash)
 Determine all potential sources of native code used by the WebLogic Server.
 nativeIO.
 Type4 jdbc driver.
 Native libraries accessed with JNI calls.
 SSL native libraries.
 JVM itself. Most of the times its from JVM.

Sometimes the JVM will produce a small log file that may contain useful information as to which library the crash has originated from. (hs_err_pid*.log)

Server Crash Analysis
When a JVM is crashed, a core file(binary image of the process) is created. Run pmap and pstack against the core file to get the library that caused the crash.

Demo to figure out offending library using existing pmap & pstack out files.
Check list:

1) hs _err_pid*.log (Look for library that caused the crash)

2) pmap core (core file created in JVM root dir)
pstack core

3) Using debugger (gdb,dbx,adb) (if above two steps does not provide any information)

17.Server Hang:

A server is said to be hung when:
 Process is still alive
 Server does not accept any requests because all the execute threads busy or stuck for some reason.
 No reponse sent to clients.
 java weblogic.Admin PING command doesn’t return a normal reponse

Server Hang Analysis:
The first step is to take multiple thread dumps.
 A thread dump is a snapshot of the JVM at the particular instant.
 Multiple thread dumps are necessary to conclude that the threads are  stuck and not progressing.

Procedure to take thread dumps:
Unix:
 Open shell window and issue the command  kill -3
 where PID is java processID of weblogic. Thread dumps are
 logged on to STDout file.
Windows:
 Do ctrl-break on command window where weblogic is running.
 Thread dumps are created on the same command window.

Windows Service:
 Open a command prompt and issue the command(Make sure beasvc.exe is in the PATH)
 c:\> beasvc -dump -svcname:service-name
 Thread dumps are created in the defined log file.
 While creating service, we can provide log option in installservice script    as:
 -log:"d:\bea\domains\mydomain\myserver-stdout.txt

•             Before we analyze thread dumps, it is important to know the common thread states:
1)Runnable [marked as R in some VMs]:
This state indicates that the thread is either running currently or is ready to run the next time the OS thread scheduler schedules it.
2)Object.wait() [marked as CW in some VMs]:
Indicates that the thread is waiting for some condition to be
fulfilled.
3)Waiting for monitor entry [marked as MW in some VMs]:
Indicates that the thread is waiting to enter a synchronized block.

These threads are something to watch out because there is lock contention here. Thread is waiting for a lock on object and some other thread is holding the lock.

In case of weblogic, the main worker threads are from group weblogic.kernel.defalt:
"ExecuteThread: '1' for queue: 'weblogic.kernel.Defalt'“….
This is the set of threads we need to look for hang/slow performance issues.
This is a snapshot of idle thread waiting for some work to be assigned.
On an idle system you would see lot of threads in the below state:

"ExecuteThread: '1' for queue: 'weblogic.kernel.Defalt'" daemon prio=5 tid=0x031a6308 nid=0x980 in Object.wait() [2dff000..2dffd8c]
at java.lang.Object.wait(Native Method)
- waiting on <0x112cf2c0> (a weblogic.kernel.ExecuteThread)
at java.lang.Object.wait(Object.java:429)
at weblogic.kernel.ExecuteThread.waitForRequest(ExecuteThread.java:153)
- locked <0x112cf2c0> (a weblogic.kernel.ExecuteThread)
at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:172)

 As for thread dump analysis & conclusion, lets see a sample thread dump and drill into it further
Demo of RSD thread dump (Thread stuck issue on UAT)
Server performing Slow
There are lot of reasons for server performing slow.
First step is to take thread dumps and see what the threads are doing. If there is nothing wrong with the threads  there are other reasons why server performs slow:

Process runs OutOfMemory:
If java heap is full, server process appears to be hung and not accepting any requests because each request needs heap for allocating objects.
So if heap is full, none of the requests get served, all the requests fail with java.lang.OutOfMemory


OutOfMemory Analysis:
OutOfMemory can occur because of real memory crunch or a memory leak causing the heap to fill with orphaned objects.
First step is to enable GC and run the server again.
(-XX:printGCDetails).
The STDout file would show the garbage collection details.
If the error is because of memory leak, then we would need to use profilers like Introscope or optmizeIT to figure out the source of leak.
OutOfMemory Analysis
Process size  = java heap + native memory + memory occupied by the executables and libraries.
On 32 bit operating systems, the virtual address space of a process can go up to 4 GB. This is data bit limitation (2 pow 32)

Out of this 4 GB, the OS kernel reserves some part for itself (typically 1 – 2 GB).
This is not a limitation on 64 bit machines like solaris(sparc) or windows running on Itanium (64 bit)

OutOfMemory Analysis
OOM can occur due to fragmentation. In this situation, we can see free memory available but still get OutOfMemory errors.
Before we know about fragmentation, we need to know the following fact:
Heap allocation can only be contiguous (As per JVM spec). If a request needs 2MB of memory then JVM has to provide 2MB of contiguous memory chunk.
Over a period of time, memory allocation is becomes scattered and there might not be enough contiguous memory available.
FullGC might no be able to reclaim the contiguous space.
This is called fragmentation
For eg: The verbose:gc output might look like the following if there was a fragmentation of heap. There is free memory available, but  still JVM throws OOM error.
(Most of the fragmentation bugs are resolved in Sun JDK1.4.2_xx)

[GC 4673K->3017K(32576K), 0.0050632 secs]
[GC 5047K->3186K(32576K), 0.0028928 secs]
[GC 5232K->3296K(32576K), 0.0019779 secs]
[GC 5309K->3210K(32576K), 0.0004447 secs]
java.lang.OutOfMemoryError

•             OutOfMemory Analysis
Fragmentation relates issues are because of bug in JVM.
Best approach is to try the latest minor version of JVM and if does not work out, we need to work with vendor to get it fixed.
 The following commands on solaris will provide good information:
vmstat :
The vmstat command reports statistics about kernel   threads, virtual memory, disks, traps and CPU activity
sar:
An OS utility that is termed as system activity reporter
•      If the application uses SSL, then the server performs slow compared to non SSL.
SSL reduces the capacity of the server by about 33 to 50 percent depending upon the strength of encryption used in the SSL connections.

Process running out of File descriptors. Server cannot accept further requests because sockets cannot be created. (Each socket created consumes a FileDescriptor)
The following exception is thrown in such cases:
java.net.SocketException: Too many open files
OR
java.io.IOException: Too many open files
In the above case, the lsof utility would help. lsof utility shows the list of all open filedescriptors. From the list of open files, we ( application owner) can easily figure out if it is a bug or expected behavior. If it is expected behavior, then the number of FDs needs to be increased. (default number is 1024)

•             GC taking long times (more than 20secs).
This appears like a hang for end users.
In the above case, we need to tune the GC parameters.
In these scenarios, we should be trying other GC options  available. In some cases (GC taking very long times), incremental GC has been useful (-Xincgc).


WebLogic Troubleshooting Communication from Apache - Weblogic

If there is any issue between Apache and Weblogic and the cause is not obvious, enable debug at Apache layer. In http.conf file add:
Debug ALL
This would create file called wlproxy.log under /tmp of Apache machine. The log would contain all the request/response headers between Apache and WebLogic.
Most of the plug-in issues in WLS8.1 were centered around the attribute “KeepAliveEnabled”.
For most of the socket related errors, it worth trying turning off
“KeepAliveEnabled” and redo the test.

Apache Restart and Check the Connection counts:

APACHE_HOME\bin\Apache –t   Syntax check
APACHE_HOME\bin\Apache  start Start the server
APACHE_HOME\bin\Apache  stop Stop the server
APACHE_HOME\bin\Apache  Restart
APACHE_HOME\bin\Apache  -l

Getting error while restarting one of the Weblogic server instance

#### <[ACTIVE] ExecuteThread: '0' for queue: 'weblog
ic.kernel.Default (self-tuning)'> <> <> <> <1189689344635> <000000>
itms/data/ldap/ldapfiles/EmbeddedLDAP.tran (Permission denied)>
#### <[ACTIVE] ExecuteThread: '0' for queue: 'weblog
ic.kernel.Default (self-tuning)'> <> <> <> <1189689344637> <000000>
l>
#### <[ACTIVE] ExecuteThread: '0' for queue: 'web
logic.kernel.Default (self-tuning)'> <> <> <> <1189689344653>
he Embedded LDAP Server. The exception thown is java.lang.ClassCastException: com.octetstring.vde.backend.BackendRoot. This ma
y indicate a problem with the data files for the Embedded LDAP Server. This managed server has a replica of the data contained
 on the Master Embedded LDAP Server in the Admin server. This replica has been marked invalid and will be refreshed on the nex
t boot of the managed server. Retry the reboot of this server.>
####
<> <> <> <1189689344667 span="">


While restarting WL instance on 9.2 I got the above-mentioned error and I found that server is getting started but again it’s getting forced shutdown.

Solution:
Just go to that server instance directory and browse inside that for the
/local/BEA/weblogic92/domain-name/servers/server-name/data/ldap/ldapfiles directory path

You will get below listed files in that particular directory

-rw-r--r--   1 weblogic weblogic   79649      Sep 13  18:48 EmbeddedLDAP.data
-rw-r--r--   1 weblogic weblogic       0          Sep 13  18:48 EmbeddedLDAP.delete
-rw-r--r--   1 weblogic weblogic     648        Sep 13  18:48 EmbeddedLDAP.index
-rw-r--r--   1 weblogic weblogic       0          Sep 13  18:48 EmbeddedLDAP.lok
-rw-r--r--   1 weblogic weblogic   80126      Sep 13  18:48 EmbeddedLDAP.tran
-rw-r--r--   1 weblogic weblogic       8          Sep 13  18:48 EmbeddedLDAP.trpos

Just delete the below listed files inside the directory
-rw-r--r--   1 weblogic weblogic       0      Sep 13 18:48 EmbeddedLDAP.delete
-rw-r--r--   1 weblogic weblogic       0      Sep 13 18:48 EmbeddedLDAP.lok

Now restart the instance from the bin directory, this will get your Server up and running without issue.
Issue 1: JMS 

EOP messaging bridges failing frequently with error : "(java.lang.Exception: javax.resource.ResourceException: method (Ljava/lang/String;Ljava/lang/Throwable;)V not found). Because of this issue messages are being piled up on MQ and not being picked up by the bridge.
Soln:

    Domain:eopdom1 (1admin +2ms spread across 2 servers). Checked the bridge configuration (70 bridges in total). Then checked the pools-param in jma-xa-dap.rar (120 on m1 and 20 on m2). Changed this to 150 on both servers as each bridge needs atleast 2 connections from the adapter pool, then redeployed and restarted weblogic instances. Also applied patch WB1E (CR326720_920.jar) to resolve the known issue with the error mentioned.

Notes:

Live is running on 9.2.0 and test is running on 9.2.3, this should be brought in sycn. Also, planning a quick round of WLS health check on EOP.


Issue2: JMS 

Messaging bridge failed to connect with the source and target destinations and was giving below error: "failed to get one of the adapters from JNDI (javax.naming.NameNotFoundException: Unable to resolve 'eis.jms.WLSConnectionFactoryJNDIXA'. Resolved 'eis.jms'; remaining name 'WLSConnectionFactoryJNDIXA')". This would suggest that the adapter file jms-xa-adp.rar was either not targeted to the required managed server instance or perhaps the deployment of adapter failed with certain error.

Soln:

Found that the adapter was only targeted to managed2 server whereas the bridge was configured to run on managed1 server. Targeted the adapter to managed1 server as well and restarted the instances.

Issue3: JMS 

A newly configured messaging bridge failed to become Active and following 2 error messages were seen: " Unable to connect to source destination" and "Configured QoS is not reachable".

Soln:

"Unable to connect to source destination" found that the source URL had a space between the "//" and IP, removed this and now the bridge was able to connect to source destination. "configured QoS is not reachable" found that the "QoS degradation allowed" was checked for earlier bridges but was unchecked for this new bridge and QoS was configured for "Exactly One" delivery, enabled this and the messaging bridge became Active upon bounce of weblogic instances.

Notes:  

 QoS "Exactly Once" required the messaging to be XA enabled i.e. the connection factory should be XA enabled and the destinations should be configured to use jms XA adapter.

Issue4: Deployer

Unable to deploy application from the console and getting following error on the console page ""Deployer:149150]An IOException occurred while reading input.; nested exception is: java.net.SocketException: Connection reset; nested exception is: java.net.SocketException: Connection reset".

Soln:

The only error message in the logs was indicating that the application is attempting to connect to java.sun.com on port 80 over internet but this was disabled due to firewall restrictions, reported this to application team. As a work-around added a manual entry in config.xml for application and restarted the admin and managed server instances and the application got deployed sucessfully.

Notes:

One of the argument was that the application was getting deployed properly on another test instance even with the same error. Though we were never able to replicate this again, one theory is that while deploying through console the deployer was attempting to connect to java.sun.com again and again, eventually timing out but by adding entry in config.xml and restart it just attempted once and moved over with other tasks which would have higher priority during restart.

Issue4: Startup

“/wls_domains/wlmrtnept/servers/managed3_wlmrtnept/tmp/managed3_wlmrtnept.lok : java.io.IOException: No locks available”

Soln:

This could be due to incorrect NFS setup (if NFS filesystem is used), check if the hosts have correct permissions on the NFS server. Check for the below nfs libraries they should be installed.yum list | grep nfs

*Note*: Red Hat Network repositories are not listed below. You must run this command as root to access RHN repositories.
nfs-utils.x86_64                           1:1.0.9-40.el5         installed
nfs-utils-lib.x86_64                       1.0.8-7.2.z2           installed
Also rpc.statd and rpc.idmapd processes should be running.


Issue5:  Cluster

For quite sometime we were observing multicast packet loss issues triggering various other problems on Weblogic like managed servers dropping out of cluster, jms messages not delivered properly to distributed queues.

A recurring message similar to below appears in the logs, although it is an informational message only but it in turn acts as a trigger to various other issues, so messages like this should not be neglected.

<[ACTIVE] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <> <1269346444069>

Soln:

We used multicast test utility to see if there is in fact any issue with multicasting

java utils. MulticastTest –n -a -p

The result showed that the multicast packets are intermittently being dropped within the vLAN causing the above issue. We then liaised with the OS experts to narrow down the issue and to see whether the multicast packets are being transmitted correctly amongst the servers. This did not help much as from the server perspective all the packets were being transmitted correctly. 

Next we involved network experts to seek their help. After thorough investigations of the network logs and various switch configurations it was concluded that this was down to the multicast address range being used and the way the local switches acknowledged that multicast range. They also suggested that in future we make use of Link Local Multicast IP Addresses for Weblogic multicasting purposes.

A note on Link Local IP can be found at: http://www.iana.org/assignments/multicast-addresses/

In short, Multicast Link-local addresses (actually, the link-local mac-addresses) are treated as broadcasts by the local switches so all web logic servers on the same vlan will see them. Other multicast addresses are dropped by the switches as default unless further action is taken:
Disable IGMP snooping on the vlan or the whole switch – otherwise the switch just drops the multicast packet because Web logic doesn’t use IGMP so the switch never sees an IGMP join request to the multicast group (and thus never maps the mac address to the switch port).  OR
Configure static multicast mac addresses for the relevant switch ports.

Both the above 2 options add network complexity and are costly to implement, test and maintain. Link-local multicast addresses completely avoid these issues. Some previous implementations using non link local multicast addresses may have worked OK if the switch had IGMP snooping disabled globally or per vlan.


Issue6:   JDBC

Stale connections causing high CPU and high memory utilization and eventually breakdown of database.

Soln:

The hardware and software could not be easily replaced due to the cost it incurred and the complexity of the application. So the challenge was to make the best utilization of the database, reduce the connections as much as possible, effectively use the connections made and refine the code, if possible.

Majority of the connections to the database were being made by the connections pools configured on Web logic, so web logic was the ultimate target for refinement. During the periods when the issue occurred it was observed that the number of sessions rose rapidly from 1400 to 1800, whilst database was capable of handling 1400 sessions it couldn’t support 1800 sessions at all. Most of these 1800 sessions were connections created by web logic in response to application request. So, it was clear that we need to get back to pen and paper and tune the connection pools as much as possible.

A look at the configuration of connections pools pointed towards a major issue.  The application had an admin server and 14 managed servers in total with four connections pools in total. Each connection pool, irrespective of whether it was required there or not, was targeted on to admin and all the managed servers. This resulted in creation of many unwanted stale connections on the database which otherwise could have easily been avoided. Few connection pools were only required on admin servers and others on managed servers only.

So, as the first step towards tuning Web logic connections pools we got rid of all such unwanted connections. This effort paid and the total maximum connections onto database was brought down from 2250 to 1520 (a saving of 730 connections).

Next we resorted to tuning various parameters available for connection pool. We mainly concentrated on two parameters, Shrink Frequency and Inactive Connection Timeout. A short description below:

Shrink Frequency: The number of seconds (between 0 and a positive 32-bit integer) before Web Logic Server shrinks the connection pool to the original number of connections or number of connections currently in use. (This field is relevant only if you check the Allow Shrinking box.)

Inactive Connection Timeout: The number of inactive seconds on a reserved connection (between 0 and a positive 32-bit integer) before Web Logic Server reclaims the connection and releases it back into the connection pool.

Shrink frequency was set to a default value of 900s. It was found that the connection pool size expanded to its maximum value during peak loads but on average any transaction took about 100 to 200s to complete. So, we thought of reducing the shrink frequency to 300s so that the pool is shrunk every 300s and any idle connections are closed.

Also, Inactive Connection Timeout was set to a default value of 0s which meant that inactive connections were not being released back to pool causing weblogic to spawn new connections. This was later set to 300s so any inactive connections can be released back to pool and can be reused.

The above actions proved quite effective in terms of reducing the overall load on the database.


Issue7: Startup

While tying to start Web logic as a Windows service the Service Manager throws an exception – “Error 1067 the process terminated unexpectedly.”

When this happens there will not be any information recorded in the web logic logs as the Service Manager had failed to initiate web logic start-up process itself. But we can check the Windows Event logs for more information on this.
Soln:

While tying to start Web logic as a Windows service the Service Manager throws an exception – “Error 1067 the process terminated unexpectedly.”

When this happens there will not be any information recorded in the web logic logs as the Service Manager had failed to initiate web logic start-up process itself. But we can check the Windows Event logs for more information on this.


Stale connections causing high CPU and high memory utilization and eventually breakdown of database.

Soln:


The hardware and software could not be easily replaced due to the cost it incurred and the complexity of the application. So the challenge was to make the best utilization of the database, reduce the connections as much as possible, effectively use the connections made and refine the code, if possible.

Majority of the connections to the database were being made by the connections pools configured on Web logic, so web logic was the ultimate target for refinement. During the periods when the issue occurred it was observed that the number of sessions rose rapidly from 1400 to 1800, whilst database was capable of handling 1400 sessions it couldn’t support 1800 sessions at all. Most of these 1800 sessions were connections created by web logic in response to application request. So, it was clear that we need to get back to pen and paper and tune the connection pools as much as possible.

A look at the configuration of connections pools pointed towards a major issue.  The application had an admin server and 14 managed servers in total with four connections pools in total. Each connection pool, irrespective of whether it was required there or not, was targeted on to admin and all the managed servers. This resulted in creation of many unwanted stale connections on the database which otherwise could have easily been avoided. Few connection pools were only required on admin servers and others on managed servers only.

So, as the first step towards tuning Web logic connections pools we got rid of all such unwanted connections. This effort paid and the total maximum connections onto database was brought down from 2250 to 1520 (a saving of 730 connections).

Next we resorted to tuning various parameters available for connection pool. We mainly concentrated on two parameters, Shrink Frequency and Inactive Connection Timeout. A short description below:

Shrink Frequency: The number of seconds (between 0 and a positive 32-bit integer) before Web Logic Server shrinks the connection pool to the original number of connections or number of connections currently in use. (This field is relevant only if you check the Allow Shrinking box.)

Inactive Connection Timeout: The number of inactive seconds on a reserved connection (between 0 and a positive 32-bit integer) before Web Logic Server reclaims the connection and releases it back into the connection pool.

Shrink frequency was set to a default value of 900s. It was found that the connection pool size expanded to its maximum value during peak loads but on average any transaction took about 100 to 200s to complete. So, we thought of reducing the shrink frequency to 300s so that the pool is shrunk every 300s and any idle connections are closed.

Also, Inactive Connection Timeout was set to a default value of 0s which meant that inactive connections were not being released back to pool causing weblogic to spawn new connections. This was later set to 300s so any inactive connections can be released back to pool and can be reused.

The above actions proved quite effective in terms of reducing the overall load on the database.


Issue7: Startup

While tying to start Web logic as a Windows service the Service Manager throws an exception – “Error 1067 the process terminated unexpectedly.”

When this happens there will not be any information recorded in the web logic logs as the Service Manager had failed to initiate web logic start-up process itself. But we can check the Windows Event logs for more information on this.

Soln:

While tying to start Web logic as a Windows service the Service Manager throws an exception – “Error 1067 the process terminated unexpectedly.”

When this happens there will not be any information recorded in the web logic logs as the Service Manager had failed to initiate web logic start-up process itself. But we can check the Windows Event logs for more information on this.

No comments:

Post a Comment