As one among Europe's leading suppliers of DB2 features, my corporation has been concerned in an enormous variety of technical help calls and incidents with their shoppers. These situations range from the tragic to the downright weird, however every and each one can be avoided (or at the least mitigated) with the appropriate processes and approaches in vicinity.
during this column, i'd like to take you on a short tour of a few of my time-honored DB2 help incidents and conflict reports for Linux, UNIX and home windows. All of those definitely came about, however I actually have ignored some particulars to be able to preserve them suitably anonymous. With due to the various Triton consultants that contributed, and nevertheless bear the scars of their involvement.subject 1: The legislations of Unintended penalties
This one all started (as many do) with a call for support from one in all their DB2 purchasers. all the rows had been deleted from a critical table in a pre-construction ambiance. The junior DBA had bravely admitted to accidentally connecting to the wrong equipment to clear down the table, but could not consider why he'd been capable of delete the rows as he do not have had the authority to accomplish that.
In all such circumstances, the precedence is always to get the facts again and the builders working once again. This changed into quickly completed with a standard recovery the use of a contemporary backup and no issues had been encountered. They then became their attention to the task of determining how the junior DBA had been in a position to delete the information in the first region.
After a great deal investigation, they discovered the following:
needless to say, the FED_NOAUTH parameter turned into unexpectedly set returned to NO and commonplace authentication checking become reinstated.
many of us spend loads of time tuning databases and applications, but how regularly can they believe about their personal performance? Time and again, they see examples of DBAs and support staff which are taking the good way circular and the use of extra time than necessary to operate hobbies tasks. One specific element of here is the use of the command line versus a GUI such as the DB2 manage middle.
i do know lots of individuals who are passionate supporters of the GUI, and indeed it can also be an outstanding strategy to directly accomplish one-off activities comparable to an advert-hoc reorg. besides the fact that children, there are a number of instances where use of the command line (or better still, DB2 commands embedded in an easy script) will make a great deal more feel. The assignment is then without difficulty repeatable, movements and their consequences can be recorded (by using redirecting the output to a textual content file), and a script is an awful lot easier to quit to a colleague than pages of guidelines and GUI screenshots!
a typical example is the movements manner of advertising database objects between building, verify and creation environments. a good script can accomplish that task in a fraction of the time mandatory to do the identical thing from the GUI, and utilities such as db2look can also be used to examine database constructions to be sure consistency.
many people are working in environments the place the SQL presented to DB2 is not written without delay by means of a human being, however generated automatically by a tool. usual examples consist of Java frameworks, 3GLs and company intelligence tools akin to Cognos during this example.
during this example, the customer said an extended elapsed time for a Cognos-generated query (over half-hour) working towards a DB2 9.5 database, with the question normally failing due to the DB2 temp area being surpassed. The users repeatedly extended the temp house and re-tried the question, however with exactly the same effect each and every time. In desperation they even ran the DB2 Index marketing consultant and created extra indexes, however temp area nonetheless blew before the query complete.
A support incident turned into opened, and analysis using a 3rd-birthday celebration query monitoring device showed a extremely high query execution charge. further examination of the precise SQL being generated with the aid of Cognos confirmed a Cartesian be part of of two tables, every with more than 25M rows in, with the new indexes no longer getting used at all. Going returned to the Cognos definitions confirmed a key database relationship had now not been described to Cognos, so it needed to filter huge amounts of records in the community.
The solution become easily to re-work the Cognos file definitions to add the missing database relationships. This caused Cognos to generate suitable be part of predicates and the stories ran to a success completion in beneath 2 minutes. The developers have been also knowledgeable to examine both the Cognos and generated "native" SQL in the future.
This particular difficulty happened in a construction atmosphere the place DB2 for LUW 9.7 was the lower back-conclusion database, but it could basically have been another RDBMS. A totally-expert offshore crew turned into setting up a brand new application, with a local system administration team looking after the OS aid.
Triton become in the beginning contacted as a one-off exercise, as the developers have been complaining about consistently terrible question efficiency and asking to circulate to an additional RDBMS as DB2 changed into certainly now not as much as the job. They discovered a shockingly commonplace circumstance: the crew of builders, particularly skilled in SQL and capable of formulate complicated queries, were completely concentrated on the utility code and nobody was looking after the data. because of this, there changed into no backup regime in area, no RUNSTATS or other housekeeping, and the only indexes have been basic ones created when the database became first designed. now not magnificent then that the question efficiency had been so terrible — none of the tables had ever had stats gathered.
additional questioning printed that the builders had no DB2-particular knowledge in any respect, and no need to gain knowledge of. as far as they have been worried, the database become a black container that they may throw SQL at and get statistics lower back.
We promptly applied a simple set of computerized housekeeping processes that resolved the efficiency concerns and made the builders chuffed. besides the fact that children, in the manner they uncovered a lot of information high-quality concerns (reminiscent of builders coding inefficient SQL to get round duplicate information that should not exist). A greater formal design evaluation of the database turned into conducted, and they have been asked to give an ongoing far flung DBA service to evade the same concerns coming up in the future.
this is an excellent example of how "finger drawback" may cause pink faces and application outage. The Senior DBA for a big public features utility took his annual vacation, leaving a these days-informed Junior DBA in cost. Responding to a request to clear down a building database, the Junior DBA fired up his telnet session, dropped and re-created the valuable database and congratulated himself on a job well executed ... unless the phones began ringing with clients complaining that they could not use the creation utility. The DBA had unintentionally dropped a important creation database, pondering that he became linked to the UAT device.
To his credit, the DBA turned into quick to own up to the error and Triton was known as in to aid with the recovery. This become carried out devoid of further drama, however the better part of a enterprise day had been lost by the time everything turned into back to general.
We received a frantic call one evening from a customer inquiring for instant aid. DB2 had "hung" and no pastime may well be performed on the production database. as the OLTP equipment had a 24x7 on-line SLA, it turned into now not astounding that the senior management become ready after they arrived on the customer web page. Upon investigation, it grew to be obvious that DB2 become attempting to find an energetic transaction log file that had long past "missing" and turned into nowhere to be found.
After investigation, the story begun to develop into clearer. In a determined try to create some house on the construction database server, a junior sysadm had stumbled upon the DB2 transaction log listing (which in this case housed both the energetic and archived logs) and concept, "Hmm, this directory might do with some clear-up." using his judgment of the age of the data, he deleted a few of them. He received area but regrettably lost an active transaction log file within the system.
regardless of their tips that a restoration from a previous backup changed into the best answer to their problem, the local Oracle DBA with some DB2 skills tried quite a lot of tips on how to "trick" DB2, like making a dummy log file with the equal name as the lacking one, no longer figuring out that the DB2 transaction log data have header information within them. After loads of lengthen, it became finally agreed to perform a fix from probably the most recent backup. This proved a little of a challenge considering that no backups were saved on disk. So, the suitable tape had to be discovered and set up. The restoration and subsequent rollforward to a constant factor in time did effectively take area and sighs of reduction may be heard in the early morning hours. however some hours of business had been misplaced, jobs had been saved! And yes, even the junior sysadm was allowed to reside on due to the fact he owned up to his mistake...
right here's a nightmare help situation: a damaged database on an unsupported edition of DB2, with out a backups or log data to get well the database.
An faulty script had deleted a couple of transaction log files that had a 'closing changed' date of more than 45 days. The same script had brought about different errors and a database restart become required, however the database did not birth. The database become hunting for an old log file, which had just been deleted with the aid of the script. because the coverage became to retain the backups and archive logs for 30 days, this log file became deleted from the archive logs too.
The database became tiny — under 50GB. however, it was a very critical one, with a number of internet facing apps counting on it for important points. To make matters worse, the edition of DB2 in use had handed its "end of carrier" date, so DB2 assist became no longer willing to investigate (although they have been satisfied to guide).
after they obtained involved just a few hours after the incident, panic had set in. in keeping with suggestions purchasable (saved snapshots and db2diag.log file), they have been in a position to conclude that there become a transaction which started within the log file the database was trying to find. This transaction was on no account completed. The fee of change of records was so small, the configured log numbers might go on for more than forty five days.
The options attainable had been to extract the records from the latest backup photo (using tools like high efficiency sell off) or extract the facts from the broken database (the use of db2dart). The latter alternative turned into chosen as this would enable us to get well probably the most recent data.
without additional prolong, they ran db2dart on the database to investigate for any mistakes and to get the Tablespaceid, Tableid and the entire variety of pages allotted to each table. They were then in a position to use the advice to build the db2dart command with the DDEL alternative to extract the statistics in delimited structure. db2dart with the DDEL option is interactive (i.e., when the command is run, it prompts for the tablesapce identity, tableid and the page latitude to extract the data). This supposed that the extract couldn't be scripted however had to be done manually for each and every of the 300+ tables.
as soon as that intellect-numbing task changed into comprehensive, they created a new database with the DDL that was obtainable (thankfully, they had a db2look output from the production database below per week before the incident). eventually, they loaded the extracted records to the new database and ran runstats on the desk and indexes.
After a number of hiccups and 15 hrs of db2dart, import/load, runstats and data fixes, the database turned into purchasable for the application. The database turned into down for more than 20 hours, nonetheless it became lower back in once piece with almost no data loss. somewhat an fulfillment below the instances!
We often must ask clients for the DB2 diagnostic log file (db2diag.log), simplest to learn "it be too enormous to ship" or "or not it's taking too lengthy to open." this is since the DB2 diagnostic information have been appended simplest due to the fact that time immemorial, their increase only limited by means of the optimum area purchasable within the file gadget they stay in. The only technique to curb this boom can be to rename the data, which might then drive the introduction of new ones. Scripts had to be written to automate this system and to delete historical files after a undeniable length. Many purchasers did not have these techniques in region.
With the advent of DB2 9.7, all this is now heritage! the brand new DIAGSIZE database supervisor parameter enables a DBA to manage the maximum sizes of the DB2 diagnostic log and administration notification log files. When this parameter is decided to a non-zero cost (which is the default pre-DB2 9.7 behaviour), a sequence of 10 rotating diagnostic log info and rotating administration notification log data (simplest on UNIX and Linux) are used. it is also sensible adequate to clear up ancient log files from the diagnostic log directory. When the tenth file is full, the oldest file is deleted, and a new file is created.
The customer was using HADR to a faraway server for resilience, working in ASYNC mode to cut the performance impact. No cluster utility changed into used for monitoring/failover, but the HADR state turned into being monitored on a regular basis using a shell script.
On the simple web page, a couple of disks failed, which caused one of the vital tablespaces to be put in "Rollforward Pending" State. Transactions gaining access to records in these tablespaces failed but the others were a success. The remaining run of the HADR State monitoring script indicated a Peer State, and therefore it was determined to issue a TAKEOVER command on the DR website to swap roles. When the application begun, some transactions failed with the equal error as on the primary web site — not good!
a listing tablespaces command showed a couple of tables in "Rollforward Pending" state. To get out of the pending state, the ROLLFORWARD command was issued with the checklist of affected tablespaces. The rollforward became making an attempt a retrieve a log, which changed into a couple of thousand logs older than the existing one. no longer fairly, this log became now not available within the archive. After a few greater tries, the ROLLFORWARD alternative changed into given up and the database was restored from the newest backup photograph and the utility begun.
So what happened? They went in the course of the db2diag.log and the notification logs. They may see that there were actual mistakes said in one of the most tablespaces on the DR web page (HADR Standby) round one hundred days prior to the incident. This turned into suggested within the db2diag.log, and the affected tablespaces have been "excluded from the rollforward set." based on other entries in the db2diag file, they have been in a position to verify that the log file requested for rollforward on the DR website become used at the time the actual mistakes occurred there. HADR persisted to follow logs for the different tablespaces and was reporting to be in "Peer" State. basically, probably the most tablespaces have been being neglected.
i'll round off my proper ten with one of the crucial primary issues they encounter, however sadly one of the most regular. it's fantastic what number of DB2 websites shouldn't have simple techniques in region for contacting IBM support in the event of a significant DB2 situation.
when you are having a major production difficulty and you want IBM's information, every minute is a must have. That isn't the time to be scrabbling round to are attempting to dig out consumer numbers and URLs for the IBM help Portal. they have even viewed cases the place the IBM assist website is blocked via the agency's web browsing policy!
IBM is often satisfied for you to open "check" PMRs to show that the technique works and ensure you could contact them instantly and efficaciously if the want arises. ultimately, remember that it's viable to your IBM enterprise accomplice to open and control your PMRs for you — they try this for many of their valued clientele as a count of direction.
i'm hoping you will be capable of be trained something from at the least some of those eventualities, and take steps to prevent equivalent cases occurring in your environment.
See all articles by using Julian Stuhler