|
Booklet:
Operations
Section:
Risk
Mitigation and Control Implementation
Subsection:
Event/Problem
Management
|
| |
|
|
An
effective event/problem management process helps protect institutions
from financial risks, operational risks, and reputation risks. Management
should ensure appropriate controls are in place to identify, log, track,
analyze, and resolve problems that occur during day-to-day operations.
The event/problem management process should be communicated and readily
available to all IT operations personnel. Appropriate personnel—from
IT operations, institution management, internal audit, fraud and loss
prevention, information security, and computer security incident response
teams—should participate in the event/problem management process.
Event/problem management plans should cover hardware, operating systems,
applications, and security devices and should address at a minimum:
| |
Event/problem
identification and rating of severity based on risk; |
| |
Event/problem
impact and root cause analysis; |
| |
Documentation
and tracking of the status of identified problems; |
| |
The
process for escalation; |
| |
Event/problem
resolution; |
| |
Management
reporting; and |
| |
Contact
and communication information, including: |
| |
 |
Current
names and/or positions of individuals that should be contacted; |
| |
 |
Current
phone numbers of contacts; and |
| |
 |
Who
should be notified (e.g. regulators; FBI; public relations group;
media; affected business lines) and the circumstances under which
they should be notified. |
Operations
personnel plan the work for each shift in advance to ensure that it is
finished in an accurate and timely manner. However, unusual events often
occur during production, which management should monitor and correct.
Examples of common production events include the following:
Production Program Failure – Operations personnel should
properly log and record program failures that require immediate intervention.
They should also notify the appropriate personnel so proper change management
procedures can be initiated. Some production failures require immediate
intervention by programming staff in order to meet an important production
goal (such as month-end or cycle processing). In these cases, emergency
procedures, sometimes called “fire call” procedures (who to
call, what to report, etc.), are invoked, and the programming staff members
perform emergency repairs either at the IT operations facility or from
a remote location.
Out-of-Balance Conditions – Personnel responsible for scheduling
should document and correct all production processes that do not contain
proper run control balances. Personnel should rerun the data to check
for operator error or erroneous transactions. When totals do not balance
after being re-run, operations personnel should log and record the event
and notify management of the need for further investigation and resolution.
Operations Tasks Performed by Different Parties than Normal –
Operations personnel customarily are cross-trained and have back-up duties
in case another employee is absent or temporarily assigned other functions.
For example, operators may act as back-up to tape librarians or production
control analysts. In these circumstances, it may be possible for the parties
to intentionally or unintentionally cause an error, fraud, or service
disruption. Where back-up employees have the potential to compromise segregation
of duties, management should establish mitigating controls.
Logging Issues – Most problem-solving techniques in an
IT operations center depend on the ability to read, consolidate, and interpret
various operations logs. Consequently, an institution should not destroy
or modify its logs. Disclosure of log tampering or manipulation is an
event that requires management resolution and the involvement of the computer
incident response team. Operations management should periodically review
all logs for completeness and ensure they have not been deleted, modified,
overwritten, or compromised.
Database Operations – Although various security devices
protect databases, it may be possible for the operator to use system utilities
or unauthorized compilations to modify the system. In such cases, the
database may become corrupt or inaccessible. Operations management should
regularly and carefully review all logs involving database programs and
files and should report all unauthorized modifications to the computer
incident response team.
Termination of Operations Personnel – Whenever the employment
of someone with access to sensitive or confidential material is terminated
for any reason, management should revoke or change all physical and logical
access controls including all key locks, badges, common locks, and cyber
locks. It is sound practice to ask the employee to leave at the time notice
is served. If this is not practical, management should carefully monitor
and review the employee’s activities to ensure the protection of
all data, files, and security devices. There should be written procedures
to define the responsibilities for all operations, IT management, and
human resources personnel when a termination occurs.
Run Time Anomalies – Management, a shift supervisor, or
another independent person should review run time logs, identify any anomalies,
and review their cause and resolution. It is possible for computer operators
to run programs out of sequence or with improper inputs to cause error
or fraud. Automated scheduling programs commonly used in large, complex
institutions significantly reduce the risk of this type of event. Unexplained
or inadequately explained anomalies should prompt a production rerun.
Event report logs for unexplained anomalies should be forwarded to the
computer incident response team for review.
Management should train and test operations personnel on their ability
to recognize security events that require referral to the computer security
incident response team, security guards, management, or other parties.
Social engineering is a growing concern for all personnel, and in some
organizations personnel may be easy targets for hackers trying to obtain
information through trickery or deception.
Management should consider the safety of its employees as paramount when
there is a life-threatening event. Policies and procedures should reflect
this philosophy. Management should ensure it trains all operations personnel
to act appropriately during significant events. Employees should also
receive training to understand event response escalation procedures.
Management should properly train operations personnel to recognize events
that could trigger implementation of the business continuity plan. Although
an event may not initially invoke the plan, it may become necessary as
conditions and circumstances change. Management should train and test
institution personnel to implement and perform appropriate business continuity
procedures within the timeframes of the BCP. Operations personnel should
properly log and record any events that trigger BCP response and document
their ultimate resolutions. Refer to the IT Handbook’s “Business
Continuity Planning Booklet” for additional discussion on this topic.
|