Skip to main content

An ERP Database Conversion

We just finished a major ERP database conversion.

The conversion consisted of the following major tasks.

  1. Migrate a decade of data from 35 RDB databases into a single Oracle 10G database.
  2. Move from an ‘all-in-one’ design where database, batch and middleware for a customer are all on one server to a single dedicated database server for all customers with separate middleware servers.
  3. Merge duplicate person records from all databases into a single person record in the new database.
  4. Point thousands of existing batch, three tier client-server, business logic services and J2EE web services to the new database.
  5. Maintain continuous downstream replication to reporting databases.

Replication

JCC Logminer was used to replicate 600 or so tables from each of the 35 production RDB databases to a merged schema in the new Oracle ERP database. Oracle Streams was used to replicate from the ERP database to a reporting database. The replication was 35 databases into 1 database into 1 database.

    JCC Logminer            Streams
RDB ----------+---->> ERP ------>> Reporting
|
RDB ----------+
|
RDB x35 ------+
The replication was established and maintained for over a year while the application was prepared for the conversion. Replication from the RDB source to the ERP and Reporting databases was normally maintained to within a few seconds. Significant effort was put into building replication monitoring tools to detect and repair broken or lagging replication. Row count reports checked for missing data. During that year, some read-only batch jobs, reports and J2EE functionality was migrated to either the Oracle ERP or reporting databases.

Preparation

During the year+ that we maintained the database replication, individual screens, forms, reports, batches and middleware components were ported to Oracle and tested against a full copy of the ERP database. Dual code bases were maintained for converted components. Additionally, practicing for the merging of person-records triggered significant data cleanup activities and introduced significant business logic changes that had to be coded and tested.

Testing

We set up three major testing initiatives. The first test was to ensure that the various batch, client server, middleware and J2EE applications would function with 10G as the backend database. Simple functionality tests were combined with database optimization testing. The screen, batch or applications had to both function as expected and also had to pass a performance test. A package was built that generated an Oracle execution plan and TKProf trace. As developers and testers checked functionality, the package automatically generated Oracle query plans and SQL traces for each function and e-mailed the results back to to the developer. This allowed developers and testers to determine if the converted functionality was rationally optimized. In many cases, functionality had to be re-factored to work efficiently on the new database.

The second testing effort was directed at schema changes required to support merged databases, and at building programs that could reliably merge the person-records (identity) of millions of customers by comparing and merging attributes such as SSN, address, etc. The merge process was tested on full database copies about a dozen times before the merge results were considered accurate.

The third testing effort was directed at simulating load by running critical parts of the application at near anticipated peak loads using load test software and agents. Historical data was used to estimate peak load for a particular function. Oracle MTS was tested, with a final configuration of both dedicated and shared Oracle processes. Transient client connections were configured to use MTS database connections, persistent J2EE and middleware connections were configured to use dedicated database connections.

Early design decisions (for example using timestamp instead of datetime) caused a fair amount of bug-hunting on the client server->Oracle connections. Archaic COBOL coding standards introduced cursor issues, network latency issues and Oracle optimization issues. A significant number of Oracle client and Compuware Uniface bugs were detected and eventually either patched or worked around.

Cutover

Cutover and failback checklists were developed and tested. An XMPP chat room was used to coordinate and track the cutover progress and serve as the authoritative record of actions taken (Starting Step 77… … … Step 77 Complete.)

Actual cutover consisted of

  1. Production shutdown
  2. Replication catch-up and data verification
  3. Database backups
  4. Schema changes to support merged databases
  5. A lengthy person-merge process to merge identities and accounts for persons who exist in more than one database.
  6. Code rollouts and configuration changes to point all application components at the new database
  7. Full database backups
  8. Sanity testing and bug fixes for newly uncovered bugs.
  9. Re-enable access to new production environment.
  10. Post-production testing and bug fixes.

Total down time was about 36 hours. The pre-merge preparation and post-merge testing took about 1/3 of that time. The person-merge and schema changes took half of the time.

Post-cutover

The first business day at production loads uncovered significant performance issues on parts of the application that had not been load tested and a few dozen minor functionality related bugs. Service was not significantly affected by either the performance issues or bugs.

Pre-cutover tension and stress level was high. Once a cutover like this goes live, there is no realistic fall back strategy. The assumption was that failure of this cutover would have been an RGE (Resume Generating Event) for senior IT leadership.

We’ve declared victory. Donuts for all.

Spoke too soon. Damn.

Comments

Popular posts from this blog

Cargo Cult System Administration

“imitate the superficial exterior of a process or system without having any understanding of the underlying substance” --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:Failing to understand the difference between necessary and sufficient. A daily backup is necessary, b…

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to …

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:
In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. The…