Thursday, March 12, 2009

An ERP Database Conversion

We just finished a major ERP database conversion.

The conversion consisted of the following major tasks.

  1. Migrate a decade of data from 35 RDB databases into a single Oracle 10G database.
  2. Move from an ‘all-in-one’ design where database, batch and middleware for a customer are all on one server to a single dedicated database server for all customers with separate middleware servers.
  3. Merge duplicate person records from all databases into a single person record in the new database.
  4. Point thousands of existing batch, three tier client-server, business logic services and J2EE web services to the new database.
  5. Maintain continuous downstream replication to reporting databases.

Replication

JCC Logminer was used to replicate 600 or so tables from each of the 35 production RDB databases to a merged schema in the new Oracle ERP database. Oracle Streams was used to replicate from the ERP database to a reporting database. The replication was 35 databases into 1 database into 1 database.

    JCC Logminer            Streams
RDB ----------+---->> ERP ------>> Reporting
|
RDB ----------+
|
RDB x35 ------+
The replication was established and maintained for over a year while the application was prepared for the conversion. Replication from the RDB source to the ERP and Reporting databases was normally maintained to within a few seconds. Significant effort was put into building replication monitoring tools to detect and repair broken or lagging replication. Row count reports checked for missing data. During that year, some read-only batch jobs, reports and J2EE functionality was migrated to either the Oracle ERP or reporting databases.

Preparation

During the year+ that we maintained the database replication, individual screens, forms, reports, batches and middleware components were ported to Oracle and tested against a full copy of the ERP database. Dual code bases were maintained for converted components. Additionally, practicing for the merging of person-records triggered significant data cleanup activities and introduced significant business logic changes that had to be coded and tested.

Testing

We set up three major testing initiatives. The first test was to ensure that the various batch, client server, middleware and J2EE applications would function with 10G as the backend database. Simple functionality tests were combined with database optimization testing. The screen, batch or applications had to both function as expected and also had to pass a performance test. A package was built that generated an Oracle execution plan and TKProf trace. As developers and testers checked functionality, the package automatically generated Oracle query plans and SQL traces for each function and e-mailed the results back to to the developer. This allowed developers and testers to determine if the converted functionality was rationally optimized. In many cases, functionality had to be re-factored to work efficiently on the new database.

The second testing effort was directed at schema changes required to support merged databases, and at building programs that could reliably merge the person-records (identity) of millions of customers by comparing and merging attributes such as SSN, address, etc. The merge process was tested on full database copies about a dozen times before the merge results were considered accurate.

The third testing effort was directed at simulating load by running critical parts of the application at near anticipated peak loads using load test software and agents. Historical data was used to estimate peak load for a particular function. Oracle MTS was tested, with a final configuration of both dedicated and shared Oracle processes. Transient client connections were configured to use MTS database connections, persistent J2EE and middleware connections were configured to use dedicated database connections.

Early design decisions (for example using timestamp instead of datetime) caused a fair amount of bug-hunting on the client server->Oracle connections. Archaic COBOL coding standards introduced cursor issues, network latency issues and Oracle optimization issues. A significant number of Oracle client and Compuware Uniface bugs were detected and eventually either patched or worked around.

Cutover

Cutover and failback checklists were developed and tested. An XMPP chat room was used to coordinate and track the cutover progress and serve as the authoritative record of actions taken (Starting Step 77… … … Step 77 Complete.)

Actual cutover consisted of

  1. Production shutdown
  2. Replication catch-up and data verification
  3. Database backups
  4. Schema changes to support merged databases
  5. A lengthy person-merge process to merge identities and accounts for persons who exist in more than one database.
  6. Code rollouts and configuration changes to point all application components at the new database
  7. Full database backups
  8. Sanity testing and bug fixes for newly uncovered bugs.
  9. Re-enable access to new production environment.
  10. Post-production testing and bug fixes.

Total down time was about 36 hours. The pre-merge preparation and post-merge testing took about 1/3 of that time. The person-merge and schema changes took half of the time.

Post-cutover

The first business day at production loads uncovered significant performance issues on parts of the application that had not been load tested and a few dozen minor functionality related bugs. Service was not significantly affected by either the performance issues or bugs.

Pre-cutover tension and stress level was high. Once a cutover like this goes live, there is no realistic fall back strategy. The assumption was that failure of this cutover would have been an RGE (Resume Generating Event) for senior IT leadership.

We’ve declared victory. Donuts for all.

Spoke too soon. Damn.

No comments:

Post a Comment