Finally, Platform Engineering for Enterprise Cloud Migration

LONDON — The role of the Fortune 500 CTO can be one of beating your head up against a wall. Not necessarily at the tech native companies taking many of the top spots, but for the 71% of companies still on mainframes, frustration, turnover and budgets for failed cloud migration run high. These behemoth enterprises are slow to compete, and the inaudibility of their systems has become a huge security risk. But moving too fast could have a catastrophic effect on revenue streams.
Legacy system modernization is a billion-dollar industry with little end in sight. What holds teams back from unlocking the benefits of DevOps, open source and the cloud? How can platform engineering and generative AI enable that first use case and accelerate from there?
Forget lift and shift and go incremental Rob Mee, CEO at Mechanical Orchard, told The New Stack. Read on to begin to understand the failures of previous migration attempts and how you might be able to make your next enterprise cloud migration your last.
The Legacy No CTO Wants to Leave
The average Fortune 500 CTO is in their fifties. Their current role may very well be their last. This leaves them torn between wanting to have that last sense of achievement and not wanting to rock the boat. And with millions of customers, avoiding disruption is key. They’ve likely been down this digital transformation road before — more than once. Their skepticism and frustration is completely valid.
Yet “if it ain’t broke, don’t fix it” doesn’t cut it anymore. Change is far too slow, making it hard to innovate at the same speed of cloud native companies. Real-time is unreachable.
“We need to do it because this business has to evolve,” Mee said, of organizations with often 30, 40 or 50-year-old monoliths with 10 million lines of mainframe code and billions of database entries, where it’s hard to recruit anyone who is trained in the technology and the legacy knowledge has long left. Add to that, Conway’s Law is in effect, meaning, for decades, the organization and its communication have grown in the shape of that complex, ginormous tech stack.
With an average of 15 years of technical experience, the generative AI-native, legacy modernization company Mechanical Orchard knows what they’re getting into when they walk into the CTO or CIO’s office.
“I got myself into trouble because, I said, ‘I bet I know why you failed’,” Mee said, reflecting on an early conversation with their first customer, a Fortune 500 retailer, who was batching data nightly, unable to unlock click and collect and real-time inventory updates. “They said, ‘Well, we’ve done it three times, so why don’t you enlighten us.”
They were right to be frustrated. There is rarely a single point of failure when a big consultancy is unsuccessful in migrating an enterprise to the cloud. But Mee has uncovered three common ways of failing:
- Projects face cascading complexities.
- A systems integrator is brought in.
- They try to do it all at once.
Each of these failure modes tends to drag out over a couple of years before you can actually test if it’s even working.
“So it’s this feeling of hitting a wall — or a number of different walls. You can go in different directions and try to get out of the maze. There’s a feeling of exhaustion, but yet still high pressure and high risk,” Mee said. “And right now the sense of urgency has gone from high to acute because they’re looking at generative AI and realizing they’re missing out.”
These orgs also haven’t been able to take advantage of open source, which, he continued, “I think is the one true development productivity evolution in this century,” and “in the cloud world, you would use open source for everything.”
On top of all this, they are running out of people who can even maintain the system, so everything becomes fragile and urgent.
It Can’t All Start with a Big Bang
“You have an organization that has molded itself and visa versa the systems for decades. If you can’t have a way to merge those or to do it in stages — then that typically gets too big,” Mee said. “There’s also moving the data. Usually, people do a whole data migration strategy for this kind of thing. That becomes extraordinarily complex too. And you’re trying to support parity in this transformation. The projects end up facing a series of cascading complexities that get bigger and they [the migrations] end up falling over in a couple of years.”
Most of these more traditional organizations have migrated little to nothing because they feel they have to do it all in one go — a Big Bang lift and shift. At best, they have found a way to create new cloud native business units, while still maintaining the old mainframes for the original work. Leaving everything somehow running on top of some sort of proprietary software that is no longer supported or patched.
Other organizations invest in large-scale translation tools.
“You can use a compiler to file a million lines of COBOL into a million lines of Java very quickly, but then it might take you a couple of years to get it into production to get it to actually work,” Mee said, reflecting on a customer in the supply chain space who had tried this. “They spent two and a half years and tens of millions of dollars. And they turned it on and user acceptance tested it, and it ran at about one-tenth the speed.”
“The hardest thing for us to do is get past the fear that risk engenders in people looking at systems that processes a lot of revenue. That have been frozen for a long time. They’re afraid to touch it. They’re afraid to have vendors touch it. They may have failed multiple times.” — Rob Mee, Mechanical Orchard
This organization already had a very limited number of people left who had contextual experience with that code. But once it got changed from COBOL into machine-generated Java, no one did.
“Then that architecture is going to be very much like a mainframe architecture in Java,” Mee said. “As this example shows, you turn it on and it’s like ‘Oh, darn, it doesn’t perform,’ because if you’d written it cloud native, you wouldn’t have written it this way. So it ends up that you spend two years to get it to work. Then you’re probably going to spend another couple of years tuning it and refactoring it to the point where the people who have to maintain it can understand it.”
Add to this, in his experience, several really large organizations he’s met with have lost the source code to at least some of their systems.
“Imagine having a system that’s literally been frozen for 10 years,” Mee said. “Customer says, ‘Well, we failed. Vendors failed. The system is really intractable, and that’s why we have a code freeze.”
But, by going the Big Bang route, the risk is mitigated until the end of the project. But, like the Waterfall project management that built it, that’s a much bigger cumulative risk that could blow up.
Cautiously, Go with the Data Flow
Nowadays, when Mee meets with potential clients, he doesn’t play guess-your-failure. Instead, he asks them to talk about their top five or ten technological pain points. Then the Mechanical Orchard team starts to analyze the legacy system’s behavior based on intercepting inputs and outputs, in order to reproduce the system from there in an open source, cloud native way. This doesn’t mean they don’t use the source code when available — that’s just not the starting point.
“Source code is very useful, but the important part is that we consider the running system and its data flows to really be the source of truth,” Mee explained. “With that, we can leave behind their code relatively easily,” because, in these generations-old systems, there’s a lot of code that isn’t used anymore or has never even been executed.
Different from the last company he founded, Pivotal, which focused on providing the service of upskilling customers, the Mechanical Orchard team is actually building and running these enterprise systems. He said, “We are incrementally modernizing, and never doing a Big Bang. We’re taking slices and going all the way through to production.”
Instead of that lift and shift, they kick off with a pilot, in an attempt to show progress early and often. They take their first slice of the monolith — the big pain point that is most easily isolated versus one that maybe has hundreds of integrations. They reproduce that piece in a sandboxed, test environment, and then observe what it does, leveraging generative AI to interrogate, and how data flows through it. Then they reverse engineer that piece in order to build it with the same inputs and outputs, but in a cloud native way.
Older organizations are naturally risk-averse. Instead of directly breaking off pieces to migrate to the cloud, Mee’s team initially avoids production completely by leveraging a test environment to eliminate variables and the need to orchestrate between the two systems. The only thing that remains the same is the organization’s data. It’s up to Mee’s team to verify that they can produce the same outputs with the new, cloud native code.
“We’re going to show you the modern format, demonstrate that it does the same thing that the existing system does, and it does it very effectively and efficiently,” Mee said.
The next step is pushing that new, cloud native piece to production. Only by taking it out of the sandbox and into production are they able to confront the real risks of the enterprise environment. “We have to deal with security. We have to deal with networking. We have to deal with all the DBAs. We have to drive through all the organizational obstacles.”
This isn’t a faster way to the cloud — a lot of technology has to be built — but it is one with earlier results, usually within a quarter. Once they’ve taken that first piece to production, its legacy counterpart can be commissioned. But again, it can’t be a switch. While Mechanical Orchard controls the CloudOps, the rest involves a lot of coordination with the client’s DevOps team. It involves a lot of orchestration work, Mee said, “but if you’re going to deploy incrementally and address risks early, you have to do that, work to make the orchestration actually happen.”
In fact, so far they’ve found that orgs want to run what they’ve built in parallel for much longer than should be necessary, making sure that end users don’t notice a change at all and that the change hasn’t disrupt downstream systems.
They haven’t completed a cloud migration yet, but they have several components running in production today.
Mee said this plan of piece-by-piece migration “allows us to address risk early and then start to move faster and faster and faster.”