Using AI to improve AI

4242 is an applied research lab in Generative AI based in Amsterdam.

We study recursive self-improvement: the use of AI systems to train and improve other AI systems. We are applied in a specific sense: our work begins and ends with real use cases inside real organizations. And we build bridges: between what is possible in theory and what is useful in practice, and across the final stretch where most of that possibility is currently lost.

The last mile

The capability of frontier models is not the limiting factor for most organizations. By our estimate, today’s models are already sufficient for the large majority of enterprise use cases, on the order of ninety-five percent of them. And yet, in practice, very little of that capability turns into value.

We call this the last mile problem. As in logistics, the last mile is the part of the journey you can already see and still fail to cross. The distance between a capable model and a working system inside a company is short, but it is also where the real difficulty lives. Closing it is the reason this lab exists.

A capable model is not yet a working one

Imagine hiring the most capable person you can find and placing them inside an unfamiliar company. On their first day, they are not yet useful. They become useful only as they learn how the company actually works: its conventions, its history, the small nuances that are never written down anywhere.

The same is true of language models and the agents built on them. The underlying intelligence is sufficient. What is missing is everything that makes a general-purpose model fit a specific organization: how that organization works, the tools the model can draw on, the standards it is held to. We treat this as the central design problem. It is not how to make a model smarter, but how to shape an already-capable one to the particulars of the work it is meant to do.

Learning from experience

A good employee becomes more valuable over time, because they learn from experience. Most deployed models do not. Their weights are fixed at training time and do not change as they work. For as long as this remains true of the dominant architecture, learning from experience has to happen somewhere other than the weights.

We locate it in the harness, the flexible layer of context, instructions, tools, and feedback that surrounds a model. The harness is where an agent can be shaped to a company’s specific needs, and where it can keep changing as those needs change. In the current paradigm, this is what continuous learning looks like in practice: not updating the model on the fly, but continually reshaping what surrounds it.

We do not claim that reinforcement learning and fine-tuning are without value; they clearly have their place. Our thesis is narrower and more practical. For adapting an agent to a particular organization and use case, optimizing the harness is more flexible, more cost-effective, and more efficient than repeatedly updating model weights.

The bottleneck is no longer the model

We are AI engineers, and we build agents the way most builders do. You deploy an agent, watch its traces, study the cases where it fails, run experiments to improve it, and repeat.

This works until it doesn’t. Past a certain scale, you fall behind. There are too many traces to read. New edge cases appear faster than you can address them. The dataset you rely on to catch regressions keeps growing. At that point something quietly shifts: the constraint is no longer the intelligence of the model. It is the human doing the evaluating and improving.

Using AI to build AI

If the bottleneck is human work, then the human work is what should be automated. So we rebuilt our own job. We took the loop that every agent builder runs (deploy, observe, evaluate, experiment, improve) and turned it into a system that runs on top of any agent, independent of framework, model, or use case.

We call it meta meta: a meta system whose purpose is to do the work we used to do by hand. Unlike a human team, it can spin up as many copies of itself as a problem requires, running thousands of experiments in parallel in search of the harness that works best for a given agent.

This is what recursive self-improvement looks like once it is made concrete. Not a single system bootstrapping itself in the abstract, but AI doing the systematic, patient labor of making other AI systems better.

Agents are living systems

Traditional software is largely deterministic. Given the same input, it returns the same output, and once it is correct it tends to stay correct. Agents are different. Language models are non-deterministic, and most useful agents are open-ended: people will use them in more ways than anyone designed for.

This means evaluation and optimization are not a phase that finishes. They are a continuous condition of running an agent at all. It helps to think of an agent less as a finished artifact and more as a living system: something that keeps adapting until it takes the shape that works for you, and then keeps adapting as the world around it moves.

Working in the field

A lab that calls itself applied has to stay close to real systems in real organizations. So alongside our own research, we partner with a small number of companies on their hardest agent problems. We do this both to help them cross the last mile, and because those problems are where our work is tested and sharpened.

Most of what we learn comes from cases other teams couldn’t solve. If you are building agents and have reached the ceiling of what you can get working on your most demanding use cases, where the model is no longer the thing standing in your way, we would like to hear about it. The engagements we take on are few, and we are drawn to the ones where the difficulty is real.

If that describes you, write to us.

The gap between what AI can do and what it actually does for people is, we think, one of the most important applied problems in the field today. It is also a solvable one. We are building the systems that close it, and using AI to do the building.

Get in touch

Latest research

May 28, 2026
Beating τ-bench retail by 10% How a three-person team outperformed the best public models on τ-bench retail — and what it taught us about agent optimization.