- Published on
- · 13 min read
Can Horizontal Scaling Get Us Closer to AGI?
- Authors
 - Name
- michaellwy
- @michael_lwy
 
 
Have thoughts on this topic? Join the conversation on X.

Introduction
Here’s Fortytwo in one sentence: Fortytwo is a decentralized AI network that runs swarms of small language models on consumer‑grade hardware, coordinating them into one inference engine that can out‑scale and out‑reason centralized giants.
Okay. Decentralized AI. Swarm inference. Big vision. Abstract concepts. Is this just another narrative‑driven project padded with buzzwords?
In this article, I explore the nature of LLM, break down the Fortytwo whitepaper with added context, and lay out the facts so you can judge for yourself.
LLM as ‘lossy compression of internet’
To understand what problem Fortytwo is trying to solve, we first need to understand the nature of LLMs.
An LLM is a predictive machine. It absorbs vast collections of text and learns which words tend to follow which. During the training process, it distills all that data into a fixed set of numerical weights. The model then uses those weights to guess the next token whenever we prompt it with new text.
OpenAI’s co-founder Andrej Karpathy calls this process “lossy compression.” Science fiction writer Ted Chiang offers another metaphor: ChatGPT is “a blurry JPEG of the Web.”
What do they mean by that? Let’s use images as an analogy (sorry analogies haters).
Photographers often shoot in RAW format, which stores uncompressed data for maximum editing flexibility. JPEGs, by contrast, compress the file to save space, discarding fine details while preserving the broad outline.
How does it relate to LLM?
A model such as GPT‑4 contains many billions of parameters, yet that number is still far smaller than the total text (or tokens) it ingested. The result is a fuzzy, approximate map of human knowledge. When we ask a question, the model “decompresses” its internal summary into human language. Most of the time the answer is plausible and correct. Sometimes it is not, because the underlying representation is approximate.
Of course models are improving day by day. You can train them on more data and make bigger models with more parameters.
Yet the lossiness seems to be fundamental: no single model can memorize the entire internet with perfect precision. Making a model larger improves fidelity, but never eliminates the blur.
This brings us back to Fortytwo’s design approach: if any one model is inevitably imperfect, why not combine many?
Fortytwo basically links a network of small language models (SLMs) each with its own compressed view of the world. A user query goes to a subset of these models. They answer in parallel, then score one another’s responses and vote. Different blind spots cancel out. If one node supplies a vital detail, others can recognize its value and elevate that answer. A weighted consensus selects the final output.
Fortytwo, in short, treats individual LLMs as imperfect mirrors and fuses them at inference time into a clearer composite.
By framing inference as an ensemble problem, the network aims to overcome the lossiness that no single model can avoid.

Source: Fortytwo blog
Lessons from Autonomous Driving
Does Fortytwo’s design approach make sense? We can look to a related field like autonomous‑driving systems for some lessons. Martin Casado of a16z captured this succinctly:
Early on, people imagined a “god-model” autonomous-driving AI that takes in raw sensor data (cameras, LiDAR, etc.) and directly outputs steering and throttle commands. Essentially a single end-to-end neural network that drives a car. Indeed, some research prototypes have tried this end-to-end approach.

However, the industry at large moved away from relying on one monolithic model.
In self‑driving cars today, the AI is typically split into specialized subsystems: perception (detecting lanes, vehicles, pedestrians), prediction (anticipating what those agents will do next), planning (deciding the car’s path or actions), and control (executing steering or braking to follow the plan).
This is a classic divide‑and‑conquer strategy: each component tackles part of the problem and passes its result to the next.

Fortytwo’s philosophy aligns with this idea: complex problems are best solved by an organized team of specialists rather than a single know‑it‑all model.
Self-supervised models
So, how does Fortytwo actually organize these AI “specialists”?
The Fortytwo network is composed of many nodes, each of which is essentially an AI model running on a participant’s device. These nodes are designed to be self-supervised and self-contained, meaning they can generate answers and also evaluate answers. In the whitepaper, the authors describe a multi-component node architecture where each node has a:

- Primary Cognitive Module – this is the brain of the node, responsible for understanding queries, generating responses, and ranking responses.
- Tool Preprocessor (optional) that might prepare or transform input (for instance, converting an image to text, if one node had a vision plugin),
- Tool Postprocessor (optional) that can refine the output or call external services
These optional “auxiliary units” give nodes (which host LLMs) flexibility to handle various modalities or integrate tools or APIs, while the overall structure of a node remains consistent. Figure 1 illustrates the node architecture in simplified form.
Swarm-Based Consensus: How the Best Answer is Chosen
Having multiple nodes answer a question is only useful if there’s a way to pick the best or combine answers. Fortytwo uses a swarm-based consensus mechanism to coordinate the nodes and converge on a high-quality result.
Let’s walk through a typical inference round from Fortytwo’s swarm consensus mechanism step by step (illustrated in figure below):

Step 1-2: Broadcast & Sub‑Swarm Formation
When a user’s query comes in, it is broadcast to all nodes in the network. Each willing node (those that are free and consider themselves relevant to the query) generates its own answer independently.
Step 3-6: Encrypted Answer Generation
To ensure fairness, nodes first encrypt their answers and submit these encrypted responses. This prevents any node from peeking at others’ answers and copying or adapting them (commit-reveal scheme). After a short time window for answering, all nodes reveal a decryption key so that the answers can be read openly. At this point, the swarm has a set of proposed answers, one from each participating node, all generated without influence from the others.

Step 7-11: Randomized Peer Ranking
Now each node gets the task of evaluating some of the answers. A pseudo-random process (using a recent blockchain block hash as a randomness source) assigns each node a subset of the other nodes’ answers to evaluate. For example, each node might be asked to rank 1/3 of the answers from best to worst (excluding its own answer). Because the subsets are random and each node only sees a portion of the answers, no single node can sway the entire vote, and collusion between nodes is difficult. The nodes submit their rankings, again encrypted to prevent influence, and then those rankings are decrypted and collected. In step 11, the rankings are aggregated: using each node’s weight (rating) as a voting weight, the protocol calculates a weighted ranking for each answer.
Step 12-13: Consensus & Return
In the last phase, the system aggregates all the peer evaluations to decide which answer wins the consensus. Not all node’s opinions might be equal though. If the network knows that some node is usually very reliable, its vote might be given more weight. Fortytwo’s paper describes a weighted ranking method: each node has a weight (or reputation score), and the scores it gives to answers are weighted by that and summed up. The answer with the highest total weighted score is declared the best.
Handling Malicious Nodes and Adversarial Attacks
The swarm based consensus mechanism is a sort of a voting system where thousands of small language models agree on a single answer. Now ask yourself: What’s the first thing a clever attacker might try? Fortytwo walks through the same thought experiment and implements countermeasures into its protocol.
1. “I’ll flood the network with fake identities.”
That is the classic Sybil attack. Spin up a farm of nodes and out‑vote everyone else. Fortytwo’s safeguard here is economic disincentives. Each node needs to pay for a “ticket” before it can submit an answer. If its answer doesn’t win, the stake is slashed and redistributed to the winners. Simulations in the white‑paper show that a deposit as tiny as 1 % of the reward pool flips a Sybil campaign into negative ROI.
2. “I’ll collude with friends instead.”
Collusion only works if conspirators can consistently judge each other’s answers. But in Fortytwo, they can’t. After all answers are encrypted and revealed, a hash from the latest block pseudo randomly assigns each node a subset of peers to rank (never itself). Because assignments change every round, colluders can’t guarantee they’ll even judge one another. Rankings are tallied with reputation weights, so a long term honest record counts more than a newcomer’s opinion.
3. “What about tricking the models themselves?”
One ploy malicious actors might try is the low‑frequency‑token attack: inject rare Unicode symbols or gibberish strings that push certain models off balance. The idea is to exploit the possibility that some models might oddly favor rare tokens, and think of them as smart answers. Fortytwo counters by encouraging heterogeneity. Each node can run a different architecture or fine‑tune on different data. A token that derails one model is meaningless to another, so the prank never reaches a majority.
4. “I’ll just butter up the judges.”
Self‑aggrandising phrases (“trust me, I’m right”) are treated as noise. Ranking prompts focus models on factual and logical quality. Vanity text tends to hurt and not help the score.
Performance and Latency
Ok all of this sounds great. We ask a bunch of nodes to come to a conclusion collectively to give a more robust and accurate answer. But what does it mean to performance? All the routing in the backend and communication between nodes surely means that inference takes a long time?
One of the impressive claims in the Fortytwo whitepaper is that this swarm approach achieves very low latency even for large models on the order of hundreds of milliseconds (Note: this is added latency on top of baseline inference time or network latency). To appreciate this, consider some other decentralized or verifiable inference approaches:
- Some systems use zero-knowledge proofs to verify that a model inference was done correctly. While secure, these methods (e.g., ZK-SNARKs for ML) can take minutes or hours to produce a proof for a single image or a single neural network inference
- Using trusted hardware like Intel SGX (secure enclaves) can reduce latency (into a second or less), but they don’t scale well and require one to trust the hardware manufacturer and that the enclave isn’t compromised .
- Simpler “proof-of-quality” schemes (where a lightweight model quickly checks a larger model’s output) can be faster (tens of milliseconds) but might not guarantee high accuracy
Fortytwo reports that its swarm consensus inference can be done in 125 ms even on an LLM model with 405 billion parameters for context, 405B is larger than most current models, so they are aiming really high).
How is this possible? The speed comes from a combination of factors designed into the system:
Parallel Processing: All nodes work at the same time on answering the query. In a centralized setup, if you had a single model, you might have to run it on a request and wait for it to finish. In Fortytwo’s decentralized network, dozens or hundreds of smaller models can each tackle the question simultaneously. Essentially the initial answer generation phase doesn’t incur a network-wide delay beyond the slowest node’s response. If each node is running on its own hardware, they are all utilized concurrently.
Selective, Distributed Evaluation: As described, not every node evaluates every answer. Each does only a fraction of the work of judging answers. The workload of evaluation is spread out, avoiding having everyone check everyone.
Minimal Overhead per Evaluation: Ranking an answer is computationally cheap for an LLM. The whitepaper highlights that a node only needs to infer a few tokens (or a very small output) to produce a ranking. For instance, a node might just output a score or a single word like “Good” vs “Bad” for a given answer. Generating one token in an LLM is extremely fast compared to generating a whole paragraph.
Weighted Voting is Light: Combining the rankings with weights (essentially doing a weighted sum and finding the max) is a trivial computation for a computer . It’s not something that grows complex with model size or data size in a problematic way
Asynchrony and Pipelining: Many steps in the process can overlap. For example, while some nodes are still finishing up their answer generation, others might already start computing partial evaluations on answers that have been revealed. Or the network can begin collecting votes from nodes that finished early while waiting for the stragglers, etc. The whitepaper notes that asynchronous operations allow overlapping of tasks . This pipelining shaves off idle time, ensuring that the network isn’t just sitting and waiting at each barrier unnecessarily.
Conclusion
Fortytwo’s network represents a bold reimagining of how we scale and distribute AI. Instead of building one model to rule them all, it builds a community of models. In this blog post, we dissected the core principles behind Fortytwo:
- treating LLMs as lossy knowledge compressors and using many of them to recover accurate results by consensus
- learning from analogous systems like autonomous driving, which favor modular teams over monolithic “god” solutions
- designing LLM nodes that can self-evaluate and coordinate without central control and
- constructing a consensus protocol that allows a crowd of models to produce a single high-quality answer with speed and reliability.
This architecture offers three practical advantages over monolithic LLMs:
- Cost curve: compute scales horizontally on commodity hardware, driving per‑query cost down as adoption rises.
- Specialisation: individual contributors can plug in private, domain-tuned models that expand the network’s expertise while sharing only inference results, enhancing accuracy and earning rewards through custom specialization.
- Governance: policy decisions (what counts as high quality & safe output,, which tools are allowed) can be voted on chain, giving users and operators leverage instead of leaving them at the mercy of a closed API.
If these mechanics prove out, “running AI” will look less like renting time on a distant supercomputer and more like joining an open marketplace where models compete, cooperate, and continually raise the benchmark for reliable reasoning.