How we approach evaluation.
The methodology document exists to prevent trust collapse. If we publish comparisons without explaining how we arrived at them, the comparisons are not useful — they're just another set of vendor-adjacent claims. This page explains what we measure, how we measure it, and where our current limitations are.
The inputs to each comparison
Comparisons on MetaAI.io may draw on any combination of the following, depending on the benchmark and the dimension being evaluated. We do not claim a single unified measurement approach — the approach varies by what's practical and honest to assess at this stage.
- ΛObserved behavior. We run tasks across models using consistent prompts and document the outputs. This is our most direct form of evidence.
- ΛEditorial assessment. Some qualities — tone, nuance, usefulness of a response — don't reduce to numbers. We make qualitative calls and label them clearly as editorial.
- ΛPublic information. Pricing, announced capabilities, knowledge cutoff dates, and context window sizes are taken from official documentation and labeled accordingly.
- ΛEarly benchmark criteria. In Phase 0, we are still developing the evaluation framework. Some criteria are provisional and will evolve as we gather more signal.
- ΛWorkflow-specific interpretation. A model's aggregate performance doesn't determine its fit for a specific workflow. We're beginning to explore how model choice changes by use case.
What is measured vs estimated vs editorial
Every data point on a benchmark page carries one of three labels. This is not decoration — it's the primary mechanism by which we maintain honesty about the quality of the evidence behind each claim.
Directly observed
Values derived from running the same prompt or task across models and recording outputs. Labeled as measured.
Estimated
Values inferred from public documentation, pricing pages, or third-party reports. Labeled as estimated.
Manually reviewed
Qualitative assessments made by human review, not automated scoring. Labeled as editorial.
Provisional (Phase 0)
Some values are marked provisional where we have early signal but insufficient data to be confident. These will be updated as the project develops.
What this project does not claim to be
Being explicit about limitations is not a weakness in an evaluation project. It's a requirement for the project to be trustworthy at all. The following constraints apply to everything published on this site.
- ΛModel behavior changes with each version release. A comparison that was accurate in one month may be outdated the next.
- ΛOutputs vary significantly by prompt phrasing, context window size, system instructions, and access tier. Our observations reflect specific conditions that may not match yours.
- ΛNot every benchmark dimension is exhaustively tested at this stage. Where coverage is limited, we say so.
- ΛThis project is an aid to evaluation — a structured starting point for thinking through model choice. It is not a universal truth source or a replacement for testing models against your specific use case.
- ΛMetaAI.io has no commercial relationship with any of the AI providers referenced on this site. We do not receive referral fees or incentives for any assessment.
Phase 0 is a 30-day test.
MetaAI.io is in an early testing phase. We're exploring whether there's demand for an independent comparison layer in a market where model choice is genuinely difficult and vendor-neutral signal is scarce.
If early signal from this phase is strong, we'll develop the evaluation methodology further — adding more structured measurement protocols, expanding benchmark coverage, and working directly with teams to test workflow-specific comparisons.
Nothing on this site should be read as a finished product or a permanent commitment. It's an honest test of whether this kind of project is worth building — and if so, how it should be built.
What we're not doing
- Λ Building features that don't yet exist
- Λ Claiming category leadership or authority
- Λ Publishing benchmarks with false precision
- Λ Making commercial claims about any AI provider
What we're learning
- Λ Which dimensions teams most want compared
- Λ What workflow-specific evaluation looks like in practice
- Λ Whether independent comparison creates useful signal
- Λ What methodology is actually defensible at this stage
Help shape the methodology.
If you're evaluating AI models for real workflows, your input would directly inform how we develop this project. We're looking for early design partners and access requests.