Which LLM for My Task?
There is no single "best" LLM — only the best fit for your task and constraints. A frontier model is overkill for high-volume classification, and a cheap fast model will fail at hard agentic reasoning. Tell the picker what you're building and what you care about — quality, cost, speed, context, open weights — and it scores each model against your needs and ranks them, with the reasoning shown.
- The right model depends on the task, not just leaderboard position.
- Quality, cost, speed and context window are usually in tension — weight what matters.
- Open-weight models win on control, privacy and self-hosting; frontier models on the hardest tasks.
- Use this to shortlist, then confirm with your own evaluation set.
Find your model
How the picker scores models
Each model carries a profile of capability ratings — general intelligence, coding, reasoning, speed, cost efficiency, context window, vision, tool use and creative writing — plus whether its weights are open. Your chosen task decides which of those capabilities count and how much: a coding task leans on coding, reasoning and tool use, while high-volume classification leans on speed and low cost. That produces a task-fit score.
Your priority sliders then add a second score that weights quality, cost, speed and context the way you ranked them, and any hard requirements — open weights, vision, a minimum context window — remove models that don't qualify before scoring. The two scores combine into a single ranking. Because the underlying ratings are an editable snapshot rather than benchmark truth, treat the result as a shortlist: it points you at two or three credible candidates to validate on your own evaluation set, which is the only test that reflects your real prompts.
Frequently Asked Questions (FAQ)
Start from the task, then weigh your priorities. A coding agent values reasoning and tool use, a high-volume extractor values speed and low cost, and a long-document assistant values context window. This picker scores each model against your task and priority weights to rank them.
No. Frontier models lead on the hardest reasoning and coding, but for routine classification, summarisation or chat a cheaper, faster model often matches them at a fraction of the cost. Match the model to the task rather than always reaching for the top tier.
Open-weight models suit data residency, privacy, on-premise deployment and cost control, since you can self-host and avoid per-token fees at scale. Filter to open weights here when control matters; the trade-off is the operational work of hosting and scaling them yourself.
The ratings are an editable, illustrative snapshot, not live benchmark data, because the model landscape shifts almost weekly. Treat the ranking as a starting framework and confirm against current leaderboards and, most importantly, your own task-specific evaluations.
It matters when you feed long documents, large codebases or extensive retrieved context into a single call. For short prompts it is irrelevant. Raise the context priority or set a minimum window filter only when your task genuinely needs to hold a lot of input at once.
Yes. A picker narrows the field, but the only reliable test is your own evaluation set on your real prompts. Shortlist two or three models here, then run them against representative tasks and measure quality, latency and cost before committing in production.
Treat near-ties as a shortlist rather than a single winner. Break the tie on factors the score cannot fully capture, such as provider reliability, region availability, rate limits, ecosystem and tooling fit, then validate the finalists on your own evaluation.
Often the best design routes routine steps to a small, cheap, fast model and hard reasoning to a frontier model. Run the picker once per task type in your system, then route accordingly to balance quality against cost across the whole workload.
It blends a task fit score, from the capabilities your chosen task emphasises, with a priority score from your quality, cost, speed and context weights, after removing any models that fail your hard filters. The combined score drives the ranking.
Your task, priorities and filters are saved only in your browser using local storage so the picker remembers them next time. Nothing is sent to a server, and the reset button restores the defaults and clears your saved selections.