Public · Methodology v1.2

How Apps Almanac scores work

Every score in Apps Almanac is a deterministic composite of measurable signals from public Shopify App Store listings. No LLM is in the loop, so the same inputs always produce the same score. This page explains, in plain English and at a high level, what each score measures and which signals drive it.

Data comes from public listing information only; Apps Almanac does not use the Shopify API and is not affiliated with, endorsed by, or sponsored by Shopify Inc. See /data-sources.

Published 2026-05-24 · methodology v1.2 (last reviewed 2026-05-27).

See the method applied to live data: top categories by opportunity, meta-category leaderboard, and the weekly State of the Shopify App Market.

See also: methodology changelog · errata · data freshness · roadmap.

Proprietary methodology. The scoring system described below the choice of signals, how they are weighted, normalized, and combined is the proprietary work of Apps Almanac and is protected as our intellectual property. We describe it here directionally so you can trust and interpret the output. This is an explanation, not a recipe: it does not license, and is not intended to enable, reproduction of the engine. Reproducing, reverse-engineering, or republishing the methodology to build a competing scoring system is prohibited, and we will pursue all available legal remedies against misappropriation.

How to read the scores

Here's what each label on screen means for your view.

Opportunity · Vulnerabilitylow → high

0.0-0.5Negligible
0.5-1.5Very low
1.5-2.5Low
2.5-3.2Below avg
3.2-4.0Moderate
4.0-4.8Notable
4.8-5.6Above avg
5.6-6.5High
6.5-7.5Very high
7.5-10Top

Fit- how safe an app is to rely on

SolidWell-rated, stable, and responsively supported - a dependable option.
ReliableDependable, with only minor caveats.
CautionSome risk signals: read the complaints before relying on it.
RiskActive decay, neglect, or a complaint surge: approach carefully.
UnratedToo few reviews to judge reliably yet.

Vulnerability (per app)

How exposed an app is pain, neglect, and decay in its reviews. The same signal reads differently depending on who's looking: a builder sees a target to out-build, the app's owner sees a fix-list, and a merchant sees a reliability risk (shown as a fit band, not a vulnerability score). Merchant-pain-led: the three pain-signal components together carry the majority of the score. Apps with very few reviews (under ≈50) are hard-floored to 0 so we don't surface unproven micro-apps. Platform-owned apps (those published by Shopify, Meta, Microsoft, Google and similar large platforms) are excluded from vulnerability leaderboards. This exclusion is specific to the disruption/vulnerability surface. Platform-owned apps are included in fit scoring and merchant shortlists because a merchant choosing tools cares about whether an app works, not who built it.

What feeds the score, in order of weight (the realized influence on any one app depends on its own signals): the pain components dominate at the top of the board, while the structural anchors below (free plan, proven demand) carry more weight for apps without a strong rating signal.

Vulnerability is an absolute score, not normalized against the rest of the corpus. Each component is measured against a fixed scale (for example, how far a rating has decayed, or how stale a listing is) rather than ranked against other apps, so one app's vulnerability never moves just because a different app changed. This is the opposite of the per-category Opportunity score below, which is normalized across all real categories. Don't assume both engines normalize the same way.

Rating decay how far the recent rating has fallen below the all-time average. The single most influential component.
Negative-review velocity the recent rate of incoming 1-2★ reviews. Captures an accelerating complaint stream.
Low recent rating a depressed rolling rating over recent months. Catches sustained dissatisfaction even when the all-time average hasn't fully caught up yet.
Support neglect the share of older negative reviews left without a developer reply.
Proven demand whether the app sits in a healthy review-volume band. Anchors the score so we don't surface zero-revenue apps.
Staleness how long it's been since the listing was last updated.
Free plan presence of a free tier slightly raises vulnerability (lower switching cost for a new entrant).
Rating polarization how divided the user base is. An app whose reviews split between many 5★ and many 1★ (rather than clustering around its average) has a polarized base a clearer opening for a challenger than a uniformly mediocre one with the same headline rating.

Opportunity (per category)

How much room there is to win in a category. Six components:

Complaint ratio the share of category reviews rated 1-2★. A high ratio means users are actively unhappy. The most heavily weighted component.
Review volume total review activity in the category; high volume signals proven demand.
Developer neglect the share of negative reviews left without a developer reply, aggregated across the category.
Price ceiling the typical paid-plan price in the category; a higher ceiling means more revenue headroom for a challenger.
Competition density demand per competitor: lots of demand spread across few apps points to more whitespace.
Average rating (inverted) a lower category-average rating means users are hurting.

Three intent presets ("gap", "beat", "validate") emphasize these components differently to expose alternative perspectives on the same data. Components are robustly normalized across all real categories before they're combined, so one extreme category can't flatten the rest of the scale.

Shape (per meta-category)

A bucket-level score over the 38 hand-curated meta-categories. Five components, in order of influence:

Mean vulnerability the average vulnerability of apps in the meta-category.
Pain density the share of apps showing clear merchant-pain evidence.
Price floor the typical entry-level paid plan across the meta-category; a higher floor means a stronger monetization precedent.
Demand total review activity in the meta-category.
Low density fewer apps competing in the bucket scores higher.

Confidence and sample size

We always show the sample size behind a score (apps, reviews). When the sample is small (fewer than ≈30 reviews behind a category), we surface a low confidence chip so the score is read with appropriate caution.

Data refreshes weekly. See the freshness page for refresh cadence and what's in scope.

Glossary

Opportunity Composite 0..1 how much room there is to build a winner here.Weighted blend of six components: app count gap, rating gap, review-volume gap, pain density, price ceiling, and platform-saturation inversion.
Vulnerability Composite 0..1 how vulnerable an incumbent app is (a challenger's target, an incumbent's fix-list).Higher means rating decay, neg-review velocity, support neglect, staleness, or pricing thrash. Apps with <50 reviews are hard-floored to 0.
Shape Meta-category bucket score how compelling the bucket is overall.5 components: pain density, app count gap, review concentration, price ceiling, and platform-saturation inversion.
Fit How safe an app is to rely on shown as a band (Solid / Reliable / Caution / Risk).A blend of rating health, stability, support responsiveness, active maintenance, and momentum. Apps with too few reviews to judge read 'Unrated'. Shown banded, never as a raw number.
Preset scores Three flavors of opportunity: gap (whitespace), beat (weak incumbents), validate (proven demand).Same components, different weights. We surface only the active preset by default; advanced mode shows all three.
Pain density Share of reviews containing complaint keywords.Counts hits against the curated complaint phrase set divided by total review count for the category.
Velocity trend Δ in 30-day review pace vs the 180-day baseline.Positive → app is heating. Negative → cooling. Hard-floored for apps with <50 total reviews.
App complexity S / A / B / C / D / E how complex it is to ship a credible competitor.It's never just a UI tweak even a simple app needs data and logic. S: light UI + config. A: simple CRUD + a database. B: significant logic. C: hard integrations. D: heavy ML / data. E: requires Shopify-side platform changes.
Pain evidence True if at least 5% of reviews trigger our complaint regex.
Platform developer First-party Shopify-owned (or partner) app. Excluded from vulnerability leaderboards.
Reply lag (days) Median days from review-posted to developer-reply-posted.
Support debt ratio Share of recent reviews still unanswered. Higher = more neglect.
Abandoned ≥180 days since lastmod or last developer reply.
Gap intent Maximize whitespace heavy weight on app count gap and pain density.
Beat intent Maximize weak-incumbent share heavy weight on rating gap and pain density.
Validate intent Maximize proven demand heavy weight on review-volume and platform-saturation inversion.
Delisting An app that was listed at an earlier point but is no longer live in the store.Counted by comparing the set of live listings between two points in the record. Apps leave for many reasons voluntary removal, acquisition, or store policy.
Freemium An app with a free tier plus paid upgrade plans, as presented on its listing.Distinct from fully free (no paid plan) and paid (no free tier). Based on how pricing is shown on the public listing not on actual revenue.
Saturation How crowded a category is many apps competing for the same need.Read off the rise-and-saturation chart: a steeply climbing curve is still expanding; a flattening one is filling up, so a new entrant faces more competition.
Review velocity How fast a category is accumulating new reviews lately a proxy for merchant attention.Rising velocity means merchants are actively adopting and reviewing apps in that category right now (heating); falling velocity means attention is cooling.
Average rating Mean of the public star ratings across all apps in scope, at each point in time.Aggregate only a single store-wide or category-wide figure, never tied to a named app on this public page.

What we don't do

We don't run an LLM inside the scoring engine. Briefings are deterministic text you paste into your own AI assistant.
We don't surface a "helpful" count for reviews the public App Store doesn't expose one we can read reliably.
We don't show country names or other noise as if they were product categories those are filtered out before anything is scored or displayed.