methodology / AI Readiness / #16

AI crawler robots.txt directives

#16 · Variable · Web Quality · weighted · AI Readiness · weight 1.3% · impl implemented · method v1.2.0

Web Quality factor

This factor is part of Web Quality — the weighted 0..100 score that sits above Web Standards. Its weight depends on what kind of site is being measured. Web Standards items take priority; this factor only enters the score once Web Standards passes.

Base weight: 0.4 applied to every site type unless overridden below
Why this weight: Having explicit AI-crawler directives — allow OR disallow — is the citizenship signal. The site has thought about it.

Per-site-type overrides

Site type	Weight	Δ vs base
Blog	0.3	-0.1
Corporate / B2B	0.5	+0.1
News / Publisher	1.0	+0.6
Personal site	0.2	-0.2
SaaS / Product	0.6	+0.2
Media / Streaming	0.9	+0.5

Site types not listed inherit the base weight.

Same factor, two depths.

What we measure

AI search engines respect robots.txt. Blocking them entirely hides you from AI answers; not addressing them at all is fine but means you have no explicit policy. Sites that ALLOW AI crawlers are more discoverable in AI search.

How to improve your score

Decide your policy. To be discoverable in AI search: don't block (or explicitly allow) `GPTBot`, `ClaudeBot`, `PerplexityBot`, `Google-Extended`, `CCBot`. To opt out: add `User-agent: GPTBot\nDisallow: /` etc.

Implementation

stale · v1 · seeded — no connector publish yet · source: freshcoat-discovery/src/connectors/legacy-audit.ts:scoreAiCrawlerDirectives

Detection method

Reads robots_ai_blocked_count from the audit endpoint's robots.txt parse. INVERTED rubric: any explicit AI-crawler directive (allow OR disallow) is the citizenship signal — the operator has thought about it. No explicit directives = warn.

Detection sources

Scoring bands · soft ladder

Score	Condition
100	≥1 explicit AI-crawler block (GPTBot, ClaudeBot, PerplexityBot, etc.)
80	No explicit AI directives (the silent majority — implicit allow)

Evidence-key dictionary

What every notes string the connector emits means. Surfaces in the per-domain dossier evidence column.

Applicability

Variable tier. Editorial choice — blocking GPTBot is a legitimate IP stance (NYT, WSJ, Reuters do it). The signal is 'has thought about it,' not 'permits everyone'.

Changelog

Facts

Ticket

WEBQ-16

When this applies

AI crawler directives live inside robots.txt, which this platform doesn't let site owners edit.

Marked n/a when the detected platform doesn't support canControlRobotsTxt (e.g., Squarespace and Wix can't set custom HTTP headers, so factor #4 becomes n/a there).

Scoring

Scoring formulas are versioned with the methodology. The current method (v1.2.0) maps raw measurements to pass, warn, fail. Factor weights determine how much each contributes to the composite — see the methodology index for the full table.

Cited by these standards

Standards in the Standards Library whose satisfiedBy requirement tree references this factor. Each link goes to the standard's full entry — methodology, scope, and the other factors it relies on.

AI crawler permissions AI-readiness global Explicit allow/disallow rules for GPTBot, ClaudeBot, PerplexityBot, and friends. Default-deny means missing AI citations; default-allow means free training data.
ai.txt AI-readiness global Site-level opt-out signal for AI training, distinct from llms.txt. Where llms.txt is a positive content map for AI consumption, ai.txt is `do not train on this`.
IETF AI Preferences (aipref) AI-readiness global The IETF working group standardizing how sites express AI training / inference preferences. Likely to subsume ai.txt, llms.txt opt-out semantics, and the messy patchwork of robots.txt AI directives.

Version history

Version	Change	Date
v1.2.0	Factor introduced. Status: live. Scoring impl: implemented.	2026-04-25

← back to methodology