This is a static-scan check (robots.txt + llms.txt + schema.org + headers). Live engine probes across ChatGPT, Claude, Gemini, and Perplexity arrive in a future build — currently in queue. Real visibility lives in category and comparison queries, which we measure with a 100-prompt stratified set on the Audit tier.
Same scan, free, no signup. Results in ~5 seconds at your own permanent canaifind.com/r/{slug} URL.
AI crawler robots.txt audit
§1 of 4| GPTBot | Training crawler for future OpenAI models. | ? Unknown (fetch blocked) |
| OAI-SearchBot | ChatGPT Search index. Disallowing makes you invisible to ChatGPT Search. | ? Unknown (fetch blocked) |
| ChatGPT-User | User-initiated retrieval. Ignores robots.txt by design. | — Ignores robots.txt |
| ClaudeBot | Training crawler for Anthropic models. | ? Unknown (fetch blocked) |
| Claude-User | Retrieves pages when a Claude user asks about them. Respects robots.txt (unlike OpenAI's ChatGPT-User). | ? Unknown (fetch blocked) |
| Claude-SearchBot | Search index for Claude. Disallowing reduces Claude search quality. | ? Unknown (fetch blocked) |
| claude-code | Claude Code CLI / IDE retrieval. Documentation-targeted. | ? Unknown (fetch blocked) |
| PerplexityBot | Perplexity indexing. Disallowing removes you from Perplexity retrieval. | ? Unknown (fetch blocked) |
| Perplexity-User | User-initiated retrieval. Ignores robots.txt by design. | — Ignores robots.txt |
| Google-Extended | Training opt-out for Gemini / Bard. Disallowing opts you out of Google AI training. | ? Unknown (fetch blocked) |
| GoogleOther | Catch-all for non-Search Google crawlers. | ? Unknown (fetch blocked) |
| Meta-ExternalAgent | Meta AI crawler. Disallowing opts you out of Meta AI training/retrieval. | ? Unknown (fetch blocked) |
| Applebot-Extended | Apple Intelligence training opt-out (separate from Applebot Search). | ? Unknown (fetch blocked) |
| Bytespider | ByteDance / TikTok AI crawler. | ? Unknown (fetch blocked) |
| CCBot | Common Crawl. Heavily used as a training-corpus source by every major model. | ? Unknown (fetch blocked) |
Structured data & discovery files
§2 of 4| Artifact | Status | Note |
|---|---|---|
| llms.txtA markdown index of the site's most important pages, served at /llms.txt. Anthropic Claude Desktop and Claude.ai fetch this. IDE tooling (Cursor, Claude Code, GitHub Copilot, Cline, Aider) routinely retrieves it. Google has explicitly confirmed it does NOT support it (Gary Illyes, July 2025). OpenAI is unconfirmed. | ✗ Missing | Anthropic Claude respects this; Google has confirmed it does not; OpenAI is unconfirmed. |
| llms-full.txtOptional full-content companion to llms.txt. Useful for agents with large context windows that prefer a single fetch over crawling. Doesn't replace llms.txt — both can coexist. | ✗ Missing | Optional full-content companion file. |
| Artifact | Status | Note |
|---|---|---|
| schema.org OrganizationThe brand-identity anchor LLMs use to disambiguate the site. Without it, profile links on LinkedIn, Wikidata, Crunchbase, GitHub etc. aren't bound to the homepage's entity in the AI's knowledge graph. The sameAs array is the load-bearing field. | ✗ Missing | Entity anchor for the sameAs graph. |
| schema.org FAQPagePages with FAQPage JSON-LD show 2.7× citation rate vs without — 41% vs 15% in the Relixir 2025 study. The JSON-LD must mirror visible Q&A content on the page; Google penalises mismatch. Single highest-leverage fix in the audit. | ✗ Missing | 2.7× citation rate vs without (Relixir 2025) — highest-leverage single fix. |
| schema.org ArticleFor journalistic/editorial pages. Declares author, datePublished, dateModified, and section to AI engines. They preferentially cite recent, dated, authored content in answer-engine results. | ✗ Missing | For editorial pages. |
| schema.org HowToFor step-by-step procedural content. AI engines preferentially cite HowTo markup when answering procedural queries ("how do I X"). Maps directly to retrieval intent. | ✗ Missing | For tutorials. |
| schema.org SoftwareApplicationFor product/app pages. Maps to vendor-evaluation queries ("best X for Y"). Effectively required for B2B SaaS visibility in AI citations — 89% of B2B buyers now use AI for vendor research (Averi 2026). | ✗ Missing | For product pages. |
| Person (author entity)Author entity on bylines, linked to the Article entity. E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) signal — AI engines weight content authored by named, credentialed people higher than anonymous content. | ✗ Missing | E-E-A-T signal on bylines. |
HTTP headers
§3 of 4| Header | Value |
|---|---|
| X-Robots-TagPage-level crawler directives. The proposed `noai` / `noimageai` values request AI crawlers skip the page (publisher hint — honored by some, ignored by others). Not enforced; combine with explicit robots.txt rules for layered defense. | — not set |
| Cache-ControlTells crawlers how aggressively to cache the response. Aggressive `no-store` or `private no-cache` directives hurt retrieval freshness signaling — AI engines may distrust the page or refresh it less often. For public pages, prefer `public, max-age=300` or longer. | max-age=0, no-cache, no-store |
| Link: canonicalCanonical URL declared as an HTTP response header (RFC 8288 + RFC 6596). Processed earlier in the retrieval pipeline than the HTML `<link rel="canonical">` tag, so it works for crawlers that don't fully render HTML. | — not set |
| Content-TypeDeclares the response format. `text/html; charset=utf-8` is the standard for pages. If the site supports Markdown negotiation, the same URL can serve `text/markdown` when `Accept: text/markdown` is sent. | text/html |
| Agent-content probe | Status | Note |
|---|---|---|
| Markdown negotiationEmerging Cloudflare-led standard. When an agent sends Accept: text/markdown, the site returns a markdown rendering of the page instead of HTML. Sibling of llms.txt. Reduces token cost for downstream AI usage and lets agents consume the site without rendering. | ✗ Returns HTML | No text/markdown response when Accept: text/markdown is sent. |
| Agent-discovery Link relsRFC 8288 Link-header relations that let agents find machine-readable surfaces of the site without scraping HTML. We look for api-catalog (RFC 9727), service-desc (OpenAPI), describedby (RDF / JSON-LD), and agent-card (emerging convention). | ✗ None | No api-catalog / service-desc / describedby / agent-card rels. |
Top findings
§4 of 4- 1Med
robots.txt is behind bot protection.
adidas.com/robots.txt returned 401/403 from Akamai bot mitigation. AI retrieval crawlers that do not execute JavaScript may be blocked from this file too, which means none of them can read the crawler rules. The fact that we cannot see robots.txt IS itself a signal worth knowing.
- 2Med
Could not fetch the homepage.
We tried https://adidas.com/ and got HTTP 403 from Akamai bot mitigation. AI retrieval crawlers that don't execute JavaScript (OAI-SearchBot, Claude-SearchBot, PerplexityBot, GoogleOther) likely face the same block. The fact that adidas.com's homepage is invisible to non-browser clients IS itself a finding worth knowing.
- 3Med
Cache-Control may prevent retrieval-layer caching.
Aggressive no-store / private no-cache directives tell retrieval crawlers not to trust the response. For public pages you want cited, prefer `Cache-Control: public, max-age=300` or similar.
- 4Tip
No Link: rel="canonical" HTTP header.
Most CMSs handle canonicalization via `<link rel="canonical">` in HTML. Adding the HTTP header version too is processed by retrieval crawlers that don't fully render HTML. Optional.
- 5Tip
Could not probe Markdown negotiation.
The homepage returned 401/403 to our `Accept: text/markdown` probe — typically bot protection. We cannot determine whether the site supports Markdown for Agents.
This report has a permanent URL: canaifind.com/r/y2N8iDqB. Screenshot, drop in Slack, quote-tweet, or send to whoever's going to ask. That's how this tool finds the next person who needs it.