Reportinstagram.com·checked 2026-05-21 23:17 UTC·methodology v0.1 (preview)·canaifind.com/r/qJiAxqMQ
NoAI engines are likely blocked from indexing instagram.com.

This is a static-scan check (robots.txt + llms.txt + schema.org + headers). Live engine probes across ChatGPT, Claude, Gemini, and Perplexity arrive in a future build — currently in queue. Real visibility lives in category and comparison queries, which we measure with a 100-prompt stratified set on the Audit tier.

╴ Check your own domain

Same scan, free, no signup. Results in ~5 seconds at your own permanent canaifind.com/r/{slug} URL.

AI crawler robots.txt audit

§1 of 4
⚠ Conflict detected
instagram.com disallows GPTBot but ChatGPT-search is also disallowed via fall-through.

GPTBot is OpenAI's *training* crawler — disallowing it opts the site out of future training data. ChatGPT's *live retrieval* runs through OAI-SearchBot (which instagram.com has also disallowed via the fall-through `User-agent: *` rule) and ChatGPT-User (which ignores robots.txt by design). To keep ChatGPT able to cite instagram.com, add an explicit `User-agent: OAI-SearchBot` block with `Allow: /`.

Read the full explanation →
⚠ Conflict detected
instagram.com disallows ClaudeBot but Claude-search is also disallowed via fall-through.

ClaudeBot is Anthropic's training crawler — disallowing it opts the site out of training. Claude's live retrieval uses Claude-User and Claude-SearchBot. instagram.com has disallowed Claude-SearchBot via the fall-through `User-agent: *` rule, removing the content from Claude's search index. Add an explicit `User-agent: Claude-SearchBot` block with `Allow: /` to restore Claude search visibility.

Read the full explanation →

No Content Signals in robots.txt.Content Signals (IETF draft, contentsignals.org) are a declarative way to state AI-usage preferences: `Content-Signal: search=yes, ai-input=yes, ai-train=no`. Compliance is voluntary today, but the signals are cheap to publish and align with the direction the standards are heading.

OpenAI
GPTBotTraining crawler for future OpenAI models.✗ Disallowed
OAI-SearchBotChatGPT Search index. Disallowing makes you invisible to ChatGPT Search.✗ Disallowed (fall-through)
ChatGPT-UserUser-initiated retrieval. Ignores robots.txt by design.— Ignores robots.txt
Anthropic
ClaudeBotTraining crawler for Anthropic models.✗ Disallowed
Claude-UserRetrieves pages when a Claude user asks about them. Respects robots.txt (unlike OpenAI's ChatGPT-User).✗ Disallowed (fall-through)
Claude-SearchBotSearch index for Claude. Disallowing reduces Claude search quality.✗ Disallowed (fall-through)
claude-codeClaude Code CLI / IDE retrieval. Documentation-targeted.✗ Disallowed (fall-through)
Perplexity
PerplexityBotPerplexity indexing. Disallowing removes you from Perplexity retrieval.✗ Disallowed
Perplexity-UserUser-initiated retrieval. Ignores robots.txt by design.— Ignores robots.txt
Google
Google-ExtendedTraining opt-out for Gemini / Bard. Disallowing opts you out of Google AI training.✗ Disallowed
GoogleOtherCatch-all for non-Search Google crawlers.✗ Disallowed (fall-through)
Meta
Meta-ExternalAgentMeta AI crawler. Disallowing opts you out of Meta AI training/retrieval.✗ Disallowed (fall-through)
Apple
Applebot-ExtendedApple Intelligence training opt-out (separate from Applebot Search).✗ Disallowed
ByteDance
BytespiderByteDance / TikTok AI crawler.✗ Disallowed (fall-through)
Common Crawl
CCBotCommon Crawl. Heavily used as a training-corpus source by every major model.✗ Disallowed (fall-through)

Structured data & discovery files

§2 of 4
ArtifactStatusNote
llms.txtA markdown index of the site's most important pages, served at /llms.txt. Anthropic Claude Desktop and Claude.ai fetch this. IDE tooling (Cursor, Claude Code, GitHub Copilot, Cline, Aider) routinely retrieves it. Google has explicitly confirmed it does NOT support it (Gary Illyes, July 2025). OpenAI is unconfirmed.PresentAnthropic Claude respects this; Google has confirmed it does not; OpenAI is unconfirmed.
llms-full.txtOptional full-content companion to llms.txt. Useful for agents with large context windows that prefer a single fetch over crawling. Doesn't replace llms.txt — both can coexist.✓ PresentOptional full-content companion file.
ArtifactStatusNote
schema.org OrganizationThe brand-identity anchor LLMs use to disambiguate the site. Without it, profile links on LinkedIn, Wikidata, Crunchbase, GitHub etc. aren't bound to the homepage's entity in the AI's knowledge graph. The sameAs array is the load-bearing field.✗ MissingEntity anchor for the sameAs graph.
schema.org FAQPagePages with FAQPage JSON-LD show 2.7× citation rate vs without — 41% vs 15% in the Relixir 2025 study. The JSON-LD must mirror visible Q&A content on the page; Google penalises mismatch. Single highest-leverage fix in the audit.✗ Missing2.7× citation rate vs without (Relixir 2025) — highest-leverage single fix.
schema.org ArticleFor journalistic/editorial pages. Declares author, datePublished, dateModified, and section to AI engines. They preferentially cite recent, dated, authored content in answer-engine results.✗ MissingFor editorial pages.
schema.org HowToFor step-by-step procedural content. AI engines preferentially cite HowTo markup when answering procedural queries ("how do I X"). Maps directly to retrieval intent.✗ MissingFor tutorials.
schema.org SoftwareApplicationFor product/app pages. Maps to vendor-evaluation queries ("best X for Y"). Effectively required for B2B SaaS visibility in AI citations — 89% of B2B buyers now use AI for vendor research (Averi 2026).✗ MissingFor product pages.
Person (author entity)Author entity on bylines, linked to the Article entity. E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) signal — AI engines weight content authored by named, credentialed people higher than anonymous content.✗ MissingE-E-A-T signal on bylines.

HTTP headers

§3 of 4
HeaderValue
X-Robots-TagPage-level crawler directives. The proposed `noai` / `noimageai` values request AI crawlers skip the page (publisher hint — honored by some, ignored by others). Not enforced; combine with explicit robots.txt rules for layered defense.— not set
Cache-ControlTells crawlers how aggressively to cache the response. Aggressive `no-store` or `private no-cache` directives hurt retrieval freshness signaling — AI engines may distrust the page or refresh it less often. For public pages, prefer `public, max-age=300` or longer.private, no-cache, no-store, must-revalidate
Link: canonicalCanonical URL declared as an HTTP response header (RFC 8288 + RFC 6596). Processed earlier in the retrieval pipeline than the HTML `<link rel="canonical">` tag, so it works for crawlers that don't fully render HTML.— not set
Content-TypeDeclares the response format. `text/html; charset=utf-8` is the standard for pages. If the site supports Markdown negotiation, the same URL can serve `text/markdown` when `Accept: text/markdown` is sent.text/html; charset="utf-8"
Agent-content probeStatusNote
Markdown negotiationEmerging Cloudflare-led standard. When an agent sends Accept: text/markdown, the site returns a markdown rendering of the page instead of HTML. Sibling of llms.txt. Reduces token cost for downstream AI usage and lets agents consume the site without rendering.✗ Returns HTMLNo text/markdown response when Accept: text/markdown is sent.
Agent-discovery Link relsRFC 8288 Link-header relations that let agents find machine-readable surfaces of the site without scraping HTML. We look for api-catalog (RFC 9727), service-desc (OpenAPI), describedby (RDF / JSON-LD), and agent-card (emerging convention).✗ NoneNo api-catalog / service-desc / describedby / agent-card rels.
Endpoint Context ProtocolCommunity spec (endpointcontextprotocol.io) for agent-vs-browser content negotiation. Two signals: a /.well-known/ecp.json manifest listing available representations, and a Vary header listing Sec-Fetch-Dest so caches handle the negotiation correctly. Pre-standards but emerging.✗ Not detectedNo /.well-known/ecp.json; Vary does not list Sec-Fetch-Dest.

Top findings

§4 of 4
╴ Fix everything in one paste

Single prompt covering all 10 actionable findings, ordered by severity. Paste into Claude Code, Cursor, or any AI dev tool — the agent walks through each fix in sequence, groups changes by file, and reports what it touched.

  1. 1

    instagram.com disallows GPTBot but ChatGPT-search is also disallowed via fall-through.

    GPTBot is OpenAI's *training* crawler — disallowing it opts the site out of future training data. ChatGPT's *live retrieval* runs through OAI-SearchBot (which instagram.com has also disallowed via the fall-through `User-agent: *` rule) and ChatGPT-User (which ignores robots.txt by design). To keep ChatGPT able to cite instagram.com, add an explicit `User-agent: OAI-SearchBot` block with `Allow: /`.

    High
  2. 2

    instagram.com disallows ClaudeBot but Claude-search is also disallowed via fall-through.

    ClaudeBot is Anthropic's training crawler — disallowing it opts the site out of training. Claude's live retrieval uses Claude-User and Claude-SearchBot. instagram.com has disallowed Claude-SearchBot via the fall-through `User-agent: *` rule, removing the content from Claude's search index. Add an explicit `User-agent: Claude-SearchBot` block with `Allow: /` to restore Claude search visibility.

    High
  3. 3

    No JSON-LD schema.org markup on the homepage.

    Pages with 15+ recognized entities show 4.8× higher AI Overview selection probability. The baseline is shipping at least an Organization entity on the homepage.

    High
  4. 4

    llms.txt present but spec-noncompliant.

    Issues: Missing H1 title (first line should be `# <Brand>`). / Missing blockquote summary (the spec recommends `> One-sentence summary` immediately after the title). / No H2 sections found. The spec expects at least one (e.g., `## Documentation`).. The agents that fetch llms.txt are tolerant, but the canonical structure (H1 title + blockquote summary + H2 sections with markdown link lists) is what tooling expects.

    Med
  5. 5

    Cache-Control may prevent retrieval-layer caching.

    Aggressive no-store / private no-cache directives tell retrieval crawlers not to trust the response. For public pages you want cited, prefer `Cache-Control: public, max-age=300` or similar.

    Med
╴ Share this report

This report has a permanent URL: canaifind.com/r/qJiAxqMQ. Screenshot, drop in Slack, quote-tweet, or send to whoever's going to ask. That's how this tool finds the next person who needs it.