Can AI find instagram.com?

Reportinstagram.com·checked 2026-05-21 23:17 UTC·methodology v0.1 (preview)·canaifind.com/r/qJiAxqMQ

NoAI engines are likely blocked from indexing instagram.com.

This is a static-scan check (robots.txt + llms.txt + schema.org + headers). Live engine probes across ChatGPT, Claude, Gemini, and Perplexity arrive in a future build — currently in queue. Real visibility lives in category and comparison queries, which we measure with a 100-prompt stratified set on the Audit tier.

╴ Check your own domain

Same scan, free, no signup. Results in ~5 seconds at your own permanent canaifind.com/r/{slug} URL.

AI crawler robots.txt audit

§1 of 4

⚠ Conflict detected

instagram.com disallows GPTBot but ChatGPT-search is also disallowed via fall-through.

GPTBot is OpenAI's *training* crawler — disallowing it opts the site out of future training data. ChatGPT's *live retrieval* runs through OAI-SearchBot (which instagram.com has also disallowed via the fall-through `User-agent: *` rule) and ChatGPT-User (which ignores robots.txt by design). To keep ChatGPT able to cite instagram.com, add an explicit `User-agent: OAI-SearchBot` block with `Allow: /`.

Read the full explanation →

⚠ Conflict detected

instagram.com disallows ClaudeBot but Claude-search is also disallowed via fall-through.

ClaudeBot is Anthropic's training crawler — disallowing it opts the site out of training. Claude's live retrieval uses Claude-User and Claude-SearchBot. instagram.com has disallowed Claude-SearchBot via the fall-through `User-agent: *` rule, removing the content from Claude's search index. Add an explicit `User-agent: Claude-SearchBot` block with `Allow: /` to restore Claude search visibility.

Read the full explanation →

No Content Signals in robots.txt. — Content Signals (IETF draft, contentsignals.org) are a declarative way to state AI-usage preferences: `Content-Signal: search=yes, ai-input=yes, ai-train=no`. Compliance is voluntary today, but the signals are cheap to publish and align with the direction the standards are heading.

OpenAI

GPTBot	Training crawler for future OpenAI models.	✗ Disallowed
OAI-SearchBot	ChatGPT Search index. Disallowing makes you invisible to ChatGPT Search.	✗ Disallowed (fall-through)
ChatGPT-User	User-initiated retrieval. Ignores robots.txt by design.	— Ignores robots.txt

Anthropic

ClaudeBot	Training crawler for Anthropic models.	✗ Disallowed
Claude-User	Retrieves pages when a Claude user asks about them. Respects robots.txt (unlike OpenAI's ChatGPT-User).	✗ Disallowed (fall-through)
Claude-SearchBot	Search index for Claude. Disallowing reduces Claude search quality.	✗ Disallowed (fall-through)
claude-code	Claude Code CLI / IDE retrieval. Documentation-targeted.	✗ Disallowed (fall-through)

Perplexity

PerplexityBot	Perplexity indexing. Disallowing removes you from Perplexity retrieval.	✗ Disallowed
Perplexity-User	User-initiated retrieval. Ignores robots.txt by design.	— Ignores robots.txt

Google

Google-Extended	Training opt-out for Gemini / Bard. Disallowing opts you out of Google AI training.	✗ Disallowed
GoogleOther	Catch-all for non-Search Google crawlers.	✗ Disallowed (fall-through)

Structured data & discovery files

§2 of 4

Artifact	Status	Note
llms.txt	✓ Present	Anthropic Claude respects this; Google has confirmed it does not; OpenAI is unconfirmed.
llms-full.txt	✓ Present	Optional full-content companion file.

Artifact	Status	Note
schema.org Organization	✗ Missing	Entity anchor for the sameAs graph.
schema.org FAQPage	✗ Missing	2.7× citation rate vs without (Relixir 2025) — highest-leverage single fix.
schema.org Article	✗ Missing	For editorial pages.
schema.org HowTo	✗ Missing	For tutorials.
schema.org SoftwareApplication	✗ Missing	For product pages.
Person (author entity)	✗ Missing	E-E-A-T signal on bylines.

HTTP headers

§3 of 4

Header	Value
X-Robots-Tag	— not set
Cache-Control	private, no-cache, no-store, must-revalidate
Link: canonical	— not set
Content-Type	text/html; charset="utf-8"

Agent-content probe	Status	Note
Markdown negotiation	✗ Returns HTML	No text/markdown response when Accept: text/markdown is sent.
Agent-discovery Link rels	✗ None	No api-catalog / service-desc / describedby / agent-card rels.
Endpoint Context Protocol	✗ Not detected	No /.well-known/ecp.json; Vary does not list Sec-Fetch-Dest.

Top findings

§4 of 4

╴ Fix everything in one paste

Single prompt covering all 10 actionable findings, ordered by severity. Paste into Claude Code, Cursor, or any AI dev tool — the agent walks through each fix in sequence, groups changes by file, and reports what it touched.

1
instagram.com disallows GPTBot but ChatGPT-search is also disallowed via fall-through.
GPTBot is OpenAI's *training* crawler — disallowing it opts the site out of future training data. ChatGPT's *live retrieval* runs through OAI-SearchBot (which instagram.com has also disallowed via the fall-through `User-agent: *` rule) and ChatGPT-User (which ignores robots.txt by design). To keep ChatGPT able to cite instagram.com, add an explicit `User-agent: OAI-SearchBot` block with `Allow: /`.
High
2
instagram.com disallows ClaudeBot but Claude-search is also disallowed via fall-through.
ClaudeBot is Anthropic's training crawler — disallowing it opts the site out of training. Claude's live retrieval uses Claude-User and Claude-SearchBot. instagram.com has disallowed Claude-SearchBot via the fall-through `User-agent: *` rule, removing the content from Claude's search index. Add an explicit `User-agent: Claude-SearchBot` block with `Allow: /` to restore Claude search visibility.
High
3
No JSON-LD schema.org markup on the homepage.
Pages with 15+ recognized entities show 4.8× higher AI Overview selection probability. The baseline is shipping at least an Organization entity on the homepage.
High
4
llms.txt present but spec-noncompliant.
Issues: Missing H1 title (first line should be `# <Brand>`). / Missing blockquote summary (the spec recommends `> One-sentence summary` immediately after the title). / No H2 sections found. The spec expects at least one (e.g., `## Documentation`).. The agents that fetch llms.txt are tolerant, but the canonical structure (H1 title + blockquote summary + H2 sections with markdown link lists) is what tooling expects.
Med
5
Cache-Control may prevent retrieval-layer caching.
Aggressive no-store / private no-cache directives tell retrieval crawlers not to trust the response. For public pages you want cited, prefer `Cache-Control: public, max-age=300` or similar.
Med

╴ Share this report

This report has a permanent URL: canaifind.com/r/qJiAxqMQ. Screenshot, drop in Slack, quote-tweet, or send to whoever's going to ask. That's how this tool finds the next person who needs it.

AI crawler robots.txt audit

Structured data & discovery files

HTTP headers

Top findings

instagram.com disallows GPTBot but ChatGPT-search is also disallowed via fall-through.

instagram.com disallows ClaudeBot but Claude-search is also disallowed via fall-through.

No JSON-LD schema.org markup on the homepage.

llms.txt present but spec-noncompliant.

Cache-Control may prevent retrieval-layer caching.