Can AI find openai.com?

Reportopenai.com·checked 2026-05-21 16:01 UTC·methodology v0.1 (preview)·canaifind.com/r/hJH6624s

PartialSome fundamentals in place; high-leverage gaps identified.

This is a static-scan check (robots.txt + llms.txt + schema.org + headers). Live engine probes across ChatGPT, Claude, Gemini, and Perplexity arrive in a future build — currently in queue. Real visibility lives in category and comparison queries, which we measure with a 100-prompt stratified set on the Audit tier.

╴ Check your own domain

Same scan, free, no signup. Results in ~5 seconds at your own permanent canaifind.com/r/{slug} URL.

AI crawler robots.txt audit

§1 of 4

All AI crawlers are allowed. — openai.com's robots.txt allows every AI crawler we track to crawl it. This is the default-permissive configuration — fine for being discovered, training-ingested, and cited. To opt out of training (legal, competitive, or strategic reasons), add explicit Disallow rules for training-purpose crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) while leaving search-index crawlers allowed.

No Content Signals in robots.txt. — Content Signals (IETF draft, contentsignals.org) are a declarative way to state AI-usage preferences: `Content-Signal: search=yes, ai-input=yes, ai-train=no`. Compliance is voluntary today, but the signals are cheap to publish and align with the direction the standards are heading.

OpenAI

GPTBot	Training crawler for future OpenAI models.	✓ Allowed
OAI-SearchBot	ChatGPT Search index. Disallowing makes you invisible to ChatGPT Search.	✓ Allowed
ChatGPT-User	User-initiated retrieval. Ignores robots.txt by design.	— Ignores robots.txt

Anthropic

ClaudeBot	Training crawler for Anthropic models.	✓ Allowed
Claude-User	Retrieves pages when a Claude user asks about them. Respects robots.txt (unlike OpenAI's ChatGPT-User).	✓ Allowed
Claude-SearchBot	Search index for Claude. Disallowing reduces Claude search quality.	✓ Allowed
claude-code	Claude Code CLI / IDE retrieval. Documentation-targeted.	✓ Allowed

Perplexity

PerplexityBot	Perplexity indexing. Disallowing removes you from Perplexity retrieval.	✓ Allowed
Perplexity-User	User-initiated retrieval. Ignores robots.txt by design.	— Ignores robots.txt

Google

Google-Extended	Training opt-out for Gemini / Bard. Disallowing opts you out of Google AI training.	✓ Allowed
GoogleOther	Catch-all for non-Search Google crawlers.	✓ Allowed

Structured data & discovery files

§2 of 4

Artifact	Status	Note
llms.txt	✗ Missing	Anthropic Claude respects this; Google has confirmed it does not; OpenAI is unconfirmed.
llms-full.txt	✗ Missing	Optional full-content companion file.

Artifact	Status	Note
schema.org Organization	✗ Missing	Entity anchor for the sameAs graph.
schema.org FAQPage	✗ Missing	2.7× citation rate vs without (Relixir 2025) — highest-leverage single fix.
schema.org Article	✗ Missing	For editorial pages.
schema.org HowTo	✗ Missing	For tutorials.
schema.org SoftwareApplication	✗ Missing	For product pages.
Person (author entity)	✗ Missing	E-E-A-T signal on bylines.

HTTP headers

§3 of 4

Header	Value
X-Robots-Tag	— not set
Cache-Control	— not set
Link: canonical	— not set
Content-Type	text/html; charset=UTF-8

Agent-content probe	Status	Note
Markdown negotiation	✗ Returns HTML	No text/markdown response when Accept: text/markdown is sent.
Agent-discovery Link rels	✗ None	No api-catalog / service-desc / describedby / agent-card rels.

Top findings

§4 of 4

╴ Fix everything in one paste

Single prompt covering all 6 actionable findings, ordered by severity. Paste into Claude Code, Cursor, or any AI dev tool — the agent walks through each fix in sequence, groups changes by file, and reports what it touched.

1
Could not fetch the homepage.
We tried https://openai.com/ and got HTTP 403 — typically a Cloudflare / PerimeterX / Akamai JS-challenge or bot rule. AI retrieval crawlers that don't execute JavaScript (OAI-SearchBot, Claude-SearchBot, PerplexityBot, GoogleOther) likely face the same block. The fact that openai.com's homepage is invisible to non-browser clients IS itself a finding worth knowing.
Med
2
No Content Signals in robots.txt.
Content Signals (IETF draft, contentsignals.org) are a declarative way to state AI-usage preferences: `Content-Signal: search=yes, ai-input=yes, ai-train=no`. Compliance is voluntary today, but the signals are cheap to publish and align with the direction the standards are heading.
Tip
3
No Link: rel="canonical" HTTP header.
Most CMSs handle canonicalization via `<link rel="canonical">` in HTML. Adding the HTTP header version too is processed by retrieval crawlers that don't fully render HTML. Optional.
Tip
4
No Cache-Control header.
Most CDNs default to reasonable cache behavior. Explicit `Cache-Control: public, max-age=300` (or longer for stable content) makes intent clear and helps retrieval freshness signaling.
Tip
5
Could not probe Markdown negotiation.
The homepage returned 401/403 to our `Accept: text/markdown` probe — typically bot protection. We cannot determine whether the site supports Markdown for Agents.
Tip

╴ Share this report

This report has a permanent URL: canaifind.com/r/hJH6624s. Screenshot, drop in Slack, quote-tweet, or send to whoever's going to ask. That's how this tool finds the next person who needs it.

AI crawler robots.txt audit

Structured data & discovery files

HTTP headers

Top findings

Could not fetch the homepage.

No Content Signals in robots.txt.

No Link: rel="canonical" HTTP header.

No Cache-Control header.

Could not probe Markdown negotiation.