notes on AI, growth, and the journey from 0ân
How Perplexity Crawls and Indexes Your Website
Your content deserves to be cited. Here's how to make it happen.

The rise of generative AI search has flipped traditional SEO on its head. Users are no longer just skimming ten blue links on a search results pageâtheyâre getting direct, cited answers from models like ChatGPT, Gemini, and Perplexity.
And unlike Google, which pulls from a near-infinite index of URLs, Perplexity operates with different rules. It uses a smaller, curated set of sources. Authority still matters, but freshness, structure, and clarity now decide the tie-breakers.
So if you want your content to show up as a citation in a Perplexity answer or be surfaced via its AI-powered shopping tools, youâll need to optimize differently.
Hereâs the playbook.
Crawling and Indexing 101 (Perplexity Edition)
Before your content can be cited, summarized, or surfaced in Perplexity, it has to be discovered, and that starts with crawling.
Crawling is how PerplexityBot (the platformâs automated agent) finds pages on your site. It follows links, reads your robots.txt file, and uses public signals like sitemaps to map your siteâs structure.
Indexing comes next. Once a page is crawled, Perplexity decides whether to store and display its content, either in summaries, citations, or shopping results.
Unlike Google, Perplexity doesnât index everything it finds. It uses a curated index, meaning only clear, authoritative, and accessible content makes the cut. But the principles are familiar.
Key requirements:
- â Donât block PerplexityBot in robots.txt
- â Keep key pages publicly accessibleâno login gates or paywalls
- â Ensure every indexable page returns a 200 status code
- đ« Avoid redirect chains or infinite loops
- â ïž Donât rely on JavaScript to load core content (PerplexityBot does not render JS)
While thereâs no âPerplexity Search Console,â you can verify crawling activity by monitoring server logs for:
User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
And remember: even if PerplexityBot sees your site, it might skip over it if the content feels too shallow, unstructured, or stale. Make your pages easy to understand and clearly valuable from the start.
Understand how Perplexity discovers and uses your content
Perplexity relies on two main agents for crawling and indexing:
- PerplexityBot: This is the platformâs crawler. Perplexityâs documentation says PerplexityBot honours robots.txt (changes propagate in â24 h), although some researchers have reported edge-case misses. Monitor your logs to verify compliance.. If you block it, Perplexity will still show your siteâs URL and title (like a bare citation), but it wonât display any full-text content.
- Perplexity-User: This is a user-triggered agent, like when someone uses the Copilot tool or Perplexity Pro features to explore a specific URL. These visits donât follow robots.txt and act more like real-time browsing.
Perplexity is not a model-training platform. Your content isn't used to train foundation models. It's simply indexed, cited, and summarized in response to real-time queries.
Configure your site to be crawlable
To be indexed and cited, your site needs to be open to PerplexityBot. That means:
User-agent: PerplexityBot Â
Allow: /
If you want to block it:
User-agent: PerplexityBot Â
Disallow: /
And if you're unsure whether your site is being crawled, you can check for visits from:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
In addition, make sure your sitemap is submitted and accessible. While Perplexity does not currently offer a dedicated sitemap submission tool like Google Search Console, having a clean and comprehensive sitemap.xml file at the root of your domain helps other aggregators and secondary crawlers discover your most important content.
Remember: blocking PerplexityBot wonât prevent the platform from referencing your title or domain in its answersâit just wonât include any content.
Master Your Robots.txt File and XML Sitemap
PerplexityBot follows the Robots Exclusion Protocol (robots.txt) and respects all standard directives. That makes your robots.txt file a critical control layer for what the platform can and canât access.
Use robots.txt to:
- Block low-value or sensitive pages (e.g., /admin, /cart, /internal-search)
- Specify which bots are allowed (e.g., User-agent: PerplexityBot)
- Declare the location of your sitemap
Example:
# robots.txt â place at https://www.example.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /search/
User-agent: PerplexityBot
Allow: /
Sitemap: https://www.example.com/sitemap.xml
â ïž Note: Blocking a page via robots.txt does not guarantee it wonât be cited. If another site links to that page, Perplexity may still display a bare citation (just the title and URL), even if it canât access the full content.
Your XML sitemap acts as a roadmap for crawlers, including Perplexity and secondary aggregators, it may reference.
A good sitemap should include:
- All important, canonical URLs
- Only 200-status pages (no redirects or errors)
- Accurate <lastmod> dates for freshness signaling
- Exclusion of noindex or disallowed pages
Even though Perplexity doesn't currently offer a sitemap submission tool (like Google Search Console), referencing your sitemap in robots.txt increases discoverability, especially via partner crawlers or shared infrastructure.
Hereâs a tip: Use separate sitemap files (or a sitemap index) if you have more than 50,000 URLs or if the uncompressed file exceeds 50MB
Prevent Indexing with Meta Robots Tags
Allowing PerplexityBot to crawl your site doesnât mean youâre forced to have every page indexed or cited.
PerplexityBot respects meta robots directives embedded in your HTML, just like search engine crawlers do. That means you can allow crawling (via robots.txt) while selectively excluding certain pages from summaries, citations, and internal indexing using <meta> tags.
Add the following inside the <head> section of your HTML to exclude a page:
<meta name="robots" content="noindex">
This ensures PerplexityBot crawls the page, sees the directive, and skips indexing it.
Combine with other options as needed:
<meta name="robots" content="noindex, nofollow">
Available directives include:
- noindex: Donât include this page in summaries or internal search.
- nofollow: Donât follow the links on this page.
- nosnippet: Donât display any preview text or media.
- noarchive: Prevent caching of this page.
- unavailable_after: Set an expiration date for visibility.
For non-HTML files like PDFs or videos, apply an X-Robots-Tag in the HTTP response header:
X-Robots-Tag: noindex
Do not block noindex pages in robots.txt. If you do, PerplexityBot canât reach the page to read the meta tag, meaning it may still show up as a bare citation (just title + URL) if linked from other sources.
Why it matters:
- PerplexityBot does not use your content for model training, but it does index and summarize publicly accessible pages.
- Using noindex lets you opt certain pages out of those summaries without limiting crawl-based discovery across the rest of your site.
This is especially useful for:
- Landing pages, paywalled content, or internal-facing tools
- Drafts or experiment pages you donât want cited
- Avoiding fragmented or duplicate page summaries in responses
Control over crawling and indexing is two separate layers. With Perplexity, you can (and should) fine-tune both.
Use clear, authoritative content formats
Perplexity doesnât rely on traditional keyword matching. It uses natural language understanding to extract relevant, verifiable answers from your site.
That means:
- Answer real questions: Write content in Q&A format. Think FAQs, how-tos, comparisons.
- Make your structure scannable: Use H2s, bullet points, bolded headers, and short paragraphs.
- Cite sources: Link out to reputable domains. Perplexity favors well-cited content.
- Avoid keyword stuffing: Prioritize clarity and context over density.
Perplexityâs summariser gives extra weight to lists, tables, and FAQs that already include outbound citationsâthose citations sometimes get echoed verbatim in Perplexityâs answer card.
Perplexity also shows a preference for structured content formats like:
- Product comparison tables
- Summarized lists of pros/cons
- Step-by-step guides
These layouts make it easier for the AI to pull relevant snippets with confidence.
Submit your site manually with Perplexity Pages
One major difference between Perplexity and Google? You can submit your content directly via Perplexity Pages, a feature within Perplexity Pro.
Think of Pages as Perplexityâs built-in blog platform. When you publish a Page, it instantly lives inside Perplexityâs index and lets you embed citations back to your own site. Treat it as a canonical summary: clear H2s, tight paragraphs, and links pointing to the deeper resources you want PerplexityBot to crawl next.
It works like a streamlined version of Google Search Console:
- Create a Perplexity Pro account
- Navigate to the âPagesâ tab
- Publish a Page that summarises your research and links back to your site. The Page itself is instantly included in Perplexityâs internal index, and your outbound links give your domain another citation surface.
- Structure your pages with concise, factual, and well-organized sections
This feature is still early-stage, but early adoption matters. Brands already using Perplexity Pages are seeing top-ranked visibility in answer summaries.
Keep your content updated and visibly trusted
Perplexity surfaces content based on freshness and trustworthiness. It prefers:
- Pages updated regularly
- Domains with strong E-E-A-T (Experience, Expertise, Authority, Trust) signals
- Content that appears on or links to platforms like Yahoo, MarketWatch, Reddit, or Wikipedia
What matters most?
- Date stamps: Make it obvious when your content was last updated
- Outbound links to authoritative sources
- Active participation in trusted ecosystems (like being cited on Reddit, included on Amazon, or covered in industry blogs)
Perplexity isnât just scraping any siteâitâs citing a carefully curated set. Your job is to position your content in that set.
Treat Perplexity SEO as a hybrid strategy
Getting cited in Perplexity means playing both sides:
- You still need strong Google SEO fundamentalsâindexing, sitemap, page speed, and mobile optimization. Perplexity surfaces content through real-time crawling, often through secondary or partner data.Â
- If Google doesnât rank or trust your content, itâs far less likely to appear in Perplexity. Perplexity often references top-ranking Google pages, reinforcing the value of traditional SEO.Â
- You also need GEO (Generative Engine Optimization): a new layer that optimizes content for LLM-driven summary, citation, and retrieval.
This means:
- Creating llms.txt or llms-full.txt to highlight your best resources
- Using schema markup to reinforce meaning (FAQ, Article, Product)
- Writing like youâre the answer, because you might be
What you should know about llms.txt for Perplexity
llms.txt is a proposed file formatâa Markdown file at yourdomain.com/llms.txtâmeant to help LLMs parse structured content (guides, docs, etc.).
In theory, it offers a human- and machine-readable TOC for AI assistants.
However:
- No LLM provider, including Perplexity, has officially confirmed support.
- Tools like GPTBot, Claude, and Gemini do not reference it as a ranking or citation signal.
- Googleâs John Mueller compared it to the deprecated keywords meta tagâsomething site owners want to believe works, but canât verify.
That said, llms.txt is easy to create and doesnât conflict with other files. If you're experimenting:
# Title
Brief description of the site.
## Section Name
- [Link Title](https://link_url): Optional description
- [Link Title](https://link_url/subpath): Optional description
## Another Section
- [Link Title](https://link_url): Optional description
Think of it as speculative. Thereâs no harm in having it, but donât expect benefits today. Treat it like future-proofing, not a core SEO tactic.
Use Schema Markup to Reinforce Context
Perplexity relies on structure to extract the right snippets. One of the easiest ways to give it confidence in your contentâs meaning is through schema markup.
Schema markup adds machine-readable metadata to your HTML, describing what each page is aboutâwhether it's a blog post, product, FAQ, or guide. This helps Perplexity (and secondary crawlers it may consult) correctly interpret your content and cite it with more precision.
Use one of two methods:
- JSON-LD: A single <script type="application/ld+json"> block in your HTML <head>
- Microdata: Inline attributes inside HTML elements
Hereâs an example using JSON-LD for a blog post:
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "How to Ensure Perplexity Can Crawl and Index Your Website",
  "author": {
    "@type": "Organization",
    "name": "Daydream"
  },
  "datePublished": "2025-05-04",
  "image": "https://example.com/images/perplexity-seo-guide.jpg",
  "publisher": {
    "@type": "Organization",
    "name": "Daydream",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.png"
    }
  },
  "description": "A detailed guide to improving your site's visibility in Perplexity AI, including crawlability, structure, and citation readiness."
}
</script>
For FAQs, use the FAQPage type with individual Question and Answer pairs. This increases the likelihood that Perplexity will extract direct, verifiable Q&A snippets.
Example:
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [{
    "@type": "Question",
    "name": "Does Perplexity use my content for training?",
    "acceptedAnswer": {
      "@type": "Answer",
      "text": "No. Perplexity indexes and summarizes content for retrieval, but it does not use your site to train its foundational models."
    }
  }]
}
</script>
Use Googleâs Rich Results Test or Schema.orgâs validator to ensure your markup is error-free, even if your primary goal is Perplexity visibility.
Structured data wonât guarantee citation, but it:
- Improves machine interpretability
- Boosts snippet accuracy
- Clarifies relationships (e.g., author, publish date, product specs)
Adoption isnât universal yet, but early movers gain an edge as LLMs start consulting these manifests.
Perplexity doesnât just list your link. It quotes you. That quote needs to make sense without a click.
Prioritize server-side renderingÂ
A critical but often overlooked factor: PerplexityBot, like most AI crawlers, does not render JavaScript. That means if your site relies on client-side rendering (CSR), key content might be invisible.
To ensure Perplexity sees your content:
- Use server-side rendering (SSR), static site generation (SSG), or incremental static regeneration (ISR) for critical pages
- Ensure essential content (text, metadata, links) is included in the initial HTML response
- Use CSR only for enhancements (e.g., interactivity, counters, widgets)
PerplexityBot behaves similarly to other major LLM crawlers like GPTBot and Claudeâit fetches JavaScript files but does not execute them. In other words, it canât âseeâ content loaded dynamically via client-side scripts.
Monitor domain visibility with overlap tracking
Here are a few key takeaways for which domains show up most often in Perplexity citations:
- Perplexity favors sites like Reddit, MarketWatch, and Yahoo more than Google does
- In eCommerce, Amazon is massively prioritized
- In healthcare, Mayo Clinic and NIH dominate
- In B2B tech, sites like TechTarget, IBM, and Cloudflare perform best
So if your site isnât in those spheres yet, start with:
- Syndicating content on those platforms
- Earning mentions or backlinks from them
- Structuring your content like theirs
Visibility is now a multi-layered game
In a world where AI-driven tools like Perplexity are becoming primary discovery engines, your content strategy needs to do more than rank.
It needs to:
- Be crawled (open robots.txt)
- Be structured (clean headers, Q&A format)
- Be trusted (well-cited, up-to-date)
- Be submitted (via Perplexity Pro Pages)
- Be adaptive (ready for GEO and AI citation patterns)
Weâre no longer optimizing for blue linksâweâre optimizing to be the answer.
At daydream, we help brands rethink content from the ground up for generative discovery. Want to ensure Perplexity sees (and cites) you? Letâs talk.
References:
- How does Perplexity follow robots.txt?
- Perplexity Crawlers
- How to Optimize Your Website for Perplexity: A Complete Guide
- How to Get Indexed in Perplexity AI: SEO for Perplexity AI In 2025
- The Ultimate Guide to Perplexity
- How to Get Your Website Indexed in Perplexity AI â The Future of SEO
- How to Get Indexed In Perplexity AI: SEO for Perplexity AI
- https://vercel.com/blog/the-rise-of-the-ai-crawlerÂ
The future of search is unfolding; donât get left behind
Gain actionable insights in real-time as we build and apply the future of AI-driven SEO