How Search Engines Find Pages

🤖

Googlebot — the search crawler

"Hi! I'm Googlebot. Every day I crawl billions of pages across the web. Want to know how I find your site and decide what makes it into search?"

💡 Crawling — the process of a search robot following links and downloading page content. Indexing — analyzing and storing those pages in the search database.

How Does Googlebot Find Pages?

🔗

Links

Follows links from already known pages

→

🗺️

Sitemap

Reads the XML sitemap

→

📤

Submit URL

Manual submission via Search Console

Three Conditions for Page Indexing

Condition	How to check
✅ The page has inbound links	Ahrefs / GSC — Internal links report
✅ No Disallow in robots.txt	Check robots.txt in GSC
✅ No noindex tag on the page	Look for <meta name="robots" content="noindex"> in source

Crawl Budget

Google cannot infinitely crawl your site. Each site is allocated a crawl budget — the number of pages the bot will visit per session. For large sites, it's critical to spend that budget on the right pages.

How to avoid wasting crawl budget:

☐ Block filter/sort pages in robots.txt
☐ Fix 404 pages with 301 redirects
☐ Exclude duplicates via canonical tags
☐ Avoid infinite URLs with parameters
☐ Submit XML sitemap in GSC

🧑‍💻

Alex audits a client's site

"Look — you have 50,000 filter pages in the index. Googlebot burns its entire budget on them and never reaches the important product pages. That's why you have no traffic."

🎯 Remember: crawling ≠ indexing. The bot can visit a page and not add it to the index (if the content is weak or there's a noindex tag). Track both metrics in Google Search Console.

🎮 Test yourself: select the conditions required for a page to be indexed!

🎯

Lesson Task

Test your knowledge and earn +20 XP

← Course

Lesson 1 of 22

Go to Task →