We Built a Technical SEO Audit Tool. Then We Pointed It at Ourselves.

💡 TL;DR (Too Long; Didn't Read)

Key takeaways in 60 seconds:

The Tool: We built a 32-check, 5-category technical SEO auditor with a TypeScript runner and a regex-based HTML parser. Zero external parsing dependencies. Full audit completes in ~600ms.

The Stream: The web version uses Server-Sent Events (SSE) to stream check results in real-time — no WebSocket, no Socket.io, no polling. The browser's native EventSource API handles everything.

The Trap: When your server fetches user-provided URLs, you've built an open proxy. We spent more time on SSRF protection (6 validation layers including DNS rebinding defense) than on the entire scoring engine.

The Dogfood: First run against our own tool page scored 65 / 100 — REGULAR. Infrastructure (hreflang, sitemap) was perfect. Page-level SEO (schema, structure) was catastrophic — 0% on JSON-LD because it simply didn't exist.

The Thesis: Infrastructure SEO is stateless and declarative — when it breaks, you notice. Page-level SEO is embedded in React components and invisible by default. That's why you need automated checks.

The Architecture

There's a specific kind of silence that happens when you run your own diagnostic tool against your own production site and the terminal prints SCORE: 65 / 100 — REGULAR.

Not "good." Not "excellent." Regular. The kind of grade that means "technically passing, but nobody's impressed."

This is the story of building that tool — a 32-check, 5-category technical SEO auditor — and what it found when we turned it inward. The architecture decisions, the security landmines, and the gap between "we have a sitemap" and "our SEO actually works."

Why Existing Tools Fell Short

If you've ever deployed a Next.js site with internationalization and then opened Google Search Console to find your Hindi locale indexed while your actual Portuguese content sits at position 42, you understand the frustration.

Lighthouse measures performance and accessibility but doesn't know what hreflang is. PageSpeed Insights cares about Core Web Vitals but won't tell you that your tool page is missing an <h1> because the React component renders the title as a styled <div>. Google Rich Results Test validates JSON-LD one page at a time, manually. Screaming Frog does everything, for £199/year, and requires a desktop app that runs a full crawl when all you need is "did my last deploy break the hreflang?"

Verified SourceScreaming Frog Official Pricing

Screaming Frog SEO Spider licence costs £199/year per user.

We needed something specific: a tool that a developer runs after git push to verify that technical SEO didn't regress. Not a crawler. Not a monitoring platform. A diagnostic snapshot.

So we built one.

The Stack: Four Files That Do Everything

The entire engine is four TypeScript files with clean separation of concerns:

File	LOC	Responsibility
`runner.ts`	563	Core audit engine — 32 checks across 5 categories, progress callbacks
`parser.ts`	240	Regex-based HTML/XML parser — extracts title, canonical, hreflang, OG, H1/H2, JSON-LD, images, links
`fetcher.ts`	103	HTTP client with timeout (8s), body size limit (2MB), caching
`url-validator.ts`	111	SSRF protection — 6 validation layers including DNS resolution

~1,000 lines of TypeScript total. No test framework. No external parsing library. No browser binaries.

Each check is a typed function inside the runner that receives a fetcher and a parser, runs its logic, and returns a CheckResult:

typescript

// Inside runner.ts — simplified from actual implementation
const canonical = parser.canonical(html);
if (!canonical)
  return { pass: false, details: 'No <link rel="canonical"> found' };
return { pass: true, details: `Canonical: ${canonical}` };

Adding a new check means writing a function and registering it in the runner's category map. No configuration files, no indirection layers — just TypeScript functions calling TypeScript functions.

The 32 Checks

Five categories, each testing a different dimension of technical SEO:

Category	Checks	What it catches
`hreflang`	6	Missing canonical, orphaned locales, duplicate hreflang tags
`sitemap`	7	Broken sitemap, missing robots.txt, no lastmod
`metadata`	8	Long titles, missing H1, no OG image
`schema`	5	Missing JSON-LD, invalid types, incomplete markup
`structure`	6	No H2 hierarchy, missing internal links, thin content

Each check has a weight (3–10 points). Pass gets the full weight; fail gets zero. Category score is the percentage of earned points. Global score is the arithmetic mean of all five categories.

This is deliberately simple. No partial credit, no weighted categories, no machine learning confidence scores. When a developer sees MTA-006: FAIL — No H1 tag found, there's exactly one thing to do about it.

Why Regex, Not a DOM Parser

The parser is 240 lines of targeted regex extraction. No Cheerio. No jsdom. No Playwright. Just patterns that match the specific HTML structures that matter for SEO:

typescript

// parser.ts — extracts exactly what SEO checks need
export function parseHtml(html: string) {
  const title = html.match(/<title[^>]*>([^<]*)<\/title>/i)?.[1]?.trim();
  const canonical = html.match(
    /<link[^>]+rel=["']canonical["'][^>]+href=["']([^"']+)["']/i
  )?.[1];
  const h1s = [...html.matchAll(/<h1[^>]*>([\s\S]*?)<\/h1>/gi)]
    .map(m => m[1].replace(/<[^>]+>/g, '').trim());
  // ... 15 more extractors
}

This works because SEO signals live in predictable, well-structured HTML patterns. A canonical tag is always <link rel="canonical" href="...">. An hreflang tag is always <link rel="alternate" hreflang="..." href="...">. JSON-LD is always inside <script type="application/ld+json">. You don't need a full DOM tree to find these — you need pattern matching.

The trade-off is real: regex can't handle deeply nested or malformed HTML. But we're parsing the SSR output — the HTML that the server sends on first request. This is exactly what Googlebot sees. If a React component renders hreflang tags client-side after hydration, Googlebot might not see them, and neither will our tool. That's a feature, not a bug.

The payoff: zero external dependencies in the parser, no node_modules bloat, and parsing completes in single-digit milliseconds.

The Fetcher: Concurrency Without Complexity

The HTTP layer handles three things that matter for a diagnostic tool: caching, concurrency control, and TTFB measurement.

javascript

class Fetcher {
  constructor({ concurrency = 5, timeoutMs = 10000 }) {
    this.cache = new Map();
    this.activeRequests = 0;
    this.queue = [];
  }

  async fetch(url, opts = {}) {
    const cacheKey = `${opts.method || 'GET'}:${url}`;
    if (this.cache.has(cacheKey)) return this.cache.get(cacheKey);
    return this._enqueue(url, opts);
  }
}

The queue is a manual semaphore — no external library. When activeRequests < concurrency, the task runs immediately. Otherwise it's pushed to the queue and dequeued when a slot opens. This keeps localhost from being hammered during development and keeps production Firebase Hosting from throttling us.

Caching is keyed on method + URL. The same page fetched for hreflang checks and metadata checks hits the network once. The full audit completes in 606 milliseconds against a live site. That's not a typo.

Shipping to the Web: SSE Over REST

The CLI version prints results as they happen. For the web version, we needed the same experience: the user sees each check completing in real-time, not a spinner followed by a wall of data.

Server-Sent Events (SSE) are the natural fit. The protocol is simple — an HTTP response with Content-Type: text/event-stream that the server keeps open, sending named events:

event: progress
data: {"phase":"hreflang","check":3,"total":6,"id":"HRF-003","status":"pass"}

event: progress
data: {"phase":"metadata","check":6,"total":8,"id":"MTA-006","status":"fail",
       "details":"No H1 tag found"}

event: category_done
data: {"id":"metadata","score":69,"label":"REGULAR"}

The Next.js Route Handler streams these events as each check completes:

typescript

// app/api/tools/seo-health-check/route.ts

export async function GET(req: NextRequest) {
  const url = req.nextUrl.searchParams.get('url');
  // ... validation, rate limiting ...

  const stream = new ReadableStream({
    async start(controller) {
      const send = (event, data) => {
        controller.enqueue(
          encoder.encode(`event: ${event}\ndata: ${JSON.stringify(data)}\n\n`)
        );
      };

      send('connected', { message: 'Audit started' });

      const report = await audit(url, {
        onCheckComplete: (p) => send('progress', p),
        onCategoryDone: (c) => send('category_done', c),
      });

      send('complete', report);
      controller.close();
    },
  });

  return new Response(stream, {
    headers: { 'Content-Type': 'text/event-stream' },
  });
}

On the client, a React hook wraps the native EventSource API:

typescript

function useAudit() {
  const [progress, setProgress] = useState([]);
  const [report, setReport] = useState(null);

  const startAudit = useCallback((url) => {
    const es = new EventSource(
      `/api/tools/seo-health-check?url=${encodeURIComponent(url)}`
    );

    es.addEventListener('progress', (e) => {
      setProgress((prev) => [...prev, JSON.parse(e.data)]);
    });

    es.addEventListener('complete', (e) => {
      setReport(JSON.parse(e.data));
      es.close();
    });
  }, []);

  return { progress, report, startAudit };
}

No WebSocket. No Socket.io. No polling. The browser's built-in EventSource handles reconnection automatically, though for a sub-second audit we've never needed it.

Verified SourceMDN Web Docs — Server-Sent Events

SSE is a W3C standard enabling servers to push data to web clients over HTTP.

Why Not WebSocket?

SSE is unidirectional: server → client. That's all we need. The client sends the URL once (in the query string), and the server streams results back. WebSocket adds bidirectional capability we'd never use, at the cost of a more complex protocol, manual heartbeats, and the need to handle the upgrade handshake.

SSE also works through HTTP proxies and CDNs without special configuration. When you deploy on Firebase Hosting with Cloudflare in front, that matters.

Security: Your Server Is Now a Proxy

Here's the part most tutorials skip.

When your server fetches URLs that users provide, you've built an open proxy. Without protection, someone can submit http://169.254.169.254/latest/meta-data/iam/security-credentials/ and your server will happily fetch your cloud provider's instance metadata, including temporary IAM credentials.

This is Server-Side Request Forgery (SSRF), and it's the reason we spent more time on URL validation than on the entire scoring engine.

Six Layers of Validation

typescript

export async function validateUrl(input: string) {
  // 1. Parse — reject malformed URLs
  const url = new URL(input); // throws on invalid

  // 2. Protocol — HTTP(S) only
  if (!['http:', 'https:'].includes(url.protocol))
    return { valid: false, error: 'Only HTTP and HTTPS allowed' };

  // 3. Port — 80/443 only
  if (url.port && !['80', '443', ''].includes(url.port))
    return { valid: false, error: 'Only ports 80 and 443 allowed' };

  // 4. No IP literals — force DNS resolution
  if (isIP(url.hostname))
    return { valid: false, error: 'IP addresses not allowed' };

  // 5. Blocked hosts — cloud metadata endpoints
  if (BLOCKED_HOSTS.includes(url.hostname.toLowerCase()))
    return { valid: false, error: 'Hostname blocked' };

  // 6. DNS resolution — verify resolved IP is public
  const addresses = await dns.resolve4(url.hostname);
  for (const ip of addresses) {
    if (isPrivateIP(ip))
      return { valid: false, error: 'Resolves to private IP' };
  }

  return { valid: true, url };
}

Layer 6 is the critical one. An attacker can register a domain that resolves to 169.254.169.254, bypassing hostname checks. By re-validating after DNS resolution, we catch DNS rebinding attacks.

Verified SourceOWASP — Server-Side Request Forgery Prevention

OWASP recommends validating resolved IP addresses after DNS lookup, not just hostnames, to defend against DNS rebinding.

We also enforce:

Rate limiting: 3 audits per IP per hour, 20 global per hour, 2 concurrent max.
Timeouts: 8s per fetch, 45s total.
Body limits: 2MB per response (a sitemap shouldn't be larger).

The Dogfooding: Score 65

The moment of truth. We pointed the tool at its own page — gsstk.gem98.com/pt-BR/tools/seo-tools/seo-health-check — and got this:

  ▸ hreflang    100%  ██████████████████  EXCELLENT
  ▸ sitemap     100%  ██████████████████  EXCELLENT
  ▸ metadata     69%  ████████████░░░░░░  REGULAR
  ▸ schema        0%  ░░░░░░░░░░░░░░░░░░  CRITICAL
  ▸ structure    55%  ██████████░░░░░░░░  REGULAR

  SCORE: 65 / 100  REGULAR
  21 passed · 6 failed · 5 skipped

Two categories perfect. One at zero. The story of why is more instructive than the numbers.

What We Got Right: The Infrastructure Layer

Hreflang and sitemap scored 100%. But getting there wasn't the smooth process the score suggests.

Eight months earlier, we'd had a sitemap that returned binary data because Firebase Hosting was serving the XML with the wrong Content-Type header. Before that, we'd had a sitemap.xml that was truncated — a token limit in our generation script silently cut the file at ~200 URLs, leaving hundreds of pages undiscoverable. And Cloudflare's AI Crawl Control was blocking bots we wanted to let through, including some SEO crawlers.

The robots.txt went through its own evolution. An early version forgot to Disallow the ghost locales — /hi/, /en-GB/, /fr/ — which meant Googlebot was dutifully crawling machine-translated content that diluted our authority across 15 locales when we only had real content in two.

These weren't catastrophic failures. They were the kind of slow, invisible erosion that happens when infrastructure is set up once and never validated again. The sitemap existed. The robots.txt existed. They just didn't work correctly. It took building an automated checker to catch the subtle breakage — and even now, the only reason they're at 100% is that we fixed each issue as the tool flagged it over several iterations.

The lesson: having a sitemap is not the same as having a correct sitemap. The infrastructure checks pass today because automation caught what manual review missed, repeatedly, over months.

What We Got Wrong: Everything Else

The metadata category scored 69%. The failures were surgical but embarrassing:

MTA-003: Title has 61 characters (max recommended: 60). One character over. The tool page title was "SEO Health Check — Verificador de SEO Técnico | gsstk" — 61 characters. We'd carefully crafted it for keyword density and forgot to count.

MTA-006: No H1 tag found. The tool page renders its title inside a React component that uses a styled <div> instead of a semantic <h1>. Visually identical. Semantically invisible to Googlebot. This is the kind of bug that only exists in the gap between "looks right" and "is right" — the exact gap that a regex-based parser exposes because it sees the SSR output, not the styled page.

MTA-008: No og:image. We'd built the entire sharing infrastructure — OG title, OG description — and forgot the image. Every tool page on gsstk has this same gap. A global fix in the tool layout component would resolve it for every tool at once.

The schema category scored 0%. Zero. Not because the JSON-LD was wrong — because it didn't exist at all. No <script type="application/ld+json"> anywhere on the page. We'd planned WebApplication schemas in the PRD, discussed the exact fields (applicationCategory: "DeveloperApplication", offers.price: "0"), written the components — and never shipped them. The checks for valid JSON, schema.org context, and @type validation all returned skip because there was nothing to validate.

The structure category (55%) revealed that the tool page was functionally a single-page app with no static content for Googlebot to index. No H2 headings. No internal links to other tools or blog posts. No content hierarchy at all. The React component renders a form, waits for user input, and shows results — none of which exists in the SSR output. To Google, the page is practically empty.

The Architecture of Blindness

The score of 65 tells a story about two different kinds of technical debt.

The infrastructure layer (hreflang, sitemap, robots.txt) scored 100% because it's stateless and declarative. You write a config, generate files, deploy. When it breaks, it breaks visibly — Google Search Console sends you angry emails about sitemap errors. And when you fix it, the fix is permanent.

The page-level SEO (metadata, schema, structure) scored 41% on average because it's embedded in components and invisible by default. Nobody opens a React component and checks whether the H1 is a real <h1> or a <div className="text-3xl font-bold">. Nobody views source after every deploy to verify the JSON-LD rendered. Nobody counts internal links in the SSR output.

This is why the tool exists. Not to replace Google Search Console or Lighthouse, but to catch the class of bugs that live in the gap between what the developer sees in the browser and what the search engine sees in the HTML.

The Six Fixes

Every failure from the audit maps to a concrete, small change:

Check	Fix	Effort
MTA-003	Shorten title by 1 character	1 minute
MTA-006	Change `<div>` to `<h1>` in tool header component	1 minute
MTA-008	Add `og:image` to tool layout metadata	5 minutes
SCH-001	Add `WebApplication` JSON-LD to tool pages	30 minutes
STR-002	Add static H2 sections below the tool form	1 hour
STR-004	Add internal links to related tools and blog posts	1 hour

Total estimated effort: ~3 hours. Projected score after: ≥ 90/100.

The point isn't that these are hard problems. The point is that nobody noticed them for months. The page worked fine. Users could audit URLs. The React components were polished. But the SSR output — the version that matters for SEO — was a skeleton.

What's Next

The gap between what the tool checks today and what it should check is mostly in the schema category. The current version checks "does JSON-LD exist?" but not "are the required fields present?" or "is the @type a valid schema.org type?" When someone ships a WebApplication schema without a name field, the tool gives them 100% on schema. That's misleading. Expanding to full structural validation is the next priority.

We're also missing cross-page checks. Verifying that hreflang tags are reciprocal (page A links to page B and page B links back to page A) requires fetching additional pages — which means more latency and more SSRF surface. There's a design tension between completeness and the "sub-second, one URL" promise.

Google Search Console: Too Early to Tell

We shipped the page-level SEO fixes on the same day we shipped this article. Google's crawl cycle typically takes days to weeks to reflect structural changes like new JSON-LD or reorganized metadata. Publishing "before and after" numbers today would mean fabricating data to fit a narrative — and we don't do that here.

What we can say: the infrastructure layer (sitemap, hreflang, robots.txt) has been stable and correct for weeks, after months of iterative fixes caught by the tool. The page-level fixes (schema, structure, metadata) went live today.

We'll revisit this in 60–90 days with real numbers. If the data tells a boring story, we'll publish the boring story.