🏡

Englishtap

Decent Partner: XML to Database feed parsing

Purpose

This document explains the conceptual process of handling XML sitemaps from WordPress (or similar CMS systems) and storing their contents in a database. The goal is to provide a clear framework for differentiating between sitemap indexes and URL sitemaps, and to outline how a system can navigate from the top-level feed down to actual content URLs.

Core Concepts

  • Sitemap Index: An XML file whose root element is <sitemapindex>. It contains multiple <sitemap> entries, each pointing to another sitemap file.
  • URL Sitemap: An XML file whose root element is <urlset>. It contains multiple <url> entries, each pointing directly to a content page.
  • Recursive Navigation: The process of starting at the index, following each <loc> link, and continuing until only <url> entries remain.
  • Database Storage: The system stores the discovered URLs, along with metadata such as <lastmod>, rather than storing every intermediate sitemap file.

Process Overview

  1. Entry Point: Begin with the top-level sitemap (commonly /wp-sitemap.xml).
  2. Identify Type: Inspect the root element.
    • If <sitemapindex> → treat as a directory of other sitemaps.
    • If <urlset> → treat as a list of content URLs.
  3. Branching:
    • For <sitemap> entries, follow each <loc> link to another XML file.
    • For <url> entries, collect the <loc> values as actual content URLs.
  4. Recursion: Continue following <sitemap> entries until only <url> entries remain.
  5. Storage: Insert the collected URLs and metadata into the database. The top-level sitemap itself can be stored as a reference, but the key data is the list of content URLs.

Conceptual Flow

  wp-sitemap.xml
     ├── wp-sitemap-posts-post-1.xml
     │       ├── URL 1
     │       ├── URL 2
     │       └── ...
     ├── wp-sitemap-posts-page-1.xml
     │       ├── URL A
     │       ├── URL B
     │       └── ...
     └── wp-sitemap-taxonomies-category-1.xml
             ├── Category URL X
             └── Category URL Y
    

Key Differentiation

The simplest way to differentiate between sitemap types is by checking the elements inside:

  • <sitemap> → leads to further sitemaps.
  • <url> → leads directly to content pages.

Best Practices

  • Always treat the top-level sitemap as the authoritative entry point.
  • Do not rely solely on file naming conventions (e.g., -post-1.xml), as custom structures may vary.
  • Use the XML schema itself (<sitemapindex> vs. <urlset>) to determine behavior.
  • Store URLs with metadata such as <lastmod> for freshness tracking.
  • Refresh periodically by re-parsing the index sitemap, ensuring new content is discovered.

Analogy

Think of the sitemap index as a table of contents. Each child sitemap is a chapter, and each <url> entry is a page. You don’t need to store every chapter file separately; you just need to know how to navigate from the table of contents down to the pages.

Conclusion

By differentiating between <sitemap> and <url> elements, a system can reliably traverse from the top-level sitemap index down to the actual content URLs. This process ensures efficient storage in the database and keeps the system aligned with search engine standards for sitemap handling.

Appendix: Proof of the Pudding in Domain Maps

The effectiveness of our sitemap ingestion and keyword mapping process can be demonstrated directly in the Domain Map

How the Domain Map Functions

  • Entry Point: Each partner domain in the system has a link to its Domain Map.
  • Keyword Mapping: The Domain Map displays all extracted keywords associated with that domain.
  • Page Associations: Each keyword is linked to the specific pages discovered through the sitemap crawl.
  • Navigation: Users can click a keyword to see the list of pages where it appears, confirming the keyword-to-page relationship.

Why This Is Proof

The Domain Map is the visible outcome of the entire pipeline:

  1. Sitemap index is parsed and child sitemaps expanded.
  2. Content URLs are collected and stored.
  3. Keywords are extracted from each page.
  4. Keywords are mapped back to their originating pages.
  5. The Domain Map presents this mapping in a user-facing view.

Validation

By visiting the Domain Map linked to a partner domain, one can verify:

  • That the sitemap ingestion process correctly discovered all relevant pages.
  • That keyword extraction is functioning as intended.
  • That the keyword-to-page mapping is accurate and complete.

Conclusion

The Domain Map acts as the “proof of the pudding” — a direct, navigable representation of how sitemaps, content URLs, and keyword indexing converge into a coherent partner domain listing. It is both a diagnostic tool and a demonstration of system integrity.

Loading comments…
(last activity recently ago)
Decent ..


Sign in to follow page Copy link ✂ Show Vocabulary