Decent Partner: XML to Database feed parsing
Purpose
This document explains the conceptual process of handling XML sitemaps from WordPress (or similar CMS systems) and storing their contents in a database. The goal is to provide a clear framework for differentiating between sitemap indexes and URL sitemaps, and to outline how a system can navigate from the top-level feed down to actual content URLs.
Core Concepts
-
Sitemap Index:
An XML file whose root element is
<sitemapindex>. It contains multiple<sitemap>entries, each pointing to another sitemap file. -
URL Sitemap:
An XML file whose root element is
<urlset>. It contains multiple<url>entries, each pointing directly to a content page. -
Recursive Navigation:
The process of starting at the index, following each
<loc>link, and continuing until only<url>entries remain. -
Database Storage:
The system stores the discovered URLs, along with metadata such as
<lastmod>, rather than storing every intermediate sitemap file.
Process Overview
- Entry Point: Begin with the top-level sitemap (commonly
/wp-sitemap.xml). - Identify Type: Inspect the root element.
- If
<sitemapindex>→ treat as a directory of other sitemaps. - If
<urlset>→ treat as a list of content URLs.
- If
- Branching:
- For
<sitemap>entries, follow each<loc>link to another XML file. - For
<url>entries, collect the<loc>values as actual content URLs.
- For
- Recursion: Continue following
<sitemap>entries until only<url>entries remain. - Storage: Insert the collected URLs and metadata into the database. The top-level sitemap itself can be stored as a reference, but the key data is the list of content URLs.
Conceptual Flow
wp-sitemap.xml
├── wp-sitemap-posts-post-1.xml
│ ├── URL 1
│ ├── URL 2
│ └── ...
├── wp-sitemap-posts-page-1.xml
│ ├── URL A
│ ├── URL B
│ └── ...
└── wp-sitemap-taxonomies-category-1.xml
├── Category URL X
└── Category URL Y
Key Differentiation
The simplest way to differentiate between sitemap types is by checking the elements inside:
<sitemap>→ leads to further sitemaps.<url>→ leads directly to content pages.
Best Practices
- Always treat the top-level sitemap as the authoritative entry point.
- Do not rely solely on file naming conventions (e.g.,
-post-1.xml), as custom structures may vary. - Use the XML schema itself (
<sitemapindex>vs.<urlset>) to determine behavior. - Store URLs with metadata such as
<lastmod>for freshness tracking. - Refresh periodically by re-parsing the index sitemap, ensuring new content is discovered.
Analogy
Think of the sitemap index as a table of contents. Each child sitemap is a chapter, and each
<url> entry is a page. You don’t need to store every chapter file separately; you
just need to know how to navigate from the table of contents down to the pages.
Conclusion
By differentiating between <sitemap>
and <url> elements, a system can
reliably traverse from the top-level sitemap index
down to the actual content URLs. This process ensures
efficient storage in the database and keeps the system aligned with search engine standards for sitemap
handling.
Appendix: Proof of the Pudding in Domain Maps
The effectiveness of our sitemap ingestion and keyword mapping process can be demonstrated directly in the Domain Map
How the Domain Map Functions
- Entry Point: Each partner domain in the system has a link to its Domain Map.
- Keyword Mapping: The Domain Map displays all extracted keywords associated with that domain.
- Page Associations: Each keyword is linked to the specific pages discovered through the sitemap crawl.
- Navigation: Users can click a keyword to see the list of pages where it appears, confirming the keyword-to-page relationship.
Why This Is Proof
The Domain Map is the visible outcome of the entire pipeline:
- Sitemap index is parsed and child sitemaps expanded.
- Content URLs are collected and stored.
- Keywords are extracted from each page.
- Keywords are mapped back to their originating pages.
- The Domain Map presents this mapping in a user-facing view.
Validation
By visiting the Domain Map linked to a partner domain, one can verify:
- That the sitemap ingestion process correctly discovered all relevant pages.
- That keyword extraction is functioning as intended.
- That the keyword-to-page mapping is accurate and complete.
Conclusion
The Domain Map acts as the “proof of the pudding” — a direct, navigable representation of how sitemaps, content URLs, and keyword indexing converge into a coherent partner domain listing. It is both a diagnostic tool and a demonstration of system integrity.
(last activity recently ago)