LLMs Extract Text from Malformed Schema, Not Parse Structured Data

Share this article

Schema, LLMs and the Low Bar for "Evidence" in GEO

Image credit: Search Engine Journal

An experiment demonstrated Tuesday that large language models extract information from deliberately malformed JSON-LD schema, suggesting they process it as plain text rather than structured data.

This finding challenges a prevalent misconception among SEO and GEO experts who often misinterpret LLMs’ ability to retrieve schema data as evidence of proper parsing, potentially misrepresenting its value to clients, according to industry analysis.

Mark Williams-Cook, a director at Candour, conducted the experiment and highlighted the technical understanding of LLM training processes that informed the analysis. He said many experts incorrectly attribute LLM data retrieval to schema parsing.

Schema.org structured data is specifically designed to reduce ambiguity for machines, such as search engines like Google, Microsoft, Yahoo and Yandex, by providing explicit, machine-readable information that feeds into knowledge graphs.

However, LLMs likely do not parse schema during their training phases because pre-training pipelines are engineered to strip HTML, boilerplate content and script tags, where JSON-LD schema typically resides, to extract clean prose.

Tools commonly used in creating large datasets, such as ‘trafilatura’ found in ‘FineWeb’, are designed to ignore script tags. This makes it improbable that schema markup would be incorporated into an LLM’s core understanding during its initial training, Williams-Cook noted.

The experiment involved feeding LLMs deliberately broken JSON-LD schema. The models were still able to extract relevant data, indicating they did not validate or understand the structural integrity of the schema itself.

Williams-Cook’s analysis, also discussed by Search Engine Roundtable, aligns with insights from experts like Andrej Karpathy, who has detailed the extensive data cleaning processes involved in training large AI models.

The implication for the SEO and GEO communities is that while structured data remains important for traditional search engines and knowledge graphs, its direct interaction with LLMs appears to be limited to text extraction rather than semantic parsing.

Source: Search Engine Journal

Tags: #ai #GEO #JSON-LD #Knowledge Graph #LLMs #Schema Markup #Search Engines #seo #Structured Data #Training Data

Written by

Palumbo Angela

Angela Palumbo, Senior Editor at Rabbit Rank since 2023, holds a bachelor's in communications. She focuses on fact-checking and simplifying complex topics while also leading strategy for the news department.

View All Posts

Keep reading