
Image credit: Search Engine Journal
An experiment demonstrated Tuesday that large language models extract information from deliberately malformed JSON-LD schema, suggesting they process it as plain text rather than structured data.
This finding challenges a prevalent misconception among SEO and GEO experts who often misinterpret LLMs’ ability to retrieve schema data as evidence of proper parsing, potentially misrepresenting its value to clients, according to industry analysis.
Mark Williams-Cook, a director at Candour, conducted the experiment and highlighted the technical understanding of LLM training processes that informed the analysis. He said many experts incorrectly attribute LLM data retrieval to schema parsing.
Schema.org structured data is specifically designed to reduce ambiguity for machines, such as search engines like Google, Microsoft, Yahoo and Yandex, by providing explicit, machine-readable information that feeds into knowledge graphs.
However, LLMs likely do not parse schema during their training phases because pre-training pipelines are engineered to strip HTML, boilerplate content and script tags, where JSON-LD schema typically resides, to extract clean prose.
Tools commonly used in creating large datasets, such as ‘trafilatura’ found in ‘FineWeb’, are designed to ignore script tags. This makes it improbable that schema markup would be incorporated into an LLM’s core understanding during its initial training, Williams-Cook noted.
The experiment involved feeding LLMs deliberately broken JSON-LD schema. The models were still able to extract relevant data, indicating they did not validate or understand the structural integrity of the schema itself.
Williams-Cook’s analysis, also discussed by Search Engine Roundtable, aligns with insights from experts like Andrej Karpathy, who has detailed the extensive data cleaning processes involved in training large AI models.
The implication for the SEO and GEO communities is that while structured data remains important for traditional search engines and knowledge graphs, its direct interaction with LLMs appears to be limited to text extraction rather than semantic parsing.
Source: Search Engine Journal
Written by
Palumbo Angela
Angela Palumbo, Senior Editor at Rabbit Rank since 2023, holds a bachelor's in communications. She focuses on fact-checking and simplifying complex topics while also leading strategy for the news department.
Keep reading
Related Articles

UK Regulators Mandate Google Overhaul Search Ranking Transparency
The UK CMA imposes new rules on Google for fair organic search ranking, including AI Overviews, and makes data...

Digital Teams Integrate SEO, PPC, Content for AI Search Era
An integrated search brief aligns SEO, PPC, and content teams for shared business goals and efficiency in the...

Generative AI Reshapes Job Market, Demands New Employee Skills
Generative AI is transforming the job market. Discover the new strategic and analytical skills companies expec...