Understanding what metadata is and how it helps to better organize your knowledge base

Blog

Well-behaved documents[1] contains meaningful metadata for automatic classification!

Metadata is data about data[2] which helps to describe the content or characteristics of an electronic or physical object. The author of a scientific paper, the publisher of a book, the copy right of a photograph, the scene of a painting, the theme of a song and the architect of a 2500BC pyramid are all valid metadata or attributes of the item concerned. They serve to classify it in your private search engine or to share it with colleagues or even with the world at large.

This latter point deserves special attention, because different users may require different pieces of data. Art critiques, veterinaries, physicists and lawyers use different jargons, yet all download regularly content from other domains to integrate in their own knowledge base, preferably without manual intervention, based solely on embedded metadata such as title, author, subject, description, keywords etc., provided documents are well-behaved.

While there are dozens of metadata standards[3] belonging to dozens of fraternities, each with its own vocabulary[4], none covering all needs of all users. A student may need data he can use to generate citations and an engineer would search his collection of documents by technical criteria.

The digi-libris pragmatic solution to this dilemma:

An individually extensible and universally applicable metadata set.

It builds on the widely used Dublin Core standard[5] (minus refinements) plus an unlimited number of customizable attribute/value pairs for the data. Consider it as an alternative Dublin Core application profile (DCAP) for individuals who may or may not have to rely on a single standard issued by an institution. One does not exclude the other! For the exchange of Metadata with third parties it relies on Adobe’s widely accepted XMP[6] technology. This is the format already implemented in PDF documents and it includes a placeholder for arbitrary or custom variables (pdfx).

[1] A well-behaved document is an electronic document that is both user friendly and search friendly, it is easy to read and to peruse. It has bookmarks and an interactive table-of-contents and useful embedded meta data to classify it with little or no manual intervention.

[2]Metadata is "data about data". The term is ambiguous, as it is used for two fundamentally different concepts: Structural metadata anddescriptive metadata, which is about individual instances of application data, the data content - what a consumer of content is interested in.

[3]see http://en.wikipedia.org/wiki/Metadata_standards#Available_metadata_standards

[4] see also under nomenclature, namespaces or naming systems e.g. http://en.wikipedia.org/wiki/Namespace#Naming_system

[5] Dublin Core http://dublincore.org/metadata-basics/ originally proposed 15 metadata elements for simple and generic resource descriptions. It has since extended these to 55 terms, 15 of which are duplicates of the first elements, which can lead to confusion. Unfortunately this list is a bit one-sided which is why we have included the possibility to add custom attribute/value pairs.

[6] Adobe's Extensible Metadata Platform (XMP) is a technology that allows you to embed metadata into the file itself. With XMP, desktop applications and back-end publishing systems gain a common method for capturing, sharing, and leveraging this valuable metadata. While typically found in PDF documents, the XMP is built on a set of public specifications. You may download them from Adobe's XMP pages: