What are vocabularies?

Vocabularies are made up of a set of related 'terms' which when used can express some idea or concept. They allow producers to teach consumers richer ways to interpret and handle data.

Vocabularies can range in complexity from simple to complex. A simple vocabulary might tell a consumer which property to use as an entity's title when displaying it in a form, whereas a more complex vocabulary might tell someone how to convert an OData person entity into a vCard entry.

Here are some simple examples:

  • This property should be used as the Title of this entity
  • This property has a range of acceptable values (e.g. 1 to 100)
  • This entity can be converted into an vCard
  • This entity is a foaf:Person
  • This navigation property is essentially a 'foaf:Knows [a person]' relationship
  • This property is a georss:Point
  • Etc

Vocabularies are not a new concept unique to OData, vocabularies are used extensively in the linked data and RDF worlds to great effect, in fact we should be able to re-use many of these existing vocabularies in OData.

Why does OData need vocabularies?

OData is being used in many different verticals now. Each vertical brings its own specific set of requirements and challenges. While some problems are general enough that solving them inside OData adds value to the OData eco-system as a whole, most don't meet that bar.

It seems clear then that we need a mechanism that allows Producers to share more information that 'smarter' Consumers MAY understand enough to enable a higher fidelity experience.

In fact some consumers are already trying to provide a higher fidelity experience, for example Sesame can render the results of OData queries on a map. Sesame does this by looking for specifically named properties, which it 'guesses' represent the entity's location. While this is powerful, it would be much better if it wasn't a 'guess', if the Producer used a well-known vocabulary to tell Consumers which property is the entity's location.

Goals

As with any new feature, we need to agree on a set of goals before we can come up with the right design. To get us started I propose this set of goals:

  • Ability to re-use or reference common micro-formats and vocabularies.
  • Ability to annotate OData metadata using the terms from a particular vocabulary.
    • Both internally (inside the CSDL file returned from $metadata)
    • And externally (allowing for third-parties to 'enrich' existing OData services they don't own).
    • No matter how the annotation is made, consumers should be able to consume the annotations in much the same way.
    • Ability to annotate OData data too? Although this one is beyond the scope of this post.
    • Consumers that don't understand a particular vocabulary should still be able to work with services that reference that vocabulary. The goal should be to enrich the eco-system for those who 'optionally' understand the vocabulary.
    • We should be able to reference terms from a vocabulary in CSDL, OData Atom and OData JSON.

It is important to note that our goal stops short of specifying how to define the vocabulary itself, or how to capture the semantics of the vocabulary, or how to enforce the vocabulary. Those concerns lay solely with vocabulary writers, and the producers and consumers that profess to understand the vocabulary. By staying out of this business it allows OData to reference many existing vocabularies and micro-formats, without being unnecessarily restrictive on how those vocabularies are defined or the types of semantics they might imply.

Exploration

Today if you ask for an OData services metadata (~/service/$metadata) you get back an EDMX document that contains a CSDL schema. Here is an example.

CSDL already supports annotations, which we could use to refer to a vocabulary and its terms. For example this EntityType definition includes both a structural annotation (validation:Constraint) and a simple attribute annotation (display:Title):

<EntityType Name="Person" display:Title="Firstname Lastname">
<Key>
<PropertyRef Name="ID" />
</Key>
<Property Name="ID" Type="Edm.Int32" Nullable="false" />
<Property Name="Firstname" Type="Edm.String" Nullable="true" />
<Property Name="Lastname" Type="Edm.String" Nullable="true"  />
<Property Name="Email" Type="Edm.String" Nullable="true">
<validation:Constraint>
<validation:Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</validation:Regex>
<validation:ErrorMessage>Please enter a valid EmailAddress</validation:ErrorMessage>
</validation:Constraint>
</Property>
<Property Name="Age" Type="Edm.Int32" />
</EntityType>

For this to be valid XML the display and validation namespaces would have to be introduced somewhere something like this:

<Schema
xmlns:display="http://odata.org/vocabularies/display"
xmlns:validation="http://odata.org/vocabularies/validation">

Here the URL of the xsd reference identifies the vocabulary globally.

While this allows for completely arbitrary annotations and is extremely expressive, it has a number of down-sides:

  1. Structural annotations (i.e. XML elements) support the full power of XML. While power is good, it comes at a price, and here the price is figuring out how to represent the same thing in say JSON? We could come up with a proposal to make CSDL and OData Atom/JSON completely isomorphic, but is that worth the effort? Probably not.
  2. There is no way to refer to something, like say a property, so that you can annotate it externally, which is one of our goals.
  3. If we allow for annotations inline in the data (and let's not forget metadata would just be data in an addressable metadata service) it would change the shape of the resulting JSON structure. For example the javascript expression to access the age property of an entity would need to change from something like object.Age to something like object.Age.Value so that object.Age can hold onto all the 'inline annotations'. This is clearly unacceptable if we want existing 'naive' consumers to continue to work.

OData values:

If we address these issues in turn, one concern for (1) is to restrict the XML available when using a vocabulary to the XML we already know how to convert from XML into JSON, i.e. OData values. For example we take something like this:

<Property Name="Email" Type="Edm.String" Nullable="true">
<validation:Constraint>
<validation:Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</validation:Regex>
<validation:ErrorMessage>Please enter a valid EmailAddress</validation:ErrorMessage>
</validation:Constraint>
</Property>

The annotation is pretty simple, and could be modeled as a ComplexType pretty easily:

<ComplexType Name="Constraint">
<Property Name="Regex" Type="Edm.String" Nullable="false" />
<Property Name="ErrorMessage" Type="Edm.String" Nullable="true" />
</ComplexType>

In fact if you execute an OData request that just retrieves an instance of this complex type the response would look like this:

<Constraint
p1:type="Namespace.Constraint"
xmlns:p1="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"
xmlns="http://schemas.microsoft.com/ado/2007/08/dataservices" >
<Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</Regex>
<ErrorMessage>Please enter a valid EmailAddress</ErrorMessage>
</Constraint>

And this is almost identical to our original annotation, the only differences being around the xml namespaces.

Which means it is not too much of a stretch to say if your annotation can be modeled as a ComplexType which - of course allow nested complex types properties and multi-value properties too - then the Annotation is simply an OData value.

This is very nice because it means when we do addressable metadata you can in theory write a query like this to retrieve the annotations for a specific property:

~/$metadata.svc/Properties('Namespace.Type.PropertyName')/Annotations

ISSUE: Actually this introduces a problem, since each annotation instance would have a different 'type' we would need to support either ComplexType inheritance (so we can define the annotation as an EntityType with a Value property of type AnnotationValue, but instances of Annotations would invariably have Values derived from the base AnnotationValue type) or mark Annotation as a OpenType or provide a way to specify a property without specifying the type.

Of course today annotations are allowed that can't be modeled as ComplexTypes, so we would need to be able to distinguish those. Perhaps the easiest way is like this:

<Property Name="Email" Type="Edm.String" Nullable="true">
<validation:Constraint m:Type="validation:Constraint" >
<validation:Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</validation:Regex>
<validation:ErrorMessage>Please enter a valid EmailAddress</validation:ErrorMessage>
</validation:Constraint>
</Property>

Here the m:Type attribute indicates that the annotation is an OData value. This tells servers and clients that they can if needed convert between CSDL, ATOM and JSON formats using the above rules.

By adopting the OData atom format for annotations we can use a few more OData-ism to get clearer about the structure of the Annotation:

  • By default each element in the annotation represents a string but you can use m:Type="***"  to change the type to something like Edm.Int32.
    e.g.    <validation:ErrorSeverity m:Type="Edm.Int32">1</validation:ErrorSeverity>
  • We can use m:IsNull="true" to tell the difference between an empty string and null.
    e.g.   <validation:ErrorMessage m:IsNull="true" />

This looks good. It supports both constrained (OData values) and unconstrained annotations, and is consistent with the existing annotation support in OData.

Out of line & External Annotations

Now if we turn our attention back to concern (2), this example implicitly refers to its parent; however we need to allow vocabularies to refer to something explicitly. For metadata the most obvious solution is to leverage addressable metadata, which allows you to refer to individual pieces of the metadata.

For example if this URL is the address of the metadata for the Email property: http://server/service/$metadata/Properties('Namespace.Person.Email')

Then this 'free floating' element is 'annotating' the Email property using the 'http://odata.org/vocabularies/constraints' vocabulary:

<Annotation AppliesTo="http://server/service/$metadata/Properties('Namespace.Person.Email')"
xmlns:validation="http://odata.org/vocabularies/constraints">
<validation:Constraint m:Type="validation:Constraint" >
<validation:Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</validation:Regex>
<validation:ErrorMessage>Please enter a valid EmailAddress</validation:ErrorMessage>
</validation:Constraint>
</Annotation>

Annotation by reference also neatly sidesteps issue (3), i.e. the object annotated is left structurally unchanged, which means we could use a similar approach to annotate data without breaking code (like a javascript path) that relies on a particular structure.

Another nice side-effect of this design is that you can use it 'inside' the CSDL too, simply by removing the address of the metadata service from the AppliesTo url - since we are in the CSDL we can us 'relative addressing':

<Annotation AppliesTo="Properties('Namespace.Person.Email')"
xmlns:validation="http://odata.org/vocabularies/constraints">
<validation:Constraint m:Type="validation:Constraint" >
<validation:Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</validation:Regex>
<validation:ErrorMessage>Please enter a valid EmailAddress</validation:ErrorMessage>
</validation:Constraint>
</Annotation>

Indeed if you have a separate file with many annotations for a particular model, you could group a series of annotations together like this:

<Annotations AppliesTo="http://server/service/$metadata/">
<Annotation AppliesTo="Properties('Namespace.Person.Email')"
xmlns:validation="http://odata.org/vocabularies/constraints">
<validation:Constraint m:Type="validation:Constraint" >
<validation:Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</validation:Regex>
<validation:ErrorMessage>Please enter a valid EmailAddress</validation:ErrorMessage>
</validation:Constraint>
</Annotation>
<Annotation AppliesTo="Properties('Namespace.Customer.Email')"
xmlns:validation="http://odata.org/vocabularies/constraints">
<validation:Constraint m:Type="validation:Constraint" >
<validation:Regex>^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,6}$.</validation:Regex>
<validation:ErrorMessage>Please enter a valid EmailAddress</validation:ErrorMessage>
</validation:Constraint>
</Annotation>
</Annotations>

Here the Annotations/@AppliesTo attribute indicates the shared root url for all the annotations, and could be any url that points to a model, be that http, file or whatever.

Vocabulary definitions and semantics

It is important to note, that while we are proposing how to 'bind' or 'apply' a vocabulary, we are *not* proposing how to:

  • Define the terms in a vocabulary (e.g. Regex, ErrorMessage)
  • Define the meaning or semantics associated with the terms (e.g. Regex should be applied to instance values, if the regex doesn't match an error/exception should be raised with the ErrorMessage).

Clearly however to interoperate two parties must agree on the vocabulary terms available and their meaning. We are however not dictating how that understanding develops. It could be done in many different ways - for example using a Hallway conversation, Word or PDF document, Diagram, or perhaps even an XSD or EDM model.

Who creates vocabularies?

The short answer is 'anyone'.

The more nuanced answer is there are many candidate vocabularies - from georss to vCard to Display to Validations - overtime people and companies from the OData ecosystem will start promoting vocabularies they have an interest in, and as is always the case the most useful will flourish.

Where are we?

I think that this proposal is a good place to gather feedback and think about the scenarios it enables. Using this approach you can imagine a world where:

  1. A Producer exposes a Data Service without using any terms from a useful vocabulary.
  2. Someone creates an 'annotation' file that attaches terms from the useful vocabulary to the service, which then enables 'smart consumers' to interact with the Data Service with higher-fidelity.
  3. The Producer learns of this 'annotation' file and embeds the annotations simply by converting all the 'appliesTo' urls (that are currently absolute) into relative urls.

You can also imagine a world where consumers like Tableau, PowerPivot and Sesame allow users to build up their own annotation files in response to gestures*.

*Gestures - you can think of the various mouse clicks, drags, key presses performed by a user as gestures. So for example right clicking on a column and picking from a list of well-known vocabularies could be interpreted as binding the selected vocabulary to the corresponding property definition. These 'interpretations' could easily be group and stored in an external annotation file.

Summary

I hope that you, like me, can see the potential power and expressiveness that vocabularies can bring to OData. Vocabularies will allow the OData eco-system to connect more deeply with the web and allow for ever richer and more immersive data consumption scenarios.

I really want to hear your feedback on the approach proposed above. The proposal is more exploratory than anything. It's definitely not set in stone, so tell us what you think.

Some specific questions:

  • Do you agree that if we have a reference based annotation model, there is no need to support an inline model?
  • What do you think of the idea of restricting annotation to the OData type system?
  • Do you like the symmetry between in service (i.e. inside $metadata) and external annotations?
  • Do we need to define how you attach vocabularies to data too? For example do you have scenarios where each instance has different annotations?
  • Are there any particularly cool scenarios you think this enables?
  • What vocabularies would you like to see pushed?

Thanks for reading!
- Alex