External data representation and marshalling

External data representation is an agreed standard for representation of data structures and primitive values. The stored information in running program is represented as data structure. The information messages consists of sequences of bytes. Irrespective of the form of communication used, the data structure must be converted to a sequence of bytes before transmission and rebuilt on arrival. The individual primitive data items transmitted in message can be data values of many different types, and not all computers store primitive values in the same order. The representation differs between architecture.

Marshalling is the process of getting collection of data items and assembling them into a suitable form for transmission in a message. Unmarshalling is the process of disassembling them on arrival to get same collection of data items in the receiving point. Marshalling consists of the translation of structured data items and primitive values into an external data representation, unmarshalling consists of the generation of primitive values from their external data representation and the rebuilding of the data structures.

One of the following methods can be used to enable any two computers to exchange binary data values:

  • The values are converted to an agreed external format before transmission and converted to the local form on receipt; if the two computers are known to be the same type, the conversion to external format can be omitted.
  • The values are transmitted in the sender‘s format, together with an indication of the format used, and the recipient converts the values if necessary.

However, that bytes themselves are never altered during transmission. To support RMI or RPC, any data type that can be passed as an argument or returned as a result must be able to be flattened and the individual primitive data values represented in an agreed format. Agreed standard for the representation of data structures and primitive values is known as external data representation.

CORBA’s common data representation is the external data representation, which can represent all of the data types that can be used as arguments and return values in remote invocations in CORBA. These consist of primitive values and with a range of composite types. Each argument or result in a remote invocation is represented by sequence of bytes in the invocation or result message. It can be used by lots of programming languages.

For primitive types, CDR defines representation for both big-endian and little-endian ordering. Values are transmitted in sender’s ordering, which is specified in each message. Recipient translates if it requires different ordering.

Marshalling in CORBA can be generated automatically form the specification of the types of data items to be transmitted in message. Types of the data structures and the types of the basic data items are described in CORBA IDL, which provides a notation for describing the types of the arguments and results of RMI methods. The CORBA interface compiler generates appropriate marshalling and unmarshalling operations for the arguments and results of remote methods from the definitions of the types of their parameters and results.

ava’s object serialization is concerned with the flattening and external data representation of any single object or tree of objects that may need to be transmitted in a message or stored in a disk. It’s only use by Java.

In java RMI, both objects and primitive data values may be passed as arguments and results of method invocations. An object is an instance of a class. Stating that a class implements the serializable interface, which is provided in the java.io package; that class has the effect of allowing its instances to be serialized.

In Java, serialization refers to the activity of flattening an object or a connected set of objects into a serial form that is suitable for storing on disk or transmitting in a message. Deserialization consists of restoring the state of an object or set of objects from their serialized form. It’s assumed that the process that does the deserialization has no prior knowledge of the types of the objects in the serialized form. There for some information about the class of each object is included in the serialized form. This information enables the recipient to load the appropriate class when an object is deserialized.

Serialization and deserialization of the arguments and results of remote invocations are generally carried out automatically by the middle ware, without any participation by the application programmer. If necessary, programmers with special requirements may write their own version of the methods that read and write objects. Another way in which a programmer may modify the effects of serialization is by declaring variables that should not be serialized as transient.

XML is a markup language for general use on the web. The term markup language refers to a textual encoding that represents both text and details as to its structure or its appearance. XML was designed for writing structured documents for the web. now it’s also used by clients and servers in the web services for represent data sent in messages exchanged.

XML data items are tagged with markup strings. The tags are used to describe the logical structure of the data and to associate attribute value pairs with logical structures.

XML is used to enable clients to communicate with web services and for defining the interfaces and other properties of web services. XML is also used in many other ways, including in archiving and retrieval systems although an XML archive may be larger than a binary one, it has the advantage of being readable on any computer.

XML is extensible in the sense that user can define their own tags. If a XML document is intended to be used by more than one application, then the name of the tags must be agreed between them.

Some external data representations such as CORBA CDR don’t need to be self describing, because it’s assumed that the client and server exchanging a message have prior knowledge of the order and the types of the information it contains. XML was intended to be used by multiple applications for different purposes. The provision of tags, together with the use of namespaces to define the meaning of the tags, has made this possible. The use of tags enables applications to select just those parts of a document it needs to process, it will not be affected by the addition of information relevant to other applications.

XML documents, being textual, can be read by humans. In practice , most XML documents are generated and read by XML processing software, but the ability to read XML can be useful when things go wrong. The use of text makes XML independent of any particular platform. Use of textual rather than binary representation, togather with the use of tags, makes the message large, so they require longer processing and transmission time, as well as more space to store. The efficiency of messages using the CORBA CDR is better than SOAP XML format.

In the first two cases, the marshalling and unmarshalling activities are intended to be carried out by a middleware layer without any involvement on the part of the application programmer. Even in the case of XML, which is textual and therefore more accessible to hand-encoding, software for marshalling and unmarshalling is available for all commonly used platforms and programming environments. Because marshalling requires the consideration of all the finest details of the representation of the primitive components of composite objects, the process is likely to be error-prone if carried out by hand. Compactness is another issue that can be addressed in the design of automatically generated marshalling procedures.

In the first two approaches, the primitive data types are marshalled into a binary form. In the third approach (XML), the primitive data types are represented textually. The textual representation of a data value will generally be longer than the equivalent binary representation. The HTTP protocol is another example of the textual approach.

Another issue with regard to the design of marshalling methods is whether the marshalled data should include information concerning the type of its contents. For example, CORBA‘s representation includes just the values of the objects transmitted, and nothing about their types. On the other hand, both Java serialization and XML do include type information, but in different ways. Java puts all of the required type information into the serialized form, but XML documents may refer to externally defined sets of names (with types) called namespaces.

Although we are interested in the use of an external data representation for the arguments and results of RMIs and RPCs, it does have a more general use for representing data structures, objects or structured documents in a form suitable for transmission in messages or storing in files.

Two other techniques for external data representation are worthy of mention. Google uses an approach called protocol buffers to capture representations of both stored and transmitted data. There is also considerable interest in JSON (JavaScript Object Notation) as an approach to external data representation. Protocol buffers and JSON represent a step towards more lightweight approaches to data representation (when compared, for example, to XML).

Final year Software Engineering Student at University of Kelaniya.