Data Intensive App #4 | Encoding And Evolution

4.0 Introduction

  • Applications inevitably change over time. (Features are added or modified as new products are launched)
  • A change to an app’s features also requires a change to data that it stores.
  • This means that old and new versions of the code, and old and new data formats may potentially all coexists in the system at the same time.
    • Backward Compatibility : Newer code can read data written by older code => Easy
    • Forward Compatibility : Older code can read data written by newer code => can be tricker => requires older code to ignore additions made by newer code
  • In this chapter,
    • We will look at several formats for encoding data
    • We will then discuss how those formats are used for data storage and for communication




4.1 Formats for Encoding Data

  • Programs usually work with data at least two differenct representations :

    • In memory data : is kept in object, structs, lists, array, hash tables, trees, and so on.
    • Encoded data : When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (e.g. JSON)
  • Thus, we need some kind of translation between the two representations

    • Encoding (Serialization or Marshalling) : translation from the in-memory representation to a byte sequence
      • Language Specific Formats
      • JSON, XML
      • Binary Variants
    • Decoding (Parsing, Deserialization, unmarshalling) : translation from a byte sequence to in-memory representation

4.1.1 Language Specific Formats

  • Many programming language come with built-in support for encoding in-memory objects into byte sequences (java.io.Serializable, Ruby Marshal, Python Pickle)

  • These are very convenient. However they also have a number of deep problems :

    1. The encoding is often tied to a particular programming language, and reading the data in another language is very difficult.

    2. In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes : Security Problem

      Pickle docs : The pickle module is not secure. It’s possible to construct malicious pickle data which will execute arbitrary code during unpickling. https://docs.python.org/3/library/pickle.html

    3. Versioning data is often an afterthought in these libraries : as they ared intended for quick and easy encoding of data => neglect the inconvinient problems of forward and backward compatibility

    4. Efficiency is also often an afterthought (CPU time taken to encode or decode, and the size of the encoded structure)

여기까지는 language specific formate 에 대하여 알아보았고 아래부터는 다양한 language에서 사용가능한 format인 json과 xml에 대하여 볼 예정

4.1.2 Textual Formats : JSON, XML and CSV

  • standardized encodings that can be written and read by many programming languages
  • Widely known, widely supported, and alomst as wideley disliked
    • XML : is often criticized for being too verbose and unnecessarily complicated
    • JSON : ’s popularity is mainly due to its built-in support in web browsers
    • CSV : is another popular language-independent format, albeitless powerful
  • JSON, XML, CSV are textual format, human-readable, but also have some subtle problems
    • ambiguity around the encoding of numbers : XML and CSV cannot distinguish between a number and a string, JSON distinguishes strings and numbers but it doesn’t distinguish integers and floating-point numbers => source of a problem when dealing with large numbers
    • JSON and XML have good support for Unicode character strings (i.e., human readable text), but they don’t support binary strings (ssequences of bytes without a character encoding). : so people get aroung this limitation by encoding the binary data as text using Base64 => This works, but it’s somwhat hacky and increases the data size by 33%
    • There is optional schema support for both XML[11] and JSON[12]. These schema languages are quite powerful and thus quite complicated to learn and implement.
    • CSV does not have any schema, so it’s up to the application to define the meaning of each row and column.
  • Despite these flaws, JSON, XML, and CSV are good enough for many purposes.
  • The difficulty of getting different organizations to agree on anything outweighs most other concerns => It often doen’t matter how pretty or efficient the format is.

4.1.3. Binary Variants : Thrift, Protocol Buffers and Avro

  • For data that is used only internally within yout organization, there is less pressure to use a lowest-common-denominator encoding format => more compact or faster to parse

  • Apache Thrift and Protocol Buffers (google) are binary encoding libraries on the same principle

    1. Define Schema : Both Thrift and Protocol Buffers require a schema for any data that is encoded.
    2. schema => code generation tool => produces classes that implement the schema in various programming language
    3. call this generated code to encode or decode records of the schema
    // in Thrift, you would descirbe the schema in the Thrift interface definition language
    struct Person {  
    	1: required string       userName,  
    	2: optional i64          favoriteNumber,  
    	3: optional list<string> interests
    	}
    
  • Binary Variants can handle schema changes (Schema Evolution) while keeping backward and forward compatibility

  • Avro : is another binary encoding format that is interestingly differenct from Protocol Buffers and Thrift => started as a subproject of Hadoop, as a result of Thrift noy being a good fit for Hadoon’s use cases [21]

  • The Merits of Schemes : although textual data formats such as JSON, XML, and CSV are widespread, binary encodings based on schemas are also a viable option




4.2 Modes of DataFlow (p.128)

  • We said that whenever you want to send some data to another process with which you don’t share memory, you need to encode it as a sequence of bytes.
  • We then discussed a variety of different encodings for doing this.
  • In the rest of this chapter, we will explore some of the most common ways how data flows between processes :
    • Via database
    • Via service calls
    • Via asynchronous message passing

4.2.1 Dataflow Through Database

  • In a database, the process that writes to the database encodes the data, and the process that reads from the database deocdes it

  • **Backward Compatibility **: storing something in the DB => should be readable in the future

  • Forward Compatibility : It’s common for several different processes to be accessing a DB at the same time

    • A value in the DB may be written by a newer version of the code, and subsequently read by an older version of the code
    • Be careful not to update data written by a newer version of the application : Figure 4-7
  • Different values written at different times

    • application은 new-version으로 entire replace가 쉽지만 DB는 아님,
    • 물론 rewriting data into a new schema 가 가능할수도있지만 expensive하고 이에따라 대다수 DB들은 null을 default로 새 column을 추가하는등의 schema change를 지원한다
    • 결국 schema evolution이 5년전에 생성된 데이터든 신규 데이터든 single schema로 만들어진것처럼 하게한다.

4.2.2 Dataflow Through Services : REST and RPC

  • In some ways, services are similar to DB, they typically allow clients to submit and query data. However, these two are different in :

    • DB allow arbitrary queries using the query language,
    • Services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic of the service
  • A key design goal of a service-oritented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable.

    • => we should expect old and new version of servers and clients to be running at the same time
    • => so the data encoding used by servers and clients must be compatible across versions of the service API
  • web service : when HTTP is used as the underlying protocol for talking to the service, it is called a web service. Two popular apporaches are: REST and SOAP

    • REST : emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication => has been gaining popularity
    • SOAP : aims to be independent from HTTP and avoids using most HTTP features, comes withs a sprawling and complex multitude of related standards => has fallen out of favor in most companies.
  • The problems with RPCs (Remote Procedure Calls)

    • Technologies for making API requests over network : Web, EJB, RMI, DCOM

    • All of these are based on the idea of RPC

    • Although RPC seems convenient at first, the aproach is fundamentaly flawed.

      1. A local function call is predictable. A network request is unpredictable : the request or response may be lost due to a network problem.
      2. iIt normally takes about the same time to execute a local function. But a latency of network request is wildly variable
      3. You can efficiently pass references (pointers) of objects in local memory to a locla funcion. But for a network request, all those parameters need to be encoded into a sequence of bytes
      4. The client and the service may be implemented in different programming languages, so the RPC framework must translate datatypes from one language into another
    • Despite all these problems, RPC isn’t going away. => Various RPC frameworks have been built on top of all the encodings, mentioned in this chapter.

  • Data encoding and evolution for RPC

    • For evolvability, it is important that RPC clients and servers can be changed and deployed independently.
    • The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:
      • Thrift, gRPC (Protocol Buffer), and Avro RPC can be evolved according to the compatibility rules of the respective encoding format
      • In SOAP, requests and responses are specified with XML schemas.
      • RESTful APIs most commonly use JSON (w/o a formally specified schema). Adding ‘optional’ request parameters and adding new fields to respone objects are usually considered.

4.2.3 Message-Passing Dataflow

  • We will briefly look at asynchronous message-passing systems

    • similar to RPC in that a client’s request is delivered to another process with low latency.
    • the message is not sent via a direct network connection, but goes via an intermediary called a message broker (also called message queue)
  • Using a message broker has several advantages compared to direct RPC:

    1. can act as a buffer if the recipient is unavailable or overloaded

    2. can automatically redeliver messages to a process that has crashed => prevent messages from being lost

    3. avoids the sender needing to know the IP address and port number of the recipient

    4. allows one message to be sent to several recipients

    5. logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).

  • Message Brokers : RabbitMQ, ActiveMQ, HornetQ, Apache Kafka => compare them in more detail in Chapter11




4.3 Summary

  • We discussed several data encoding formats and their compatibiliry properties :

    • Programming language specific encodings
    • Textual formats like JSON, XML, and CSV
    • Binary schema-driven formats like Thrift, Protocol Buffers, and Avro
  • We also discussed several modes of data flow

    • Databases : the process writing to the database encodes the data and the process reading from the database decodes it
    • RPC and REST APIs : the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response
    • Asynchronous message passing : communicate by sending each other messages that are encoded by sender and decoded by the recipient
  • We can conclue that with a bit of care, backward/forward compatibility and rolling upgrades are quite achievable.