4.0 Introduction
- Applications inevitably change over time. (Features are added or modified as new products are launched)
- A change to an app’s features also requires a change to data that it stores.
- This means that old and new versions of the code, and old and new data formats may potentially all coexists in the system at the same time.
- Backward Compatibility : Newer code can read data written by older code => Easy
- Forward Compatibility : Older code can read data written by newer code => can be tricker => requires older code to ignore additions made by newer code
- In this chapter,
- We will look at several formats for encoding data
- We will then discuss how those formats are used for data storage and for communication
4.1 Formats for Encoding Data
-
Programs usually work with data at least two differenct representations :
- In memory data : is kept in object, structs, lists, array, hash tables, trees, and so on.
- Encoded data : When you want to write data to a file or send it over the network, you have to encode it as some kind of self-contained sequence of bytes (e.g. JSON)
-
Thus, we need some kind of translation between the two representations
- Encoding (Serialization or Marshalling) : translation from the in-memory representation to a byte sequence
- Language Specific Formats
- JSON, XML
- Binary Variants
- Decoding (Parsing, Deserialization, unmarshalling) : translation from a byte sequence to in-memory representation
- Encoding (Serialization or Marshalling) : translation from the in-memory representation to a byte sequence
4.1.1 Language Specific Formats
-
Many programming language come with built-in support for encoding in-memory objects into byte sequences (java.io.Serializable, Ruby Marshal, Python Pickle)
-
These are very convenient. However they also have a number of deep problems :
-
The encoding is often tied to a particular programming language, and reading the data in another language is very difficult.
-
In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes : Security Problem
Pickle docs : The pickle module is not secure. It’s possible to construct malicious pickle data which will execute arbitrary code during unpickling. https://docs.python.org/3/library/pickle.html
-
Versioning data is often an afterthought in these libraries : as they ared intended for quick and easy encoding of data => neglect the inconvinient problems of forward and backward compatibility
-
Efficiency is also often an afterthought (CPU time taken to encode or decode, and the size of the encoded structure)
-
여기까지는 language specific formate 에 대하여 알아보았고 아래부터는 다양한 language에서 사용가능한 format인 json과 xml에 대하여 볼 예정
4.1.2 Textual Formats : JSON, XML and CSV
- standardized encodings that can be written and read by many programming languages
- Widely known, widely supported, and alomst as wideley disliked
- XML : is often criticized for being too verbose and unnecessarily complicated
- JSON : ’s popularity is mainly due to its built-in support in web browsers
- CSV : is another popular language-independent format, albeitless powerful
- JSON, XML, CSV are textual format, human-readable, but also have some subtle problems
- ambiguity around the encoding of numbers : XML and CSV cannot distinguish between a number and a string, JSON distinguishes strings and numbers but it doesn’t distinguish integers and floating-point numbers => source of a problem when dealing with large numbers
- JSON and XML have good support for Unicode character strings (i.e., human readable text), but they don’t support binary strings (ssequences of bytes without a character encoding). : so people get aroung this limitation by encoding the binary data as text using Base64 => This works, but it’s somwhat hacky and increases the data size by 33%
- There is optional schema support for both XML[11] and JSON[12]. These schema languages are quite powerful and thus quite complicated to learn and implement.
- CSV does not have any schema, so it’s up to the application to define the meaning of each row and column.
- Despite these flaws, JSON, XML, and CSV are good enough for many purposes.
- The difficulty of getting different organizations to agree on anything outweighs most other concerns => It often doen’t matter how pretty or efficient the format is.
4.1.3. Binary Variants : Thrift, Protocol Buffers and Avro
-
For data that is used only internally within yout organization, there is less pressure to use a lowest-common-denominator encoding format => more compact or faster to parse
-
Apache Thrift and Protocol Buffers (google) are binary encoding libraries on the same principle
- Define Schema : Both Thrift and Protocol Buffers require a schema for any data that is encoded.
- schema => code generation tool => produces classes that implement the schema in various programming language
- call this generated code to encode or decode records of the schema
// in Thrift, you would descirbe the schema in the Thrift interface definition language struct Person { 1: required string userName, 2: optional i64 favoriteNumber, 3: optional list<string> interests }
-
Binary Variants can handle schema changes (Schema Evolution) while keeping backward and forward compatibility
-
Avro : is another binary encoding format that is interestingly differenct from Protocol Buffers and Thrift => started as a subproject of Hadoop, as a result of Thrift noy being a good fit for Hadoon’s use cases [21]
-
The Merits of Schemes : although textual data formats such as JSON, XML, and CSV are widespread, binary encodings based on schemas are also a viable option
4.2 Modes of DataFlow (p.128)
- We said that whenever you want to send some data to another process with which you don’t share memory, you need to encode it as a sequence of bytes.
- We then discussed a variety of different encodings for doing this.
- In the rest of this chapter, we will explore some of the most common ways how data flows between processes :
- Via database
- Via service calls
- Via asynchronous message passing
4.2.1 Dataflow Through Database
-
In a database, the process that writes to the database encodes the data, and the process that reads from the database deocdes it
-
**Backward Compatibility **: storing something in the DB => should be readable in the future
-
Forward Compatibility : It’s common for several different processes to be accessing a DB at the same time
- A value in the DB may be written by a newer version of the code, and subsequently read by an older version of the code
- Be careful not to update data written by a newer version of the application : Figure 4-7
-
Different values written at different times
- application은 new-version으로 entire replace가 쉽지만 DB는 아님,
- 물론 rewriting data into a new schema 가 가능할수도있지만 expensive하고 이에따라 대다수 DB들은 null을 default로 새 column을 추가하는등의 schema change를 지원한다
- 결국 schema evolution이 5년전에 생성된 데이터든 신규 데이터든 single schema로 만들어진것처럼 하게한다.
4.2.2 Dataflow Through Services : REST and RPC
-
In some ways, services are similar to DB, they typically allow clients to submit and query data. However, these two are different in :
- DB allow arbitrary queries using the query language,
- Services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic of the service
-
A key design goal of a service-oritented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable.
- => we should expect old and new version of servers and clients to be running at the same time
- => so the data encoding used by servers and clients must be compatible across versions of the service API
-
web service : when HTTP is used as the underlying protocol for talking to the service, it is called a web service. Two popular apporaches are: REST and SOAP
- REST : emphasizes simple data formats, using URLs for identifying resources and using HTTP features for cache control, authentication => has been gaining popularity
- SOAP : aims to be independent from HTTP and avoids using most HTTP features, comes withs a sprawling and complex multitude of related standards => has fallen out of favor in most companies.
-
The problems with RPCs (Remote Procedure Calls)
-
Technologies for making API requests over network : Web,
EJB,RMI,DCOM -
All of these are based on the idea of RPC
-
Although RPC seems convenient at first, the aproach is fundamentaly flawed.
- A local function call is predictable. A network request is unpredictable : the request or response may be lost due to a network problem.
- iIt normally takes about the same time to execute a local function. But a latency of network request is wildly variable
- You can efficiently pass references (pointers) of objects in local memory to a locla funcion. But for a network request, all those parameters need to be encoded into a sequence of bytes
- The client and the service may be implemented in different programming languages, so the RPC framework must translate datatypes from one language into another
-
Despite all these problems, RPC isn’t going away. => Various RPC frameworks have been built on top of all the encodings, mentioned in this chapter.
-
-
Data encoding and evolution for RPC
- For evolvability, it is important that RPC clients and servers can be changed and deployed independently.
- The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:
- Thrift, gRPC (Protocol Buffer), and Avro RPC can be evolved according to the compatibility rules of the respective encoding format
- In SOAP, requests and responses are specified with XML schemas.
- RESTful APIs most commonly use JSON (w/o a formally specified schema). Adding ‘optional’ request parameters and adding new fields to respone objects are usually considered.
4.2.3 Message-Passing Dataflow
-
We will briefly look at asynchronous message-passing systems
- similar to RPC in that a client’s request is delivered to another process with low latency.
- the message is not sent via a direct network connection, but goes via an intermediary called a message broker (also called message queue)
-
Using a message broker has several advantages compared to direct RPC:
-
can act as a buffer if the recipient is unavailable or overloaded
-
can automatically redeliver messages to a process that has crashed => prevent messages from being lost
-
avoids the sender needing to know the IP address and port number of the recipient
-
allows one message to be sent to several recipients
-
logically decouples the sender from the recipient (the sender just publishes messages and doesn’t care who consumes them).
-
-
Message Brokers : RabbitMQ, ActiveMQ, HornetQ, Apache Kafka => compare them in more detail in Chapter11
4.3 Summary
-
We discussed several data encoding formats and their compatibiliry properties :
- Programming language specific encodings
- Textual formats like JSON, XML, and CSV
- Binary schema-driven formats like Thrift, Protocol Buffers, and Avro
-
We also discussed several modes of data flow
- Databases : the process writing to the database encodes the data and the process reading from the database decodes it
- RPC and REST APIs : the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response
- Asynchronous message passing : communicate by sending each other messages that are encoded by sender and decoded by the recipient
-
We can conclue that with a bit of care, backward/forward compatibility and rolling upgrades are quite achievable.