Data Intensive App #12 | The Future of Data Systems

2021-10-26 5. Study Comments

The Future of Data Systems

In this final chapter,
- we will shift our perspective toward the future and discuss how things should be:
- propose some ideas and approaches that may fundamentally improve the ways we design and build applications.
The goal of this book was outlined in Chapter 1 : to explore how to create applications and systems that are reliable, scalable, and maintainable.
In this chapter we will bring all of these ideas together, and build on them to envisage the future.

12.1 Data Integration

A recurring theme has been that for any given problem, there are several solutions, all of which have different pros, cons, and trade-offs.
So you inevitably end up having to cobble together several different pieces of software in order to provide your application’s functionality.

12.1.1 Combining Specialized Tools by Deriving Data

Derived data (log-based) vs distributed transactions
- In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems.
The limits of total ordering :
- With small systems, constructing a totally ordered event log is entirely feasible.
- However, as systems are scaled toward bigger and more complex workloads, limitations begin to emerge : (p.446, 168, 131, 170)
Ordering events to capture causality
- Unfortunately, there does not seem to be a simple answer to this problem
- Perhaps, over time, patterns for application development will emerge that allow causal dependencies to be captured efficiently, and derived state to be maintained correctly, without forcing all events to go through the bottleneck of total order broadcast.

12.1.2 Batch and Stream Processing

The goal of data integration is to make sure that data ends up in the right form in all the right places
Doing so requires consuming inputs, transforming, joining, filtering, aggregating, training models, evaluating, and eventually writing to the appropriate outputs.
Batch and stream processors are the tools for achieving this goal.
- the main fundamental difference is that stream processors operate on unbounded datasets whereas batch process inputs are of a known, finite size.
- but these distinctions are beginning to blur
- Spark performs stream processing on top of a batch processing engine by breaking the stream into microbatches, whereas Apache Flink performs batch processing on top of a stream processing engine
Maintaining derived state :
- maintaining derived data => batch processing (determministic) => well-defined inputs and outputs => good for fault tolerance and reasoning about the dataflows
Reprocessing data for application evolution :
- batch processing allows large amounts of accumulated historical data to be reprocessed in order to derive new views onto an existing dataset
- gradul evolution of application
The lambda architecture : how to combine batch process & stream processing
- batch processing is used to reprocess historical data,
- stream processing is used to process recent updates,
- incoming data should be recorded by appending immutable events to an always-growing dataset, similarly to event sourcing (p.457)
- the stream processor consumes the events and quickly produces an approximate update to the view;
- the batch processor later consumes the same set of events and produces a corrected version of the derived view.
Unifying batch and stream processing : More recent work has enabled the benefits of the lambda architecture to be enjoyed without its downsides, by allowing both batch computations (reprocessing historical data) and stream computations (processing events as they arrive) to be implemented in the same system

12.2 Unbundling Databases (Done)

At a most abstract level, databases and operating systems all perform the same functions: they store some data, and they allow you to process and query that data.
Unix and relational databases have approached the information management problem with very different philosophies.
- Unix is simpler in the sense that it is a fairly thin wrapper around hardware resources
- relational data‐bases are simpler in the sense that a short declarative query can draw on a lot of powerful infrastructure
In this section I will attempt to reconcile the two philosophies, in the hope that we can combine the best of both worlds.

12.2.1 Composing Data Storage Technologies

We have discussed various features provided by databases and how they work,
The meta-database of everything :
- the dataflow across an entire organization starts looking like one huge database
- Recently, they are provided by various different pieces of software, different machines, different teams. (instead of a single integrated database product,)
- Where will these developments take us in the future?
  - Federated databases (unifying reads) : provide a unified query interface to a wide variety of underlying storage engines and processing methods. (PostgreSQL’s foreign data wrapper)
  - Unbundled databases (unifying writes) : is like unbundling a database’s index-maintenance features that can synchronize writes across disparate technologies
Making unbundling work : keeping the writes to several storage systems in sync
- traditional approach : distributed transaction => the lack of a standardized transaction protocol makes integration much harder.
- log-based integration : An ordered log of events with idempotent consumers => loose coupling between the various components,

12.2.2 Designing Applications Around Dataflow

In this section, I will explore some ways of building applications around the ideas of unbundled databases and dataflow
Application code as a derivation function :
- When the function that creates a derived dataset is not a standard cookie-cutter function (like ML, feature extraction) => custom code is required => where many databases struggle.
Separation of application code state :
- Databases could be deployment environments => poorly suited
- it makes sense to have some parts of a system that specialize in durable data storage, and other parts that specialize in running application code.
- YARN, Docker, Kubernetes
Dataflow: Interplay between state changes and application code :
- Stable message ordering and fault-tolerant message processing are quite stringent demands, but they are much less expensive and more operationally robust than distributed transactions.
- Instead of treating a database as a passive variable that is manipulated by the application, we think much more about the interplay and collaboration between state, state changes, and code that processes them.
- Unbundling the database means taking this idea and applying it to the creation of derived datasets outside of the primary database: caches, full-text search indexes, machine learning, or analytics systems. We can use stream processing and messaging systems for this purpose.
Stream processors and services
- The currently trendy style of application development involves breaking down functionality into a set of services that communicate via synchronous network requests such as REST APIs : Microservices
- Not only is the dataflow approach faster, but it is also more robust to the failure of another service.

12.2.3 Observing Derived State

The derived dataset is the place where the write path and the read path meet,
It represents a trade-off between the amount of work that needs to be done at write time and the amount that needs to be done at read time.
Materialized views and caching (fig12-1)
- ex1) If you didn’t have an index, a search query would have to scan over all documents (like grep), which would get very expensive => No index means less work on the write path (no index to update), but a lot more work on the read path.
- ex2) precomputing the search results for all possible queries => less work to do on the read path
- Usually) precompute the search results for only a fixed set of the most common queries, (cache, materialized view)
- Viewed like this, the role of caches, indexes, and materialized views is simple: they shift the boundary between the read path and the write path
Stateful, offline-capable clients (뒤랑 연결안되면 빼도될듯)
- Let’s look at the idea in different context
- changing capabilities have led to a renewed interest in offline-first applications
- we can think of the on-device state as a cache of state on the server.
Pushing state changes to clients
- Typical web page : The state on the device is a stale cache that is not updated unless you explicitly poll for changes.
- Recent protocols : the server can actively push messages to the browser as long as it remains connected. (subscribe)
- The same technique works for individual users, where each device is a small subscriber to a small stream of events
End-to-end event streams
- Recent tools for developing stateful clients and user interfaces, (Facebook’s toolchain of React, Flux, and Redux) already manage internal client-side state by subscribing to a stream of events representing user input or responses from a server, structured similarly to event sourcing
- In order to extend the write path all the way to the end user, we would need to fundamentally rethink the way we build many of these systems
- I think that the advantages of more responsive user interfaces and better offline support would make it worth the effort. If you are designing data systems, I hope that you will keep inmind the option of subscribing to changes, not just querying the current state.

12.3 Aiming for Correctness (Done)

In this section I will suggest some ways of thinking about correctness in the context of dataflow architectures.

12.3.1 The End-to-End Argument for Databases

Let’s look at a more subtle example of data corruption that can occur.
Exactly-once execution of an operation
- If something goes wrong while processing a message, you can either give up or try again.
- If you try again => there is the risk that the message ends up being processed twice (charge a customer twice) => idempotence (478p) => the final effect is the same as if no faults had occurred, even if the operation actually was retried due to some fault
Duplicate suppression
- The same pattern of duplicates occurs in many other places besides stream processing.
Operation identifiers
- To make the operation idempotent, it is not sufficient to rely just on a transaction mechanism
- you need to consider the end-to-end flow of the request.
  - For example, you could generate a unique identifier for an operation (such as a UUID)
  - client reques twice => have the same operation ID
The end-to-end argument
- This scenario of suppressing duplicate transactions is just one example of a more general principle called the end-to-end argument
Applying end-to-end thinking in data systems
- It would be really nice to wrap up the remaining high-level fault-tolerance machinery in an abstraction so that application code needn’t worry about it
- But we have not yet found the right abstraction
- I think it is worth exploring fault-tolerance abstractions that make it easy to provide application-specific end-to-end correctness properties,

12.3.2 Enforcing Constraints

Techniques that enforce uniqueness can often be used for these kinds of constraints as well
- Uniqueness constraints (user name, email address, book…)
- Account balance never goes negative,
Uniqueness constraints require consensus : there are several concurrent requests with the same value, the system somehow needs to decide which one of the conflicting operations is accepted, and reject the others as violations of the constraint (Chapter9)
Uniqueness in log-based messaging : In the unbundled database approach with log-based messaging, we can use a very similar approach to enforce uniqueness constraints : Total Order Broadcast. (p.348)

12.3.3 Timeliness and Integrity

consistency : conflates two different requirements that are worth considering separately
- Timelineness : means ensuring that users observe the system in an up-to-date state
- Integrity : means absence of corruption; i.e., no data loss, and no contradictory orfalse data
In most applications, integrity (error in sum) is much more important than timeliness (not yet appear). : credit card statement
Loosely interpreted constraints :
- many real applications can actually get away with much weaker notions of uniqueness (overbook)
- In many business contexts, it is actually acceptable to temporarily violate a constraintand fix it up later by apologizing
- These applications do require integrity: you would not want to lose a reservation, or have money disappear due to mismatched credits and debits.
Coordination-avoiding data systems
- You cannot reduce the number of apologies to zero, but you can aim to find the best trade-off for your needs—the sweet spot where there are neither too many inconsistencies nor too many availability problems.

12.3.4 Trust, but Verify

All of our discussion of correctness, integrity, and fault-tolerance has been under the assumption that certain things might go wrong,
But we might also assume that data written to disk is not lost after fsync, that data in memory is not corrupted, and that the multiplication instruction of our CPU always returns the correct result.
They are still very rare on modern hardware. I just wantt o point out that they are not beyond the realm of possibility, and so they deserve some attention.
Maintaining integrity in the face of software bugs : Despite considerable efforts in careful design, testing, and review, bugs still creep in.
Don’t just blindly trust what they promise : Thus, we should at least have a way of finding out if data has been corrupted so that we can fix it and try to track down the source of the error. Don’t just blindly trust that it is all working
A culture of verification :
- Systems like HDFS and S3 still have to assume that disks work correctly most of the time—which is a reasonable assumption, but not the same as assuming that they always work correctly.
- self-validating or self-auditing systems : read back files, compare them to other replicas, and move files from one disk toanother, in order to mitigate the risk of silent corruption
Tools for auditable data systems
- distributed ledger technologies : Bitcoin, Ethereum, Ripple,…
- I could imagine integrity-checking and auditing algorithms, like those of certificate transparency and distributed ledgers, becoming more widely used in data systems ingeneral.

12.4 Doing the Right Thing (Done)

Throughout this book we have examined a wide range of different architectures for data systems,
Finally, Let’s take a step back and examine some ethical aspects of building data intensive applications.

12.4.1 Predictive Analytics

Algorithmic prison : As algorithmic decision-making becomes more widespread, someone who has been labeled as risky by some algorithm may suffer a large number of those “no” decisions (jobs, insurance coverage, rental, ..)
Bias and discrimination : If there is a systematic bias in the input to an algorithm, the system will most likely learn and amplify that bias in its output
Responsibility and accountability : When a self-driving car causes an accident, who is responsible?
Feedback loops : When services become good at predicting what content users want to see, they may end up showing people only opinions they already agree with, leading to echochambers in which stereotypes, misinformation, and polarization can breed.

12.4.2 Privacy and Tracking

Besides the problems of predictive analytics, there are ethical problems with data collection itself. => surveillance, privacy, etc.
As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect.