Data flow - what functional programming and Unix philosophy can teach us about data streaming

What does stream processing have in common with functional programming and Unix? At first glance, it seems that those things are entirely unrelated. However, when we look closer at the principles that govern functional programming and writing software for Unix, we see some similarities.

At the beginning of this article, I describe the desired properties of a data streaming workflow. Later, I write about the mindset of functional programming and the Unix rules of software. In the end, I show how to apply all of those ideas to implementing data streaming jobs that are easy to maintain.

Benefits of stream workflows

In the paper “Kafka, Samza and the Unix Philosophy of Distributed Data,” Martin Kleppman and Jay Kreps describe the benefits of stream and batch workflows. Those benefits closely resemble the results we get when we write the code in the functional style.

First of all, they write about the ability to have multiple consumers of every step of the workflow. In the case of stream processing, several jobs can read the same input without affecting each other. Similarly, many languages designed specifically for functional programming - like Scala - give us data structures that are immutable by default, which makes multithreading and sharing state trivial because the input data never changes.

Additionally, the authors of the paper describe advantages such as team interface or loose coupling. They say that the ability to define the structure of input and output parameters allows the programmer teams to work independently from each other. The coupling between tasks in our backlog and the code we write exists only because of the input and output types.

Likewise, languages such as Scala give us a possibility to define the business domain as data types in the code. We can use static typing to our advantage by implementing the domain knowledge as data types. I was talking about such usage of Scala code in my talk during the LambdaDays conference.

Last but not least, the stream processing workflows allow failure recovery by re-running the failed steps. Data immutability and pure functions allow us to achieve the same results in functional programming by writing code that is 100% deterministic and depends only on the input data.

The mindset of functional programming

A few years ago, Jessica Kerr gave a talk, “Functional principles for object-oriented development.” In the talk, she described what she calls the mindset of functional programming. According to Jessica Kerr, the concept is all about “letting go of the knowledge when our code will execute, or even how many times, or by whom, or on what thread.”

Functional programming is about focusing on what is relevant to the problem and expressing it as a series of data transformations. Jessica Kerr describes pure functions as “data-in data-out” functions. Those things are conceptually the same.

Pure functions don’t let you access the global state. They don’t let you modify the global state either. You can’t even modify the input of those functions. The only thing allowed is to use the input parameters to return some output.

At first glance, this looks limiting. There are tons of things we are not allowed to do even though modifying the global state is the sole reason why we run the code in the first place!

So what is wrong? Nothing. The concept is all about pushing the side-effects, global state modification, writing to databases, using APIs, etc. to the edges of your code. We should build the whole program as a series of functions where only the first and the last function in the chain is not pure. Doesn’t it resemble stream processing?

Pure functions give us a tremendous advantage because when we express the core of the business logic - our most crucial code - as a series of pure functions, we get code that is 100% testable and easy to understand.

When the business knowledge gets encoded as specific types, every function creates an isolated bounded context that deals with only one aspect of the problem, when all of those pure functions are composed to create a single flow of control, we get application code that reads like a lesson in the business domain.

Unix philosophy

The last paragraph sounds idealistic, but we have been doing it for dozens of years. After all, the command line software available in Unix was built upon the Unix Philosophy. The programmers who wrote them were trying to make each program do one thing well and expected that the output of their program would be used as an input of another program. That is why doing something useful in Linux or macOS shell requires chaining calls to many specialized programs.

The Unix Rule of Modularity tells us to write simple parts connected by interfaces. Doesn’t it resemble writing simple functions with strictly defined input and output types? The Unix Rule of Composition instructs us to design programs to be connected to other programs. Just like function composition or splitting a complicated stream processing code into a few simpler stages. The Unix Rule of Separation goes back to Jessica Kerr’s functional mindset. The rule tells us to separate policy from the mechanism and separate interfaces from engines.

Functional stream processing

How would stream processing code look like if we apply the principles of functional programming to it?

First, we would not have side effects in the intermediate steps of the pipeline. It means that no API calls are allowed, no database lookups. We can use only the input stream messages.

Because of that, we should push IO to the boundaries of the stream. We can do it by making a copy of external data in a separate stream joined with the processed messages. Of course, this requires updating the data snapshot periodically, but we would need to do the API calls while processing the messages too.

The intermediate functions of the stream processing code should be pure. I think it is the right approach because the initial and terminal steps of a stream processing code tend to deal with technical details (writing data to databases, sending emails, calling APIs) while the intermediate steps implement the business logic.

As we already know, using pure functions to implement business logic gives us 100% deterministic testability.

In addition to that, we must remember that errors are data too! Programmers who spent much time writing object-oriented code tend to deal with errors by throwing an exception. Throwing an exception destroys the flow of control and should be avoided.

In languages such as Scala, we have specialized types like Try and Either, which allow us to pass errors as data. Using a specialized type makes the possible problems explicit and allows us to redirect the errors to a separate stream and let a specialized function deal with them.

The most important lesson we can learn from functional programming is function composition. We must not write a colossal message processing function that sits in the middle of the stream processing code.

Instead of that, we should split it into a series of simpler functions. In my opinion, the right way to implement stream processing is to write a tiny function that does only one step of the data processing and passes the data to another stream.

Conclusion

Of course, it is not a silver bullet that solves all of our problems. On the one hand, we can parallelize work by letting many people work on the intermediate functions at the same time.

When we implement the business logic as a series of functions, we split the processing into fain grained steps. As a result, it is not only easier to modify the code, but it also gives us more useful data about the performance.

When we replace one big function with stream composition, stream processing instrumentation will show us which of the intermediate functions is slow. Because of that, we can scale up that tiny part of the process by adding more partitions to the intermediate stream containing slowly processed messages.

When we use the suggested implementation, we may monitor, debug, and test every single step separately. Because of that, the code adheres to the Unix Rule of Transparency, which says that we should design for visibility to make the inspection and debugging easier.

On the other hand, the solution I suggested leads to a situation in which we have many streams with intermediate messages. Such implementations need to be thoroughly documented. Otherwise, the onboarding of new programmers is going to turn into a nightmare.

In addition to wasted training time, we may also waste a lot of processing time doing serialization and deserialization of messages.

As usual, it is all about making tradeoffs. I think that we can’t go wrong with following the functional programming principles because it is much easier to combine two functions into one when we see that data serialization becomes a bottleneck than to split an existing function into smaller steps.

Sources

Older post

AWS IAM roles and policies explained

How to give users access rights in AWS

Newer post

Athena performance tips explained

How to use query execution plans to speed up Athena queries