Data should not be represented as data classes

In Java we usually represent maps as data classes, but doing this incurs some hidden costs.

·

5 min read

As a fan of functional programming and Clojure in particular, I've been thinking about how to articulate why I feel that languages like Java are so much harder to work with. This time I try to highlight some issues with data classes, which is a common way of working with data in the Java world.

As a disclaimer I'll add that most programming languages allow for different styles of programming, so we're not "doomed" to work this way even in Java. But a language is more than just its features, it also includes norms, standards and design philosophies that shape the way we think.

With that out of the way...

The problem with data classes

In Java, "objects and classes" is the hammer turns every problem into a nail. For convenience, we typically use classes to represent JSON-objects as opposed to using maps (a closer equivalent). This has some significant drawbacks that we typically don't consider.

On the bright side, it's a decent way of coercing data to fit the application's view of the world, and JSON-libraries such as Jackson also double as data validation. We also benefit from code completion and refactoring becomes slightly easier.

On the other hand, by turning JSON-objects into classes we also need to decide what to do with fields we don't care about. In other words, our Java-applications usually wants "closed maps" where all fields must be known (or ignored).

We can also not modify the data as easily, for example by adding or removing fields, or merging different pieces of data together (in the example below, imagine we wanted to merge a Car and Boat into a CarBoat). It also is harder to write reusable functions that work with many types of data.

Let's try to illustrate the problem with an example.

Example

The "domain"

We will create a Boat- and Car-class to serve as our data model. Since we want to illustrate how we can reuse code across our Boats and Cars we also create some interfaces.

public interface ThingWithColor {
  public String getColor();
}
public interface ThingWithWeight {
  public double getWeight();
}

public class Car implements ThingWithColor, ThingWithWeight {
  private String brand;
  private String color;
  private double weight;

  public Car(String brand, String color, double weight) {
    this.brand = brand;
    this.color = color;
    this.weight = weight;
  }

  public String getColor() { return color; } 
  public double getWeight() { return weight; }
}

public class Boat implements ThingWithColor, ThingWithWeight {
  public String name;
  public String color;
  public double weight;

  public Boat() {};

  public Boat(String name, String color, double weight) {
    this.name = name;
    this.color = color;
    this.weight = weight;
  }

  public String getColor() { return color; }
  public double getWeight() { return weight; }
}

Code reuse

Next, we create a printColors-function, which works for both Boats and Cars thanks to our interface:

private static <T extends ThingWithColor> void printColors(List<T> things) {
  System.out.println(things.stream()
    .map(ThingWithColor::getColor)
    .collect(Collectors.toList()));
}

But sadly, objects that we don't control (i.e. from libraries) won't implement this interface, so our function will only support our own data classes. Oh well!

Also, for any non-trivial problem it seems we'd be creating a fair amount of interface-bloat with this approach. In this example we have two interfaces for the two fields. But if a function happens to require a combination of fields, do we then make an interface for every combination? Or maybe we use casting to get around the problem and hope for the best?

private static void printColorsAndWeights(List things) {
  System.out.println(things.stream()
    .map(x -> List.of(((ThingWithColor) x).getColor(), ((ThingWithWeight) x).getWeight()))
    .collect(Collectors.toList()));
}

I think the example above illustrates why I feel like Java-functions (or methods, I guess) end up highly tuned to specific classes.

Data loss, or death by specificity?

Though the "schema validation" we get from decoding JSON via the Jackson library is neat, extra fields count as a failed validation/coercion by default:

var boatsFromJson = new ObjectMapper().readValue("""
  [{"name": "Boaty", "color": "red", "weight": 200000.0, "crew": 10},
   {"name": "Princess", "color": "orange", "weight": 2000000.0, "crew": 20}]
""", Boat[].class);
// UnrecognizedPropertyException: Unrecognized field "crew"

It's a hassle to have to update our data model if data entering the system has some extra stuff we don't care about, so we can fix the exception by ignoring unknown properties;

import com.fasterxml.jackson.annotation.*;

@JsonIgnoreProperties(ignoreUnknown = true)
public class Boat implements ThingWithColor, ThingWithWeight { ... }

var boatsFromJson = new ObjectMapper().readValue("""
  [{"name": "Boaty", "color": "red", "weight": 200000.0, "crew": 10},
   {"name": "Princess", "color": "orange", "weight": 2000000.0, "crew": 20}]
""", Boat[].class);
System.out.println(new ObjectMapper().writeValueAsString(boatsFromJson));
// [{"name":"Boaty","color":"red","weight":200000.0},{"name":"Princess","color":"orange","weight":2000000.0}]

Oops, now our application ate the crew!

As far as I can tell we have two options; we can either accept that data is lost because we decided it doesn't exist (which is fine in many cases), or we accept that any additions to our data model might result in pointless maintenance work in parts of the code that doesn't even make use of the new information anyway.

Sidenote; the term "death by specificity" here refers to Rich Hickey's talk "Clojure Made Simple" which also makes the case that new additions to our domain not always needs an accompanying "data class" (49:06).

Conclusion? Don't use J... err, classes

There are better ways of dealing with this, but in Java we don't have a habit of using Maps to represent data that were maps in the first place. This is understandable because Maps in Java aren't very elegant to work with compared to languages like Clojure or Javascript.

The following example shows the advantage of treating JSON-objects as maps. All of our Clojure-code is untyped, so we would need to add some schema validation separately to get on Java's level of soundness. At any rate, this is easy to do, and now we can also let the data we don't care (or know) about flow through the application.

(defn print-colors [things]
  (println (map (fn [x] (x "color")) things)))

(let [boats (json/read-str "[{\"name\": \"Boaty\", \"color\": \"red\", \"weight\": 200000.0, \"crew\": 10}, {\"name\": \"Princess\", \"color\": \"orange\", \"weight\": 2000000.0, \"crew\": 20}]")]
  (print-colors boats)
  (println (json/write-str boats)))
; => (red orange)
; The crew was not eaten:
; => [{"name":"Boaty","color":"red","weight":200000.0,"crew":10},{"name":"Princess","color":"orange","weight":2000000.0,"crew":20}]

And though I'm a Clojurist at heart, I'll admit that Typescript also has a good story here. The following function's type definition has no knowledge of our cars and boats:

function printColors(listOfColorfulThings: { color: string }[]) {
    console.log(listOfColorfulThings.map(x => x.color))
}