bmfg's comments

bmfg · on Jan 10, 2012

Really enjoyed the first chapter. Quick question: how do you deal with filtering out duplicate records (e.g. blog posts/comments) when saving to the batch layer ?

nathanmarz · on Jan 10, 2012

Chapter 2 talks about forming a data model for the master dataset. The core idea is that each record should be a "fact" that stands on its own as something true at a moment in time. When you write your batch computations, you should make them work on any set of valid facts. There's nothing wrong with saying the same record twice, as logically "A and A" is the same as "A". So by formulating batch computations to work on any valid set of facts, it doesn't matter if facts are duplicated.