Hacker Newsnew | past | comments | ask | show | jobs | submit | bmfg's commentslogin

Really enjoyed the first chapter. Quick question: how do you deal with filtering out duplicate records (e.g. blog posts/comments) when saving to the batch layer ?


Chapter 2 talks about forming a data model for the master dataset. The core idea is that each record should be a "fact" that stands on its own as something true at a moment in time. When you write your batch computations, you should make them work on any set of valid facts. There's nothing wrong with saying the same record twice, as logically "A and A" is the same as "A". So by formulating batch computations to work on any valid set of facts, it doesn't matter if facts are duplicated.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: