We are working on a project where we need to accept a series of business events related to a set of common users. Those events are produced by various external systems. We need to ingest them, detect anomalies (based on shared referential data and format), detect potential frauds (detect the same user at different place at the same time), and try to rebuild users action in the various systems in order to detect any suspicious behavior.

Here a quick diagram illustrating the current solution :

Our data pipeline

As for now we going to received 40Gb of data at each end of day from those external systems and we have to analyze them as quickly as possible. In the future, the datas will come in a more realtime fashion (by api or by small files for example each 15minutes)

Our first architectural proposition was create a sort of datapipeline with data pulled by the interested services in each step. this is a custom etl : we ingest all the data in an oracle db, and from there some services are pulling batch of data to analyze,clean,detect fraud...

But as I am doing some research on this topic, I am wondering if Kafka Stream, or a solution based on Apache Spark/Flink is not more appropriate. Could you confirm me (or not) if in your mind we should switch to a solution based on Kafka or Spark or Flink. And why? But additionally a big infrastructure could be necessary to run a solution based on Kafka or Spark whereas for now we are more in batch oriented solution requiring may be less servers, don't you think ?