Trying to apply a MapReduce concept to synchronization
Trying to figure out if the MapReduce concept could be adapted to a synchronization process (server side mostly). I.e. could we find independent, parallelizable work unit? My first thoughts tend to be a yes. Hereafter is my first high level draft, on which I will work on and refine.
- Coarse level: each data section (or database) could be an independent work unit
- Could be problematic for interdependent section like container (or group or folder or etc.) referencing new item in another data section. Possible solutions: only parallelize the independent sections or use an optimistic approach and reduce could finalize potential identifier mapping
- Pretty coarse level, but easy to implement
Fine level: each data section item could be an independent work unit- Would provide an easy way to parallelize the core of the synchronization process (duplicate detection, merging, field level conflict, etc.)
- An improvement would be to put in the work unit n items, and the n could be dynamically calculated based on resources (memory usage, number of threads, network latency, etc.) availability and consumption (opposite effects)
The reduce part is aggregating the synchronization process intermediate results to produce the result of the overall sync process- Would be not easy to find a solution working for any type of synchronization protocol like OMA DS, a.k.a. SyncML, or FeedSync
Perform a double stage MapReduce with both coarse and fine levelsThe MapReduce concept could be easily applied at a server instance level, which will result in parallel thread, but also at the cluster level as it is the real intend of MapReduce
- At the server instance level the changes to integrate a MapReduce programming level should be easy by wrapping the core synchronization process
- At the cluster level, the changes are more dramatic as impacting the intrinsic architecture and the deployment
- We would have specialized node (process or server instance) like map, synchronization, reduce, etc.
The protocol and session handling would be around the MapReduce architectureA synchronization process is always involving data and data storage either through a direct persistent storage access (database, etc.) or through a higher level service managing the data storage access
- The specificities of the data storage (latency, atomicity, transaction, etc.) would also directly impact the granularity of the Map
- In synchronization, the performances are almost always dependent of the data storage performances. I.e. direct impact on the design of integrating MapReduce
And now let's the fun begin to work out all the details and most importantly validate the feasibility.