Jeff Merlet

Architecture, Scalability, Web, Mobility, Synchronization 
Filed under

MapReduce

 

A Comparison of Approaches to Large-Scale Data Analysis - MapReduce vs. DBMS Benchmarks

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity.

Here is the link of the article (PDF).

Loading mentions Retweet
Filed under  //   Benchmark   Database   MapReduce   Performance  

Comments [0]

Trying to apply a MapReduce concept to synchronization

Trying to figure out if the MapReduce concept could be adapted to a synchronization process (server side mostly). I.e. could we find independent, parallelizable work unit? My first thoughts tend to be a yes. Hereafter is my first high level draft, on which I will work on and refine.

  • Coarse level: each data section (or database) could be an independent work unit
  • Could be problematic for interdependent section like container (or group or folder or etc.) referencing new item in another data section. Possible solutions: only parallelize the independent sections or use an optimistic approach and reduce could finalize potential identifier mapping
  • Pretty coarse level, but easy to implement
  • Fine level: each data section item could be an independent work unit
    • Would provide an easy way to parallelize the core of the synchronization process (duplicate detection, merging, field level conflict, etc.)
    • An improvement would be to put in the work unit n items, and the n could be dynamically calculated based on resources (memory usage, number of threads, network latency, etc.) availability and consumption (opposite effects)
  • The reduce part is aggregating the synchronization process intermediate results to produce the result of the overall sync process
    • Would be not easy to find a solution working for any type of synchronization protocol like OMA DS, a.k.a. SyncML, or FeedSync 
  • Perform a double stage MapReduce with both coarse and fine levels
  • The MapReduce concept could be easily applied at a server instance level, which will result in parallel thread, but also at the cluster level as it is the real intend of MapReduce
    • At the server instance level the changes to integrate a MapReduce programming level should be easy by wrapping the core synchronization process
    • At the cluster level, the changes are more dramatic as impacting the intrinsic architecture and the deployment
    • We would have specialized node (process or server instance) like map, synchronization, reduce, etc.
  • The protocol and session handling would be around the MapReduce architecture
  • A synchronization process is always involving data and data storage either through a direct persistent storage access (database, etc.) or through a higher level service managing the data storage access
    • The specificities of the data storage (latency, atomicity, transaction, etc.) would also directly impact the granularity of the Map
    • In synchronization, the performances are almost always dependent of the data storage performances. I.e. direct impact on the design of integrating MapReduce
    And now let's the fun begin to work out all the details and most importantly validate the feasibility.

    Loading mentions Retweet
    Filed under  //   architecture   MapReduce   synchronization  

    Comments [0]