|
Map/reduce in MongoDB is useful for batch manipulation of data and aggregation operations. It is similar in spirit to using something like Hadoop with all input coming from a collection and output going to a collection. Often, in a situation where you would have used GROUP BY in SQL, map/reduce is the right tool in MongoDB.
Overview
map/reduce is invoked via a database command. The database creates a temporary collection to hold output of the operation. The collection is cleaned up when the client connection closes, or when explicitly dropped. Alternatively, one can specify a permanent output collection name. map and reduce functions are written in JavaScript and execute on the server. Command syntax: db.runCommand(
{ mapreduce : <collection>,
map : <mapfunction>,
reduce : <reducefunction>
[, query : <query filter object>]
[, sort : <sort the query. useful for optimization>]
[, limit : <number of objects to return from collection>]
[, out : <output-collection name>]
[, keeptemp: <true|false>]
[, finalize : <finalizefunction>]
[, scope : <object where fields go into javascript global scope >]
[, verbose : true]
}
);
Result: { result : <collection_name>,
counts : {
input : <number of objects scanned>,
emit : <number of times emit was called>,
output : <number of items in output collection>
} ,
timeMillis : <job_time>,
ok : <1_if_ok>,
[, err : <errmsg_if_error>]
}
A command helper is available in the MongoDB shell : db.collection.mapReduce(mapfunction,reducefunction[,options]); map, reduce, and finalize functions are written in JavaScript. Map FunctionThe map function references the variable this to inspect the current object under consideration. A map function must call emit(key,value) at least once, but may be invoked any number of times, as may be appropriate. function map(void) -> void Reduce FunctionThe reduce function receives a key and an array of values. To use, reduce the received values, and return a result. function reduce(key, value_array) -> value The MapReduce engine may invoke reduce functions iteratively; thus, these functions must be idempotent. That is, the following must hold for your reduce function: for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)
If you need to perform an operation only once, use a finalize function.
Finalize FunctionA finalize function may be run after reduction. Such a function is optional and is not necessary for many map/reduce cases. The finalize function takes a key and a value, and returns a finalized value. function finalize(key, value) -> final_value Sharded EnvironmentsIn sharded environments, data processing of map/reduce operations runs in parallel on all shards. ExamplesShell Example 1The following example assumes we have an events collection with objects of the form: { time : <time>, user_id : <userid>, type : <type>, ... }
We then use MapReduce to extract all users who have had at least one event of type "sale": > m = function() { emit(this.user_id, 1); }
> r = function(k,vals) { return 1; }
> res = db.events.mapReduce(m, r, { query : {type:'sale'} });
> db[res.result].find().limit(2)
{ "_id" : 8321073716060 , "value" : 1 }
{ "_id" : 7921232311289 , "value" : 1 }
If we also wanted to output the number of times the user had experienced the event in question, we could modify the reduce function like so: > r = function(k,vals) {
... var sum=0;
... for(var i in vals) sum += vals[i];
... return sum;
... }
Note, here, that we cannot simply return vals.length, as the reduce may be called multiple times. Shell Example 2$ ./mongo
> db.things.insert( { _id : 1, tags : ['dog', 'cat'] } );
> db.things.insert( { _id : 2, tags : ['cat'] } );
> db.things.insert( { _id : 3, tags : ['mouse', 'cat', 'dog'] } );
> db.things.insert( { _id : 4, tags : [] } );
> // map function
> m = function(){
... this.tags.forEach(
... function(z){
... emit( z , { count : 1 } );
... }
... );
...};
> // reduce function
> r = function( key , values ){
... var total = 0;
... for ( var i=0; i<values.length; i++ )
... total += values[i].count;
... return { count : total };
...};
> res = db.things.mapReduce(m,r);
> res
{"timeMillis.emit" : 9 , "result" : "mr.things.1254430454.3" ,
"numObjects" : 4 , "timeMillis" : 9 , "errmsg" : "" , "ok" : 0}
> db[res.result].find()
{"_id" : "cat" , "value" : {"count" : 3}}
{"_id" : "dog" , "value" : {"count" : 2}}
{"_id" : "mouse" , "value" : {"count" : 1}}
> db[res.result].drop()
More Examples
Note on Permanent CollectionsEven when a permanent collection name is specified, a temporary collection name will be used during processing. At map/reduce completion, the temporary collection will be renamed to the permanent name atomically. Thus, one can perform a map/reduce job periodically with the same target collection name without worrying about a temporary state of incomplete data. This is very useful when generating statistical output collections on a regular basis. ParallelismAs of right now, MapReduce jobs on a single mongod process are single threaded. This is due to a design limitation in current JavaScript engines. We are looking into alternatives to solve this issue, but for now if you want to parallelize your MapReduce jobs, you will need to either use sharding or do the aggregation client-side in your code. PresentationsMap/reduce, geospatial indexing, and other cool features - Kristina Chodorow at MongoSF (April 2010) See Also |

PLEASE POST QUESTIONS IN THE USER GROUPS FORUM. Post non-question comments and helpful hints here.
blog comments powered by Disqus