Should I start out with sharded or with a non-sharded MongoDB environment?We suggest starting unsharded for simplicity and quick startup unless your initial data set will not fit on single servers. Upgrading to sharding from unsharded is easy and seamless, so there is not a lot of advantage to setting up sharding before your data set is large. Whether with or without sharding, you should be using replication (replica sets) early on for high availability and disaster recovery. How does sharding work with replication?Each shard is a logical collection of partitioned data. The shard could consist of a single server or a cluster of replicas. We recommmend using a replica set for each shard. Where do unsharded collections go if sharding is enabled for a database?In the current version of MongoDB, unsharded data goes to the "primary" for the database specified (query config.databases to see details). Future versions will parcel out unsharded collections to different shards (that is, a collection could be on any shard, but will be on only a single shard if unsharded). When will data be on more than one shard?MongoDB sharding is range based. So all the objects in a collection get put into a chunk. Only when there is more than 1 chunk is there an option for multiple shards to get data. Right now, the default chunk size is 64mb, so you need at least 64mb for a migration to occur. What happens if I try to update a document on a chunk that is being migrated?The update will go through immediately on the old shard, and then the change will be replicated to the new shard before ownership transfers. What if a shard is down or slow and I do a query?If a shard is down, the query will return an error unless the "Partial" query options is set. If a shard is responding slowly, mongos will wait for it. You won't get partial results by default. How do queries distribute across shards?There are a few different cases to consider, depending on the query keys and the sort keys. Suppose 3 distinct attributes, X, Y, and Z, where X is the shard key. A query that keys on X and sorts on X will translate straightforwardly to a series of queries against successive shards in X-order. This is faster than querying all shards in parallel because mongos can determine which shards contain the relevant chunks without waiting for all shards to return results. A query that keys on X and sorts on Y will execute in parallel on the appropriate shards, and perform a merge sort keyed on Y of the documents found. A query that keys on Y must run on all shards: if the query sorts by X, the query will serialize over shards in X-order; if the query sorts by Z, the query will parallelize over shards and perform a merge sort keyed on Z of the documents found. How do queries involving sorting work?Each shard pre-sorts its results and the mongos does a merge before sending to the client. See the How queries work with sharding PDF for more details. Now that I sharded my collection, how do I <...> (e.g. drop it)?Even if chunked, your data is still part of a collection and so all the collection commands apply. If I don't shard on _id how is it kept unique?If you don't use _id as the shard key then it is your responsibility to keep the _id unique. If you have duplicate _id values in your collection bad things will happen (as mstearn says). Best practice on a collection not sharded by _id is to use an identifier that will always be unique, such as a BSON ObjectID, for the _id field. Why is all my data on one server?Be sure you declare a shard key for your large collections. Until that is done they will not partition. MongoDB sharding breaks data into chunks. By default, the default for these chunks is 64MB (in older versions, the default was 200MB). Sharding will keep chunks balanced across shards. You need many chunks to trigger balancing, typically 2gb of data or so. db.printShardingStatus() will tell you how many chunks you have, typically need 10 to start balancing. Can I remove old files in the moveChunk directory?Yes, these files are made as backups during normal shard balancing operations. Once the operations are done then they can be deleted. The cleanup process is currently manual so please do take care of this to free up space. How many connections does each mongos need?In a sharded configuration mongos will have 1 incoming connection from the client but may need 1 outgoing connection to each shard (possibly times the number of nodes if the shard is backed by a replicaset). This means that the possible number of open connections that a mongos server requires could be (1 + (N*M) * C) where N = number of shards, M = number of replicaset nodes, and C = number of client connections. Why does mongos never seem to give up connections?mongos uses a set of connection pools to communicate to each shard (or shard replicaset node). These pools of connections do not currently constrict when the number of clients decreases. This will lead to a possibly large number of connections being kept if you have even used the mongos instance before, even if it is currently not being used. How can I see the connections used by mongos?Run this command on each mongos instance: db._adminCommand("connPoolStats");
What is writebacklisten in my logs and currentOp()?Writeback listeners are part of the internal communications between shards and config dbs. If you are seeing these in the currentOp or in the slow logs on the server this is part of the normal operation. In particular, the writeback listener is performing long operations, so it can appear in the slow logs even during normal operation. If a moveChunk fails do I need to cleanup the partially moved docs?No, chunk moves are consistent and deterministic; the move will retry and when completed the data will only be on the new shard. Can I move/rename my config servers, or go from one to three?Yes, see Changing Config Servers When do the mongos servers pickup config server changes?The mongos servers have a cache of the config db for sharding metadata (like chunk placement on shards). Periodically, and during specific events, the cache is updated. There is not way to control this behavior from the client. I changed my replicaset configuration, how can I apply these quickly on my mongos servers?The mongos will pick these changes up over time, but it will be faster if you issue a flushRouterConfig command to each mongos directly. What does setting maxConns do on mongos?This limits the number of connections accepted by mongos. If you have a client (driver/application) which creates lots of connections but doesn't close them, letting them timeout, then it might make sense to set this value slightly higher than the maximum number of connections being created by the client, or the connection pool max size. Doing this will keep connection spikes from being sent down to the shards which could cause much worse problems, and memory allocation. |

PLEASE POST QUESTIONS IN THE FORUMS: http://groups.google.com/group/mongodb-user. Post tips and clarifications here.
blog comments powered by Disqus