From 6a86b6a868f572e3cff7349bff1dbe618864fc29 Mon Sep 17 00:00:00 2001 From: Christoph Schubert <18868819211@163.com> Date: Thu, 30 Mar 2017 09:20:19 +0800 Subject: [PATCH 01/26] Added zh folder --- zh/01-intro.md | 64 ++ zh/02-concepts.md | 449 ++++++++++++ zh/03-developers.md | 1022 +++++++++++++++++++++++++++ zh/04-operators.md | 1606 +++++++++++++++++++++++++++++++++++++++++++ zh/05-app-guide.md | 658 ++++++++++++++++++ zh/06-notes.md | 184 +++++ zh/prelude.pmd | 12 + 7 files changed, 3995 insertions(+) create mode 100644 zh/01-intro.md create mode 100644 zh/02-concepts.md create mode 100644 zh/03-developers.md create mode 100644 zh/04-operators.md create mode 100644 zh/05-app-guide.md create mode 100644 zh/06-notes.md create mode 100644 zh/prelude.pmd diff --git a/zh/01-intro.md b/zh/01-intro.md new file mode 100644 index 0000000..93a8ebb --- /dev/null +++ b/zh/01-intro.md @@ -0,0 +1,64 @@ +# Introduction + +## Downtime Roulette + +![Gambling With Uptime](../assets/decor/roulette.png) + +在赌场图片轮盘赌,任何特定的数字有一个37的机会被击中。想象一下,你可以单打一个给定的数字将*不会*被打击(约97.3%在你的支持),赢取将支付10倍你的赌注。你会打赌吗?我的钱包快到我的钱包,我的拇指会在我的口袋里开火。 + +现在想象你可以再次打赌,但是只有在这个轮对你有利的时候才能获得100次旋转,否则你会输。你还会玩吗赢得单打可能会很容易,但是在许多试验中,赔率并不符合你的利益。 + +人们随时随地都进行数据投注。单个服务器有很好的机会可用。当您运行具有数千台服务器或数十亿个请求的群集时,任何一个分解的可能性就成为规则。 + +根据十亿次机会,一次一百万次的灾难是很常见的。 + +## 什么是Riak + +Riak是用于高可用性,容错和近线性可扩展性的开源,分布式键/值数据库。简而言之,Riak具有非常高的正常运行时间,并与您一起成长。 + + +随着现代世界与日益复杂的联系紧密联系,信息管理发生重大转变。 网络和网络设备刺激了世界历史上看不到的数据收集和访问的爆炸式增长。 存储和管理的价值的数量继续以惊人的速度增长,同时,更多的人比以往需要快速可靠地访问这些数据。 这个趋势被称为*大数据*。 + + + +

总是在Riak上投注

+ +Riak的优点是大容量(可以在需要时读取和写入数据),高速度(容易响应增长)以及各种信息数据(您可以将任何类型的数据存储为值)。 + +基于*Amazon Dynamo*设计,Riak被建立为真正的Big Data问题的解决方案。 Dynamo是一种高度可用的设计,意味着它可以以非常大的规模快速响应请求,即使您的应用程序每天正在存储和提供数TB数据。 Riak在2009年被开放使用之前已经被用于生产中.Github,Comcast,Voxer,Disqus等人目前使用的是较大的系统,其中存储数百TB的数据,并且每个节点每天处理几GB。 +Riak是用Erlang编程语言编写的。 Erlang被选中,因为它强烈支持并发,稳固的分布式通信,热代码加载和容错能力。 它运行在虚拟机上,所以运行Riak需要安装Erlang。 + +那么你应该用Riak吗 潜在用户的一个很好的经验法则是询问自己,每一个停机时间将会以某种方式(金钱,用户等)花费你的钱。 并不是所有系统都需要这么多的正常运行时间,如果没有,Riak可能不适合你。 + +## 关于这本书 + +这不是一个“安装和跟进”指南。 这是一个“阅读和理解”指南。 在开始这本书的时候,不要害怕有Riak,甚至有电脑的方便。 您可能会觉得在某个时候安装,如果是这样,可以在[Riak docs](http://docs.basho.com)中找到说明. + + +在我看来,这本书最重要的部分是[概念篇](#concepts)。 如果你已经有了一些知识,可能会开始缓慢,但是赶快赶上。 在开展理论基础之后,我们将通过学习如何查询和修改某些设置来帮助[开发人员](#developers)使用Riak。 最后,我们将介绍[operators](#operators)应该知道的基本细节,例如如何设置Riak集群,配置一些值,使用可选工具等等。 + + +## 2.0新功能 + +Riak 2.0代表了Riak作为数据存储的能力和重点的重大转变。 Riak一直主要关注操作简单性,而且还没有改变。 但是,在设计决策时,操作始终优先于开发人员的需求。 这正在改变。 随着2.0的推出,我们添加了一些开发人员想要看到的功能。 也就是说,以下: + +* __Strong Consistency__ Riak仍然是最终一致的,但现在你有选择。 Riak现在是管理数据库的最简单的方法,可以平滑地调整AP和CP之间的每个数据桶的频谱。 +* __Better Search__。 Riak的制造商通过利用Solr搜索引擎的力量改进了搜索。您现在可以获得分布式Solr的所有可查询性,而无需手动索引的麻烦。 +* __Datatypes__。 Riak历史上通过允许存储任何二进制对象来提供存储灵活性。这仍然是这样,但现在您可以选择存储分布式地图,集合,计数器和标志,以便面对冲突自动收敛。 +* __Security__。一个长期的要求,一天终于来了。本地组/用户访问控制。 +* __Bucket types__。现在,您可以支持无限制的自定义桶属性,而不需要旧的八卦协议的开销。 +* __Ring调整大小__。最后!在过去,您被限制为固定的环尺寸,您现在可以选择动态增加/减少群集中的vnode数量。 +* __其他改进__我们还进行了许多其他改进,例如简化的配置管理(不再混淆`app.config`和`vm.args`),减少兄弟的爆炸(通过一个称为DVV的新逻辑时钟),改进内部元数据共享(减少八卦喋喋不休),更好的AAE等等。 + +本书还包括由John Daily编写的新篇章,以帮助指导开发人员使用Riak编写有效的应用程序。我们希望你喜欢新的,改进的,*不是很小的Riak书*。 diff --git a/zh/02-concepts.md b/zh/02-concepts.md new file mode 100644 index 0000000..cdfad78 --- /dev/null +++ b/zh/02-concepts.md @@ -0,0 +1,449 @@ +# Concepts + +Believe me, dear reader, when I suggest that thinking in a distributed fashion is awkward. When I had first encountered Riak, I was not prepared for some of its more preternatural concepts. Our brains just aren't hardwired to think in a distributed, asynchronous manner. Richard Dawkins coined the term *Middle World*---the serial, rote land humans encounter every day, which exists between the extremes of the very small strangeness of quarks and the vastness of outer space. + +We don't consider these extremes clearly because we don't encounter them on a daily basis, just like distributed computations and storage. So we create models and tools to bring the physical act of scattered parallel resources in line to our more ordinary synchronous terms. While Riak takes great pains to simplify the hard parts, it does not pretend that they don't exist. Just like you can never hope to program at an expert level without any knowledge of memory or CPU management, so too can you never safely develop a highly available cluster without a firm grasp of a few underlying concepts. + + + +## The Landscape + +The existence of databases like Riak is the culmination of two basic trends: accessible technology spurring different data requirements, and gaps in the data management market. + + + +First, as we've seen steady improvements in technology along with reductions in cost, vast amounts of computing power and storage are now within the grasp of nearly anyone. Along with our increasingly interconnected world caused by the web and shrinking, cheaper computers (like smartphones), this has catalyzed an exponential growth of data, and a demand for more predictability and speed by savvier users. In other words, more data is being created on the front-end, while more data is being managed on the backend. + +Second, relational database management systems (RDBMS) have become focused over the years for a standard set of use-cases, like business intelligence. They were also technically tuned for squeezing performance out of single larger servers, like optimizing disk access, even while cheap commodity (and virtualized) servers made horizontal growth increasingly attractive. As cracks in relational implementations became apparent, custom implementations arose in response to specific problems not originally envisioned by the relational DBs. + +These new databases are collected under the moniker *NoSQL*, and Riak is of its ilk. + +

Database Models

+ +Modern databases can be loosely grouped into the ways they represent data. Although I'm presenting 5 major types (the last 4 are considered NoSQL models), these lines are often blurred---you can use some key/value stores as a document store, you can use a relational database to just store key/value data. + + + + + + 1. **Relational**. Traditional databases usually use SQL to model and query data. + They are useful for data which can be stored in a highly structured schema, yet + require flexible querying. Scaling a relational database (RDBMS) traditionally + occurs by more powerful hardware (vertical growth). + + Examples: *PostgreSQL*, *MySQL*, *Oracle* + 2. **Graph**. These exist for highly interconnected data. They excel in + modeling complex relationships between nodes, and many implementations can + handle multiple billions of nodes and relationships (or edges and vertices). I tend to include *triplestores* and *object DBs* as specialized variants. + + Examples: *Neo4j*, *Graphbase*, *InfiniteGraph* + 3. **Document**. Document datastores model hierarchical values called documents, + represented in formats such as JSON or XML, and do not enforce a document schema. + They generally support distributing across multiple servers (horizontal growth). + + Examples: *CouchDB*, *MongoDB*, *Couchbase* + 4. **Columnar**. Popularized by [Google's BigTable](http://research.google.com/archive/bigtable.html), + this form of database exists to scale across multiple servers, and groups similar data into + column families. Column values can be individually versioned and managed, though families + are defined in advance, not unlike RDBMS schemas. + + Examples: *HBase*, *Cassandra*, *BigTable* + 5. **Key/Value**. Key/Value, or KV stores, are conceptually like hashtables, + where values are stored and accessed by an immutable key. They range from + single-server varieties like *Memcached* used for high-speed caching, to + multi-datacenter distributed systems like *Riak Enterprise*. + + Examples: *Riak*, *Redis*, *Voldemort* + +## Riak Components + +Riak is a Key/Value (KV) database, built from the ground up to safely distribute data across a cluster of physical servers, called nodes. A Riak cluster is also known as a ring (we'll cover why later). + + + +Riak functions similarly to a very large hash space. Depending on your background, you may call it hashtable, a map, a dictionary, or an object. But the idea is the same: you store a value with an immutable key, and retrieve it later. + +

Key and Value

+ +![A Key is an Address](../assets/decor/addresses.png) + +Key/value is the most basic construct in all of computerdom. You can think of a key like a home address, such as Bob's house with the unique key 5124, while the value would be maybe Bob (and his stuff). + +```javascript +hashtable["5124"] = "Bob" +``` + +Retrieving Bob is as easy as going to his house. + +```javascript +bob = hashtable["5124"] +``` + +Let's say that poor old Bob dies, and Claire moves into this house. The address remains the same, but the contents have changed. + +```javascript +hashtable["5124"] = "Claire" +``` + +Successive requests for `5124` will now return `Claire`. + +

Buckets

+ + + +Addresses in Riakville are more than a house number, but also a street. There could be another 5124 on another street, so the way we can ensure a unique address is by requiring both, as in *5124 Main Street*. + +*Buckets* in Riak are analogous to street names: they provide logical [namespaces](http://en.wikipedia.org/wiki/Namespace) so that identical keys in different buckets will not conflict. + +For example, while Alice may live at *5122 Main Street*, there may be a gas station at *5122 Bagshot Row*. + +```javascript +main["5122"] = "Alice" +bagshot["5122"] = "Gas" +``` + +Certainly you could have just named your keys `main_5122` and `bagshot_5122`, but buckets allow for cleaner key naming, and have other benefits, such as custom properties. For example, to add new Riak Search 2.0 indexes to a bucket, you might tell Riak to index all values under a bucket like this: + +```javascript +main.props = {"search_index":"homes"} +``` + +Buckets are so useful in Riak that all keys must belong to a bucket. There is no global namespace. The true definition of a unique key in Riak is actually `bucket/key`. + +

Bucket Types

+ +Starting in Riak 2.0, there now exists a level above buckets, called bucket types. Bucket types are groups of buckets with a similar set of properties. So for the example above, it would be like a bucket of keys: + +```javascript +places["main"]["5122"] = "Alice" +places["bagshot"]["5122"] = "Gas" +``` + +The benefit here is that a group of distinct buckets can share properties. + +```javascript +places.props = {"search_index":"anyplace"} +``` + +This has practical implications. Previously, you were limited to how many custom bucket properties Riak could support, because any slight change from the default would have to be propogated to every other node in the cluster (via the gossip protocol). If you had ten thousand custom buckets, that's ten thousand values that were routinely sent amongst every member. Quickly, your system could be overloaded with that chatter, called a *gossip storm*. + +With the addition of bucket types, and the improved communication mechanism that accompanies it, there's no limit to your bucket count. It also makes managing multiple buckets easier, since every bucket of a type inherits the common properties, you can make across-the-board changes trivially. + +Due to its versatility (and downright necessity in some cases) and improved performance, Basho recommends using bucket types whenever possible from this point into the future. + +For convenience, we call a *type/bucket/key + value* pair an *object*, sparing ourselves the verbosity of "X key in the Y bucket with the Z type, and its value". + + +## Replication and Partitions + +Distributing data across several nodes is how Riak is able to remain highly available, tolerating outages and network partitions. Riak combines two styles of distribution to achieve this: [replication](http://en.wikipedia.org/wiki/Replication) and [partitions](http://en.wikipedia.org/wiki/Partition). + +

Replication

+ +**Replication** is the act of duplicating data across multiple servers. Riak replicates by default. + +The obvious benefit of replication is that if one node goes down, nodes that contain replicated data remain available to serve requests. In other words, the system remains *available*. + +For example, imagine you have a list of country keys, whose values are those countries' capitals. If all you do is replicate that data to 2 servers, you would have 2 duplicate databases. + +![Replication](../assets/replication.svg) + +The downside with replication is that you are multiplying the amount of storage required for every duplicate. There is also some network overhead with this approach, since values must also be routed to all replicated nodes on write. But there is a more insidious problem with this approach, which I will cover shortly. + + +

Partitions

+ +A **partition** is how we divide a set of keys onto separate physical servers. Rather than duplicate values, we pick one server to exclusively host a range of keys, and the other servers to host remaining non-overlapping ranges. + +With partitioning, our total capacity can increase without any big expensive hardware, just lots of cheap commodity servers. If we decided to partition our database into 1000 parts across 1000 nodes, we have (hypothetically) reduced the amount of work any particular server must do to 1/1000th. + +For example, if we partition our countries into 2 servers, we might put all countries beginning with letters A-N into Node A, and O-Z into Node B. + +![Partitions](../assets/partitions.svg) + +There is a bit of overhead to the partition approach. Some service must keep track of what range of values live on which node. A requesting application must know that the key `Spain` will be routed to Node B, not Node A. + +There's also another downside. Unlike replication, simple partitioning of data actually *decreases* uptime. If one node goes down, that entire partition of data is unavailable. This is why Riak uses both replication and partitioning. + +

Replication+Partitions

+ +Since partitions allow us to increase capacity, and replication improves availability, Riak combines them. We partition data across multiple nodes, as well as replicate that data into multiple nodes. + +Where our previous example partitioned data into 2 nodes, we can replicate each of those partitions into 2 more nodes, for a total of 4. + +Our server count has increased, but so has our capacity and reliability. If you're designing a horizontally scalable system by partitioning data, you must deal with replicating those partitions. + +The Riak team suggests a minimum of 5 nodes for a Riak cluster, and replicating to 3 nodes (this setting is called `n_val`, for the number of *nodes* on which to replicate each object). + +![Replication Partitions](../assets/replpart.svg) + + + +

The Ring

+ +Riak applies *consistent hashing* to map objects along the edge of a circle (the ring). + +Riak partitions are not mapped alphabetically (as we used in the examples above), but instead a partition marks a range of key hashes (SHA-1 function applied to a key). The maximum hash value is 2^160, and divided into some number of partitions---64 partitions by default (the Riak config setting is `ring_creation_size`). + +Let's walk through what all that means. If you have the key `favorite`, applying the SHA-1 algorithm would return `7501 7a36 ec07 fd4c 377a 0d2a 0114 00ab 193e 61db` in hexadecimal. With 64 partitions, each has 1/64 of the `2^160` possible values, making the first partition range from 0 to `2^154-1`, the second range is `2^154` to `2*2^154-1`, and so on, up to the last partition `63*2^154-1` to `2^160-1`. + + + + +We won't do all of the math, but trust me when I say `favorite` falls within the range of partition 3. + +If we visualize our 64 partitions as a ring, `favorite` falls here. + +![Riak Ring](../assets/ring0.svg) + +"Didn't he say that Riak suggests a minimum of 5 nodes? How can we put 64 partitions on 5 nodes?" We just give each node more than one partition, each of which is managed by a *vnode*, or *virtual node*. + +We count around the ring of vnodes in order, assigning each node to the next available vnode, until all vnodes are accounted for. So partition/vnode 1 would be owned by Node A, vnode 2 owned by Node B, up to vnode 5 owned by Node E. Then we continue by giving Node A vnode 6, Node B vnode 7, and so on, until our vnodes have been exhausted, leaving us this list. + +* A = [1,6,11,16,21,26,31,36,41,46,51,56,61] +* B = [2,7,12,17,22,27,32,37,42,47,52,57,62] +* C = [3,8,13,18,23,28,33,38,43,48,53,58,63] +* D = [4,9,14,19,24,29,34,39,44,49,54,59,64] +* E = [5,10,15,20,25,30,35,40,45,50,55,60] + +So far we've partitioned the ring, but what about replication? When we write a new value to Riak, it will replicate the result in some number of nodes, defined by a setting called `n_val`. In our 5 node cluster it defaults to 3. + +So when we write our `favorite` object to vnode 3, it will be replicated to vnodes 4 and 5. This places the object in physical nodes C, D, and E. Once the write is complete, even if node C crashes, the value is still available on 2 other nodes. This is the secret of Riak's high availability. + +We can visualize the Ring with its vnodes, managing nodes, and where `favorite` will go. + +![Riak Ring](../assets/ring1.svg) + +The Ring is more than just a circular array of hash partitions. It's also a system of metadata that gets copied to every node. Each node is aware of every other node in the cluster, which nodes own which vnodes, and other system data. + +Armed with this information, requests for data can target any node. It will horizontally access data from the proper nodes, and return the result. + +## Practical Tradeoffs + +So far we've covered the good parts of partitioning and replication: highly available when responding to requests, and inexpensive capacity scaling on commodity hardware. With the clear benefits of horizontal scaling, why is it not more common? + +

CAP Theorem

+ +Classic RDBMS databases are *write consistent*. Once a write is confirmed, successive reads are guaranteed to return the newest value. If I save the value `cold pizza` to my key `favorite`, every future read will consistently return `cold pizza` until I change it. + + + +But when values are distributed, *consistency* might not be guaranteed. In the middle of an object's replication, two servers could have different results. When we update `favorite` to `cold pizza` on one node, another node might contain the older value `pizza`, because of a network connectivity problem. If you request the value of `favorite` on either side of a network partition, two different results could possibly be returned---the database is inconsistent. + +If consistency should not be compromised in a distributed database, we can choose to sacrifice *availability* instead. We may, for instance, decide to lock the entire database during a write, and simply refuse to serve requests until that value has been replicated to all relevant nodes. Clients have to wait while their results can be brought into a consistent state (ensuring all replicas will return the same value) or fail if the nodes have trouble communicating. For many high-traffic read/write use-cases, like an online shopping cart where even minor delays will cause people to just shop elsewhere, this is not an acceptable sacrifice. + +This tradeoff is known as Brewer's CAP theorem. CAP loosely states that you can have a C (consistent), A (available), or P (partition-tolerant) system, but you can only choose 2. Assuming your system is distributed, you're going to be partition-tolerant, meaning, that your network can tolerate packet loss. If a network partition occurs between nodes, your servers still run. So your only real choices are CP or AP. Riak 2.0 supports both modes. + + + +

Strong Consistency

+ +Since version 2.0, Riak now supports strong Consistency (SC), as well as High Availability (HA). "Waitaminute!" I hear you say, "doesn't that break the CAP theorem?" Not the way Riak does it. Riak supports setting a bucket type property as strongly consistent. Any bucket of that type is now SC. Meaning, that a request is either successfully replicated to a majority of partitions, or it fails (if you want to sound fancy at parties, just say "Riak SC uses a variant of the vertical Paxos leader election algorithm"). + +This, naturally, comes at a cost. As we know from the CAP theorem, if too many nodes are down, the write will fail. You'll have to repair your node or network, and try the write again. In short, you've lost high availability. If you don't absolutely need strong consistency, consider staying with the high availability default, and tuning it to your needs as we'll see in the next section. + + +

Tunable Availability with N/R/W

+ +A question the CAP theorem demands you answer with a distributed system is: do I give up strong consistency, or give up ensured availability? If a request comes in, do I lock out requests until I can enforce consistency across the nodes? Or do I serve requests at all costs, with the caveat that the database may become inconsistent? + +Riak's solution is based on Amazon Dynamo's novel approach of a *tunable* AP system. It takes advantage of the fact that, though the CAP theorem is true, you can choose what kind of tradeoffs you're willing to make. Riak is highly available to serve requests, with the ability to tune its level of availability---nearing, but never quite reaching, strong consistency. If you want strong consistency, you'll need to create a special SC bucket type, which we'll see in a later chapter. + + + +Riak allows you to choose how many nodes you want to replicate an object to, and how many nodes must be written to or read from per request. These values are settings labeled `n_val` (the number of nodes to replicate to), `r` (the number of nodes read from before returning), and `w` (the number of nodes written to before considered successful). + +A thought experiment may help clarify things. + +![NRW](../assets/nrw.svg) + +

N

+ +With our 5 node cluster, having an `n_val=3` means values will eventually replicate to 3 nodes, as we've discussed above. This is the *N value*. You can set other values (R,W) to equal the `n_val` number with the shorthand `all`. + +

W

+ +But you may not wish to wait for all nodes to be written to before returning. You can choose to wait for all 3 to finish writing (`w=3` or `w=all`), which means my values are more likely to be consistent. Or you could choose to wait for only 1 complete write (`w=1`), and allow the remaining 2 nodes to write asynchronously, which returns a response quicker but increases the odds of reading an inconsistent value in the short term. This is the *W value*. + +In other words, setting `w=all` would help ensure your system was more likely to be consistent, at the expense of waiting longer, with a chance that your write would fail if fewer than 3 nodes were available (meaning, over half of your total servers are down). + +A failed write, however, is not necessarily a true failure. The client will receive an error message, but the write will typically still have succeeded on some number of nodes smaller than the *W* value, and will typically eventually be propagated to all of the nodes that should have it. + +

R

+ +Reading involves similar tradeoffs. To ensure you have the most recent value, you can read from all 3 nodes containing objects (`r=all`). Even if only 1 of 3 nodes has the most recent value, we can compare all nodes against each other and choose the latest one, thus ensuring some consistency. Remember when I mentioned that RDBMS databases were *write consistent*? This is close to *read consistency*. Just like `w=all`, however, the read will fail unless 3 nodes are available to be read. Finally, if you only want to quickly read any value, `r=1` has low latency, and is likely consistent if `w=all`. + +In general terms, the N/R/W values are Riak's way of allowing you to trade lower consistency for more availability. + +

Logical Clock

+ +If you've followed thus far, I only have one more conceptual wrench to throw at you. I wrote earlier that with `r=all`, we can "compare all nodes against each other and choose the latest one." But how do we know which is the latest value? This is where logical clocks like *vector clocks* (aka *vclocks*) come into play. + + + +Vector clocks measure a sequence of events, just like a normal clock. But since we can't reasonably keep the clocks on dozens, or hundreds, or thousands of servers in sync (without really exotic hardware, like geosynchronized atomic clocks, or quantum entanglement), we instead keep a running history of updates, and look for logical, rather than temporal, causality. + +Let's use our `favorite` example again, but this time we have 3 people trying to come to a consensus on their favorite food: Aaron, Britney, and Carrie. These people are called *actors*, ie. the things responsible for the updates. We'll track the value each actor has chosen along with the relevant vector clock. + +(To illustrate vector clocks in action, we're cheating a bit. Riak doesn't track vector clocks via the client that initiated the request, but rather, via the server that coordinates the write request; nonetheless, the concept is the same. We'll cheat further by disregarding the timestamp that is stored with vector clocks.) + +When Aaron sets the `favorite` object to `pizza`, a vector clock could contain his name and the number of updates he's performed. + +```yaml +bucket: food +key: favorite + +vclock: {Aaron: 1} +value: pizza +``` + +Britney now comes along, and reads `favorite`, but decides to update `pizza` to `cold pizza`. When using vclocks, she must provide the vclock returned from the request she wants to update. This is how Riak can help ensure you're updating a previous value, and not merely overwriting with your own. + +```yaml +bucket: food +key: favorite + +vclock: {Aaron: 1, Britney: 1} +value: cold pizza +``` + +At the same time as Britney, Carrie decides that pizza was a terrible choice, and tried to change the value to `lasagna`. + +```yaml +bucket: food +key: favorite + +vclock: {Aaron: 1, Carrie: 1} +value: lasagna +``` + +This presents a problem, because there are now two vector clocks in play that diverge from `{Aaron: 1}`. By default, Riak will store both values. + +Later in the day Britney checks again, but this time she gets the two conflicting values (aka *siblings*, which we'll discuss in more detail in the next chapter), with two vclocks. + +```yaml +bucket: food +key: favorite + +vclock: {Aaron: 1, Britney: 1} +value: cold pizza +--- +vclock: {Aaron: 1, Carrie: 1} +value: lasagna +``` + +It's clear that a decision must be made. Perhaps Britney knows that Aaron's original request was for `pizza`, and thus two people generally agreed on `pizza`, so she resolves the conflict choosing that and providing a new vclock. + +```yaml +bucket: food +key: favorite + +vclock: {Aaron: 1, Carrie: 1, Britney: 2} +value: pizza +``` + +Now we are back to the simple case, where requesting the value of `favorite` will just return the agreed upon `pizza`. + +If you're a programmer, you may notice that this is not unlike a version control system, like **git**, where conflicting branches may require manual merging into one. + +

Datatypes

+ +New in Riak 2.0 is the concept of datatypes. In the preceding logical clock example, we were responsible for resolving the conflicting values. This is because in the normal case, Riak has no idea what object's you're giving it. That is to say, Riak values are *opaque*. This is actually a powerful construct, since it allows you to store any type of value you want, from plain text, to semi-structured data like XML or JSON, to binary objects like images. + +When you decide to use datatypes, you've given Riak some information about the type of object you want to store. With this information, Riak can figure out how to resolve conflicts automatically for you, based on some pre-defined behavior. + +Let's try another example. Let's imagine a shopping cart in an online retailer. You can imagine a shopping cart like a set of items. So each key in our cart contains a *set* of values. + +Let's say you log into the retailer's website on your laptop with your username *ponies4evr*, and choose the Season 2 DVD of *My Little Pony: Friendship is Magic*. This time, the logical clock will act more like Riak's, where the node that coordinates the request will be the actor. + +```yaml +type: set +bucket: cart +key: ponies4evr + +vclock: {Node_A: 1} +value: ["MYPFIM-S2-DVD"] +``` + +Once the DVD was added to the cart bucket, your laptop runs out of batteries. So you take out your trusty smartphone, and log into the retailer's mobile app. You decide to also add the *Bloodsport III* DVD. Little did you know, a temporary network partition caused your write to redirect to another node. This partition had no knowledge of your other purchase. + +```yaml +type: set +bucket: cart +key: ponies4evr + +vclock: {Node_B: 1} +value: ["BS-III-DVD"] +``` + +Happily, the network hiccup was temporary, and thus the cluster heals itself. Under normal circumstances, since the logical clocks did not descend from one another, you'd end up with siblings like this: + +```yaml +type: set +bucket: cart +key: ponies4evr + +vclock: {Node_A: 1} +value: ["MYPFIM-S2-DVD"] +--- +vclock: {Node_B: 1} +value: ["BS-III-DVD"] +``` + +But since the bucket was designed to hold a *set*, Riak knows how to automatically resolve this conflict. In the case of conflicting sets, it performs a set union. So when you go to checkout of the cart, the system returns this instead: + +```yaml +type: set +bucket: cart +key: ponies4evr + +vclock: [{Node_A: 1}, {Node_B: 1}] +value: ["MYPFIM-S2-DVD", "BS-III-DVD"] +``` + +Datatypes will never return conflicts. This is an important claim to make, because as a developer, you get all of the benefits of dealing with a simple value, with all of the benefits of a distributed, available system. You don't have to think about handling conflicts. It would be like a version control system where (*git*, *svn*, etc) where you never had to merge code---the VCS simply *knew* what you wanted. + +How this all works is beyond the scope of this document. Under the covers it's implemented by something called [CRDTs](http://docs.basho.com/riak/2.0.0/theory/concepts/crdts/) \(Conflict-free Replicated Data Types). What's important to note is that Riak supports four datatypes: *map*, *set*, *counter*, *flag* (a boolean value). Best of all, maps can nest arbitrarily, so you can create a map whose values are sets, counters, or even other maps. It also supports plain string values called *register*s. + +We'll see how to use datatypes in the next chapter. + +

Riak and ACID

+ + + +Unlike single node databases like Neo4j or PostgreSQL, Riak does not support *ACID* transactions. Locking across multiple servers would can write availability, and equally concerning, increase latency. While ACID transactions promise *Atomicity*, *Consistency*, *Isolation*, and *Durability*---Riak and other NoSQL databases follow *BASE*, or *Basically Available*, *Soft state*, *Eventually consistent*. + +The BASE acronym was meant as shorthand for the goals of non-ACID-transactional databases like Riak. It is an acceptance that distribution is never perfect (basically available), all data is in flux (soft state), and that strong consistency is untenable (eventually consistent) if you want high availability. + +Look closely at promises of distributed transactions---it's often couched in some diminishing adjective or caveat like *row transactions*, or *per node transactions*, which basically mean *not transactional* in terms you would normally use to define it. I'm not claiming it's impossible, but certainly worth due consideration. + +As your server count grows---especially as you introduce multiple datacenters---the odds of partitions and node failures drastically increase. My best advice is to design for it upfront. + +## Wrapup + +Riak is designed to bestow a range of real-world benefits, but equally, to handle the fallout of wielding such power. Consistent hashing and vnodes are an elegant solution to horizontally scaling across servers. N/R/W allows you to dance with the CAP theorem by fine-tuning against its constraints. And vector clocks allow another step closer to consistency by allowing you to manage conflicts that will occur at high load. + +We'll cover other technical concepts as needed, including the gossip protocol, hinted handoff, and read-repair. + +Next we'll review Riak from the user (developer) perspective. We'll check out lookups, take advantage of write hooks, and examine alternative query options like secondary indexing, search, and MapReduce. diff --git a/zh/03-developers.md b/zh/03-developers.md new file mode 100644 index 0000000..dcba1ad --- /dev/null +++ b/zh/03-developers.md @@ -0,0 +1,1022 @@ +# Developers + + + +_We're going to hold off on the details of installing Riak at the moment. If you'd like to follow along, it's easy enough to get started by following the [install documentation](http://docs.basho.com/riak/latest/) on the website (http://docs.basho.com). If not, this is a perfect section to read while you sit on a train without an Internet connection._ + +Developing with a Riak database is quite easy to do, once you understand some of the finer points. It is a key/value store, in the technical sense (you associate values with keys, and retrieve them using the same keys) but it offers so much more. You can embed write hooks to fire before or after a write, or index data for quick retrieval. Riak has SOLR search, and lets you run MapReduce functions to extract and aggregate data across a huge cluster in relatively short timespans. We'll show some configurable bucket-specific settings. + +## Lookup + + + +Since Riak is a KV database, the most basic commands are setting and getting values. We'll use the HTTP interface, via curl, but we could just as easily use Erlang, Ruby, Java, or any other supported language. + +The basic structure of a Riak request is setting a value, reading it, +and maybe eventually deleting it. The actions are related to HTTP methods +(PUT, GET, POST, DELETE). + +```bash +PUT /types//buckets//keys/ +GET /types//buckets//keys/ +DELETE /types//buckets//keys/ +``` + +For the examples in this chapter, let's call an environment variable `$RIAK` that points to our access node's URL. + +```bash +export RIAK=http://localhost:8098 +``` + +

PUT

+ +The simplest write command in Riak is putting a value. It requires a key, value, and a bucket. In curl, all HTTP methods are prefixed with `-X`. Putting the value `pizza` into the key `favorite` under the `food` bucket and `items` bucket type is done like this: + +```bash +curl -XPUT "$RIAK/types/items/buckets/food/keys/favorite" \ + -H "Content-Type:text/plain" \ + -d "pizza" +``` + +I threw a few curveballs in there. The `-d` flag denotes the next string will be the value. We've kept things simple with the string `pizza`, declaring it as text with the proceeding line `-H 'Content-Type:text/plain'`. This defines the HTTP MIME type of this value as plain text. We could have set any value at all, be it XML or JSON---even an image or a video. Riak does not care at all what data is uploaded, so long as the object size doesn't get much larger than 4MB (a soft limit but one that it is unwise to exceed). + +

GET

+ +The next command reads the value `pizza` under the type/bucket/key `items`/`food`/`favorite`. + +```bash +curl -XGET "$RIAK/types/items/buckets/food/keys/favorite" +pizza +``` + +This is the simplest form of read, responding with only the value. Riak contains much more information, which you can access if you read the entire response, including the HTTP header. + +In `curl` you can access a full response by way of the `-i` flag. Let's perform the above query again, adding that flag (`-XGET` is the default curl method, so we can leave it off). + +```bash +curl -i "$RIAK/types/items/buckets/food/keys/favorite" +HTTP/1.1 200 OK +X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmdwZTImMfKcN3h1Um+LAA= +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted... +Last-Modified: Wed, 10 Oct 2012 18:56:23 GMT +ETag: "1yHn7L0XMEoMVXRGp4gOom" +Date: Thu, 11 Oct 2012 23:57:29 GMT +Content-Type: text/plain +Content-Length: 5 + +pizza +``` + +The anatomy of HTTP is a bit beyond this little book, but let's look at a few parts worth noting. + +
Status Codes
+ +The first line gives the HTTP version 1.1 response code `200 OK`. You may be familiar with the common website code `404 Not Found`. There are many kinds of [HTTP status codes](http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html), and the Riak HTTP interface stays true to their intent: **1xx Informational**, **2xx Success**, **3xx Further Action**, **4xx Client Error**, **5xx Server Error** + +Different actions can return different response/error codes. Complete lists can be found in the [official API docs](http://docs.basho.com/riak/latest/references/apis/). + +
Timings
+ +A block of headers represents different timings for the object or the request. + +* **Last-Modified** - The last time this object was modified (created or updated). +* **ETag** - An *[entity tag](http://en.wikipedia.org/wiki/HTTP_ETag)* which can be used for cache validation by a client. +* **Date** - The time of the request. +* **X-Riak-Vclock** - A logical clock which we'll cover in more detail later. + +
Content
+ +These describe the HTTP body of the message (in Riak's terms, the *value*). + +* **Content-Type** - The type of value, such as `text/xml`. +* **Content-Length** - The length, in bytes, of the message body. + +Some other headers like `Link` will be covered later in this chapter. + + +

POST

+ +Similar to PUT, POST will save a value. But with POST a key is optional. All it requires is a bucket name (and should include a type), and it will generate a key for you. + +Let's add a JSON value to represent a person under the `json`/`people` type/bucket. The response header is where a POST will return the key it generated for you. + +```bash +curl -i -XPOST "$RIAK/types/json/buckets/people/keys" \ + -H "Content-Type:application/json" \ + -d '{"name":"aaron"}' +HTTP/1.1 201 Created +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.9.2 (someone had painted... +Location: /riak/people/DNQGJY0KtcHMirkidasA066yj5V +Date: Wed, 10 Oct 2012 17:55:22 GMT +Content-Type: application/json +Content-Length: 0 +``` + +You can extract this key from the `Location` value. Other than not being pretty, this key is treated the same as if you defined your own key via PUT. + +
Body
+ +You may note that no body was returned with the response. For any kind of write, you can add the `returnbody=true` parameter to force a value to return, along with value-related headers like `X-Riak-Vclock` and `ETag`. + +```bash +curl -i -XPOST "$RIAK/types/json/buckets/people/keys?returnbody=true" \ + -H "Content-Type:application/json" \ + -d '{"name":"billy"}' +HTTP/1.1 201 Created +X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmdwZTImMfKkD3z10m+LAA= +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted... +Location: /riak/people/DnetI8GHiBK2yBFOEcj1EhHprss +Last-Modified: Tue, 23 Oct 2012 04:30:35 GMT +ETag: "7DsE7SEqAtY12d8T1HMkWZ" +Date: Tue, 23 Oct 2012 04:30:35 GMT +Content-Type: application/json +Content-Length: 16 + +{"name":"billy"} +``` + +This is true for PUTs and POSTs. + +

DELETE

+ +The final basic operation is deleting keys, which is similar to getting a value, but sending the DELETE method to the `type`/`bucket`/`key`. + +```bash +curl -XDELETE "$RIAK/types/json/buckets/people/keys/DNQGJY0KtcHMirkidasA066yj5V" +``` + +A deleted object in Riak is internally marked as deleted, by writing a marker known as a *tombstone*. Unless configured otherwise, another process called a *reaper* will later finish deleting the marked objects. + +This detail isn't normally important, except to understand two things: + +1. In Riak, a *delete* is actually a *read* and a *write*, and should be considered as such when calculating read/write ratios. +2. Checking for the existence of a key is not enough to know if an object exists. You might be reading a key after it has been deleted, so you should check for tombstone metadata. + +

Lists

+ +Riak provides two kinds of lists. The first lists all *buckets* in your cluster, while the second lists all *keys* under a specific bucket. Both of these actions are called in the same way, and come in two varieties. + +The following will give us all of our buckets as a JSON object. + +```bash +curl "$RIAK/types/default/buckets?buckets=true" + +{"buckets":["food"]} +``` + +And this will give us all of our keys under the `food` bucket. + +```bash +curl "$RIAK/types/default/buckets/food/keys?keys=true" +{ + ... + "keys": [ + "favorite" + ] +} +``` + +If we had very many keys, clearly this might take a while. So Riak also provides the ability to stream your list of keys. `keys=stream` will keep the connection open, returning results in chunks of arrays. When it has exhausted its list, it will close the connection. You can see the details through curl in verbose (`-v`) mode (much of that response has been stripped out below). + +```bash +curl -v "$RIAK/types/default/buckets/food/keys?keys=stream" +... + +* Connection #0 to host localhost left intact +... +{"keys":["favorite"]} +{"keys":[]} +* Closing connection #0 +``` + + + +You should note that list actions should *not* be used in production (they're really expensive operations). But they are useful for development, investigations, or for running occasional analytics at off-peak hours. + +## Conditional requests + +It is possible to use conditional requests with Riak, but these are +fragile due to the nature of its availability/eventual consistency +model. + +### GET + +When retrieving values from Riak via HTTP, a last-modified timestamp +and an [ETag](https://en.wikipedia.org/wiki/HTTP_ETag) are +included. These may be used for future `GET` requests; if the value +has not changed, a `304 Not Modified` status will be returned. + +For example, let's assume you receive the following headers. + +```bash +Last-Modified: Thu, 17 Jul 2014 21:01:16 GMT +ETag: "3VhRP0vnXbk5NjZllr0dDE" +``` + +Note that the quotes are part of the ETag. + +If the ETag is used via the `If-None-Match` header in the next request: + +```bash +curl -i "$RIAK/types/default/buckets/food/keys/dinner" \ + -H 'If-None-Match: "3VhRP0vnXbk5NjZllr0dDE"' +HTTP/1.1 304 Not Modified +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.10.5 (jokes are better explained) +ETag: "3VhRP0vnXbk5NjZllr0dDE" +Date: Mon, 28 Jul 2014 19:48:13 GMT +``` + +Similarly, the last-modified timestamp may be used with `If-Modified-Since`: + +```bash +curl -i "$RIAK/types/default/buckets/food/keys/dinner" \ + -H 'If-Modified-Since: Thu, 17 Jul 2014 21:01:16 GMT' +HTTP/1.1 304 Not Modified +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.10.5 (jokes are better explained) +ETag: "3VhRP0vnXbk5NjZllr0dDE" +Date: Mon, 28 Jul 2014 19:51:39 GMT +``` + +### PUT & DELETE + +When adding, updating, or removing content, the HTTP headers +`If-None-Match`, `If-Match`, `If-Modified-Since`, and +`If-Unmodified-Since` can be used to specify ETags and timestamps. + +If the specified condition cannot be met, a `412 Precondition Failed` +status will be the result. + + +## Bucket Types/Buckets + +Although we've been using bucket types and buckets as namespaces up to now, they are capable of more. + +Different use-cases will dictate whether a bucket is heavily written to, or largely read from. You may use one bucket to store logs, one bucket could store session data, while another may store shopping cart data. Sometimes low latency is important, while other times it's high durability. And sometimes we just want buckets to react differently when a write occurs. + +

Quorum

+ +The basis of Riak's availability and tolerance is that it can read from, or write to, multiple nodes. Riak allows you to adjust these N/R/W values (which we covered under [Concepts](#practical-tradeoffs)) on a per-bucket basis. + +

N/R/W

+ +N is the number of total nodes that a value should be replicated to, defaulting to 3. But we can set this `n_val` to less than the total number of nodes. + +Any bucket property, including `n_val`, can be set by sending a `props` value as a JSON object to the bucket URL. Let's set the `n_val` to 5 nodes, meaning that objects written to `cart` will be replicated to 5 nodes. + +```bash +curl -i -XPUT "$RIAK/types/default/buckets/cart/props" \ + -H "Content-Type: application/json" \ + -d '{"props":{"n_val":5}}' +``` + +You can take a peek at the bucket's properties by issuing a GET to the bucket. + +*Note: Riak returns unformatted JSON. If you have a command-line tool like jsonpp (or json_pp) installed, you can pipe the output there for easier reading. The results below are a subset of all the `props` values.* + +```bash +curl "$RIAK/types/default/buckets/cart/props" | jsonpp +{ + "props": { + ... + "dw": "quorum", + "n_val": 5, + "name": "cart", + "postcommit": [], + "pr": 0, + "precommit": [], + "pw": 0, + "r": "quorum", + "rw": "quorum", + "w": "quorum", + ... + } +} +``` + +As you can see, `n_val` is 5. That's expected. But you may also have noticed that the cart `props` returned both `r` and `w` as `quorum`, rather than a number. So what is a *quorum*? + +
Symbolic Values
+ +A *quorum* is one more than half of all the total replicated nodes (`floor(N/2) + 1`). This figure is important, since if more than half of all nodes are written to, and more than half of all nodes are read from, then you will get the most recent value (under normal circumstances). + +Here's an example with the above `n_val` of 5 ({A,B,C,D,E}). Your `w` is a quorum (which is `3`, or `floor(5/2)+1`), so a PUT may respond successfully after writing to {A,B,C} ({D,E} will eventually be replicated to). Immediately after, a read quorum may GET values from {C,D,E}. Even if D and E have older values, you have pulled a value from node C, meaning you will receive the most recent value. + +What's important is that your reads and writes *overlap*. As long as `r+w > n`, in the absence of *sloppy quorum* (below), you'll be able to get the newest values. In other words, you'll have a reasonable level of consistency. + +A `quorum` is an excellent default, since you're reading and writing from a balance of nodes. But if you have specific requirements, like a log that is often written to, but rarely read, you might find it make more sense to wait for a successful write from a single node, but read from all of them. This affords you an overlap + +```bash +curl -i -XPUT "$RIAK/types/default/buckets/logs/props" \ + -H "Content-Type: application/json" \ + -d '{"props":{"w":"one","r":"all"}}' +``` + +* `all` - All replicas must reply, which is the same as setting `r` or `w` equal to `n_val` +* `one` - Setting `r` or `w` equal to `1` +* `quorum` - A majority of the replicas must respond, that is, “half plus one”. + +

Sloppy Quorum

+ +In a perfect world, a strict quorum would be sufficient for most write requests. However, at any moment a node could go down, or the network could partition, or squirrels get caught in the tubes, triggering the unavailability of a required nodes. This is known as a strict quorum. Riak defaults to what's known as a *sloppy quorum*, meaning that if any primary (expected) node is unavailable, the next available node in the ring will accept requests. + +Think about it like this. Say you're out drinking with your friend. You order 2 drinks (W=2), but before they arrive, she leaves temporarily. If you were a strict quorum, you could merely refuse both drinks, since the required people (N=2) are unavailable. But you'd rather be a sloppy drunk... erm, I mean sloppy *quorum*. Rather than deny the drink, you take both, one accepted *on her behalf* (you also get to pay). + +![A Sloppy Quorum](../assets/decor/drinks.png) + +When she returns, you slide her drink over. This is known as *hinted handoff*, which we'll look at again in the next chapter. For now it's sufficient to note that there's a difference between the default sloppy quorum (W), and requiring a strict quorum of primary nodes (PW). + +
More than R's and W's
+ +Some other values you may have noticed in the bucket's `props` object are `pw`, `pr`, and `dw`. + +`pr` and `pw` ensure that many *primary* nodes are available before a read or write. Riak will read or write from backup nodes if one is unavailable, because of network partition or some other server outage. This `p` prefix will ensure that only the primary nodes are used, *primary* meaning the vnode which matches the bucket plus N successive vnodes. + +(We mentioned above that `r+w > n` provides a reasonable level of consistency, violated when sloppy quorums are involved. `pr+pw > n` allows for a much stronger assertion of consistency, although there are always scenarios involving conflicting writes or significant disk failures where that too may not be enough.) + +Finally `dw` represents the minimal *durable* writes necessary for success. For a normal `w` write to count a write as successful, a vnode need only promise a write has started, with no guarantee that write has been written to disk, aka, is durable. The `dw` setting means the backend service (for example Bitcask) has agreed to write the value. Although a high `dw` value is slower than a high `w` value, there are cases where this extra enforcement is good to have, such as dealing with financial data. + +
Per Request
+ +It's worth noting that these values (except for `n_val`) can be overridden *per request*. + +Consider a scenario in which you have data that you find very important (say, credit card checkout), and want to help ensure it will be written to every relevant node's disk before success. You could add `?dw=all` to the end of your write. + +```bash +curl -i -XPUT "$RIAK/types/default/buckets/cart/keys/cart1?dw=all" \ + -H "Content-Type: application/json" \ + -d '{"paid":true}' +``` + +If any of the nodes currently responsible for the data cannot complete the request (i.e., hand off the data to the storage backend), the client will receive a failure message. This doesn't mean that the write failed, necessarily: if two of three primary vnodes successfully wrote the value, it should be available for future requests. Thus trading availability for consistency by forcing a high `dw` or `pw` value can result in unexpected behavior. + +

Hooks

+ +Another utility of buckets are their ability to enforce behaviors on writes by way of hooks. You can attach functions to run either before, or after, a value is committed to a bucket. + +Precommit hooks are functions that run before a write is called. A precommit hook has the ability to cancel a write altogether if the incoming data is considered bad in some way. A simple precommit hook is to check if a value exists at all. + +I put my custom Erlang code files under the riak installation `./custom/my_validators.erl`. + +```java +-module(my_validators). +-export([value_exists/1]). + +%% Object size must be greater than 0 bytes +value_exists(RiakObject) -> + Value = riak_object:get_value(RiakObject), + case erlang:byte_size(Value) of + 0 -> {fail, "A value sized greater than 0 is required"}; + _ -> RiakObject + end. +``` + +Then compile the file. + +```bash +erlc my_validators.erl +``` + +Install the file by informing the Riak installation of your new code with an `advanced.config` file that lives alongside `riak.conf` in each node, then rolling restart each node. + +```bash +{riak_kv, + {add_paths, ["./custom"]} +} +``` + +Then you need to do set the Erlang module (`my_validators`) and function (`value_exists`) as a JSON value to the bucket's precommit array `{"mod":"my_validators","fun":"value_exists"}`. + +```bash +curl -i -XPUT "$RIAK/types/default/buckets/cart/props" \ + -H "Content-Type:application/json" \ + -d '{"props":{"precommit":[{"mod":"my_validators","fun":"value_exists"}]}}' +``` + +If you try and post to the `cart` bucket without a value, you should expect a failure. + +```bash +curl -XPOST "$RIAK/types/default/buckets/cart/keys" \ + -H "Content-Type:application/json" +A value sized greater than 0 is required +``` + +You can also write precommit functions in JavaScript, though Erlang code will execute faster. + +Post-commits are similar in form and function, albeit executed after the write has been performed. Key differences: + +* The only language supported is Erlang. +* The function's return value is ignored, thus it cannot cause a failure message to be sent to the client. + + +## Datatypes + +A new feature in Riak 2.0 are datatypes. Rather than the opaque values of days past, these new additions allow a user to define the type of values that are accepted under a given bucket type. In addition to the benefits listed in the previous chapter of automatic conflict resolution, you also interact with datatypes in a different way. + + + +In normal Riak operations, as we've seen, you put a value with a given key into a type/bucket object. If you wanted to store a map, say, as a JSON object representing a person, you would put the entire object with every field/value as an operation. + +```bash +curl -XPOST "$RIAK/types/json/buckets/people/keys/joe" \ + -H "Content-Type:application/json" + -d '{"name_register":"Joe", "pets_set":["cat"]}' +``` + +But if you wanted to add a `fish` as a pet, you'd have to replace the entire object. + +```bash +curl -XPOST "$RIAK/types/json/buckets/people/keys/joe" \ + -H "Content-Type:application/json" + -d '{"name_register":"Joe", "pets_set":["cat", "fish"]}' +``` + +As we saw in the previous chapter, this runs the risk of conflicting, thus creating a sibling. + +``` +{"name_register":"Joe", "pets_set":["cat"]} +{"name_register":"Joe", "pets_set":["cat", "fish"]} +``` + +But if we used a map, we'd instead issue only updates to create a map. So, assume that the bucket type `map` is of a map datatype (we'll see how operators can assign datatypes to bucket types in the next chapter). This command will insert a map object with two fields (`name_register` and `pets_set`). + +```bash +curl -XPOST "$RIAK/types/map/buckets/people/keys/joe" \ + -H "Content-Type:application/json" + -d '{ + "update": { + "name_register": "Joe" + "pets_set": { + "add_all": "cat" + } + } + }' +``` + +Next, we want to update the `pets_set` contained within `joe`'s map. Rather than set Joe's name and his pet cat, we only need to inform the object of the change. Namely, that we want to add a `fish` to his `pets_set`. + +```bash +curl -XPOST "$RIAK/types/map/buckets/people/keys/joe" \ + -H "Content-Type:application/json" + -d '{ + "update": { + "pets_set": { + "add": "fish" + } + } + }' +``` + +This has a few benefits. Firstly, we don't need to send duplicate data. Second, it doesn't matter what order the two requests happen in, the outcome will be the same. Third, because the operations are CmRDTs, there is no possibility of a datatype returning siblings, making your client code that much easier. + +As we've noted before, there are four Riak datatypes: *map*, *set*, *counter*, *flag*. The object type is set as a bucket type property. However, when populating a map, as we've seen, you must suffix the field name with the datatype that you wish to store: \*\_map, \*\_set, \*\_counter, \*\_flag. For plain string values, there's a special \*\_register datatype suffix. + +You can read more about [datatypes in the docs](http://docs.basho.com/riak/latest/dev/using/data-types). + + +## Entropy + +Entropy is a byproduct of eventual consistency. In other words: although eventual consistency says a write will replicate to other nodes in time, there can be a bit of delay during which all nodes do not contain the same value. + +That difference is *entropy*, and so Riak has created several *anti-entropy* strategies (abbreviated as *AE*). We've already talked about how an R/W quorum can deal with differing values when write/read requests overlap at least one node. Riak can repair entropy, or allow you the option to do so yourself. + +Riak has two basic strategies to address conflicting writes. + +

Last Write Wins

+ +The most basic, and least reliable, strategy for curing entropy is called *last write wins*. It's the simple idea that the last write based on a node's system clock will overwrite an older one. This is currently the default behavior in Riak (by virtue of the `allow_mult` property defaulting to `false`). You can also set the `last_write_wins` property to `true`, which improves performance by never retaining vector clock history. + +Realistically, this exists for speed and simplicity, when you really don't care about true order of operations, or the possibility of losing data. Since it's impossible to keep server clocks truly in sync (without the proverbial geosynchronized atomic clocks), this is a best guess as to what "last" means, to the nearest millisecond. + +

Vector Clocks

+ +As we saw under [Concepts](#practical-tradeoffs), *vector clocks* are Riak's way of tracking a true sequence of events of an object. Let's take a look at using vector clocks to allow for a more sophisticated conflict resolution approach than simply retaining the last-written value. + +

Siblings

+ +*Siblings* occur when you have conflicting values, with no clear way for Riak to know which value is correct. As of Riak 2.0, as long as you use a custom (not `default`) bucket type that isn't a datatype, conflicting writes should create siblings. This is a good thing, since it ensures no data is ever lost. + +In the case where you forgo a custom bucket type, Riak will try to resolve these conflicts itself if the `allow_mult` parameter is configured to `false`. You should generally always have your buckets set to retain siblings, to be resolved by the client by ensuring `allow_mult` is `true`. + +```bash +curl -i -XPUT "$RIAK/types/default/buckets/cart/props" \ + -H "Content-Type:application/json" \ + -d '{"props":{"allow_mult":true}}' +``` + +Siblings arise in a couple cases. + +1. A client writes a value using a stale (or missing) vector clock. +2. Two clients write at the same time with the same vector clock value. + +We used the second scenario to manufacture a conflict in the previous chapter when we introduced the concept of vector clocks, and we'll do so again here. + +

Creating an Example Conflict

+ +Imagine we create a shopping cart for a single refrigerator, but several people in a household are able to order food for it. Because losing orders would result in an unhappy household, Riak is using a custom bucket type `shopping` which keeps the default `allow_mult=true`. + +First Casey (a vegan) places 10 orders of kale in the cart. + +Casey writes `[{"item":"kale","count":10}]`. + +```bash +curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \ + -H "Content-Type:application/json" \ + -d '[{"item":"kale","count":10}]' +HTTP/1.1 200 OK +X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmTwZTImMfKsMKK7RRfFgA= +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted... +Last-Modified: Thu, 01 Nov 2012 00:13:28 GMT +ETag: "2IGTrV8g1NXEfkPZ45WfAP" +Date: Thu, 01 Nov 2012 00:13:28 GMT +Content-Type: application/json +Content-Length: 28 + +[{"item":"kale","count":10}] +``` + +Note the opaque vector clock (via the `X-Riak-Vclock` header) returned by Riak. That same value will be returned with any read request issued for that key until another write occurs. + +His roommate Mark, reads the order and adds milk. In order to allow Riak to track the update history properly, Mark includes the most recent vector clock with his PUT. + +Mark writes `[{"item":"kale","count":10},{"item":"milk","count":1}]`. + +```bash +curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \ + -H "Content-Type:application/json" \ + -H "X-Riak-Vclock:a85hYGBgzGDKBVIcypz/fgaUHjmTwZTImMfKsMKK7RRfFgA="" \ + -d '[{"item":"kale","count":10},{"item":"milk","count":1}]' +HTTP/1.1 200 OK +X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmTwZTIlMfKcMaK7RRfFgA= +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted... +Last-Modified: Thu, 01 Nov 2012 00:14:04 GMT +ETag: "62NRijQH3mRYPRybFneZaY" +Date: Thu, 01 Nov 2012 00:14:04 GMT +Content-Type: application/json +Content-Length: 54 + +[{"item":"kale","count":10},{"item":"milk","count":1}] +``` + +If you look closely, you'll notice that the vector clock changed with the second write request + +* a85hYGBgzGDKBVIcypz/fgaUHjmTwZTImMfKsMKK7RRfFgA= (after the write by Casey) +* a85hYGBgzGDKBVIcypz/fgaUHjmTwZTIlMfKcMaK7RRfFgA= (after the write by Mark) + +Now let's consider a third roommate, Andy, who loves almonds. Before Mark updates the shared cart with milk, Andy retrieved Casey's kale order and appends almonds. As with Mark, Andy's update includes the vector clock as it existed after Casey's original write. + +Andy writes `[{"item":"kale","count":10},{"item":"almonds","count":12}]`. + +```bash +curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \ + -H "Content-Type:application/json" \ + -H "X-Riak-Vclock:a85hYGBgzGDKBVIcypz/fgaUHjmTwZTImMfKsMKK7RRfFgA="" \ + -d '[{"item":"kale","count":10},{"item":"almonds","count":12}]' +HTTP/1.1 300 Multiple Choices +X-Riak-Vclock: a85hYGBgzGDKBVIcypz/fgaUHjmTwZTInMfKoG7LdoovCwA= +Vary: Accept-Encoding +Server: MochiWeb/1.1 WebMachine/1.9.0 (someone had painted... +Last-Modified: Thu, 01 Nov 2012 00:24:07 GMT +ETag: "54Nx22W9M7JUKJnLBrRehj" +Date: Thu, 01 Nov 2012 00:24:07 GMT +Content-Type: multipart/mixed; boundary=Ql3O0enxVdaMF3YlXFOdmO5bvrs +Content-Length: 491 + + +--Ql3O0enxVdaMF3YlXFOdmO5bvrs +Content-Type: application/json +Etag: 62NRijQH3mRYPRybFneZaY +Last-Modified: Thu, 01 Nov 2012 00:14:04 GMT + +[{"item":"kale","count":10},{"item":"milk","count":1}] +--Ql3O0enxVdaMF3YlXFOdmO5bvrs +Content-Type: application/json +Etag: 7kfvPXisoVBfC43IiPKYNb +Last-Modified: Thu, 01 Nov 2012 00:24:07 GMT + +[{"item":"kale","count":10},{"item":"almonds","count":12}] +--Ql3O0enxVdaMF3YlXFOdmO5bvrs-- +``` + +Whoa! What's all that? + +Since there was a conflict between what Mark and Andy set the fridge value to be, Riak kept both values. + +

VTag

+ +Since we're using the HTTP client, Riak returned a `300 Multiple Choices` code with a `multipart/mixed` MIME type. It's up to you to parse the results (or you can request a specific value by its Etag, also called a Vtag). + +Issuing a plain get on the `shopping/fridge/97207` key will also return the vtags of all siblings. + +``` +curl "$RIAK/types/shopping/buckets/fridge/keys/97207" +Siblings: +62NRijQH3mRYPRybFneZaY +7kfvPXisoVBfC43IiPKYNb +``` + +What can you do with this tag? Namely, you request the value of a specific sibling by its `vtag`. To get the first sibling in the list (Mark's milk): + +```bash +curl "$RIAK/types/shopping/buckets/fridge/keys/97207?vtag=62NRijQH3mRYPRybFneZaY" +[{"item":"kale","count":10},{"item":"milk","count":1}] +``` + +If you want to retrieve all sibling data, tell Riak that you'll accept the multipart message by adding `-H "Accept:multipart/mixed"`. + +```bash +curl "$RIAK/types/shopping/buckets/fridge/keys/97207" \ + -H "Accept:multipart/mixed" +``` + + + +

Resolving Conflicts

+ +When we have conflicting writes, we want to resolve them. Since that problem is typically *use-case specific*, Riak defers it to us, and our application must decide how to proceed. + +For our example, let's merge the values into a single result set, taking the larger *count* if the *item* is the same. When done, write the new results back to Riak with the vclock of the multipart object, so Riak knows you're resolving the conflict, and you'll get back a new vector clock. + +Successive reads will receive a single (merged) result. + +```bash +curl -i -XPUT "$RIAK/types/shopping/buckets/fridge/keys/97207?returnbody=true" \ + -H "Content-Type:application/json" \ + -H "X-Riak-Vclock:a85hYGBgzGDKBVIcypz/fgaUHjmTwZTInMfKoG7LdoovCwA=" \ + -d '[{"item":"kale","count":10},{"item":"milk","count":1},\ + {"item":"almonds","count":12}]' +``` + +

Last write wins vs. siblings

+ +Your data and your business needs will dictate which approach to conflict resolution is appropriate. You don't need to choose one strategy globally; instead, feel free to take advantage of Riak's buckets to specify which data uses siblings and which blindly retains the last value written. + +A quick recap of the two configuration values you'll want to set: + +* `allow_mult` defaults to `false`, which means that the last write wins. +* Setting `allow_mult` to `true` instructs Riak to retain conflicting writes as siblings. +* `last_write_wins` defaults to `false`, which (perhaps counter-intuitively) still can mean that the behavior is last write wins: `allow_mult` is the key parameter for the behavioral toggle. +* Setting `last_write_wins` to true will optimize writes by assuming that previous vector clocks have no inherent value. +* Setting both `allow_mult` and `last_write_wins` to `true` is unsupported and will result in undefined behavior. + +

Read Repair

+ +When a successful read happens, but not all replicas agree upon the value, this triggers a *read repair*. This means that Riak will update the replicas with the most recent value. This can happen either when an object is not found (the vnode has no copy) or a vnode contains an older value (older means that it is an ancestor of the newest vector clock). Unlike `last_write_wins` or manual conflict resolution, read repair is (obviously, I hope, by the name) triggered by a read, rather than a write. + +If your nodes get out of sync (for example, if you increase the `n_val` on a bucket), you can force read repair by performing a read operation for all of that bucket's keys. They may return with `not found` the first time, but later reads will pull the newest values. + +

Active Anti-Entropy (AAE)

+ +Although resolving conflicting data during get requests via read repair is sufficient for most needs, data which is never read can eventually be lost as nodes fail and are replaced. + +Riak supports active anti-entropy (AAE), to proactively identify and repair inconsistent data. This feature is also helpful for recovering data loss in the event of disk corruption or administrative error. + +The overhead for this functionality is minimized by maintaining sophisticated hash trees ("Merkle trees") which make it easy to compare data sets between vnodes, but if desired the feature can be disabled. + +## Querying + +So far we've only dealt with key-value lookups. The truth is, key-value is a pretty powerful mechanism that spans a spectrum of use-cases. However, sometimes we need to lookup data by value, rather than key. Sometimes we need to perform some calculations, or aggregations, or search. + +

Secondary Indexing (2i)

+ +A *secondary index* (2i) is a data structure that lowers the cost of +finding non-key values. Like many other databases, Riak has the +ability to index data. However, since Riak has no real knowledge of +the data it stores (they're just binary values), it uses metadata to +index defined by a name pattern to be either integers or binary values. + +If your installation is configured to use 2i (shown in the next chapter), +simply writing a value to Riak with the header will be indexes, +provided it's prefixed by `X-Riak-Index-` and suffixed by `_int` for an +integer, or `_bin` for a string. + +```bash +curl -i -XPUT $RIAK/types/shopping/buckets/people/keys/casey \ + -H "Content-Type:application/json" \ + -H "X-Riak-Index-age_int:31" \ + -H "X-Riak-Index-fridge_bin:97207" \ + -d '{"work":"rodeo clown"}' +``` + +Querying can be done in two forms: exact match and range. Add a couple more people and we'll see what we get: `mark` is `32`, and `andy` is `35`, they both share `97207`. + +What people own `97207`? It's a quick lookup to receive the +keys that have matching index values. + +```bash +curl "$RIAK/types/shopping/buckets/people/index/fridge_bin/97207" +{"keys":["mark","casey","andy"]} +``` + +With those keys it's a simple lookup to get the bodies. + +The other query option is an inclusive ranged match. This finds all +people under the ages of `32`, by searching between `0` and `32`. + +```bash +curl "$RIAK/types/shopping/buckets/people/index/age_int/0/32" +{"keys":["mark","casey"]} +``` + +That's about it. It's a basic form of 2i, with a decent array of utility. + +

MapReduce

+ +MapReduce is a method of aggregating large amounts of data by separating the +processing into two phases, map and reduce, that themselves are executed +in parts. Map will be executed per object to convert/extract some value, +then those mapped values will be reduced into some aggregate result. What +do we gain from this structure? It's predicated on the idea that it's cheaper +to move the algorithms to where the data lives, than to transfer massive +amounts of data to a single server to run a calculation. + +This method, popularized by Google, can be seen in a wide array of NoSQL +databases. In Riak, you execute a MapReduce job on a single node, which +then propagates to the other nodes. The results are mapped and reduced, +then further reduced down to the calling node and returned. + +![MapReduce Returning Name Char Count](../assets/mapreduce.svg) + +Let's assume we have a bucket for log values that stores messages +prefixed by either INFO or ERROR. We want to count the number of INFO +logs that contain the word "cart". + +```bash +LOGS=$RIAK/types/default/buckets/logs/keys +curl -XPOST $LOGS -d "INFO: New user added" +curl -XPOST $LOGS -d "INFO: Kale added to shopping cart" +curl -XPOST $LOGS -d "INFO: Milk added to shopping cart" +curl -XPOST $LOGS -d "ERROR: shopping cart cancelled" +``` + +MapReduce jobs can be either Erlang or JavaScript code. This time we'll go the +easy route and write JavaScript. You execute MapReduce by posting JSON to the +`/mapred` path. + +```bash +curl -XPOST "$RIAK/mapred" \ + -H "Content-Type: application/json" \ + -d @- \ +<MR + 2i + +Another option when using MapReduce is to combine it with secondary indexes. +You can pipe the results of a 2i query into a MapReducer, simply specify the +index you wish to use, and either a `key` for an index lookup, or `start` and +`end` values for a ranged query. + +```json + ... + "inputs":{ + "bucket":"people", + "index": "age_int", + "start": 18, + "end": 32 + }, + ... +``` + +MapReduce in Riak is a powerful way of pulling data out of an +otherwise straight key/value store. But we have one more method of finding +data in Riak. + + + + +

Search 2.0

+ +Search 2.0 is a complete, from scratch, reimagining of distributed search +in Riak. It's an extension to Riak that lets you perform searches to find +values in a Riak cluster. Unlike the original Riak Search, Search 2.0 +leverages distributed Solr to perform the inverted indexing and management of +retrieving matching values. + +Before using Search 2.0, you'll have to have it installed and a bucket set +up with an index (these details can be found in the next chapter). + +The simplest example is a full-text search. Here we add `ryan` to the +`people` table (with a default index). + +```bash +curl -XPUT "$RIAK/type/default/buckets/people/keys/ryan" \ + -H "Content-Type:text/plain" \ + -d "Ryan Zezeski" +``` + +To execute a search, request `/solr//select` along with any distributed +[Solr parameters](http://wiki.apache.org/solr/CommonQueryParameters). Here we +query for documents that contain a word starting with `zez`, request the +results to be in json format (`wt=json`), only return the Riak key +(`fl=_yz_rk`). + +```bash +curl "$RIAK/solr/people/select?wt=json&omitHeader=true&fl=_yz_rk&q=zez*" +{ + "response": { + "numFound": 1, + "start": 0, + "maxScore": 1.0, + "docs": [ + { + "_yz_rk": "ryan" + } + ] + } +} +``` + +With the matching `_yz_rk` keys, you can retrieve the bodies with a simple +Riak lookup. + +Search 2.0 supports Solr 4.0, which includes filter queries, ranges, page scores, +start values and rows (the last two are useful for pagination). You can also +receive snippets of matching +[highlighted text](http://wiki.apache.org/solr/HighlightingParameters) +(`hl`,`hl.fl`), which is useful for building a search engine (and something +we use for [search.basho.com](http://search.basho.com)). You can perform +facet searches, stats, geolocation, bounding shapes, or any other search +possible with distributed Solr. + + +

Tagging

+ +Another useful feature of Search 2.0 is the tagging of values. Tagging +values give additional context to a Riak value. The current implementation +requires all tagged values begin with `X-Riak-Meta`, and be listed under +a special header named `X-Riak-Meta-yz-tags`. + +```bash +curl -XPUT "$RIAK/types/default/buckets/people/keys/dave" \ + -H "Content-Type:text/plain" \ + -H "X-Riak-Meta-yz-tags: X-Riak-Meta-nickname_s" \ + -H "X-Riak-Meta-nickname_s:dizzy" \ + -d "Dave Smith" +``` + +To search by the `nickname_s` tag, just prefix the query string followed +by a colon. + +```bash +curl "$RIAK/solr/people/select?wt=json&omitHeader=true&q=nickname_s:dizzy" +{ + "response": { + "numFound": 1, + "start": 0, + "maxScore": 1.4054651, + "docs": [ + { + "nickname_s": "dizzy", + "id": "dave_25", + "_yz_ed": "20121102T215100 dave m7psMIomLMu/+dtWx51Kluvvrb8=", + "_yz_fpn": "23", + "_yz_node": "dev1@127.0.0.1", + "_yz_pn": "25", + "_yz_rk": "dave", + "_version_": 1417562617478643712 + } + ] + } +} +``` + +Notice that the `docs` returned also contain `"nickname_s":"dizzy"` as a +value. All tagged values will be returned on matching results. + +

Datatypes

+ +One of the more powerful combinations in Riak 2.0 are datatypes and Search. +If you set both a datatype and a search index in a bucket type's properties, +values you set are indexed as you'd expect. Map fields are indexed as their +given types, sets are multi-field strings, counters as indexed as integers, +and flags are boolean. Nested maps are also indexed, seperated by dots, and +queryable in such a manner. + +For example, remember Joe, from the datatype section? Let's assume that +this `people` bucket is indexed. And let's also add another pet. + +```bash +curl -XPUT "$RIAK/types/map/buckets/people/keys/joe" \ + -H "Content-Type:application/json" + -d '{"update": {"pets_set": {"add":"dog"}}}' +``` + +Then let's search for `pets_set:dog`, filtering only `type/bucket/key`. + +```bash +{ + "response": { + "numFound": 1, + "start": 0, + "maxScore": 1.0, + "docs": [ + { + "_yz_rt": "map" + "_yz_rb": "people" + "_yz_rk": "joe" + } + ] + } +} +``` + +Bravo. You've now found the object you wanted. Thanks to Solr's customizable +schema, you can even store the field you want to return, if it's really that +important to save a second lookup. + +This provides the best of both worlds. You can update and query values without +fear of conflicts, and can query Riak based on field values. It doesn't require +much imagination to see that this combination effectively turns Riak into +a scalable, stable, highly available, document datastore. Throw strong consistency +into the mix (which we'll do in the next chapter) and you can store and query +pretty much anything in Riak, in any way. + +If you're wondering to yourself, "What exactly does Mongo provide, again?", well, +I didn't ask it. You did. But that is a great question... + +Well, moving on. + + +## Wrap-up + +Riak is a distributed data store with several additions to improve upon the +standard key-value lookups, like specifying replication values. Since values +in Riak are opaque, many of these methods either require custom code to +extract and give meaning to values, such as *MapReduce*m or allow for +header metadata to provide an added descriptive dimension to the object, +such as *secondary indexes* or *search*. + +Next, we'll peek further under the hood and show you how to set up and manage +a cluster of your own. diff --git a/zh/04-operators.md b/zh/04-operators.md new file mode 100644 index 0000000..5333bb6 --- /dev/null +++ b/zh/04-operators.md @@ -0,0 +1,1606 @@ +# Operators + + + +In some ways, Riak is downright mundane in its role as the easiest +NoSQL database to operate. Want more servers? Add them. A network +cable is cut at 2am? Deal with it after a few more hours of +sleep. Understanding this integral part of your application stack is +still important, however, despite Riak's reliability. + +We've covered the core concepts of Riak, and I've provided a taste of +how to use it, but there is more to the database than that. There are +details you should know if you plan on operating a Riak cluster of +your own. + +## Clusters + +Up to this point you've conceptually read about "clusters" and the "Ring" in +nebulous summations. What exactly do we mean, and what are the practical +implications of these details for Riak developers and operators? + +A *cluster* in Riak is a managed collection of nodes that share a common Ring. + +

The Ring

+ +*The Ring* in Riak is actually a two-fold concept. + +Firstly, the Ring represents the consistent hash partitions (the partitions +managed by vnodes). This partition range is treated as circular, from 0 to +2^160-1 back to 0 again. (If you're wondering, yes this means that we are +limited to 2^160 nodes, which is a limit of a 1.46 quindecillion, or +`1.46 x 10^48`, node cluster. For comparison, there are only `1.92 x 10^49` +[silicon atoms on Earth](http://education.jlab.org/qa/mathatom_05.html).) + +When we consider replication, the N value defines how many nodes an object is +replicated to. Riak makes a best attempt at spreading that value to as many +nodes as it can, so it copies to the next N adjacent nodes, starting with the +primary partition and counting around the Ring, if it reaches the last +partition, it loops around back to the first one. + +Secondly, the Ring is also used as a shorthand for describing the state of the +circular hash ring I just mentioned. This Ring (aka *Ring State*) is a +data structure that gets passed around between nodes, so each knows the state +of the entire cluster. Which node manages which vnodes? If a node gets a +request for an object managed by other nodes, it consults the Ring and forwards +the request to the proper nodes. It's a local copy of a contract that all of +the nodes agree to follow. + +Obviously, this contract needs to stay in sync between all of the nodes. If a node is permanently taken +offline or a new one added, the other nodes need to readjust, balancing the partitions around the cluster, +then updating the Ring with this new structure. This Ring state gets passed between the nodes by means of +a *gossip protocol*. + +

Gossip and CMD

+ +Riak has two methods of keeping nodes current on the state of the Ring. The first, and oldest, is the *gossip protocol*. If a node's state in the cluster is altered, information is propagated to other nodes. Periodically, nodes will also send their status to a random peer for added consistency. + +A newer method of information exchange in Riak is *cluster metadata* (CMD), which uses a more sophisticated method (plum-tree, DVV consistent state) to pass large amounts of metadata between nodes. The superiority of CMD is one of the benefits of using bucket types in Riak 2.0, discussed below. + +In both cases, propagating changes in Ring is an asynchronous operation, and can take a couple minutes depending on Ring size. + + + +

How Replication Uses the Ring

+ +Even if you are not a programmer, it's worth taking a look at this Ring example. It's also worth +remembering that partitions are managed by vnodes, and in conversation are sometimes interchanged, +though I'll try to be more precise here. + +Let's start with Riak configured to have 8 partitions, which are set via `ring_creation_size` +in the `etc/riak.conf` file (we'll dig deeper into this file later). + +```bash +## Number of partitions in the cluster (only valid when first +## creating the cluster). Must be a power of 2, minimum 8 and maximum +## 1024. +## +## Default: 64 +## +## Acceptable values: +## - an integer +ring_size = 8 +``` + +In this example, I have a total of 4 Riak nodes running on `riak@AAA.cluster`, +`riak@BBB.cluster`, `riak@CCC.cluster`, and `riak@DDD.cluster`, each with two partitions (and thus vnodes) + +Riak has the amazing, and dangerous, `attach` command that attaches an Erlang console to a live Riak +node, with access to all of the Riak modules. + +The `riak_core_ring:chash(Ring)` function extracts the total count of partitions (8), with an array +of numbers representing the start of the partition, some fraction of the 2^160 number, and the node +name that represents a particular Riak server in the cluster. + +```bash +$ bin/riak attach +(riak@AAA.cluster)1> {ok,Ring} = riak_core_ring_manager:get_my_ring(). +(riak@AAA.cluster)2> riak_core_ring:chash(Ring). +{8, + [{0,'riak@AAA.cluster'}, + {182687704666362864775460604089535377456991567872, 'riak@BBB.cluster'}, + {365375409332725729550921208179070754913983135744, 'riak@CCC.cluster'}, + {548063113999088594326381812268606132370974703616, 'riak@DDD.cluster'}, + {730750818665451459101842416358141509827966271488, 'riak@AAA.cluster'}, + {913438523331814323877303020447676887284957839360, 'riak@BBB.cluster'}, + {1096126227998177188652763624537212264741949407232, 'riak@CCC.cluster'}, + {1278813932664540053428224228626747642198940975104, 'riak@DDD.cluster'}]} +``` + +To discover which partition the bucket/key `food/favorite` object would be stored in, for example, +we execute `riak_core_util:chash_key( {<<"food">>, <<"favorite">>} )` and get a wacky 160 bit Erlang +number we named `DocIdx` (document index). + +Just to illustrate that Erlang binary value is a real number, the next line makes it a more +readable format, similar to the ring partition numbers. + +```bash +(riak@AAA.cluster)3> DocIdx = +(riak@AAA.cluster)3> riak_core_util:chash_key({<<"food">>,<<"favorite">>}). +<<80,250,1,193,88,87,95,235,103,144,152,2,21,102,201,9,156,102,128,3>> + +(riak@AAA.cluster)4> <> = DocIdx. I. +462294600869748304160752958594990128818752487427 +``` + +With this `DocIdx` number, we can order the partitions, starting with first number greater than +`DocIdx`. The remaining partitions are in numerical order, until we reach zero, then +we loop around and continue to exhaust the list. + +```bash +(riak@AAA.cluster)5> Preflist = riak_core_ring:preflist(DocIdx, Ring). +[{548063113999088594326381812268606132370974703616, 'riak@DDD.cluster'}, + {730750818665451459101842416358141509827966271488, 'riak@AAA.cluster'}, + {913438523331814323877303020447676887284957839360, 'riak@BBB.cluster'}, + {1096126227998177188652763624537212264741949407232, 'riak@CCC.cluster'}, + {1278813932664540053428224228626747642198940975104, 'riak@DDD.cluster'}, + {0,'riak@AAA.cluster'}, + {182687704666362864775460604089535377456991567872, 'riak@BBB.cluster'}, + {365375409332725729550921208179070754913983135744, 'riak@CCC.cluster'}] +``` + +So what does all this have to do with replication? With the above list, we simply replicate a write +down the list N times. If we set N=3, then the `food/favorite` object will be written to +the `riak@DDD.cluster` node's partition `5480631...` (I truncated the number here), +`riak@AAA.cluster` partition `7307508...`, and `riak@BBB.cluster` partition `9134385...`. + +If something has happened to one of those nodes, like a network split +(confusingly also called a partition---the "P" in "CAP"), the remaining +active nodes in the list become candidates to hold the data. + +So if the node coordinating the write could not reach node +`riak@AAA.cluster` to write to partition `7307508...`, it would then attempt +to write that partition `7307508...` to `riak@CCC.cluster` as a fallback +(it's the next node in the list preflist after the 3 primaries). + +The way that the Ring is structured allows Riak to ensure data is always +written to the appropriate number of physical nodes, even in cases where one +or more physical nodes are unavailable. It does this by simply trying the next +available node in the preflist. + +

Hinted Handoff

+ +When a node goes down, data is replicated to a backup node. This is +not permanent; Riak will periodically examine whether each vnode +resides on the correct physical node and hands them off to the proper +node when possible. + +As long as the temporary node cannot connect to the primary, it will continue +to accept write and read requests on behalf of its incapacitated brethren. + +Hinted handoff not only helps Riak achieve high availability, it also facilitates +data migration when physical nodes are added or removed from the Ring. + + +## Managing a Cluster + +Now that we have a grasp of the general concepts of Riak, how users query it, +and how Riak manages replication, it's time to build a cluster. It's so easy to +do, in fact, I didn't bother discussing it for most of this book. + +

Install

+ +The Riak docs have all of the information you need to [install](http://docs.basho.com/riak/latest/tutorials/installation/) it per operating system. The general sequence is: + +1. Install Erlang +2. Get Riak from a package manager (a la `apt-get` or Homebrew), or build from source (the results end up under `rel/riak`, with the binaries under `bin`). +3. Run `riak start` + +Install Riak on four or five nodes---five being the recommended safe minimum for production. Fewer nodes are OK during software development and testing. + +

Command Line

+ +Most Riak operations can be performed though the command line. We'll concern ourselves with two commands: `riak` and `riak-admin`. + +

riak

+ +Simply typing the `riak` command will give a usage list. If you want more information, you can try `riak help`. + +```bash +Usage: riak +where is one of the following: + { help | start | stop | restart | ping | console | attach + attach-direct | ertspath | chkconfig | escript | version | getpid + top [-interval N] [-sort { reductions | memory | msg_q }] [-lines N] } | + config { generate | effective | describe VARIABLE } [-l debug] + +Run 'riak help' for more detailed information. +``` + +Most of these commands are self explanatory, once you know what they mean. `start` and `stop` are simple enough. `restart` means to stop the running node and restart it inside of the same Erlang VM (virtual machine), while `reboot` will take down the Erlang VM and restart everything. + +You can print the current running `version`. `ping` will return `pong` if the server is in good shape, otherwise you'll get the *just-similar-enough-to-be-annoying* response `pang` (with an *a*), or a simple `Node X not responding to pings` if it's not running at all. + +`chkconfig` is useful if you want to ensure your `etc/riak.conf` is not broken +(that is to say, it's parsable). I mentioned `attach` briefly above, when +we looked into the details of the Ring---it attaches a console to the local +running Riak server so you can execute Riak's Erlang code. `escript` is similar +to `attach`, except you pass in script file of commands you wish to run automatically. + + + +

riak-admin

+ +The `riak-admin` command is the meat operations, the tool you'll use most often. This is where you'll join nodes to the Ring, diagnose issues, check status, and trigger backups. + +```bash +Usage: riak-admin { cluster | join | leave | backup | restore | test | + reip | js-reload | erl-reload | wait-for-service | + ringready | transfers | force-remove | down | + cluster-info | member-status | ring-status | vnode-status | + aae-status | diag | status | transfer-limit | reformat-indexes | + top [-interval N] [-sort reductions|memory|msg_q] [-lines N] | + downgrade-objects | security | bucket-type | repair-2i | + search | services | ensemble-status } +``` + +For more information on commands, you can try `man riak-admin`. + +A few of these commands are deprecated, and many don't make sense without a +cluster, but some we can look at now. + +`status` outputs a list of information about this cluster. It's mostly the same information you can get from getting `/stats` via HTTP, although the coverage of information is not exact (for example, riak-admin status returns `disk`, and `/stats` returns some computed values like `gossip_received`). + +```bash +$ riak-admin status +1-minute stats for 'riak@AAA.cluster' +------------------------------------------- +vnode_gets : 0 +vnode_gets_total : 2 +vnode_puts : 0 +vnode_puts_total : 1 +vnode_index_reads : 0 +vnode_index_reads_total : 0 +vnode_index_writes : 0 +vnode_index_writes_total : 0 +vnode_index_writes_postings : 0 +vnode_index_writes_postings_total : 0 +vnode_index_deletes : 0 +... +``` + +New JavaScript or Erlang files (as we did in the [developers](#developers) chapter) are not usable by the nodes until they are informed about them by the `js-reload` or `erl-reload` command. + +`riak-admin` also provides a little `test` command, so you can perform a read/write cycle +to a node, which I find useful for testing a client's ability to connect, and the node's +ability to write. + +Finally, `top` is an analysis command checking the Erlang details of a particular node in +real time. Different processes have different process ids (Pids), use varying amounts of memory, +queue up so many messages at a time (MsgQ), and so on. This is useful for advanced diagnostics, +and is especially useful if you know Erlang or need help from other users, the Riak team, or +Basho. + +![Top](../assets/top.png) + +

Making a Cluster

+ +With several solitary nodes running---assuming they are networked and are able to communicate to +each other---launching a cluster is the simplest part. + +Executing the `cluster` command will output a descriptive set of commands. + +```bash +$ riak-admin cluster +The following commands stage changes to cluster membership. These commands +do not take effect immediately. After staging a set of changes, the staged +plan must be committed to take effect: + + join Join node to the cluster containing + leave Have this node leave the cluster and shutdown + leave Have leave the cluster and shutdown + + force-remove Remove from the cluster without + first handing off data. Designed for + crashed, unrecoverable nodes + + replace Have transfer all data to , + and then leave the cluster and shutdown + + force-replace Reassign all partitions owned by + to without first handing off data, + and remove from the cluster. + +Staging commands: + plan Display the staged changes to the cluster + commit Commit the staged changes + clear Clear the staged changes +``` + +To create a new cluster, you must `join` another node (any will do). Taking a +node out of the cluster uses `leave` or `force-remove`, while swapping out +an old node for a new one uses `replace` or `force-replace`. + +I should mention here that using `leave` is the nice way of taking a node +out of commission. However, you don't always get that choice. If a server +happens to explode (or simply smoke ominously), you don't need its approval +to remove it from the cluster, but can instead mark it as `down`. + +But before we worry about removing nodes, let's add some first. + +```bash +$ riak-admin cluster join riak@AAA.cluster +Success: staged join request for 'riak@BBB.cluster' to 'riak@AAA.cluster' +$ riak-admin cluster join riak@AAA.cluster +Success: staged join request for 'riak@CCC.cluster' to 'riak@AAA.cluster' +``` + +Once all changes are staged, you must review the cluster `plan`. It will give you +all of the details of the nodes that are joining the cluster, and what it +will look like after each step or *transition*, including the `member-status`, +and how the `transfers` plan to handoff partitions. + +Below is a simple plan, but there are cases when Riak requires multiple +transitions to enact all of your requested actions, such as adding and removing +nodes in one stage. + +```bash +$ riak-admin cluster plan +=============================== Staged Changes ============== +Action Nodes(s) +------------------------------------------------------------- +join 'riak@BBB.cluster' +join 'riak@CCC.cluster' +------------------------------------------------------------- + + +NOTE: Applying these changes will result in 1 cluster transition + +############################################################# + After cluster transition 1/1 +############################################################# + +================================= Membership ================ +Status Ring Pending Node +------------------------------------------------------------- +valid 100.0% 34.4% 'riak@AAA.cluster' +valid 0.0% 32.8% 'riak@BBB.cluster' +valid 0.0% 32.8% 'riak@CCC.cluster' +------------------------------------------------------------- +Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 + +WARNING: Not all replicas will be on distinct nodes + +Transfers resulting from cluster changes: 42 + 21 transfers from 'riak@AAA.cluster' to 'riak@CCC.cluster' + 21 transfers from 'riak@AAA.cluster' to 'riak@BBB.cluster' +``` + +Making changes to cluster membership can be fairly resource intensive, +so Riak defaults to only performing 2 transfers at a time. You can +choose to alter this `transfer-limit` using `riak-admin`, but bear in +mind the higher the number, the greater normal operations will be +impinged. + +At this point, if you find a mistake in the plan, you have the chance to `clear` it and try +again. When you are ready, `commit` the cluster to enact the plan. + +```bash +$ riak-admin cluster commit +Cluster changes committed +``` + +Without any data, adding a node to a cluster is a quick operation. However, with large amounts of +data to be transferred to a new node, it can take quite a while before the new node is ready to use. + +

Status Options

+ +To check on a launching node's progress, you can run the `wait-for-service` command. It will +output the status of the service and stop when it's finally up. In this example, we check +the `riak_kv` service. + +```bash +$ riak-admin wait-for-service riak_kv riak@CCC.cluster +riak_kv is not up: [] +riak_kv is not up: [] +riak_kv is up +``` + +You can get a list of available services with the `services` command. + +You can also see if the whole ring is ready to go with `ringready`. If the nodes do not agree +on the state of the ring, it will output `FALSE`, otherwise `TRUE`. + +```bash +$ riak-admin ringready +TRUE All nodes agree on the ring ['riak@AAA.cluster','riak@BBB.cluster', + 'riak@CCC.cluster'] +``` + +For a more complete view of the status of the nodes in the ring, you can check out `member-status`. + +```bash +$ riak-admin member-status +================================= Membership ================ +Status Ring Pending Node +------------------------------------------------------------- +valid 34.4% -- 'riak@AAA.cluster' +valid 32.8% -- 'riak@BBB.cluster' +valid 32.8% -- 'riak@CCC.cluster' +------------------------------------------------------------- +Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 +``` + +And for more details of any current handoffs or unreachable nodes, try `ring-status`. It +also lists some information from `ringready` and `transfers`. Below I turned off the C +node to show what it might look like. + +```bash +$ riak-admin ring-status +================================== Claimant ================= +Claimant: 'riak@AAA.cluster' +Status: up +Ring Ready: true + +============================== Ownership Handoff ============ +Owner: dev1 at 127.0.0.1 +Next Owner: dev2 at 127.0.0.1 + +Index: 182687704666362864775460604089535377456991567872 + Waiting on: [] + Complete: [riak_kv_vnode,riak_pipe_vnode] +... + +============================== Unreachable Nodes ============ +The following nodes are unreachable: ['riak@CCC.cluster'] + +WARNING: The cluster state will not converge until all nodes +are up. Once the above nodes come back online, convergence +will continue. If the outages are long-term or permanent, you +can either mark the nodes as down (riak-admin down NODE) or +forcibly remove the nodes from the cluster (riak-admin +force-remove NODE) to allow the remaining nodes to settle. +``` + +If all of the above information options about your nodes weren't enough, you can +list the status of each vnode per node, via `vnode-status`. It'll show each +vnode by its partition number, give any status information, and a count of each +vnode's keys. Finally, you'll get to see each vnode's backend type---something I'll +cover in the next section. + +```bash +$ riak-admin vnode-status +Vnode status information +------------------------------------------- + +VNode: 0 +Backend: riak_kv_bitcask_backend +Status: +[{key_count,0},{status,[]}] + +VNode: 91343852333181432387730302044767688728495783936 +Backend: riak_kv_bitcask_backend +Status: +[{key_count,0},{status,[]}] + +VNode: 182687704666362864775460604089535377456991567872 +Backend: riak_kv_bitcask_backend +Status: +[{key_count,0},{status,[]}] + +VNode: 274031556999544297163190906134303066185487351808 +Backend: riak_kv_bitcask_backend +Status: +[{key_count,0},{status,[]}] + +VNode: 365375409332725729550921208179070754913983135744 +Backend: riak_kv_bitcask_backend +Status: +[{key_count,0},{status,[]}] +... +``` + +Some commands we did not cover are either deprecated in favor of their `cluster` +equivalents (`join`, `leave`, `force-remove`, `replace`, `force-replace`), or +flagged for future removal `reip` (use `cluster replace`). + +I know this was a lot to digest, and probably pretty dry. Walking through command +line tools usually is. There are plenty of details behind many of the `riak-admin` +commands, too numerous to cover in such a short book. I encourage you to toy around +with them on your own installation. + + +## New in Riak 2.0 + +Riak has been a project since 2009. And in that time, it has undergone a few evolutions, largely technical improvements, such as more reliability and data safety mechanisms like active anti-entropy. + +Riak 2.0 was not a rewrite, but rather, a huge shift in how developers who use Riak interact with it. While Basho continued to make backend improvements (such as better cluster metadata) and simplified using existing options (`repair-2i` is now a `riak-admin` command, rather than code you must execute), the biggest changes are immediately obvious to developers. But many of those improvements are also made easier for operators to administrate. So here are a few highlights of the new 2.0 interface options. + + +

Bucket Types

+ +A centerpiece of the new Riak 2.0 features is the addition of a higher-level bucket configuration namespace called *bucket types*. We discussed the general idea of bucket types in the previous chapters, but one major departure from standard buckets is that they are created via the command-line. This means that operators with server access can manage the default properties that all buckets of a given bucket type inherit. + +Bucket types have a set of tools for creating, managing and activating them. + +```bash +$ riak-admin bucket-type +Usage: riak-admin bucket-type + +The follow commands can be used to manage bucket types for the cluster: + + list List all bucket types and their activation status + status Display the status and properties of a type + activate Activate a type + create Create or modify a type before activation + update Update a type after activation +``` + +It's rather straightforward to `create` a bucket type. The JSON string accepted after the bucket type name are any valid bucket propertied. Any bucket that uses this type will inherit those properties. For example, say that you wanted to create a bucket type whose n_val was always 1 (rather than the default 3), named unsafe. + +```bash +$ riak-admin bucket-type create unsafe '{"props":{"n_val":1}}' +``` + +Once you create the bucket type, it's a good idea to check the `status`, and ensure the properties are what you meant to set. + +```bash +$ riak-admin bucket-type status unsafe +``` + +A bucket type is not active until you propgate it through the system by calling the `activate` command. + +```bash +$ riak-admin bucket-type activate unsafe +``` + +If something is wrong with the type's properties, you can always `update` it. + +```bash +$ riak-admin bucket-type update unsafe '{"props":{"n_val":1}}' +``` + +You can update a bucket type after it's actived. All of the changes that you make to the type will be inherited by every bucket under that type. + +Of course, you can always get a `list` of the current bucket types in the system. The list will also say whether the bucket type is activated or not. + +Other than that, there's nothing interesting about bucket types from an operations point of view, per se. Sure, there are some cool internal mechanisms at work, such as propogated metadata via a path laied out by a plum-tree and causally tracked by dotted version vectors. But that's only code plumbing. What's most interesting about bucket types are the new features you can take advantage of: datatypes, strong consistency, and search. + + +

Datatypes

+ +Datatypes are useful for engineers, since they no longer have to consider the complexity of manual conflict merges that can occur in fault situations. It can also be less stress on the system, since larger objects need only communicate their changes, rather than reinsert the full object. + +Riak 2.0 supports four datatypes: *map*, *set*, *counter*, *flag*. You create a bucket type with a single datatype. It's not required, but often good form to name the bucket type after the datatype you're setting. + +```bash +$ riak-admin bucket-type create maps '{"props":{"datatype":"map"}}' +$ riak-admin bucket-type create sets '{"props":{"datatype":"set"}}' +$ riak-admin bucket-type create counters '{"props":{"datatype":"counter"}}' +$ riak-admin bucket-type create flags '{"props":{"datatype":"flag"}}' +``` + +Once a bucket type is created with the given datatype, you need only active it. Developers can then use this datatype like we saw in the previous chapter, but hopefully this example makes clear the suggestion of naming bucket types after their datatype. + +```bash +curl -XPUT "$RIAK/types/counters/buckets/visitors/keys/index_page" \ + -H "Content-Type:application/json" + -d 1 +``` + + +

Strong Consistency

+ +Strong consistency (SC) is the opposite of everything that Riak stands for. Where Riak is all about high availability in the face of network or server errors, strong consistency is about safety over liveness. Either the network and servers are working perfectly, or the reads and writes fail. So why on earth would we ever want to provide SC and give up HA? Because you asked for. Really. + +There are some very good use-cases for strong consistency. For example, when a user is completing a purchase, you might want to ensure that the system is always in a consistent state, or fail the purchase. Communicating that a purchase was made when it in fact was not, is not a good user experience. The opposite is even worse. + +While Riak will continue to be primarily an HA system, there are cases where SC is useful, and developers should be allowed to choose without having to install an entirely new database. So all you need to do is activate it in `riak.conf`. + +```bash +strong_consistency = on +``` + +One thing to note is, although we generally recommend you have five nodes in a Riak cluster, it's not a hard requirement. Strong consistency, however, requires three nodes. It will not operate with fewer. + +Once our SC systme is active, you'll lean on bucket types again. Only buckets that live under a bucket type setup for strong consistency will be strongly consistent. This means that you can have some buckets HA, other SC, in the same database. Let's call our SC bucket type `strong`. + +```bash +$ riak-admin bucket-type create strong '{"props":{"consistent":true}}' +$ riak-admin bucket-type activate strong +``` + +That's all the operator should need to do. The developers can use the `strong` bucket similarly to other buckets. + +```bash +curl -XPUT "$RIAK/types/strong/buckets/purchases/keys/jane" \ + -d '{"action":"buy"}' +``` + +Jane's purchases will either succeed or fail. It will not be eventually consistent. If it fails, of course, she can try again. + +What if your system is having problems with strong consistency? Basho has provided a command to interrogate the current status of the subsystem responsible for SC named ensemble. You can check it out by running `ensemble-status`. + +```bash +$ riak-admin ensemble-status +``` + +It will give you the best information it has as to the state of the system. For example, if you didn't enable `strong_consistency` in every node's `riak.conf`, you might see this. + +```bash +============================== Consensus System =============================== +Enabled: false +Active: false +Ring Ready: true +Validation: strong (trusted majority required) +Metadata: best-effort replication (asynchronous) + +Note: The consensus subsystem is not enabled. + +================================== Ensembles ================================== +There are no active ensembles. +``` + +In the common case when all is working, you should see an output similar to the following: + +```bash +============================== Consensus System =============================== +Enabled: true +Active: true +Ring Ready: true +Validation: strong (trusted majority required) +Metadata: best-effort replication (asynchronous) + +================================== Ensembles ================================== + Ensemble Quorum Nodes Leader +------------------------------------------------------------------------------- + root 4 / 4 4 / 4 riak@riak1 + 2 3 / 3 3 / 3 riak@riak2 + 3 3 / 3 3 / 3 riak@riak4 + 4 3 / 3 3 / 3 riak@riak1 + 5 3 / 3 3 / 3 riak@riak2 + 6 3 / 3 3 / 3 riak@riak2 + 7 3 / 3 3 / 3 riak@riak4 + 8 3 / 3 3 / 3 riak@riak4 +``` + +This output tells you that the consensus system is both enabled and active, as well as lists details about all known consensus groups (ensembles). + +There is plenty more information about the details of strong consistency in the online docs. + + +

Search 2.0

+ +From an operations standpoint, search is deceptively simple. Functionally, there isn't much you should need to do with search, other than activate it in `riak.conf`. + +```bash +search = on +``` + +However, looks are deceiving. Under the covers, Riak Search 2.0 actually runs the search index software called Solr. Solr runs as a Java service. All of the code required to convert an object that you insert into a document that Solr can recognize (by a module called an *Extractor*) is Erlang, and so is the code which keeps the Riak objects and Solr indexes in sync through faults (via AAE), as well as all of the interfaces, security, stats, and query distribution. But since Solr is Java, we have to manage the JVM. + +If you don't have much experience running Java code, let me distill most problems for you: you need more memory. Solr is a memory hog, easily requiring a minimum of 2 GiB of RAM dedicated only to the Solr service itself. This is in addition to the 4 GiB of RAM minimum that Basho recommends per node. So, according to math, you need a minimum of 6 GiB of RAM to run Riak Search. But we're not quite through yet. + +The most important setting in Riak Search are the JVM options. These options are passed into the JVM command-line when the Solr service is started, and most of the options chosen are excellent defaults. I recommend not getting to hung up on tweaking those, with one notable exception. + +```bash +## The options to pass to the Solr JVM. Non-standard options, +## i.e. -XX, may not be portable across JVM implementations. +## E.g. -XX:+UseCompressedStrings +## +## Default: -d64 -Xms1g -Xmx1g -XX:+UseStringCache -XX:+UseCompressedOops +## +## Acceptable values: +## - text +search.solr.jvm_options = -d64 -Xms1g -Xmx1g -XX:+UseStringCache -XX:+UseCompressedOops +``` + +In the default setting, Riak gives 1 GiB of RAM to the Solr JVM heap. This is fine for small clusters with small, lightly used indexes. You may want to bump those heap values up---the two args of note are: `-Xms1g` (minimum size 1 gigabyte) and `-Xmx1g` (maximum size 1 gigabyte). Push those to 2 or 4 (or even higher) and you should be fine. + +In the interested of completeness, Riak also communicates to Solr internally through a port, which you can configure (along with an option JMX port). You should never need to connect to this port yourself. + +```bash +## The port number which Solr binds to. +## NOTE: Binds on every interface. +## +## Default: 8093 +## +## Acceptable values: +## - an integer +search.solr.port = 8093 + +## The port number which Solr JMX binds to. +## NOTE: Binds on every interface. +## +## Default: 8985 +## +## Acceptable values: +## - an integer +search.solr.jmx_port = 8985 +``` + +There's generally no great reason to alter these defaults, but they're there if you need them. + +I should also note that, thanks to fancy bucket types, you can associate a bucket type with a search index. You associate buckets (or types) with indexes by adding a search_index property, with the name of a Solr index. Like so, assuming that you've created a solr index named `my_index`: + +```bash +$ riak-admin bucket-type create indexed '{"props":{"search_index":"my_index"}}' +$ riak-admin bucket-type activate indexed +``` + +Now, any object that a developer puts into yokozuna under that bucket type will be indexed. + +There's a lot more to search than we can possibly cover here without making it a book in its own right. You may want to checkout the following documentation in docs.basho.com for more details. + +* [Riak Search Settings](http://docs.basho.com/riak/latest/ops/advanced/configs/search/) +* [Using Search](http://docs.basho.com/riak/latest/dev/using/search/) +* [Search Details](http://docs.basho.com/riak/latest/dev/advanced/search/) +* [Search Schema](http://docs.basho.com/riak/latest/dev/advanced/search-schema/) +* [Upgrading Search from 1.x to 2.x](http://docs.basho.com/riak/latest/ops/advanced/upgrading-search-2/) + +

Security

+ +Riak has lived quite well in the first five years of its life without security. So why did Basho add it now? With the kind of security you get through a firewall, you can only get coarse-grained security. Someone can either access the system or not, with a few restrictions, depending on how clever you write your firewall rules. + +With the addition of Security, Riak now supports authentication (identifying a user) and authorization (restricting user access to a subset of commands) of users and groups. Access can also be restricted to a known set of sources. The security design was inspired by the full-featured rules in PostgreSQL. + +Before you decide to enable security, you should consider this checklist in advance. + +1. If you use security, you must upgrade to Riak Search 2.0. The old Search will not work (neither will the deprecated Link Walking). Check any Erlang MapReduce code for invocations of Riak modules other than `riak_kv_mapreduce`. Enabling security will prevent those from succeeding unless those modules are available via `add_path` +2. Make sure that your application code is using the most recent drivers +3. Define users and (optionally) groups, and their sources +4. Grant the necessary permissions to each user/group + +With that out of the way, you can `enable` security with a command-line option (you can `disable` security as well). You can optionally check the `status` of security at any time. + +```bash +$ riak-admin security enable +$ riak-admin security status +Enabled +``` + +Adding users is as easy as the `add-user` command. A username is required, and can be followed with any key/value pairs. `password` and `groups` are special cases, but everything is free form. You can alter existing users as well. Users can belong to any number of groups, and inherit a union of all group settings. + + +```bash +$ riak-admin security add-group mascots type=mascot +$ riak-admin security add-user bashoman password=Test1234 +$ riak-admin security alter-user bashoman groups=mascots +``` + +You can see the list of all users via `print-users`, or all groups via `print-groups`. + +```bash +$ riak-admin security print-users ++----------+----------+----------------------+---------------------+ +| username | groups | password | options | ++----------+----------+----------------------+---------------------+ +| bashoman | mascots |983e8ae1421574b8733824| [{"type","mascot"}] | ++----------+----------+----------------------+---------------------+ +``` + +Creating user and groups is nice and all, but the real reason for doing this is so we can distinguish authorization between different users and groups. You `grant` or `revoke` `permissions` to users and groups by way of the command line, of course. You can grant/revoke a permission to anything, a certain bucket type, or a specific bucket. + +```bash +$ riak-admin security grant riak_kv.get on any to all +$ riak-admin security grant riak_kv.delete on any to admin +$ riak-admin security grant search.query on index people to bashoman +$ riak-admin security revoke riak_kv.delete on any to bad_admin +``` + +There are many kinds of permissions, one for every major operation or set of operations in Riak. It's worth noting that you can't add search permissions without search enabled. + +* __riak\_kv.get__ --- Retrieve objects +* __riak\_kv.put__ --- Create or update objects +* __riak\_kv.delete__ --- Delete objects +* __riak\_kv.index__ --- Index objects using secondary indexes (2i) +* __riak\_kv.list\_keys__ --- List all of the keys in a bucket +* __riak\_kv.list\_buckets__ --- List all buckets +* __riak\_kv.mapreduce__ --- Can run MapReduce jobs +* __riak\_core.get\_bucket__ --- Retrieve the props associated with a bucket +* __riak\_core.set\_bucket__ --- Modify the props associated with a bucket +* __riak\_core.get\_bucket\_type__ --- Retrieve the set of props associated with a bucket type +* __riak\_core.set\_bucket\_type__ --- Modify the set of props associated with a bucket type +* __search.admin__ --- The ability to perform search admin-related tasks, like creating and deleting indexes +* __search.query__ --- The ability to query an index + +Finally, with our group and user created, and given access to a subset of permissions, we have one more major item to deal with. We want to be able to filter connection from specific sources. + +```bash +$ riak-admin security add-source all| [