Revision History | |
---|---|
Revision 0.5.4 | 2015/04/05 |
Table of Contents
This document is under re-construction - beware! |
This document’s goal is to describe the software architecture of the Hibari key-value database, discuss parts of its implementation, and to document the Hibari client APIs.
At a minimum, application developers need to know how to use the various Hibari client APIs. Hibari is a key-value database, which almost by definition have a small API. Learning the basics isn’t too difficult. To be really effective, however, application developers also need to have a good overall understanding of how Hibari works.
For developers interested in working on Hibari itself, the source code is the ultimate documentation. But like most software developed in an industrial setting, Hibari grew at times quite quickly and at times sat on the shelf, waiting for customer demand. Anyone who works with the software, including Hibari’s original developers, need documentation to understand not only how something works but also why certain choices were made … and what parts need more work.
Discussion of Hibari’s implementation
To avoid a lot of cut-and-paste text, many of the operational details that a developer should know or must know about Hibari are found in the Hibari System Administrator’s Guide. This document assumes that a developer has already skimmed the System Administrator’s Guide and is willing to jump over to that guide when necessary. |
Copyright © 2005-2014 Hibari developers. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
It would be wonderful to say that Hibari sprang from someone’s forehead, fully formed and adult, like the goddess Athena’s birth from the forehead of Zeus. Software development is usually a bit more organic and unplanned than that. Hibari is no exception.
Once upon a time, in a galaxy far, far away… Hibari started as a small, focused replacement for Mnesia, the database bundled with Erlang/OTP. Cloudian, Inc. (formerly Gemini Mobile Technologies) was bidding on a project that required an extremely high throughput database, with high availability and data durability guarantees for a workload with a very low read/write ratio (i.e. very write-intensive). The amount of money dedicated to hardware was fixed. Mnesia could do the job very well, except for the throughput. Cloudian needed something both faster and simpler. The skeleton of Hibari was written in haste, in case Cloudian got the contract.
Fortunately, Cloudian lost the bid for the contract. Hibari sat on the shelf for a while, then picked up and developed as a main memory database, like Mnesia. Then requirements changed. Then a project was canceled, and Hibari set aside. Then picked up and set aside again. Each time, requirements changed.
Hindsight is perfect. This section will attempt to give the reader, developers who are maintaining Hibari or adding new features, some background for why the code is structured the way it is. The APIs for some modules are straightforward to use, and others are not so clear. Some modules were written together in a brief period of time, and other evolved slowly over several years.
As described in the Hibari Sysadmin Guide, "Bricks outside of chain replication" section, a Hibari logical brick can be used with or without chain replication. When used without chain replication, each brick is a standalone data storage entity. Any replication of data across logical bricks must be done by the client, typically by a “quorum replication” technique.
The source modules that implement the logical brick are divided into two groups:
The following modules maintain the write-ahead logs on disk. See the Hibari Sysadmin Guide, "Write-Ahead Logs" section for a description of the two types of write-ahead log and how they interact with each other.
gmt_hlog.erl
gmt_hlog_common.erl
gmt_hlog_local.erl
The following modules handle Hibari client requests and manage the in-core binary trees used for key management:
brick_ets.erl
brick_server.erl
The chain replication algorithm is implemented in brick_server.erl
.
That module also contains server and client code, which helps explain
why it’s the largest source module in the Hibari application.
The consistent hashing algorithm is implemented in the
brick_simple.erl
module. The code in this module has two roles:
brick_simple
registered process
that runs on each brick node. This server receives updates from the
Admin Server when there are changes to chain membership.
Hibari clients fall into two categories: those that use consistent hashing and those that do not.
brick_simple.erl
module implements the client API that uses
consistent hashing.
brick_server.erl
module implements the low-level client API
that is not aware of consistent hashing.
brick_squorum.erl
module is a partial implementation of a
“quorum replication” method for managing data consistency across
multiple logical bricks. This module is used only by the Admin
Server and is tailored to the Admin Server’s use. It should not be
used by other quorum replication-based applications.
See Hibari Sysadmin Guide, "The Admin Server Application" section for a description of the various services provided by the Admin Server application.
brick_admin.erl
provides the major external API to most of the
Admin Server’s function as well as basic table management
functions.
brick_bp.erl
implements the “brick pinger” processes. Each
pinger process is responsible for monitoring the health of a single
Hibari logical brick.
brick_chainmon.erl
implements the “chain monitor” processes.
Each chain monitor is responsible for monitoring the status of a
single Hibari chain and to reconfigure the chain safely as its member
bricks crash and restart.
brick_migmon.erl
implements the server process that is responsible
for monitoring data migrations that take place whenever chains are
added, deleted, or reweighted.
brick_sb.erl
implements the “scoreboard” process, which provides
historical data about each brick and chain state transition.
The the section called “Admin Server modules” indirectly outlines many of the processes that, when grouped together, form the Hibari Admin Server application:
Acting together, these processes maintain data consistency for all bricks in a Hibari cluster. However, individual processes can crash and restart at unpredictable times. The Admin Server must be able to recover correctly from a failure of any number of its helper processes, regardless of timing.
The Admin Server implementation was written for correctness first and for speed/efficiency only when necessary. It has been used in production environments with rough total of 2,000 logical brick pinger and chain monitor processes. At this scale, the implementation shows signs of stress under the worst-case scenario of "Restart the Admin Server and all logical bricks on all physical bricks simultaneously", but it works none-the-less. |
The architecture decision to have one Admin Server process, one scoreboard process, one pinger process per brick and one health monitor per chain is quite intentional. If there is only one at a given time performing a given task, then there cannot be race conditions.
The modules in this section appear in alphabetical order. For an overview of their use by functional category, see Section 2.1, “Major subsystems”.
This module provides the Erlang/OTP "application" behavior for the
Hibari application. Such modules are usually quite small. The reason
why brick.erl
doesn’t fit the small pattern is that it has some
extra logic to shutdown logical bricks in a particular order when
application shutdown has been requested.
Please see Hibari Sysadmin Guide, "The Admin Server Application" section for a description of the Admin Server’s various functions.
The Admin Server has a "schema", though perhaps that’s a poor choice of name. The schema defines:
{TableName, Key}
→ chain mapping
The schema, together with the “scoreboard” operational history, is stored in the “bootstrap bricks”. See Hibari Sysadmin Guide, "Admin Server’s Private State: the Bootstrap Bricks" section and Hibari Sysadmin Guide, "Bricks outside of chain replication" section for more details.
Functions for creating a new schema and defining the names of the
bootstrap bricks that will store the schema are in this module. So
are assorted functions for querying the schema, such as
brick_admin:get_tables/0
and brick_admin:get_table_chain_list/2
.
Changes to chain length and chain addition/deletion/reweighting are
also available here.
When a chain health monitor process makes a major state transition, it
will notify the Admin Server of the change. The Admin Server will
then broadcast, or "spam" status notifications to all server and
client nodes. The data structure spammed is a #g_hash_r
record, or
simply a "global hash record". This record is maintained by the admin
server per table. Each table has its own unique global hash record.
Among other things, the global hash contains the bricks and their roles in each chain. In this manner, a gdss client can know which chain a key belongs to and what brick in that chain is acting as head or tail. In this manner they can talk directly to the correct brick for a given operation. |
The admin server increments a minor revision number in the global hash for each update. Bricks and clients can then compare this revision number to what they already have to ensure that they don’t revert to using an older global hash. |
The brick_admin:fast_sync()
family of functions are an attempt to
help Hibari cluster administrators to perform bulk-copies of data from
in-service bricks to bricks that have zero data. For example,
consider a physical brick has crashed due to data loss caused by a
hard disk failure. The crashed brick can be put back into service
after fixing the disk problem, but the brick’s data has been lost.
Chain replication can create new replicas of the lost data, but the
chain replication implementation can create a lot of disk I/O on the
upstream brick for long periods of time, e.g. several days for
terabytes of data. The fast_sync()
function attempts to minimize
the amount of random disk I/O by copying keys & values in file+offset
sorted order.
brick_simple
servers on all nodes.
The Admin Server registers the brick_admin_event_h.erl as an event
handler with the partition detector application. It’s
handle_event()
function takes action for two kinds of events:
net_kernel
services.
This is the supervisor for Admin Server-related processes. If the Admin Server application is not running, this supervisor will have no children to monitor.
A careful reader will notice that the preceding paragraph
contains a contradiction. The Hibari The contradiction is: the Admin Server’s processes are supervised by another application’s supervisor. |
The simple answer is: the Admin Server is not a 100% OTP-compliant application.
The complicated answer is: it is complicated. There’s so much day-to-day developer activity that relies on the Admin Server that it’s really inconvenient to package the Admin Server as a 100% OTP-compliant application. It’s much, much more convenient to have its source code and its processes mixed in with the rest of the Hibari server code and processes.
So, the Admin Server application is OTP-compliant enough to be managed
by the OTP application controller. But it is conveniently managed,
source-wise and process-wise, within the Hibari/gdss
application as
a whole.
This module implements the “brick pinger” server: a "gen_fsm" process that is responsible for polling the health of a single logical brick. The Admin Server starts a "brick pinger" process for each logical brick in each chain in every table. See the Hibari Sysadmin Guide, "Brick Lifecycle Finite State Machine" section for a description of the state machine implemented by this module.
A 1-second periodic timer is used to check the status of the brick that this FSM monitors. Any major changes in status will be sent as a proplist to the "scoreboard" proc (see the section called “brick_chainmon.erl”) .
Health polling is based on two attributes: brick repair state and brick repair time. Due to the polling nature of this interface (a weakness, see the "NOTE" section above), it’s possible that a logical brick in certain repair states X could crash, restart, and reenter the state X before the "pinger" polled again. The logical brick’s start time is checked to make certain that the "pinger" is talking to the same PID over time.
brick_monitor_simple()
function, which creates a monitor to the remote logical brick and
then waits for a {'DOWN', ...}
message from that monitor.
This is the direct supervisor for all logical bricks that run on this
node. The brick_shepherd.erl
module provides the interface used to
request that this supervisor start & stop a logical brick.
This module implements the “chain monitor” server: a "gen_fsm" process that is responsible for monitoring the health of all logical bricks within a single chain. The Admin Server starts a "chain monitor" for each chain in every table. See the Hibari Sysadmin Guide, "Chain Lifecycle Finite State Machine" section for a description of the state machine implemented by this module.
See the the section called “brick_bp.erl” for an overview of the polling method used by both the brick health "pinger" processes, the "chain monitor" processes, and the "scoreboard" and that method’s known limitations.
In addition to monitoring health, the chain monitor proc is responsible for taking actions required to repair the chain.
Chain member status is tricky to calculate correctly. Any chain monitor process may crash at any time (e.g. due to bugs), or the entire machine hosting the monitor may crash. When the monitor has restarted, it doesn’t know how many times chain members may have changed state.
Each monitor queries the "scoreboard" to get the current status of
each chain member. (Remember: The scoreboard’s info may be slightly
out-of-date!) Each current status is compared with the monitor’s
in-memory history of the status during the last check. (All bricks
start in the unknown
state.) If there’s a difference, then suitable
action is taken.
Also, any change in brick or chain status is also reported to the scoreboard.
Each brick’s scoreboard status is converted to an internal status:
unknown
disk_error
pre_init
repairing
repair_overload
ok
There are times when the chain monitor detects a chain that has zero
running bricks. It must then examine the operational history of all
bricks in the chain and determine which is the best brick to start
first. This must be the last brick to crash. The chain monitor will
wait forever for that "best brick" to start. If it is impossible to
start (for example, the machine was destroyed by fire, or data was
lost due to a disk failure), then a human may use the
force_best_first_brick()
function to give permission to the chain
monitor to start another brick as the chain’s first brick.
There are times when a chain monitor restarts and discovers that some or all bricks in the chain are running. However, the new chain monitor cannot know exactly what roles each brick was in without polling each one … and because a chain transition may have been interrupted by a chain monitor crash, it is quite tricky to make correct decisions about what running bricks are OK and which ones are not.
The uncertainty of each brick’s exact status could be addressed by
more logging of intermediate states to the Admin Server’s private
state (i.e. stored in the bootstrap_copy
* bricks). But each write
to that private state has an overhead that, when multiplied by
thousands of bricks and hundreds of chains, is quite significant.
When a chain monitor starts, it attempts to calculate the chain’s
status. If the chain was healthy
before the monitor crash, the chain
will be deemed healthy
. If the chain was degraded
, then it’s
likely that the chain will be "whittled down" to a single brick and
then reconstructed using the chain repair protocol.
The process_brickstatus_diffs()
function is a long, long bear of a
function. Refactoring it would be useful, even necessary if the
polling-based mechanism were changed (as suggested elsewhere). But it
encodes a lot of hard-won knowledge of how to maintain data
consistency under really weird, hard-to-find and hard-to-fix bugs over
more than two years of testing and production use.
Read the code and the comments before embarking on any refactoring journey. |
This module implements the callbacks for the OTP application
cluster_info
, bundled with Hibari. Functions such as
cluster_info:dump_all_connected/1
can be used to write a huge amount
of diagnostic information about the cluster.
At startup time, both the gdss
and gdss_client
applications
register a callback with the cluster_info
application.
This is the start/stop module for the gdss_client
application. If
the gdss
application is not running on each node that a Hibari
client application runs, then the lighter-weight gdss_client
application must be running.
This process implements a "gen_server" that monitor the condition of
each client node that is monitored by the Admin Server. See the
brick_admin:run_client_monitor_procs()
function for the definition
of the funs used when the client monitor sense that a client node has
stopped or started.
This is the gdss
application supervisor that is responsible for
supervising all processes related to logical bricks:
brick_simple.erl
API.
In an ideal world, brick_ets.erl
would be a flexible plug-in module
to implement the data store for a Hibari logical brick. With regard
to management of in-memory data structures (i.e. ETS ordered_set
tables, which are implemented as balanced binary trees), this module
is mostly self-contained. For disk-based persistence, it uses the
write-ahead log modules gmt_hlog.erl
, gmt_hlog_common.erl
, and
gmt_hlog_local.erl
) to handle disk I/O-related activity.
In the real world, brick_ets.erl
is a mongrel, a mix of various
tasks: some memory related, some disk related, and some other stuff.
A "pluggable storage system" was not part of Hibari’s original design.
The original design called for two kinds of bricks, with two separate
implementations: one RAM-based and one disk-based. The RAM-based one
was written first: it was the original brick_ets.erl
module, to be
used as part of a quorum-style replication system. The disk-based
module was delayed.
Then Hibari development stopped. Then it restarted, but this time
needing to meet larger-than-RAM storage requirements and to use chain
replication (instead of quorum replication). The decision was made to
split the brick_ets.erl
module was several pieces:
brick_ets.erl
would maintain the RAM-based data structures
gmt_hlog.erl
would maintain the disk-based write-ahead log
brick_server.erl
would maintain chain replication & repair logic
A big legacy of the original, everything-in-brick_ets.erl
implementation are the functions with names prefixed by "bcb_". "BCB"
= "Brick CallBack". These functions are required by
brick_server.erl
for various purposes that also need access to the
ETS tables managed by brick_ets.erl
.
Originally, the principal process for a logical brick was a
"gen_server" behavior process that was implemented by
brick_ets.erl
. When the brick_ets.erl
module was split apart, the
choice was made to do the following:
brick_server.erl
as its
implementation module and use its own #state
record which is
_completely independent of the brick_ets.erl
#state
record.
brick_ets.erl
brick_server.erl
code manage
the behavior callbacks and #state
of brick_ets.erl
This choice complicates |
The #state
record used by brick_ets.erl
has members in several
major categories:
do_logging
and bigdata_dir
n_add
and syncsum_count
logging_op_serial
and log
check_pid
ctab
and shadowtab
dirty_tab
and wait_on_dirty_q
#state.ctab
, the contents table. Except for changes made during a
checkpoint, all data about a key lives in this table as a "store
tuple" (see below).
#state.dirty_tab
, the dirty table. If a key has been updated but
not yet flushed to disk, the key appears here. Necessary for any
update, inside or outside of a micro-transaction, where race
conditions are possible. See the section called “The dirty keys table”.
#state.etab
, the expiry table. If a key has a non-zero expiry time
associated with it (an integer in UNIX time_t
form), then the
expiry time appears in this table.
#state.mdtab
, the brick private metadata table. Used for private
state management during data migration and other tasks.
#state.shadowtab
, the shadow table. During checkpoints, the
#state.ctab
table is frozen while the checkpoint process dumps its
contents. All updates made while the checkpoint is running (insert
or delete) are stored in this table. When the checkpoint is
finished, the contents of this table are applied to the contents
table, and then the shadow table is deleted.
A "store tuple" is the internal representation of a key’s metadata.
It uses a variable-sized tuple to try to save some memory, avoiding
storing common values. This is the tuple that is stored in the
#state.ctab
table.
Element 1 | Element 2 | Element 3 | Element 4 | Element 5 | Element 6 |
---|---|---|---|---|---|
|
|
|
| ||
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
|
Types used:
Key = binary()
TStamp = integer()
Value = binary() | {integer(), integer()}
Value = binary()
Value = {FileNumber::integer(),
Offset::integer()}
where FileNumber
and Offset
give the
starting location for the write-ahead log "hunk" that stores the
actual value blob.
ValueLen = integer()
ExpTime = integer()
Flags = list()
The time required for full initialization of a logical brick is not predictable. We cannot know in advance how much metadata must be read from the brick’s private write-ahead log, nor do we know how much time it will take to read that data.
The OTP "supervisor" behavior places a limit on how long a worker process’s init() function can take, and while a supervisor is starting a worker process, it is blocked from starting/restarting/stopping other workers. Therefore, it’s very important that the logical brick initialization function execute in a short amount of time.
The second-to-last statement in brick_ets:init/1
is this:
self() ! do_init_second_half,
Then the handle_info/3
callback can take as much time as is
necessary to read & process the updates in the private write-ahead
log.
Here are some examples of why the dirty keys table,
#state.dirty_tab
, is used (not an exhaustive list):
add
operations succeeding. If key K
does not exist, and if multiple clients race to add(Table, K, Value)
,
then only one client should succeed.
{testset,CurrentTimeStamp}
flag.
replace
) and/or exclusive flags
(e.g. {testset,CurrentTimeStamp}
and key_must_exist
).
A Hibari micro-transaction is designed to avoid holding locks by forcing the client to send the entire micro-transaction in a single message. The server brick’s gen_server should immediately be able to enforce the above properties, correct? Yes and no, unfortunately.
Because each logical brick is implemented as a single "gen_server" process (we ignore the helper processes enumerated in the section called “Processes created by brick_ets.erl”), all messages processed by the brick are automatically serialized. That serialization property makes it much easier to implement immediate commit/abort decisions. However, there’s a small problem: disk I/O is slow. Here is one example of a race condition that is caused by slow disk I/O:
K
does not exist.
add(Table, K, Value1)
to brick B.
add
op. The key K
does not exist, so the
operation is permitted.
add(Table, K, Value2)
to brick B.
add
op. The key K
does not exist, so the
operation is permitted. Although key K
was added in step #4, that
operation’s write-ahead log has not yet been flushed safely to disk, so
therefore Brick B cannot yet guarantee that the key does exist.
ok
.
ok
. This
reply violates the principle of strong consistency and is therefore
incorrect.
Any key that is updated by an operation that is waiting for its
write-ahead log entry to be flushed to disk will have an entry in the
#state.dirty_tab
ETS table. When the fsync(2)
system call is
finished, the key will be removed from #state.dirty_tab
and the
#state.ctab
(or the "shadow table", if a checkpoint is in progress)
will be updated to make the key’s update visible.
If a Hibari client calls do
when the first op in the DoList
is the
atom txn
, then the DoList
will be evaluated as a micro-transaction.
The check is done by do_do2/3
, with micro-transaction preconditions
checked by do_txnlist/3
.
If do_txnlist/3
detects that a micro-transaction precondition has
been violated, e.g. add
a key that already exists, then an error
accumulator is built to inform the client of which items in DoList
failed. Note that the txn
op is removed from the DoList
before
do_txnlist/3
starts.
If do_txnlist/3
finds no errors in the micro-transaction, then the
DoList
is then executed by the same function that processes
non-micro-transaction lists, do_dolist/4
.
The management of the write-ahead log and maintaining strong consistency was a more difficult problem than I had first realized. To preserve strong consistency, the order of all updates must be preserved when writing log entries to the write-ahead log. Updates come from two sources:
brick_server.erl
module, see the section called “brick_server.erl”.)
The brick maintains a monotonically-increasing counter,
#state.logging_op_serial
, to assign a serial number to each update.
Each update is written in increasing serial number order. After an
update is written, the brick will request an fsync(2)
system call on
the log. The write-ahead log manager will initiate the call (if no
fsync(2)
call is currently in progress) or queue the request for a
later time (because an fsync(2)
system call is in progress already).
Because the brick does not know when the fsync(2)
system call will
finish, the brick stores the operation and its serial number in a
queue called #state.logging_op_q
.
The write-ahead log manager will notify the brick when an fsync(2)
system call is finished, telling the brick the largest serial number
N
. The brick will remove all pending requests from the
#state.logging_op_q
that have serial numbers less than or equal to
serial N
. Processing of those pending requests is then resumed.
As described above, the "syncpid"'s job is pretty simple:
fsync(2)
call. (Each request is tagged
with a log sequence number.)
fsync(2)
call via the gmt_hlog_local.erl
API.
fsync(2)
call.
The tricky part is step #2, specifically, when should "now and then" be? There are a couple of easy answers to the question:
Initiate an fsync(2)
call whenever a single request in step #1
arrives. Block all other fsync(2)
requests until this one
finishes.
Collect requests in step #1 for a fixed amount of time, e.g. 100
milliseconds, then start fsync(2)
.
The current implementation, in collect_sync_requests/3
, uses a
variable amount of time in step #1 by waiting a maximum of 5
milliseconds since the last fsync request before going to step #2.
The method is virtuous by being simple and for being "good enough" for
both very low and very high load conditions.
As explained in the section called “The "store tuple"”, when a value blob is stored on
disk, its store tuple representation is {FileNumber::integer(),
Offset::integer()}
. These two integers are used to find the value
blob’s storage location on disk. See the section called “gmt_hlog.erl” for API
details.
A brick’s behavior for value storage is defined by the value of
#state.bigdata_dir
:
undefined
, then values are stored in RAM, i.e. as an Erlang
binary within the store tuple.
undefined
, then values are stored on disk.
When a key is set by a set
/add
/replace
operation in a table that
stores value blobs on disk, there are actually two hunks written to
the brick’s write-ahead log:
{FileNumber,Offset}
tuple for the location of the value blob hunk
stored in step #1.
To retrieve a hunk, the gmt_hlog
API is used, passing the
FileNumber
and Offset
as arguments. The library function then:
FileNumber
to a full file path for the log sequence file.
Offset
Reads the hunk blob, which immediately follows the hunk header.
The serial nature of message handling by the "gen_server" behavior is almost always a good thing. However, dealing with disk I/O is one of the few times when it would be really nice to have "gen_server" handle multiple messages in parallel.
It’s certainly possible to have "gen_server" handle multiple messages in parallel, but it’s the developer’s responsibility to juggle the asynchronous replies to clients. This can get very tricky very quickly. However, a logical brick must do this kind of juggling to minimize latency across all Hibari client requests.
The Erlang virtual machine does not expose an API to the OS’s
mmap(2)
and mincore(2)
system calls, so it is impossible for a
logical brick to know which parts of a file are in the page cache and
which are not. Without that knowledge, the brick cannot predict how
long it will take to open or read a log sequence file to retrieve a
value blob.
To make the unpredictable disk I/O pattern into something almost 100% predictable, Hibari bricks borrow a trick from the Squid HTTP Caching Proxy server and the Flash HTTP server. To avoid having computation threads blocked by disk I/O, both servers use a pool of OS processes or Pthreads whose sole job is to perform disk I/O. The model goes something like this:
The logical brick uses the same basic strategy, which I’ve called "squid/flash priming" or simply "priming", as in "priming a pump". A brick will spawn a short-lived Erlang process to read the log hunk, notify the main gen_server that the I/O is finished, and then the main gen_server process can open & read the file with virtual certainty that it will not be blocked by disk I/O.
Primer processes are used when reading data from a Hibari client get
or get_many
request as well as when reading value blobs during brick
repair operations. Each has a separate throttle configuration
attribute.
There is a throttle mechanism to keep too many squid/flash
primer processes from executing simultaneously: the
|
There is plenty of opportunity for refactoring here. The current implementation has been "good and fast enough", but there’s almost certainly room for optimization, especially if very large blobs (greater than 4MB, approximately) are routinely used. Some possible optimizations would included:
|
There are two magic values that an Erlang client can use in a
set
/add
/replace
operation in place of the usual binary or iolist
value blob.
?VALUE_REMAINS_CONSTANT
set
/add
/replace
operation to keep the value blob the same while changing other
attributes of the key, such as expiration time or flags.
?VALUE_SWITCHAROO
These are some areas where logic that most likely should be
moved to |
The implementation of the contents table and the shadow table,
#state.ctab
and #state.shadowtab
respectively, causes some
problems. The biggest problem is that when a checkpoint is in
progress and the #state.shadowtab
exists, then the "does the key
exist?" decision must consult both tables in a sane manner.
The get_many
operation is most affected by the necessity to look in
both tables. The get_many_shadow()
function implements the tricky
logic that’s required to combine the contents of both contents and
shadow tables into a consistent set of results.
Because |
By default, MD5 checksums are generated for all data written to all write-ahead logs, and those MD5 checksums are checked for all data read from write-ahead logs.
If the file "disable-md5" exists in the Hibari server data directory, then data will be written to write-ahead logs without MD5 checksums, and data read from write-ahead logs will not have MD5 checksums verified.
If the file "use-md5-bif" exists in the Hibari server data directory,
then the erlang:md5/1
function will be used to create MD5
checksums. By default, the crypto:md5/1
is used to create MD5
checksums.
If an MD5 checksum error is detected, the easy thing to do is crash
the brick. In practice, this approach causes some additional problems
that we would rather avoid. The logic is in bigdata_dir_get_val()
:
Mark the sequence file as bad. The assumption is that the entire log sequence file is bad. This assumption may or may not be true, but the default is to be conservative. We don’t know if there are other checksum errors within the file, so we will:
The "scavenger" procedure is used to reclaim disk space in the "common log" that is no longer used by a local logical brick. It is essentially a copying garbage collector:
By default, the scavenger is run once every 24 hours at 03:00. |
Each write-ahead log is divided into log sequence files. Each log sequence files contains a sequence of "hunks". The hunks are put into one of two categories:
The goal of the scavenger is to reclaim disk space. The only way to reclaim disk space is to delete files. If the scavenger finds a log sequence file with 0% live hunks, that file can be deleted immediately. However, it is quite rare to find a log sequence file that has 0% live hunks. For all other log sequence files, a different strategy is used:
brick_skip_live_percentage_greater_than
attribute from
central.conf
. Call it SkipPercent
.
LivePercent
.
For each log sequence file where LivePercent
is less than
SkipPercent
:
fsync(2)
OS system call. Once an fsync(2)
call has
finished, the "sync pid" will send a message to the logical brick’s
"gen_server" process to tell it what log items (identified by log
serial number) have been flushed to disk and can be sent downstream
and/or to the client.
brick_check_checkpoint_max_mb
configuration
variable, a checkpoint process is spawned to perform the checkpoint
task. See Hibari
Sysadmin Guide, "Checkpoints" section for an overview of the
checkpoint procedure.
The consistent hashing layer is the top-level of the layered abstraction of a Hibari storage cluster, as discussed in Hibari Sysadmin Guide, "Hibari Architecture" section.
The #hash_r
record encapsulates two things:
The #g_hash_r
record is the "global hash record" for a table. It is
the record that is "spammed" to all brick servers and clients (see the
description of global hash spamming
the section called “Global Hash spamming”).
The #g_hash_r
record contains the #hash_r
records for:
Usually, the current hash config is the same as the new hash config. However, when chains are added/removed/reweighted and a data migration takes place, the new hash configuration is used to determine which keys stay in their current chain and which need to be moved to a new chain.
In theory, all Hibari clients have access to an up-to-date copy of
each tables' #g_hash_r
record, via their node-local brick_simple
server. In practice, due to message passing latencies, all clients do
not have correct global hashes 100% of the time. The Admin Server
also sends #g_hash_r
updates to all server bricks also. If a client
is using an old global hash, the servers (using the same consistent
hash calculations) can forward the request to the correct brick.
The only method that should be used for new Hibari tables is the
chash
method. The other three, naive
, ``var_prefix`, and
fixed_prefix
, are deprecated and will be removed at some point. The
chash
method supports all three schemes and also provides
migration-related features that the three deprecated schemes alone
cannot.
See the EDoc entry for chash_init/3
for full details on all the
valid properties that can be passed in the 3rd argument proplist.
The two properties that are mandatory are prefix_method
and
new_chainweights
. We strongly advise that you also include the
old_float_map
property; the float map can be extracted from a
#g_hash_r
or a #hash_r
record using the
chash_extract_new_float_map/1
function.
Do not use the |
The chain changing example in ??? shows how
to verify that the #hash_r
that you’ve created will result in the
key distribution across chains that you desire. See the
example of using ``brick_simple:chash_migration_pre_check/2`
???.
Older versions of Hibari made substantial use of
timer:send_interval/2
for sending periodic timer messages. The
timer
module’s implementation can be too inefficient when over 1,000
separate timer interval requests are made. The brick_itimer
module
creates a more CPU-efficient implementation for heavily used timer
intervals, for example 1 second.
The latency jitter in delivering these shared timer messages is intentional. Strict real-time accuracy for sending these periodic messages is not required. If stricter delivery timings are required, do not use this module.
Hibari servers use asynchronous message passing in two major areas:
The number of these asynchronous messages can arrive more quickly than a brick can handle. Perhaps its CPU is overloaded, or perhaps disk I/O rates are so high that the hardware cannot provide adequate service times. In either case, the number of messages in a logical brick "gen_server" process mailbox can grow too large.
It is vital for good performance that a brick’s mailbox. Large
mailboxes can interfere with other messaging, for example, synchronous
calls to the brick’s write-ahead log process. If the mailbox grows
too large, the VM will spend 100% of a CPU core performing selective
receive
operations on the mailbox. If the mailbox continues to grow
without limit, the entire VM can crash by consuming all virtual memory
available to the OS.
If a brick’s mailbox gets too big, then some kind of "pressure" mechanism is required to slow down message producers. In cases of mailbox overload during brick repair, repair operations by the upstream brick (with the "official tail" role) must slow down. In normal chain operations, the head brick must slow down its rate of updates.
The throttling mechanism implemented by brick_mboxmon.erl
is
straightforward. Every 500 milliseconds, the mailbox size of each
brick on the local node is polled via erlang:process_info/2
.
If the mailbox size exceeds the brick_mbox_repair_high_water
attribute in central.conf
, and if the brick is under repair, then
the throttle mechanism is activated.
brick_mbox_repair_overload_resume_interval
seconds.
If the mailbox size exceeds the brick_mbox_high_water
attribute in
central.conf
, then the throttle mechanism is activated.
When the overloaded brick’s mailbox size falls under
brick_mbox_low_water
, then the brick is no longer considered
overloaded.
The application log files on the overloaded brick, the head brick, and (in repair cases) the Admin Server will contain messages stating when and why the brick was considered overloaded and when the overload condition ended.
During times when a table’s chain configuration is changed
(e.g. chains added, removed, or reweighted), a "data migration" takes
place. Some keys are copied, or "migrated", from one chain to
another. The brick_migmon.erl
module is responsible for monitoring
the overall status of this data migration period.
Data migrations are tied closely to a table’s global hash record.
Each Hibari table has its own global hash record (#g_hash_r
record). It is therefore possible to have multiple migrations running
simultaneously for multiple tables.
However, each data migration for a global hash is assigned a "cookie", which is an opaque Erlang term. In the event that some or all processes in the Admin Server crash, this cookie is used to distinguish between resuming an in-progress migration and starting a new migration. If migrations are numbered X, X+1+, X+2, etc., then a migration X+1 for table Y will not be permitted to start until migration X for table Y has finished.
Earlier versions of Hibari had difficulty with brick "pinger" processes getting timeouts when trying to check a logical brick’s health. Under extremely high workloads, the first-come, first-served nature of a "gen_server"'s message handling was not sufficient.
The solution is to create a separate "pingee" process for each logical brick. As the main brick "gen_server" process makes major state transitions, those transitions are transmitted to the "pingee" process. The "pinger" processes actually communicate with the "pingee" process for health inquiries and not with the main brick process. Because the "pingee" is only used for state transition and health check messages, its mailbox is almost always empty, and its process almost always idle. This combination helps make responses to health checks much quicker.
This module implements the Admin Server’s "scoreboard" process. The scoreboard reflects the health status of each logical brick and chain.
Accelerate read-only query performance, all scoreboard status is
maintained in RAM. To make the scoreboard resilient in case of
crashing, each status change is synchronously written to the Admin
Server’s private bootstrap_copy
* bricks. See
Section 2.2, “Admin Server notes: crash-recovery design” for more info on crash-recovery
design. For more information about the Admin Server’s bootstrap
bricks, see Hibari
Sysadmin Guide, "Admin Server’s Private State: the Bootstrap Bricks"
section and
Hibari
Sysadmin Guide, "Bricks outside of chain replication" section.
Each status change event that is sent to the scoreboard includes a proplist that can contain additional information about the event. At this time, there is no requirement of mandatory properties in that proplist. Though mandatory properties may be introduced later, the main purpose is merely to provide a human developer/systems administrator with some extra information about the event.
The scoreboard and it’s surrounding (partial) supervision tree is depicted above by "dotted" lines. The chainmon_XXX and pinger_YYY processes manually establish a link with the brick_sb process. If a "down event" for the brick_sb processed is received, the chainmon_XXX and pinger_YYY processes sleep for 1 second before exiting abnormally. The brick_mon_sup supervisor will then automatically restart the chainmon_XXX and pinger_YYY processes.
During initialization of the scoreboard, the brick_sb loads the “scoreboard” operational history into RAM from the bootstrap bricks via the brick_admin server. After initialization, the brick_sb manages the state in RAM and synchronously writes the full state directly to the bootstrap bricks after receiving new chain and/or brick status reports. To help improve performance, the implementation of the brick_sb is optimized to process all pending status reports as a batch and to then save the full state to the bootstrap bricks as a single write operation.
The scoreboard’s maximum length of the list of historical
events is hard-coded at 100. This should be changed to use a
|
The brick_server.erl
module implements the storage-agnostic aspects
of a logical brick. It isn’t fully insulated from matters related to
ETS and disk write-ahead log, but it’s close. See the beginning of
the section called “brick_ets.erl” for more details of how brick_ets.erl
started
and how brick_server.erl
came along later.
See also the section called “"gen_server" nested inside a "gen_server", Matroshka-style”.
After the split of brick_ets.erl
and brick_server.erl
, the
"gen_server" callback functions of both were preserved, even though
both are used by a single process. As far as OTP is concerned,
brick_server.erl
is the callback module used for the main logical
brick gen_server process. If a call, cast, or message isn’t handled
by brick_server
's callback function, then the message is handled by
brick_ets
's callback function.
The wrinkle in this otherwise flawless scheme is that
brick_server.erl
has to maintain the #state
record that
brick_ets.erl
uses, keeping it completely separate from
brick_server.erl
's record of the same name.
See ??? for examples that use the do()
client interface.
All of the basic do
operations are encoded into a small number of
tuples. See make_op2()
, make_op5()
, and make_op6()
. Note that
if you want to manage your own timestamps, rather than use ones based
on the OS system clock, you must use the make_op6()
function
yourself. A more convenient API doesn’t yet exist because there’s
been no need for it … but it would be easy & quick to write.
By default, the various operations within a do
call’s DoList
do
not have any transaction semantics. If a DoList
contains 5
operations, then there is almost no difference between sending that
DoList
to a server than there is sending 5 different do
operations, each containing a single op. However…
there are two good reasons why an application developer might want
to combine those 5 ops into a single do
call:
To take advantage of the brick’s guarantee that no other ops (sent by another client) can be interleaved with the execution of those 5 ops.
There are two types of flags that can be sent with a do
op:
Flags bundled in the make_op
* function, which affect only that
particular operation.
encode_op_flags/1
for valid op flag
names.
Flags that affect all ops in the do
list.
do/5
for valid DoFlags
names.
The Admin Server maintains the table-specific settings for disk
logging and log flush parameters for all bricks in the table. By
default, both disk logging and log flushing are enabled. Both
features are actually implemented by brick_ets.erl
, but the API to
change those parameters at runtime are handled through
brick_server.erl
.
See set_do_logging/2
and set_do_sync
.
The role management functions were designed for use by the Admin
Server to manage the brick lifecycle state machine. See comment
"Chain admin & related API" in -export
statements at top of the
file.
See Hibari Sysadmin Guide, "Brick Lifecycle Finite State Machine" section for a description of a brick’s lifecycle within chain replication.
Management of the #chain_r
record is tricky whenever a role is
changed. There are some attributes of the record that should be reset
to a default value and other attributes that must be preserved.
Examples of the latter are log serial numbers used by the brick as
well as serial numbers ack’ed downstream & upstream. Some role
transitions must also be reflected in the brick_pingee
helper
process (see the section called “brick_pingee.erl”). The result looks more complex
than you’d first believe is necessary, but there isn’t much excess
code to remove: much of that complexity is necessary.
As described in the section called “brick_hash.erl”, the brick_hash.erl
module is
used for consistent hashing calculations by Hibari clients. However,
because a client may be acting on old/stale data, each Hibari brick
also uses brick_hash
to verify that each operation it receives
should be executed locally.
If a client sends a do
call to the wrong brick, as calculated by the
brick’s global hash, then that brick will forward the do
call to
what it believes is the correct brick. However, due to asynchronous
message passing, scheduling latencies, network latencies, etc., the
brick itself may have an old/stale version of the global hash. In
such a case, the do
will be forwarded to the wrong brick. However,
because each brick will forward the do
to where it believes the do
should go, eventually the do
call will arrive at its proper
location. In the event that the forwarding takes too long, the client
will see a timeout.
The forwarding mechanism is limited by a couple of factors:
Each time the do
call is forwarded, the SentAt
time (which is
originally set by the client node) is reset to the current wall-clock
time. This reset is done to prevent brick_do_op_too_old_timeout
configuration attribute enforcement; see
Hibari
Sysadmin Guide, "NTP configuration of all Hibari server and client
nodes" section.
See Hibari Sysadmin Guide, "Chain Lifecycle Finite State Machine" section for background information.
Both chain replication protocol messages and client replies are sent "downstream", i.e. they are sent to the next brick in the chain.
The reply term for a do
call is piggy-backed onto the chain
replication message that is sent down the chain. When a chain
replication protocol message reaches a brick has the "official tail"
role, then:
If the do
operation has the ignore_role
property in its DoFlags
property list, then the reply is sent directly to the client (instead
of the default behavior of being sent downstream with the chain
replication message).
The chain replication protocol messages are:
{ch_log_replay, UpstreamBrick, Serial, Thisdo_Mods, From, Reply}
UpstreamBrick
and Serial
terms are used to verify that the
message comes from the correct brick and that messages have not been
sent out-of-order. The Thisdo_Mods
term contains the write-ahead
log terms associated with this update (i.e. insert and delete
commands), and From
and Reply
are used to send the Reply
term to
the client.
{ch_serial_ack, Serial, BrickName, Node, Props}
Serial
number
in these messages and purge from their in-memory buffers all log
replay requests with serial numbers less than or equal to Serial
:
these replay requests are no longer required to recover from failure
of a middle brick.
See the section called “Log flushing and the sync pid and the logging_op_q” for background on the write-ahead log serial number’s use and management.
There is a lot of code in chain_send_downstream_iff_empty_log_q/6
and elsewhere to make certain that log events are sent downstream
without violating ordering constraints. I confess it isn’t pretty,
but all of the bugs that I know of have been wrung out. If there are
still bugs (and I suspect but cannot prove that there is one more),
the paranoid sanity checking done by the downstream/receiving brick
will crash if Serial
numbers are received out-of-order. It isn’t a
pretty way to recover from such an error, but it’s the safest reaction
that I know of.
The chain repair protocol has been implemented twice. The first
protocol was a brute-force, "as simple as possible" affair. The
upstream brick would send a series of {ch_repair, Serial,
RepairList}
messages to the repairing brick. RepairList
contained
a list of store tuples (see the section called “The "store tuple"”) for all keys in the
table. Note that these store tuples will always contain the full
value blob.
The first protocol was deprecated after it became clear that "as simple as possible" had too much overhead. When in-RAM storage of value blobs was the only option, then it was cheap to fetch the value blobs and send them across the network to the repairing brick. But when "bigdata" storage was introduced (see the section called “Value blob storage on disk: bigdata_dir”), then the disk I/O, network bandwidth, and total latency became far too high to be practical.
The second chain repair protocol uses two rounds of messages to avoid the I/O problems caused by the first protocol:
The following messages are exchanged:
{ch_repair_diff_round1, Serial, RepairList}
RepairList
.
{ch_repair_diff_round1_ack, Serial, BrickName, Node, Unknown, Ds}
This is the downstream brick’s response to the
{ch_repair_diff_round1, ...}
message. The Unknown
term is a list
of keys that are missing from the downstream brick (completely missing
or timestamp mismatch). The Ds
informs the upstream brick of how
many keys were deleted by the downstream brick.
If the brick’s value blobs are stored on disk, then an asynchronous "priming" mechanism is used by the upstream brick to force those blobs into RAM before processing sending the 2nd round.
{ch_repair_diff_round2, Serial, RepairList, Ds}
RepairList
containing all
store tuples (with value blobs) for all keys requested in round 1.
The Ds
term is not used.
{ch_repair_ack, Serial, BrickName, Node, Inserted, Deleted}
Inserted
and Deleted
count the number of keys
that were inserted and deleted into the repairing brick, respectively.
{ch_repair_finished, Brick, Node, Checkpoint_p, NumKeys}
repairing
state to the ok
state. The brick
"pinger" process will notice this state transition and trigger further
chain role changes.
While these repair protocol messages are exchanged, the upstream brick
will send all {ch_log_replay,...}
chain replication messages as
updates occur. On the upstream brick, the {ch_repair_diff_round1,
Serial, RepairList}
is created without interference from client
updates; any keys not in RepairList
are immediately deleted by the
downstream brick before replying with the
{ch_repair_diff_round1_ack,...}
message. Therefore, any possible
race conditions involving client updates of keys within the
RepairList
range of keys are resolved correctly by the log serial
mechanism, because only one of two races can happen inside the
upstream brick:
{ch_log_replay,...}
message is sent before the
upstream creates and sends the round 1 repair message. Any keys
updated in the log replay message are guaranteed to be included in
the round 1 repair message.
{ch_log_replay,...}
message is sent after the
upstream creates and sends the round 1 repair message. Any keys
updated in the log replay message are guaranteed to not be included
in the round 1 repair message. Replay of the replay message can
happen safely at any time (as long as serial number ordering is
preserved).
The data migration mechanism is used to move keys from one chain to another when chains are added, removed, or reweighted. The task of moving keys while maintaining strong consistency is a delicate business. The protocol, which uses two rounds (or phases), described below is used to move keys safely between chains.
During migration, the head of each chain maintains a "sweep key pointer". This pointer moves through the keys, first to last (in lexicographic sorting order).
The sweep key advances through the head brick’s keys, advancing by a
maximum of max_keys_per_iter
configuration attributed in
central.conf
or a maximum of 64MB of blob values (hardcoded for now
in get_sweep_tuples/4
. These keys are what the code calls the
"sweep zone": those keys between the sweep key’s current value and the
sweep key’s value from the last successful iteration of the migration
protocol.
{ch_log_replay,...}
message with {plog_sweep,phase1_sweep_info,#sweepcheckp_r}
modification inside
sweepcheckp_r
record into its private brick metadata: the head brick
before sending the message, all other bricks when receiving the
message. The private metadata is used to recover vital state in case
the head brick crashes. NOTE: This migration sweep metadata is called
a sweep checkpoint and is not related to a brick key checkpoint.
{sweep_phase1_done,LastKey}
{plog_sweep,phase1_sweep_info,...}
message.
{ch_sweep_from_other, ChainHeadPid, ChainName, Thisdo_Mods, LastKey}
Thisdo_Mods
contains the list of store tuples that are moving to the receiving
brick; this list is sent down the chain using the usual chain
replication protocol. LastKey
specifies where the sweep key
location for the ChainName
chain. One of these messages will be
sent to each chain that stores keys within the sweep zone; call the
number of chains X.
{sweep_phase2_done, Key, PropList}
{ch_sweep_from_other,...}
message have reached the tail of the new
chain, this message is sent to the head of the old chain to
acknowledge that the migrated keys are now fully replicated on the new
chain. When the head brick receives {ch_sweep_from_other,...}
messages from all X chains, then the second phase of migration is
finished.
In between round 1 and round 2 of a migration sweep iteration, the
same value blob "priming" technique is used to prevent disk I/O from
blocking the brick’s gen_server process. See sweep_move_or_keep/3
and spawn_val_prime_worker_for_sweep/3
.
"SSF" stands for "Server-Side Fun". An SSF is an Erlang fun that is
created by a Hibari client and executed on a Hibari server brick. The
SSF has the ability to rewrite the DoList
of operations in the do
call based on the ability to examine the brick’s internal state.
In the end, the SSF cannot do anything that cannot be done with multiple queries to a brick. For example, here is a simple two-query scheme to update simple counter value in a race-safe manner:
Incrementing a counter.
{ok, TS, OldValBin} = brick_simple:get(TableName, Key), OldVal = binary_to_term(OldValBin), ok = brick_simple:replace(TableName, Key, term_to_binary(OldVal + 1), [{testset, TS}]), %% Use OldVal in code below this point.
An SSF would use a very similar bit of logic and would create the same
replace
operation.
Incrementing a counter with an SSF.
F = fun(Key, _DoOp, _DoFlags, S) -> [{_Key, TS, Val, _Exp, _KeyFlags}] = brick_server:ssf_peek(Key, true, S), OldVal = binary_to_term(Val), {ok, [brick_server:make_replace(Key, term_to_binary(OldVal + 1), 0, [{testset, TS}]), {current_val, OldVal}]} end, Op = brick_server:make_ssf(Key, F), [ok, {current_val, OldVal}] = brick_server:do(TableName, [op]), %% Use OldVal in code below this point.
The SSF fun creates a list of do
primitive operations, in this case
two operations:
replace
operation to update the key
Here is an example that uses the SSF above. It assumes that the shell
variable F
has been bound to the fun above. A cut-and-paste of the
code above will work well, assuming that F
is not already bound to a
shell variable and that the final "end," is replaced with "end.".
(hibari_dev@bb3)13> brick_simple:set(tab1, "c1", term_to_binary(0)). ok
(hibari_dev@bb3)14> Op = brick_server:make_ssf("c1", F). {ssf,<<"c1">>,[#Fun<erl_eval.4.105156089>]}
(hibari_dev@bb3)15> brick_simple:do(tab1, [Op]). [ok,{current_val,0}]
(hibari_dev@bb3)16> brick_simple:do(tab1, [Op]). [ok,{current_val,1}]
The SSF is executed to create a list of For example, here is a case where the timestamp of the key has been
modified in a race with another client. Note that the 2nd element in
the return term, (hibari_dev@bb3)21> brick_simple:do(tab1, [Op2]). [{ts_error,1272654442441669},{current_val,2}] |
The API of the SSF is a work-in-progress. It is used by one internal Cloudian project but otherwise does not have strong backward-compatibility requirements. |
The API of the SSF is a work-in-progress. It is used by one internal Cloudian project but otherwise does not have strong backward-compatibility requirements. |
The implementation of SSFs (server-side funs) on the server side of
the world is an experiment using a general framework for modifying
do
operations on the client side before they are executed. This
experiment is implemented by the combination of:
brick_preprocess_method
configuration attributed in
central.conf
.
handle_call_do/3
and preprocess_fold/4
functions.
#state.do_list_preprocess
list.
The current implementation allows the output of the SSF to replace the
{ssf, Key, Fun} tuple inside the do
call’s DoList
list of
operations. It does not permit the arbitrary modification of the
entire DoList
list, nor does it permit examination of other
operations in the DoList
. The reasoning for the limitation is that
all DoList
items ultimately come from a single client in a single
do
operation; if the client wanted to reorder things arbitrarily,
the client has the power to do that before sending the do
call.
When an MD5 checksum error is detected by brick_ets.erl
or one of
the write-ahead log modules, dealing with the error is a bit complex:
brick_ets.erl
require changes.
The common_log_sequence_file_is_bad/3
function is used by
gmt_hlog_common.erl
to notify each brick when an MD5 checksum error
is found.
The checkpoint operation is implemented by brick_ets.erl
, but the
API options are documented in the EDoc for
brick_server:checkpoint/3
.
See the section called “ETS tables” and the section called “Processes created by brick_ets.erl” for more information information about checkpointing.
The implementation of the scavenger is spread across both
two modules. As a general rule, the low-level key scanning and hunk
copying is done by functions in brick_ets.erl
, while the
higher-level coordination is implemented in gmt_hlog_common.erl
.
One property worth noting here is the destructive
option. If this
property is false, then the scavenger will read keys but not do
anything to relocate the keys. When used in combination with
{skip_live_percentage_greater_than,100}
, the scavenger can verify
the MD5 checksums of all value blobs by reading all value blob hunks
in sorted, log file sequence order.
The scavenger can be halted manually using
gmt_hlog_common:stop_scavenger_commonlog/0
and resumed using
gmt_hlog_common:resume_scavenger_commonlog/2
.
One of the design ideas behind the scavenger is that it should try to
avoid random I/O as much as possible. The scavenger uses the same
write-ahead log as everyone else, and since the write-ahead log always
uses sequential file writes and does a decent job of batching multiple
writes with a fewer number of fsync(2)
calls, write I/O is mostly
sequential. However, the scavenger’s read I/O should also be as
sequential as possible … so the scavenger goes through a lot of
effort to read all hunks from a single log file at the same time and
in sequential offset order.
The sequential read property introduced at
the section called “Scavenger and code reuse” can be useful in other areas also.
For example, the "fast sync" utility in brick_admin.erl
.
The utility is intended for use in cases where a very large brick has had a catastrophic failure, and all data on that brick is lost. The chain replication algorithm will perform repair based on key lexicographic sort order. The key sorting order is not correlated with storage location within the common log. The result is disk read I/O patterns that are mostly random, meaning it could take days to fully repair a brick that had lost many terabytes of data.
The "fast sync" utility uses the scavenger’s infrastructure to sort live keys into log sequence and offset order (instead of lexicographic order). Ideally, this changes the read I/O pattern on the repairing brick to be more sequential than random. Once the "fast sync" utility is finished, then usual chain replication is used to fix discrepancies caused by updates made while the "fast sync" was running.
The brick shepherd "gen_server" is the public interface for
adding/removing logical bricks to/from the brick_brick_sup
supervisor.
For testing purposes, the functions add_do_not_restart_brick/2
and
delete_do_not_restart_brick/2
allow developers and test scripts to
crash a brick and prevent their prompt restart.
The brick_simple.erl
module provides three sets of services for
Hibari clients and administrators.
brick_simple
, runs as part of both the
gdss
and gdss_client
OTP applications.
Each client API call, e.g. add()
, get_many()
, needs to query the
brick_simple
server as prerequisite of the consistent hashing
calculation and {TableName, Key}
→ chain mapping. The Erlang
process dictionary is (ab)used to improve performance by reducing the
size of replies sent by brick_simple
.
The |
This is a simplified quorum-based replication client for Hibari server bricks. See the section called “brick_admin.erl” for background and Hibari Sysadmin Guide, "The Admin Server Application" section for background info.
The intent of this module is provide naive and simple yet robust storage for use by an cluster admin/manager server. Such a server has limited requirements for persistent data, but it doesn’t make sense to store that data on a local disk. For availability, that data should be spread across multiple bricks. However, there’s a chicken-and-the-egg problem for a cluster manager: if the cluster must be running in order to serve data, how do you start the cluster?
The answer (for now) is to have any machine capable of running an administrator to have a statically configured list of bricks that store cluster manager data. A very simple quorum technique is used for robust storage.
No transaction support is provided, since we assume that the manager will use some other mechanism for preventing multiple managers from running simultaneously.
The Admin Server uses the Schema.local
file as a hint for breaking
the chicken-and-egg problem of finding the bootstrap bricks. The
bricks listed in Schema.local
are consulted to try to find a valid
#schema_r
record. Once a copy of that record is found, then the
real list of bootstrap bricks in #schema_r.schema_bricklist
is used
for all subsequent quorum calculations.
The brick_ticket.erl
module implements a subset of functionality of
another Cloudian application. It provides ticket-based workload
limiting services, sometimes called "admission-based rate limiting".
The goal is to limit the amount of X things that can happen per unit of time, e.g. executing a function per minute or reading a maximum number of bytes/second.
Hibari bricks use this ticket-based system to provide shared
throttling mechanisms for checkpoint and scavenger operations. Via
central.conf
, both activities are limited to a certain amount of
write bandwidth per second.
See Hibari Sysadmin Guide, "Write-Ahead Logs" section for background info.
The Erlang/OTP disk_log
library does not support random access into
one of its log files. Hibari’s on-disk value blob storage requires
very efficient random access to the write-ahead log. Therefore,
disk_log
wasn’t sufficient.
I came close to using a bridge to Berkeley DB to use only the Berkeley DB logging subsystem. But I discovered some tough problems around the edges of the that subsystem, and I abandoned the idea. I don’t recall what issue(s) it was, but I recall it was related to a not-so-clean separation between the logging subsystem and another DB system when the log was opened?
The basic concept for this module is stolen from the Berkeley DB logging subsystem. DB’s logging subsystem allows you to append a hunk to a log. In return, you get an "LSN", which is a file number and byte offset for where that hunk of data is stored. Given the LSN, it’s trivial matter (with very low overhead) to retrieve any hunk in any desired access pattern.
I added a couple of features that Berkeley DB’s logging subsystem does not have:
{FileNum,Offset}
it would be identified by {FileNum,Offset,HunkNumber}
. I don’t
believe this feature is actually used by Hibari, but the CPU and
storage overheads to support it are low enough to be ignored.
The gmt_hlog.erl
module started as the sole implementation module of
a write-ahead log. Later, it was split into three separate modules:
gmt_hlog.erl
gmt_hlog_common.erl
gmt_hlog_local.erl
gmt_hlog_local
exported API are just
2-line wrappers around calls to gmt_hlog
functions, because there is
nothing for the proxy to do in those cases.
{SeqNum, Offset}
First, let’s have a short review of Hibari brick disk storage properties and goals. See the Hibari Sysadmin Guide, "Write-Ahead Logs" section and also the "Brick Initialization" section for background info.
All key updates (including ) are written to the write-ahead log.
{FileNum,Offset}
storage location
of disk-based value blob hunks, which are stored separately from
key metadata hunks.
As a consequence of these properties:
It is quite difficult to fully support Items #4 and #5 simultaneously.
The implementation of gmt_hlog.erl
tries to resolve the conflict
between I/O efficiency and short startup times by storing data in
files stored in two different areas:
shortterm
in the code)
All key metadata hunks are always stored in shortterm storage. To
support item #4, value blob hunks are also be written in short term
storage in order to amortize the expense of fsync(2)
system calls.
"s"
subdirectory, where "s" is an abbreviation for "short".
longterm
in the code)
All long-lived value blob hunks, i.e. those hunks which have existed long enough to exist after a checkpoint operation, are stored in long term storage.
Top-level listing of the common log’s gmt_hlog
file store.
% ls -l hlog.commonLogServer total 24 drwxrwxr-x 9 hibari root 4096 2010-06-09 13:56 1/ drwxrwxr-x 9 hibari root 4096 2010-06-09 13:56 2/ drwxrwxr-x 9 hibari root 4096 2010-06-09 13:56 3/ -rw-rw-r-- 1 hibari root 14 2010-06-09 13:59 flush drwxrwxr-x 2 hibari root 4096 2010-06-09 13:56 register/ drwxrwxr-x 2 hibari root 4096 2010-06-09 13:59 s/ % ls -l hlog.commonLogServer/s total 16760 -rw-rw-r-- 1 hibari root 2701321 2010-06-09 13:58 000000000014.HLOG -rw-rw-r-- 1 hibari root 2719724 2010-06-09 13:58 000000000015.HLOG -rw-rw-r-- 1 hibari root 2765506 2010-06-09 13:59 000000000016.HLOG -rw-rw-r-- 1 hibari root 2679379 2010-06-09 13:59 000000000017.HLOG -rw-rw-r-- 1 hibari root 3008375 2010-06-09 13:59 000000000018.HLOG -rw-rw-r-- 1 hibari root 2785571 2010-06-09 13:59 000000000019.HLOG -rw-rw-r-- 1 hibari root 451182 2010-06-09 13:59 000000000020.HLOG -rw-rw-r-- 1 hibari root 45 2010-06-09 13:56 Config % ls -l hlog.commonLogServer/1 total 28 drwxrwxr-x 2 hibari root 4096 2010-06-09 13:56 1/ drwxrwxr-x 2 hibari root 4096 2010-06-09 13:56 2/ drwxrwxr-x 2 hibari root 4096 2010-06-09 13:59 3/ drwxrwxr-x 2 hibari root 4096 2010-06-09 13:57 4/ drwxrwxr-x 2 hibari root 4096 2010-06-09 13:56 5/ drwxrwxr-x 2 hibari root 4096 2010-06-09 13:59 6/ drwxrwxr-x 2 hibari root 4096 2010-06-09 13:57 7/ % ls -l hlog.commonLogServer/1/7: total 2688 -rw-rw-r-- 1 hibari root 2745369 2010-06-09 13:56 -000000000006.HLOG
The application can request that a hunk be written into either area by the write-ahead log. As examples:
An entire log file can moved from short term storage to long term storage by renaming it to its long term name, i.e from a positive log sequence number to a negative log sequence number. This renaming operation is done at the end of a checkpoint operation by bricks that store value blobs on disk: see the "Brick checkpoint processing steps" list in Hibari Sysadmin Guide, "Brick Checkpoint Operations" section.
Log files cannot move from long term to short term storage.
Given the separation of various log hunk types into short term and long term storage, we have a couple of new properties:
The number of log sequence files in short term storage will be quite small, usually well under 1,000 files total and typically under 100 files.
The number of log sequence files in long term storage can be huge, perhaps millions of files.
In theory, the constants to define the number of subdirectories
at each level are completely flexible. In reality, use of the
|
There’s a fundamental tension between writing hunks to disk as
efficiently as possible and flushing them to stable storage via the
fsync(2)
system call. The latency overhead of fsync(2)
on
Winchester disks is extremely high. Also, Linux-based systems can
block writes to a file descriptor when an fsync(2)
operation is in
progress.
To mitigate the effects of fsync(2)
calls, a couple of strategies
are used. First, file:sync/1
calls are performed asynchronously
when feasible. Second, writes are buffered by the log’s "gen_server"
process while an file:sync/1
call is in progress. The "gen_server"
will promise to write the hunk at a given {FileNum,Offset}
location
but won’t actually write the data there until the fsync(2)
is
finished.
The "gen_server"'s write buffering opens up a can of worms: nasty race conditions. Races with write requests, fsync requests, in-progress fsync calls, and "advance sequence number" requests are possible. This module has been extensively reviewed and tested with both unit tests and QuickCheck to eliminate those race conditions. We believe that all bugs have been eliminated, but as with many things in the software world, we don’t really know with 100% certainty.
See Hibari Sysadmin Guide, "OS Readahead Configuration" section for background.
The key metadata for disk-based value blob storage contains both the
storage location of the value blob hunk and the size of the blob.
The blob’s size is usually passed through the gmt_hlog
API when
reading the blob. First the hunk header must be read and then the
value blob can be read. A function like read_hunk_summary/5
uses
the blob size argument to help combine (further down in the call
chain) the two reads into a single file:pread()
call via the
my_pread/4
function.
The my_pread/4
function is a simple application-level readahead
mechanism. The two major types of read operations both have elements
of readahead to them:
file:pread()
call.
See also, the discussion at xref:squid-flash-priming. |
This module is responsible for combining write & fsync requests from
multiple bricks, via their gmt_hlog_common
private write-ahead log
servers, into a single write-ahead log. The writes and fsyncs must
provide a durable storage service (to prevent data loss), and
performance must be good enough (despite the slow speed of fsync(2)
operations on Winchester disk drives).
The "short term/long term" storage areas are described in
the section called “gmt_hlog.erl”. Together with gmt_hlog_local.erl
, this module
creates a different separation of hunk types:
Key metadata hunks, type = ?LOGTYPE_METADATA
.
?LOGTYPE_METADATA
are copied
asynchronously out of the common log and into their pre-chosen
{FileNum,Offset}
location in the brick’s private write-ahead log by
the do_sync_writeback/1
function. When a logical brick restarts,
all common → private log writebacks are performed synchronously
via full_writeback/1
before the brick is allowed to restart.
?LOGTYPE_BLOB
?LOGTYPE_BAD_SEQUENCE
Another part of the lazy writeback process is to manage the transition of files from the common log’s “short term” storage area to “long term” storage. See the section called “Short term vs. long term log storage” for background; see Hibari Sysadmin Guide, "High I/O rate devices (e.g. SSD) may be used" section for a discussion of using high-speed non-volatile storage such as solid-state memory disks.
If SSD or similar disk is used, then a file system for that disk
should be mounted under the common log’s short term directory:
.../var/data/hlog.commonLogServer/s
. The
do_bigblob_hunk_writeback/3
function, together with
clean_old_seqnums/2
, will take care of copying the ?LOGTYPE_BLOB
hunks from the short term file system to the long term file system,
which (we assume) will use traditional, cheap Winchester-type disk drives.
See also: the section called “Scavenger and code reuse”.
When the scavenger copies a ?LOGTYPE_BLOB
hunk to its new storage
location, the brick that owns that value blob must be notified of its
new location. A brick that isn’t running & in ok
state cannot (by
definition) handle such notifications. If all logical bricks are not
running, the scavenger should be aborted.
The register_local_brick/2
function is used by a brick’s startup to
assist the common log keep track of all logical bricks that may have
records stored in the common log. If a registered brick is not
running when the scavenger starts, the scavenger will be aborted.
Brick failures during the scavenger’s run are handled separately.
gmt_hlog
"gen-server" process to handle I/O to and from the
"common log".
brick_dirty_buffer_wait
time interval has passed.
gdss
application in certain MD5 checksum
failure scenarios.
The gmt_hlog_local.erl
module’s purpose is to present the same
client API as gmt_hlog.erl
while providing the "common log"/"private
log" split that is described in the
Hibari Sysadmin
Guide, "Write-ahead logs in the Hibari application" section.
This module uses atoms for mapping to hunk metadata types:
metadata
→ ?LOGTYPE_METADATA
bigblob
→ ?LOGTYPE_BLOB
The most important thing this module does is maintain the
{FileNum,Offset}
storage locations for the private log. This task
is done with some knowledge of the internal workings of
gmt_hlog.erl
:
gmt_hlog:create_hunk/3
to create a fully-serialized hunk.
{eee,...}
tuple and write it to the
common log.
{FileNum,Offset}
maintained for the private log, not
the storage location given by the common log.
{eee,...}
tuples for this brick and copy the serialized hunks to the exact
storage locations specified by step #3.
Step #4 in the outline above has a huge race vulnerability
window. Any test that tries to write a hunk using
|
This is the callback module for the inets
application’s HTTP server
for the Admin Server’s HTTP status server at http://localhost:23080/
(default URL).
Sooner or later, a Hibari developer will need to do some debugging. It may be a learning exercise, trying to figure out what existing code is doing. Or perhaps it’s new code that needs debugging. In either case, there are several sets of tools available to Erlang developers.
The OTP "debugger" application is useful because it provides a GUI that many developers find comfortable, particularly the "breakpoint" feature. However, "debugger" app doesn’t always work very well with Hibari, especially on the server side.
Therefore, we recommend that you use tracing-based tools for debugging Hibari. These tools do not require a GUI, so it’s possible to use them for diagnosing remote systems where a GUI may be impossible to support. Also, the tracing tools give much finer control over what process (or processes) will be traced.
The Erlang/OTP runtime system provides an extremely powerful set of tracing tools. See the Erlang/OTP documentation in the "Tools" section for several applications that are built on top of Erlang’s tracing primitives: "dbg" (in the "runtime_tools" subsection), "observer", "inviso", and others.
Cloudian has had very positive experience with the "redbug" application. "Redbug" is bundled with the commercial packaging of Hibari and is used frequently by both customer operations staff and Cloudian field engineers to diagnose problems with both Hibari client applications and the Hibari server.
The Hibari source code has been annotated with over 400 tracepoints
using macros based on the gmt_elog.erl
and gmt_elog_policy.erl
modules. These tracepoints give the developer (and even field support
staff) more options for tracing events through Hibari’s code.
Trace data can be collected by two methods:
DTrace (or SystemTap), introduced at Hibari v0.3.0
Requirements
--with-dynamic-trace
option)
Unix/Linux with DTrace or SystemTap support.
For more details about DTrace, please read this 61-page paper. It’s not Erlang VM specific but will show you a practical example of how we can use DTrace in a production system.
dbg module in Erlang/OTP
The gmt_elog
tracepoints are designed to be extremely lightweight.
While they can be disabled completely at compile-time, their overhead
is so low that they can remain in production code and be enabled only
when needed for debugging.
For example, on an x86 laptop with the CPU frequency fixed at 1.33GHz,
a microbenchmark that called a gmt_elog_policy
tracepoint when
system tracing was disabled (i.e. normal system state) executed
88999000 trace calls in 12.925467 seconds, averaging 6.886 million
calls/second.
There are two major types of tracepoints that annotate the Hibari
code. The macros for both types are defined in the brick.hrl
header
file. The underlying mechanism is provided by macros in the
gmt_elog.hrl
header file.
The ?E_
* macros, e.g. ?E_INFO/2
, ?E_ERROR/2
. These macros are
similar in spirit to the C library’s syslog(3)
function: free-form
message text (with formatting by io_lib:format/2
) with a severity
level.
These macros actually perform two functions: generate an
application log event via
Basho Lager
and optionally generate a gmt_elog
trace message. When examining
large traces, it proved extremely inconvenient to try to merge
application log messages into the flow of trace output … so now the
macros perform that merging task automatically.
For trace events that do not merit an application log entry using
Lager, the ?DBG_
* macros provide a convenient way to specify:
One of:
io_lib:format/2
style formatted text
The definition of ?E_INFO/2
macro in brick.hrl
.
-define(E_INFO(Fmt, Args), ?ELOG_INFO(?CAT_GENERAL, Fmt, Args)).
The definition of ?ELOG_INFO/3
macro in gmt_elog.hrl
.
-define(ELOG_INFO(Msg), begin lager:info(Msg), gmt_elog_policy:dtrace(info, undefined, ?MODULE, ?LINE, Msg, []) end).
The ?DBG_
* macros use the following filtering categories. The
categories use integers with C-style bit masks to allow a single trace
message to use multiple categories by bit-wise AND’ing the categories
together.
?DBG_
* categories from brick.hrl
.
%% Any component -define(CAT_GENERAL, (1 bsl 0)). % General: init, terminate, ... %% brick_ets, mostly, except where there's cross-over purpose. -define(CAT_OP, (1 bsl 1)). % Op processing -define(CAT_ETS, (1 bsl 2)). % ETS table -define(CAT_TLOG, (1 bsl 3)). % Txn log -define(CAT_REPAIR, (1 bsl 4)). % Repair %% brick_server, mostly, except where there's cross-over purpose. -define(CAT_CHAIN, (1 bsl 5)). % Chain-related -define(CAT_MIGRATE, (1 bsl 6)). % Migration-related -define(CAT_HASH, (1 bsl 7)). % Hash-related
Here are a couple of examples of using these macros.
Sample usage of ?E_INFO
and ?DBG_ETSx/2
.
?E_INFO("~s: got unknown message ~P\n", [?MODULE, Msg, 20]). %% Example of ?DBG_ETSx() ?DBG_ETSx({inserted, S#state.tab, Key, size(Val)}).
In the end, the result of both these macros is a call to
gmt_elog_policy:dtrace/6
. When dtrace_support
in sys.config
is
set to false
or DTrace/SystemTap support is not available in Erlang
VM, this function does not do anything for lowest overhead.
When dtrace_support
is set to true
and DTrace/SystemTap support is
available, this function will trigger the "user" trace probe
in the dyntrace NIF module to send the trace message to DTrace
consumers.
If you want to use dbg module instead of DTrace/SystemTap, Setting
dtrace_support
to false
is recommended to avoid unnecessary calls
on the user trace probe. The Erlang tracing mechanism does not rely on
the probe, but will record calls on gmt_elog_policy:dtrace/6
with
the arguments.
The spec of gmt_elog_policy:dtrace/6
function.
-type log_level() :: 'emergency' | 'alert' | 'critical' | 'error' | 'warning' | 'notice' | 'info' | 'debug' | 'trace'. -spec dtrace(log_level(), integer() | undefined, module(), integer(), string(), [term()]) -> true | false | error | badarg.
The arguments for the dtrace/6
function are usually generated by
macros for convenience. The arguments are:
term()
syslog(3)
-like priority numbers (which are defined
in gmt_applog.hrl
, e.g. ?LOG_EMERG_PRI
is level 0, ?LOG_INFO_PRI
is level 6.
term()
?CAT_ETS
→ (1 bsl 2)
.
atom()
integer()
string()
io_lib:format/2
formatting string.
string()
An io_lib:format/2
formatting argument list.
The |
You will use dtrace
or systemtap
command to collect trace data not
only from ?DBG_
* macros but also Erlang VM and Unix/Linux
kernel. You can perform simple tasks by directly giving arguments to
these commands, or write D or SystemTap script for more complex
tasks. Note that you’ll need root privileges to run these commands and
scripts.
As of Erlang/OTP R15B, we can collect the following information about Erlang VM:
efile_drv.c
file I/O driver: 100% instrumented
You can also collect Unix/Linux kernel information such as:
Basic DTrace/SystemTap recipes will be shipped with Hibari starting v0.3.1 or v0.3.2. But until then, please refer to the following resources:
libs/runtime_tools-x.y.z/examples/
directory of your Erlang/OTP installation.
The DTrace Book: "DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X, and FreeBSD" by Brendan Gregg and Jim Mauro, Prentice Hall 2011, 1152 pages.
The gmt_elog:help/0
function provides a quick reference for how to
use the gmt_elog
tracing functions.
Example of gmt_elog:help/0
usage.
(hibari_dev@bb3)56> gmt_elog:help(). gmt_elog:start_tracing(). gmt_elog:start_tracing("/path/to/trace-file"). gmt_elog:add_match_spec(). ... to trace everything gmt_elog:add_match_spec(dbg:fun2ms(fun([_, target_category, _, _, _, _]) -> return_trace() end)). gmt_elog:add_match_spec(dbg:fun2ms(fun([Pri, _, _, _, _, _]) when Prio > 0 -> return_trace() end)). gmt_elog:add_match_spec(dbg:fun2ms(fun([_, target_category, target_module, _, _, _]) -> return_trace(); ([_, _, other_module, 88, _, _]) -> return_trace() end)). gmt_elog:del_match_spec(). gmt_elog:stop_tracing(). gmt_elog:format_file("/path/to/trace-file"). gmt_elog:format_file("/path/to/trace-file", "/path/to/output"). gmt_elog:format_file("/path/to/trace-file", "/path/to/output", MatchStr|""). ok
When we enable tracing, we’re actually tracing calls to the
gmt_elog_policy:dtrace/6
function. The Erlang VM’s tracing mechanism
will tell us the module name, function name, and arguments of any
function call that meets its match specification. As with any tracing
match specification, we can be as general (using '_'
wildcards) or
as specific as we wish in filtering trace events.
For the examples of add_match_spec
shown above, here is an explanation
gmt_elog_policy:enabled/6
.
Category
arg == target_category
.
Remember that the Hibari app uses integers, not atoms, for the
Category
arg, so use an integer here line 8
for (1 bsl 3) for
the ?CAT_TLOG
category.
Prio
argument > 0, i.e. everything that
isn’t emergency priority, ?LOG_EMERG_PRI
.
Trace anything where either of the following conditions is true:
Category == target_category
and Module == target_module
Module == other_module
and Line == 88
Outline of a typical tracing session.
> gmt_elog:start_tracing("/tmp/trace-file1"). > gmt_elog:add_match_spec(dbg:fun2ms(fun([_, Cat, _, _, _, _]) when Cat == 1 -> return_trace() end)). > %% ... trigger some activity that you wish to trace > %% ... when you are finished, resume with: > gmt_elog:del_match_spec(). > gmt_elog:stop_tracing(). > gmt_elog:format_file("/tmp/trace-file1", "/tmp/trace-file1.out").
If you have a low volume of trace messages, it may be convenient to
use gmt_elog:start_tracing/0
to send those messages to the shell
window instead of to a file.
gmt_elog
module for detailed
information on how to use all the tracing functions in that module.
dbg:fun2ms/1
function for an overview of
the (usually) simpler method of creating a match specification.
The subsections here are an assortment of items that were inconvenient to add in-line elsewhere in this guide. Like Erlang’s "let it crash" philosophy, it’s a bit tidier to let something else handle exceptions. This section is the "something else".
Although it is possible to run the Hibari app multiple times simultaneously on the same physical machine (or guest OS instance, if you’re using virtualization software such as VMware or Xen), it isn’t always easy. There are two reasons why Hibari might not be able to run on a machine:
Erlang node names take the form VM_Name@Host_Name
, using the "short"
host naming convention, via "erl -sname VM_Name@Host_Name". It is not
possible to run two Erlang virtual machines with the same "VM_Name".
Example of Erlang node name conflict: duplicate_name
.
{error_logger,{{2010,4,14},{19,41,28}},"Protocol: ~p: register error: ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}}, [{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2}, {net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}
If run in developer’s interactive mode,
(e.g. ???), the node name is defined in
gdss/src/Makefile
. Edit the NODENAME1
variable in Makefile
, then
re-run your make
command.
If run as a daemon,
the Erlang node name is taken from the central.conf
file, the application_nodename
attribute. Running multiple
instances of Hibari using the same installation directory will use the
same central.conf
file. The work-around is to install Hibari
multiple times, one for each Erlang VM instance. Each installation
will require a different installation prefix,
e.g. /usr/local/hibari1
and /usr/local/hibari2
. Then the
application_nodename
attribute in each central.conf
file can
contain a unique value.
The multiple-installation path technique above is not sufficient for running multiple Hibari daemons/virtual machines on the same box. See also the section called “Multiple Hibari instances have TCP port conflicts”. |
TCP port conflict: eaddrinuse
(with Admin Server HTTP Web server on TCP port 23080).
=GMT ERR REPORT==== 14-Apr-2010::19:43:19 === std_error: "Failed initiating web server: \nundefined\n{{listen,eaddrinuse},\n {child,undefined,\n {httpd_acceptor,any,23080},\n....
The error message below may be hidden among other app log messages, both related to startup as well as the premature shutdown.
Using a utility like grep
in the brick.log
file (interactive mode)
or .../var/log/gdss-app.log
file (daemon mode) might be more
helpful. Look for the pattern eaddrinuse
.
Port conflicts can exist for a number of TCP ports used by Hibari:
Configured by an Erlang/OTP configuration file
(and not central.conf
)
./root/conf/admin.conf
, relative to your current working dir
(i.e. the gdss/src
subdirectory).
/path/to/hibari-top/gdss/1.0.0/etc/root/conf/admin.conf
Configured by the central.conf
file.
../priv/central.conf
.
/path/to/hibari-top/gdss/1.0.0/etc/central.conf
cli_port
, brick_s3_tcp_port
, gdss_ebf_tcp_port
, and so on.
brick_bp.erl
Note.
brick_chainmon.erl
Note.
brick_ets.erl
Note.
#state
record in brick_ets.erl
Note.
brick_pingee.erl
Note.
brick_sb.erl
Note.
logging_op_q
mess shared by brick_ets.erl
and brick_server.erl
Note.
brick_simple.erl
Note.
brick_squorum.erl
Note.
gmt_hlog.erl
Note.
gmt_hlog.erl
Note.
write_back_to_local_log/8
function
Note.
mod_admin.erl
Note.
brick_admin:get_table_chain_list/{1,2}
get the current
operational chain list, not the healthy chain list.
brick_server:start_scavenger/3
for API options
proplist for the scavenger. However, that function should be
removed, because almost all scavenger code from brick_server.erl
has been moved over to gmt_hlog_common.erl
. This inconsistency
should be fixed.