LIO Configuration File Options

Description

The LIO and L-Store command suite is based on a highly configurable pluggable architecture. This section describes these options and also how they interact with the command line options.

The configuration file uses the INI file format with a few modifications:

Include file support using the “%include /file/to/include” directive

Include search path using the “%include_path /path/to_add” directive. TRhe paths are searched in the order they are added.

Repeated sections. The same section can be listed multiple times. The application can choose to use only the first occurrence or all repeated sections.

Support for auto scaling of integers using b, Ki, Mi, Gi, and Ti for base-2 values and K, M, G, and T for base-10. For example 1Ki = 1024 and 1K = 1000.

Comments can appear on any line and start with an “#” and continue until the end of the line. IT is possible to excape special characters by preceeding it with a backslash (“”).

Passing custom command specific default options

On start all commands call the same routine when initializing the LIO environment. This routine parses the common options and also prepends additional options retreived from the user’s environment. The simplest method is to store these options in the LIO_OPTIONS environment variable. But each command can also have its own specific default options. Commands that have the form “lio_*command*” use the environment variable LIO_OPTIONS_COMMAND where COMMAND is substibuted with the actual command. For example lio_cp looks for LIO_OPTIONS_CP. All other commands just use the full command name. For example arc_create looks for the envronment variable LIO_OPTIONS_ARC_CREATE.

Which configuration file to load?

If a specific configuration file, using the “-c” option, is specified on the command line it is always used. If no specific configuration file is provied then the system cycles through the default locations looking for a valid configuration file. The search sequence is:

Look for lio.cfg in the current directory

Look in the users home directory - ${HOME}/.lio/lio.cfg

Check /etc/lio/lio.cfg

Which section and user to load?

A common command line option is the specification of the LIO config to use as denoted by the “-lc user@section” notation. This translates to loading the credentials from the INI file section labeled [user] with the LIO configuration being loaded from [section]. The default is to use the section lio and the user key contained with the section for the credentials. This section is mainly a collection of other sections to load defining the core components. These other sections have the configuration specifics. Below is a list of options:

timeout: Global timeout for any command in seconds.
max_attr_size: Maximum size of any attribute to send or receive.
tpc_cpu: Thread pool size tied to the number of CPU cores. It defaults to using the physical number of cores.
tpc_unlimited: Thread pool size for unbounded numbers of threads. Make sure this value is large enough to not prevent deadlock. Tasks executing in this thread pool can submit other tasks to the same thread pool and wait for their completion. Default values is 10000.
mq: Section defining which Message Queue context to load. Default is mq_context.
ds: Section defining which Data Service to load. Default is ds_ibp.
rs: Section defining which Resource Service to load. Default is rs_simple.
os: Section defining which Object Service to load. Default is os_file.
user: Section defining which user credentials to load. Default is guest.
cache: Section defining which type of Global Data caching mechanism to load. Default is cache_amp.

Object path specification and unified namespace

We adopt an scp type syntax, “user@host:/path/to/file” when specifying L-Store file names. In reality this actually specifies the section and user of the config file to load. If you want to use the default user or host just leave the field blank. For example “bob@vu:/path/to/file”, “bob@:/path/to/file”, “@vu:/path/to/file”, and “@:/path/to/file” are all valid path specificiations. If the computer has the corresponding LFS mount this notation can be completely dropped because the LIO command determines if the file is located on an LFS mount and gets the information directly.

How to extend LIO?

To extend LIO just add additional [plugin] sections as needed. These are all loaded before any other initialization is performed. The following options are supported:

section: Section name for the service used for locating the plugin
name: Service name. The (section,name) tuple is used for locating any plugin or service.
library: Shared library to load containing the needed symbol
symbol: Symbol to load in the shared library. There is no fixed symbol signature. It is up to the application using the symbol to understand how to properly use it.

Cache Plugins

There are several global data caching plugins that can be loaded. This is only used for caching data blocks. It is not used for caching metadata.

cache_amp

The caching module is loosely based on IBM’s “AMP: Adaptive Multi-stream Prefetching in a Shared Cache”.

max_bytes: Total size of data cache.
max_streams: Maximum number of data streams to track.
dirty_fraction: When this fraction of the cache becomes dirty it trigggers a flush to backing store.
dirty_max_wait: Max wait time before flusing data.
default_page_size: Default pages size. Most of the time the segment service explicitly specifies this value.
async_prefetch_threshold: How far ahead to set a data prefetch trigger.
min_prefetch_size: Minimum number of bytes to prefetch on any request.
max_fetch_fraction: Don’t use more than this fraction ot the cache for prefetching.
max_fetch_size: Maximum amount of data to prefetch in a given call.
write_temp_overflow_fraction: Wiggle room on total cache size. This gives the caching layer some wiggle room to perform copy-on-writes. It’s used when pages are being flushed to disk and the application wants to overwrite the data. In this case the page is duplicated allowing the write to go through while the flush completes. After the flush completes the old page is removed.
ppages: Maximum number of partial pages. This is a small number, typcially about 64. The purpose of these pages is to act as a pre-cache to the actual cache to handle partial page writes. The idea is that the application may have broken the full page write into multiple smaller partial page writes. So instead of forcing a page read for the initial page write the partial page data is stored in a special page designed to accumulate the parital writes. Hopefully by the time the data needs to be flushed the full page has been written. Otherwise the original page is loaded and merged with the partial page.

cache_lru

Implements the classic Least Recently Used page replacement algorithm for data cache.

max_bytes: Total size of data cache.
dirty_fraction: When this fraction of the cache becomes dirty it trigggers a flush to backing store.
dirty_max_wait: Max wait time before flusing data.
default_page_size: Default pages size. Most of the time the segment service explicitly specifies this value.
max_fetch_fraction: Don’t use more than this fraction ot the cache for prefetching.
write_temp_overflow_fraction: Wiggle room on total cache size. This gives the caching layer some wiggle room to perform copy-on-writes. It’s used when pages are being flushed to disk and the application wants to overwrite the data. In this case the page is duplicated allowing the write to go through while the flush completes. After the flush completes the old page is removed.
ppages: Maximum number of partial pages. This is a small number, typcially about 64. The purpose of these pages is to act as a pre-cache to the actual cache to handle partial page writes. The idea is that the application may have broken the full page write into multiple smaller partial page writes. So instead of forcing a page read for the initial page write the partial page data is stored in a special page designed to accumulate the parital writes. Hopefully by the time the data needs to be flushed the full page has been written. Otherwise the original page is loaded and merged with the partial page.

cache_round_robin

This caching method implements a segmented cache with files being assigned in a round robin fashin to each independent cache. As such this caching method does very little with all the heavy lifting done via the child cache selected.

n_cache: The number of independent caches to create. NOTE: Be aware that the total amount of data used for cache is this number times the cache size in the child cache.
child: Section to load defining the child cache.

User

Loads the credentials for a user. Right now this justs uses a very simplistic implementation.

Data Service

The data service is responsible for block level data movement and storage. Currently IBP is the only implementation supported.

ds_ibp

IBP based data service implementation. The section contains a both DS_IBP specific options and also lower level IBP options. This section only discusses the DS_IBP specific options. For the IBP options refer to the appropriate section.

duration: Default allocation expiration in econds. A good value is a couple of days. This gives the lio_warmer time to run and extend the duration longer.
chksum_type: Type of checksum to use for allocation creation. The default uses whatever the default mode is on the depot. Valid values are NONE, MD5, SHA1, SHA256, SHA512.
chksum_blocksize: Size of each checksum block. Default is 64Ki.

Message Queue

The messaging Queue or MQ for short controls how metadata flows between clients and servers. There is currently only a single implementation based On ZeroMQ.

min_conn: Minimum number of connections to make to a server.
max_conn: Maximum number of connections allowed to a single server.
min_threads: Minimum number of message execution threads
max_threads: Maximum number of message execution threads. If this number is too small messages will backlog and can potentially cause a deadlock.
backlog_trigger: Number of messages backlogged for sending before spwaning a new connection.
heartbeat_dt: Client/Host heartbeat interval in seconds.
heartbeat_failure: If no valid heartbeat is received in this time frame (secs) then the connection is considered dead and all pending commands are failed.
min_ops_per_sec: This is used in conjunction with the backlog_trigger to determine when an additiional connection should be created. This variable sets the threshold for the application message sending rate before considering spawning an additional connection.

Object Service

The Object servive (OS) provides the traditional file system semantic for manipulating files and directories. It also provides support for arbitrary metadata to be associated with any object.

os_file

The implementation uses a traditional disk file system to store all metadata and as a result does not work between nodes. It also has limitations on scalability which can be somewhat overcome by using SSDs to store data.

base_path: Base directory for storing object metadata.
lock_table_size: Size of the fixed lock table for controlling object updates. The object name is used to generate a hash which is modulo-ed to the table size. This lock is then used to insure atomic object updates. Typical table sizes are 1000.
max_copy: This controls the maximum size of an attribute when doing attribute copies.
hardlink_dir_size: Hard links are created in a special directory location and assigned random names. This controls how many subdirectories the random names are distributed across.
authz: Authorization framework to load
authn: Authentication framework to load

os_remote_client

This is the Remote OS driver designed to run on the client. It’s designed to overcome the multi-client issues with os_file and is designed to be run in conjunction with the os_remote_server plugin.

authn: Authentication framework to load
timeout: Generic timeout for any command.
heartbeat: Interval between sending heartbeat messages to the remote OS server.
remote_address: Address of the remote OS server
max_stream: Maximum packet size to send in any message. Larger mesage are brokend into multiple packets and streamed.
stream_timeout: How often to send stream messages for ongoing communication streams.
spin_interval: Controls how often to send ping messages for sensitve operations. For a few commands, mainly object removal, it is advantageous to aggressively send ping messages between the client and server to detect if the client has died. For example issuing a “lio_rm ” and pressing Ctrl-C realizing your mistake. One the remote OS server has received the command it will diligently continue removing files even if the client has died. To overcome this the client will send a ping message at *spin_interval second intervals and if the server hasn’t received a ping message in spin_fail seconds it will automatically abort the operation. As stated earlier this is used for just a few commands.
spin_fail: Abort the command if no ping messages have occured in the last spin_fail seconds. See spin_interval for more information.

os_remote_server

This is the server side to allow clients to share a single os_file instance.

address: Address to bind to for client connections.
ongoing_interval: How often to perform checks for dead clients and releasing any resources they may hold. For example open files.
max_stream: Maximum packet size to send in any message. Larger mesage are brokend into multiple packets and streamed.
os_local: This is the section that defines the local OS to run which handles the actual object service requests.

Resource Service

The Resource Service is responsible for mapping resource requests to physical resources and monitoring the health of physical resources. There are a couple of resource services implemented with the core being rs_simple which does most of the heavy lifting.

Each resource is defined in a [rid] section. The RS consumes all [rid] sections and generates the resource list. Each [rid] section has a couple of mandatory fields rid_key and ds_key with all other being optional. All other key/value pairs other than those listed below are classified as resource attributes and can be used to partition resources and create resource classes.

rid_key

Unique identifier for this resource. This is an immutable identifier and should be unique for all resources.

ds_key

This is the data service key and is used to access the allocation. For IBP it has the format “host:port/RID”.

status

Current state of the resource:

0 - Resource is up and available for use.

1 - Ignore this resource. Easy way to disable a resource in the config file without having to completely remove it

2 - Resource doesn’t have enough free space.

3 - Can’t connect to the RID.

space_used

Used space on the resource. In most cases this is gotten directly from the reource.

space_free

Free space on the resource. In most cases this is gotten directly from the reource.

space_total

total amount of space on the resource. In most cases this is gotten directly from the reource.

rs_simple

This is main RS implementation that handles data requests. The other remote client and server RS implementations slave this resource to do the actual work.

fname: File containing resources for monitoring.
dynamic_mapping: If 1 then data requests are automatically remapped if the resource moves to a different physical machine or is accessible via a different interface than what was originally used on creation. Additionally any changes to the fname file cause it to be reloaded.
check_interval: Interval between resource health and space checks in seconds.
check_timeout: Maximum amount of time to wait for a resource query before timing out. Set this value ot 0 to disable checks.
min_free: Minimum amount of free space a resource must have to be considered for use.

rs_remote_client

This is designed to communicate to the rs_remote_server to get the resource list and and status changes.

child_fname: This is the file the child RS as defined by rs_local is monitoring for changes.
remote_address: Remote RS server address
dynamic_mapping: If 1 then data requests are automatically remapped if the resource moves to a different physical machine or is accessible via a different interface than what was originally used on creation.
check_interval: This controls how long to wait for RS changes to occur on the RS remote server. A good value is an hour, 3600 secs. Any changes on the remote server are immediately pushed back down. This interval just controls how often we send an update request. Not how often we get an update.
rs_local: This is the section that defines the local RS to run which handles the actual resource requests. The configuration file is retrieved from the remote RS server and stored in the child_fname which the local RS should be configured to monitor. The local RS should also have the health and space checks disabled since this is being performed by the remote RS server.

rs_remote_server

This provides a centralized Resource Service which all clients connect to. The actual resource service monitoring is performed by the rs_local. Any changes it detects are immediately pushed out to all clients.

address: Address to bind to for client connections.
rs_local: This is the section that defines the local RS to run which handles the actual resource requests. The local RS should have the health and space checks enabled in order for resoource changes to get propagated to clients.