Data Specification Reference
This page documents the data.yml format used by Floability.
Matrix Example and Impact
From example/matrix-multiplication/data/data.yml:
default_profile: default
data_profiles:
default:
data:
- source: "backpack://data/matrix_A.npy"
target_location: "data/matrix_A.npy"
- source: "backpack://data/matrix_B.npy"
target_location: "data/matrix_B.npy"
Impact:
- Before your notebook runs, Floability resolves these files and stages them in the workflow environment.
- Your code can read predictable local paths (for example
data/matrix_A.npy) without source-specific logic. - The same workflow can switch to another profile later (for example S3 or Pelican) without notebook changes.
Top-Level Structure
A data spec supports both modern and legacy top-level profile keys:
schema_version: 1.0 # optional
default_profile: default # optional
# Preferred key
data_profiles:
default:
policy: ... # optional
data: ... # required per profile
# Legacy key (still supported)
profiles:
default:
policy: ...
data: ...
Notes:
data_profilesis preferred in new files.profilesis kept for backward compatibility.- Floability uses
default_profilewhen present; otherwise it falls back to the first profile.
Profile Schema
Each profile has:
policy(optional): operation and verification behavior.data(required): list of data items.
Policy Keys
Supported policy keys and defaults:
run_operation:fetch|check|verify(defaultfetch)verification_type:size_only|strict(defaultsize_only)retry_attempts: integer (default0)timeout: integer seconds (default30)size_tolerance_bytes: integer (default10)
Example:
policy:
run_operation: fetch
verification_type: strict
retry_attempts: 2
timeout: 60
size_tolerance_bytes: 64
Data Item Schema
Each item in data: must include:
sourceorsourcestarget_location(or legacytarget_path)
Supported item keys:
name(optional)source(string)sources(list of source entries)source_type(optional, inferred if omitted)source_object_type(optional; for object/directory semantics where supported)target_location(preferred)target_path(legacy alias)target_prefix(optional; absolute staging prefix override)expected_size(optional)checksum(optional; typicallysha256:<hex>)content_type(optional; reserved)
Single-source Item
- name: sample_csv
source: backpack://data/samples/sample.csv
target_location: data/samples/sample.csv
expected_size: 43210
checksum: sha256:0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
Multi-source Fallback Item
When sources is provided, entries are attempted in order until one succeeds.
- name: file_a
sources:
- source: pelican://server.example.org/path/file_a.bin
- source: s3://my-bucket/file_a.bin
- source: backpack://data/file_a.bin
target_location: data/file_a.bin
For each sources[] entry:
sourceis required.source_typeis optional.source_object_typeis optional.
Source Types
Floability supports these source types in current implementation:
backpackfshttps3pelicanmulti(forsourcesaggregation)
Inference behavior:
backpack://...is treated asbackpack.http://...andhttps://...are treated ashttp.s3://...is treated ass3.osdf://...is currently handled via Pelican logic.
Minimal Working Spec
data_profiles:
local_data:
data:
- source: backpack://data/sample.csv
target_location: data/sample.csv
Complete Example
schema_version: 1.0
default_profile: backpack-data
data_profiles:
backpack-data:
policy:
retry_attempts: 0
timeout: 30
size_tolerance_bytes: 10
run_operation: fetch
verification_type: strict
data:
- name: sample_csv
source_type: backpack
source: data/samples/sample.csv
expected_size: 43210
checksum: sha256:0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
target_location: data/samples/sample.csv
pelican-data:
policy:
retry_attempts: 2
timeout: 60
size_tolerance_bytes: 64
run_operation: verify
verification_type: size_only
data:
- name: sample_csv
sources:
- source_type: pelican
source: pelican://server.example.org:443/datasets/samples/sample.csv
- source_type: backpack
source: data/samples/sample.csv
expected_size: 43210
checksum: sha256:0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
target_location: data/samples/sample.csv
CLI Integration
floability data
floability data --mode check --data-spec <path-to-data.yml>
Supported options:
--mode check|fetch|verify(defaultcheck)--data-spec--backpack--check-details--verbose--force-fetch--data-profile--data-cache-mode off|symlink|hardlink|copy(defaultoff)--data-cache-dir--force-data-cache--fingerprint-mode meta|sample|strict(defaultmeta)--base-dir
floability run / floability execute
These commands also consume data-spec options (including --data-spec, --data-profile, and cache flags). In these flows, operation defaults to profile run_operation with fallback to fetch.
Caching Behavior
Cache controls:
--data-cache-mode:off,symlink,hardlink,copy--data-cache-dir: explicit cache location--force-data-cache: rebuild cache entries
Default cache base when not overridden:
<base-dir>/floability-data-cache
Loader Validation Rules
The loader validates:
- Profile exists and has a non-empty
datalist. - Every item has
sourceor validsources. - Every item has
target_locationortarget_path. - Every
sources[]entry definessource.