Tips & Best Practices for the Study File Format API

This page lists some tips and best practices for using the Study File Format (SFF) API.

Use Programmatic Consumption

Best Practice: Use code to access data from the SFF API, rather than trying to read the files manually.

The SFF API is designed to be accessed programmatically, not by manually reading the files. The manifest file in the SFF package organizes the CSV files and metadata for each column, with each Study Data section having its own block.

The Study Design section is organized separately within the manifest file. Item Definition data is included in the Clinical Data section because each Item is a column in the CSV file. The manifest should be treated as the source of truth for the metadata and schema for the SFF.

Column Headers

SFF column headers are also designed to be machine-readable. The attributes in JSON format are not guaranteed to be in a specific order, so it’s important to parse the columns programmatically. The itemgroups array in the Clinical Data section is the only exception. This array maintains the order in which each Item Definition appears within the item group. This prevents using the same Item Definition across multiple item group definitions.

Efficiently Track Changes

Best Practice: Use the Row ID column to track changes and check the Deletes file for deleted rows.

Each file in the SFF includes a Row ID column, which serves as a unique identifier for each row. Use this column to track any row-level changes.

Treat each change as an Upsert: if a row appears in an incremental SFF file, it indicates an addition or modification to that row.
The Deletes CSV file lists any rows that were deleted in the latest increment of changes.

Mapping Reference Data

Best Practice: Use the Labels file to link object names to their labels via the Type column and the Override Labels file to identify display overrides in object relationships.

The full SFF package includes two CSV files: Labels and Override Labels. These files provide additional data for Labels and Display Override Labels configured in Veeva EDC.

Labels File

In the Labels CSV file, use the Type column to identify the category of the label and map it to the original Name of the object.

For example, if the Type is “Event”, that refers to the event definition label. The event definition name is the Event Name column within the SFF files. So, the Labels file can be used to map between the Event Definition Name and the Event Definition Label.

Override Labels File

Use the Override Labels file to identify the Display Override Label, if configured in EDC Studio for a specific object relationship.

There are five types of override labels, each linked to a study definition object:

Event Group
Event
Form
Item Group
Item

The Source Definition column maps source definition names to their corresponding target definition names in the Target Definition column.

For example:

The Source Definition value is “form_A” and the Source Label Type is “form”. The Target Definition name is “ig_A”, and the Target Label Type is “itemgroup”.

This means that the Item Group Definition named “ig_A” belongs to the form definition named “form_A” and has a Display Override Label set, whose value is present in the Target Override Label column.

Leverage Created Dates

Best Practice: Process SFF packages in the order provided by the API’s created_date.

Filenames include a published time for each defined increment: every 15 minutes for incremental extractions and every 24 hours for full extractions. It’s best practice to consume SFF ZIP packages in the order of the created_date returned by the API. Using the created_date, especially for incremental SFF consumption, helps you understand the order in which to apply changes to downstream systems to keep them in sync. For example, the created_date may be around 00:30 UTC, but the published time may show 00:45. This is because the system has captured the time interval between 00:30 and 00:45. Learn more about filenames.

Retrieve the Full SFF Package for Setup and Refresh

Best Practice: Retrieve the full SFF package once at setup and again if you need a complete refresh.

You should retrieve the full SFF package the first time you enable it. If incremental SFF is also enabled, you only need to retrieve the full package initially, unless you require third party data. Third party data is only available in the full package. If your data becomes out of sync and you need a full refresh, you can retrieve the full package again.

Recommended File Ingestion Methods

We recommend the following methods for ingesting SFF packages in these scenarios: initial enablement, study design changes, system resets, and consuming third-party data.

Initial Enablement

For initial SFF enablement, retrieve the full package at 12:00 PM UTC. It includes data as of 6:00 AM UTC. If you’re using incremental updates, download the incremental package published at 12:15 PM UTC after the initial SFF enablement. Incremental packages are generated and available from 12:15 PM UTC onwards for ongoing data updates.

Design Change & Full Required

When you change an EDC study design and ingest it into CDB, the API sets full_required to “True” in both full and incremental extraction responses, indicating that a full extraction is needed. This ensures all data reflects the latest study design.

After you consume the next full extraction where full_required is set to “True”, the following incremental extraction includes all of the changes since the created_date of that full extraction. This created_date is around 6:00 AM UTC, and the package is published at approximately 12:00 PM UTC.

Learn more about SFF and study design changes

System Resets

If the target system runs into issues and needs to reset, consume the most recent published full extraction based on the created_date. Then, consume the incremental extraction generated after that based on the created_date.

Consuming Third Party Data

Since third party or non-Veeva EDC data is only available in the full package, we recommend reducing the amount of reprocessing required by the system.

The following guidance is provided to reduce system reprocessing:

Consume the full package, but ignore the EDC source data when consuming clinical form data. You can identify the EDC source in the Clinical Data section or block of the manifest file.
Operational or reference data files with EDC data might have changed between the full extraction’s creation and publication. Therefore, you must reapply any incremental updates starting from the full extraction’s created_date until the full extraction is published. This includes all system files: Queries, Query Messages, Local Lab Codelists, Local Lab Units, and Deletes.

As of this release, operational and reference files are aggregated across sources and do not have an identifier to indicate which source each record came from.