This document discusses the process(es) for handling CSV data that is submitted to NACC by a study or collaborating organization.
The data in these files CSV files are processed for one or both of these purposes:
with the processing involving these steps
These are discussed in more detail below.
sequenceDiagram
actor source as Data<br/>Submitter
participant ingest as Ingest<br/>Project
participant lookup as Identifier<br/>Lookup
participant csplitter as Center<br/>Splitter
participant distribution as Distribution<br/>Project
participant ssplitter as Subject<br/>Splitter
participant subject as Subject<br/>Acquisition
participant importer as File<br/>Importer
source ->> ingest: upload
ingest ->>+ lookup: lookup
lookup ->>- ingest: update
alt file has center ID column
ingest ->> csplitter: split
loop each center
csplitter ->> distribution: distribute
alt file has subject ID column
distribution ->> ssplitter: split
loop each subject
ssplitter ->> subject: distribute
subject ->> importer: import
end
end
end
end
As the first step, a CSV containing data from all centers is uploaded to the study ingest. The study ingest is typically a study-specific ingest project of a center for the organization providing the data.
An example is NCRAD, which provides several data streams. There is a group for NCRAD, and then projects for ingest of each data stream. This is the process described above.
On upload, ID transformations may be required to ensure the CSV has the the NACCID and ADCID columns:
If this diagram is not rendered properly, view this document in the repository
flowchart LR
subgraph study/ingest
A@{ shape: rect, label: "CSV" } -- Identifier Lookup --> B@{ shape: rect, label: "CSV\nw/ADCID" }
end
sequenceDiagram
actor source as Data<br/>Submitter
participant ingest as Ingest<br/>Project
participant lookup as Identifier<br/>Lookup
source ->> ingest: upload
ingest ->>+ lookup: lookup
lookup ->>- ingest: update
identifier-lookup: used to ensure both ADCID and NACCID are available for splitting purposes
The primary variation is where data is ingested. Ordinarily data is uploaded to an ingest project in Flywheel, but, in other cases, data is ingested into AWS S3. In this scenario, transformation processes may occur in AWS before the data is transferred to Flywheel.
An example of this is SCAN, where data is transferred into the S3 bucket, and split by center and each file written into center-specific projects in Flywheel. This uses a different set of gears up to the point where the files are saved in Flywheel.
The next step is to split rows of the CSV by ADCID and write center-specific rows to a new CSV in a project in the group corresponding to the ADCID.
If this diagram is not rendered properly, view this document in the repository
flowchart LR
subgraph study/ingest
B@{ shape: rect, label: "CSV\nw/ADCID" }
end
subgraph center/distribution
B -- Split by ADCID --> C@{ shape: processes, label: "center\nCSVs" }
end
sequenceDiagram
participant ingest as Ingest<br/>Project
participant csplitter as Center<br/>Splitter
participant distribution as Distribution<br/>Project
ingest ->> csplitter: split
alt file has center ID column
loop each center
csplitter ->> distribution: distribute
distribution ->> ssplitter: split
activate ssplitter
alt file has subject ID column
loop each subject
ssplitter ->> subject: distribute
subject ->> importer: import
end
end
deactivate ssplitter
end
end
csv-center-splitter: uses the ADCID to split the CSV and save the corresponding rows to a CSV file in a project of the center group
The CSV center splitter supports batching the splitting process to avoid scenarios where a large number of downstream jobs are created.
The final step is to split rows in the center-specific CSV by NACCID to create JSON file attached to subject. After which the form-importer is run to load JSON into the file custom info.
If this diagram is not rendered properly, view this document in the repository
flowchart LR
subgraph center/distribution
C -- Split by NACCID --> D@{ shape: processes, label: "participant\nJSON" }
D -- import --> E@{ shape: processes, label: "file\nmetadata" }
end
file.info
The form-importer allows specifying a prefix for importing values into the custom information.
The convention is that for form data the prefix is form.json
, while for other files the prefix is raw
.
Since, custom information for a file is denoted with prefix file.info
, you may see these prefixes as file.info.form.json
and file.info.raw
.