Rucio Usage
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How can I use Rucio?
Objectives
Become familiar with aspects of Rucio
Use Rucio tags to find specific types of files
Learn how to download or stream files for further use
Getting Started
We can access and run the Rucio client from within eic-shell. From wherever you have eic-shell:
./eic-shell
rucio whoami
This should print out some information -
email : eicprod@jlab.org
account : eicread
account_type : GROUP
...
We can also check the arguments we can supply to rucio, as well as usage info with:
rucio -h
To use Rucio further, we will need to briefly look at how Rucio organises data.
Datasets and DIDs
Typically, we want to analyse data contained within specific files. Files can be grouped together into datasets which can themselves, be grouped into containers. All three refer to “data”. As such, the term “data identifier` or DID is used in Rucio. A DID is just the name of a single file, dataset or container.
In Rucio, all DIDs follow a naming scheme which is composed of two strings - a scope and a name, formatted as:
scope:name
For epic, the scope is always epic, meaning that all of our DIDs look like:
epic:name
The name contains information about the dataset in question and contains information such as the software release used to create the file, electron and ion beam energies etc.
As an example, consider the DID for the dataset:
epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x130/q2_10_20/pi+
The name here - /RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x130/q2_10_20/pi+, tells us many things about the contents of this dataset. Let’s break this down, examining the component enclosed within each pair of /---/ -
RECO- This tells us that the DID contains reconstructed output file information
26.02.0- This tells us that the
26.02.0software release was used, the February 2026 release (version 0).
- This tells us that the
epic_craterlake- This tells us that the
epic_craterlakedetector configuration was used in the simulation
- This tells us that the
EXCLSUIVE- The DID is for a dataset of exclusive physics events
DEMP- This is the specific exclusive process simulated in the dataset, Deeply Exclusive Meson Production, DEMP
DEMPgen-1.2.4- The simulation in this dataset is based upon input generated by the
DEMPgen-1.2.4event generator
- The simulation in this dataset is based upon input generated by the
10x130- The files in this dataset were simulated with 10 GeV electrons on 130 GeV ions (in this case protons, if the ion species is not specified, it is likely to be an ep simulation)
- The AAxBBB format, where AA is the electron energy and BBB is the ion energy, is typical for quoting beam energies in EIC files. This may be written as AAonBBB too.
q2_10_20- The files in this dataset are from a simulation input containing events in the 10 to 20 GeV2 Q2 range
pi+- Pi+ are generated in this output - this is specific to this DEMP reaction and signifies that it is Deeply Exclusive Pion Production
Warning - Not a filepath!The
nameof our DID here looks a lot like a filepath, however it is a flat object and does not have any hierarchy as we will see in the next section.
Other names may not necessarily contain all of the same information, but as a bare minimum, are likely to tell us something about the physics process simulated and beam conditions, as well as which software release was used. This is reflected in the metadata tags assigned as we will see later.
Finding DIDs
Now that we know what a DID looks like, how can we find the DID corresponding to the file or dataset that we’re interested in?
Well, we can list DIDs using:
rucio did list scope:name
To begin though, let’s try:
rucio did list --help
As we can see, we can apply filters to our list request if we want. We can also apply wildcards, but we need to be a bit careful due to the warning above. Whilst our DIDs look like a unix file path, they are not. The structure is flat. We can quickly see this if we try:
rucio did list epic:/RECO/\*
We get an enormous number of DIDs returned! This is every reconstruction related DID available to access right now.
Warning - Check the campaign date!If you encounter any issues when processing the DID listed earlier, it may be due to the software release version. Remember that campaigns older than ~6 months will not be instantly accessible. Try switching to a more recent campaign version.
Working backwards from the full DID we had earlier, we could combine in the software release, detector configuration, process and generator to narrow down the list of DIDs:
rucio did list epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/\*
We now have a more manageable list and we can see that we have some different beam energy configurations and Q2 ranges available. Note that we can list just the DID itself by adding --short as a prefix to our did list call.
We could also make slightly better use of wildcards in this command too:
rucio did list epic:*RECO*26.02.0*DEMP*
But as above, this does require some knowledge of what our DID looks like to begin with.
Pin for later:If we used
--shortas suggested to just get a list of DIDs, we could pipe this output to a file.Each line would be the full DID for an item which we could potentially make use of.
As we can see, the DIDs we have in our list now are all datasets. We can check the contents of these datasets too. Let’s pick one of our DIDs and examine the content. We can do this via:
rucio did content list scope:name
e.g.
rucio did content list epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+
Again, a lot of output. We see though that this DID is for a dataset of a large number of files. We can again us --short to just get this as a list:
rucio did content list --short epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+
We can check where a specific file in our dataset is stored too:
rucio replica list file --protocols root --pfns --rses isopenaccess epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root
list file comment:Despite the slightly misleading command above, we can actually just provide a dataset DID here too. If we do so, we will get the location of all files in the dataset in one command, e.g:
rucio replica list file --protocols root --pfns --rses isopenaccess epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+You could pope this to a file for later usage. However, note that replicas may exist for a given file. Both would be printed by this command as is. You can check if multiple copies exist via:
rucio rule list --did scope:namee.g.
rucio rule list --did epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+This will list where the DID is stored and how many copies exist.
The root://dtn-eic.jlab.org at the start of the output tells us that this particular file is stored on JLab servers. As mentioned in the outset, Rucio works across multiple sites easily, however, methods which we might use to stream files do not. As such, being able to check where our files are stored is a useful feature.
So, we can find DIDs, check what they are and what they contain. To get to this point though, we needed some pre-knowledge of what the DID looked like which isn’t necessarily that helpful for finding something. However, a much easier approach to finding what we need is to use the metadata tags that are assigned all DIDs from March 2026 onwards.
Metadata Tags
The following tags are available as of March 2026:
- software_release
- Software release used in the simulation. Written as a container version tag/simulation campaign naming:
- vYY.MM.v
- E.g. v25.06.2 -> June 2025 software container, version 2
- requester_pwg
- Defines the physics working group (PWG) that the simulated data relates to, options are:
- edt (exclusive, diffractive and tagging)
- inclusive
- jets_hf
- semi_inclusive
- ew_bsm
- other
- Can be one or more
- Defines the physics working group (PWG) that the simulated data relates to, options are:
- q2_min
- Minumum Q2 value (GeV^2) in the simulation file, entered as a number.
- Optional tag - Not all simulated files use this
- q2_max
- Maximum Q2 value (GeV^2) in the simulation file, entered as a number.
- Optional tag - Not all simulated files use this
- electron_beam_energy
- Electron beam energy in GeV
- ion_beam_energy
- Ion/nucleus beam energy in GeV
- is_background_mixed
- True/false depending upon whether sample includes any background mixing
- ion_species
- Ion species in the simulation, defaults to
p, proton, if not specified- Typed as formatted in files, e.g.
Au197for gold,He3for helium 3 etc. Cu63,H2,Ru96andpare some other options
- Typed as formatted in files, e.g.
- Ion species in the simulation, defaults to
- generator
- MC event generator used to generate the simulated data
- E.g. Pythia8, Herwig etc
- Entered as all lower case
- E.g.
dempgennotDEMPgen
- E.g.
As noted on some items in this list, some tags are optional and may not be applied to all datasets. However, the following tags are required for all datasets:
- software_release
- physics_process
- electron_beam_energy
- ion_beam_energy
- is_background_mixed
- ion_species
- generator
Note that as mentioned for the generator, tags are entered in lower case, with the exception of ion species.
We can use these tags to filter through the available datasets and identify those of interest to us. For example:
rucio did list --filter 'TAG==*' 'scope:*'
So, as an example, we could list all DIDs with electron beam energies of 10 GeV via:
rucio did list --filter 'electron_beam_energy==10' 'epic:*'
We can also combine tags and filter on several at once, e.g:
rucio did list --filter 'electron_beam_energy==10, ion_beam_energy==250' 'epic:*'
which will return only datasets with 10x250 collisions (10 GeV electrons on 250 GeV ions using the standard ePIC conventions). We can keep adding filters in this manner as we like to really narrow down the DIDs we return with our query.
Exercise:Using tags, find the DIDs of the latest:
- DEMP events in the Q2 range of 3 to 10 for 10 GeV electrons on 250 GeV protons
- Print the full DID and check the number of files in the dataset
Hint - Check the example name we looked at when introducing DIDs in a previous section.
Using DIDs - Downloading or Processing Files
So far we’ve seen how we can find DIDs and check some basic info such as what type of data they point to and where that data is stored. We generally want to do a bit more than that though. Typically we want to find data to use it in some way. For our simulation data, this is usually to analyse it!
We can download DIDs, containers, datasets or files, straightforwardly:
rucio download scope:name
So to download our file from earlier, we just do:
rucio download epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root
By default it will download to our current directory with its original name. In this case, that’s unfortunate because as we noticed earlier, this looks a lot like a UNIX file path. As such, we now have a large number of nested directories to go through before we get to our file!
Warning - Do you need the whole dataset?Think very carefully before downloading a DID. What is it? If it’s a full dataset, do you really need all of the data?
Generally you will not need a local copy of a full dataset. It’s generally best to only download a small subset of files to test and run.
We can stream files from a full dataset rather than downloading them as we’ll see in a moment.
It might actually be easier to use xrootd to grab our file as it’s a bit more intuitive, we do need our location from earlier for this though:
xrdcp root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root ./
Where ./ is our current directory, this time, we’ll just get the actual file! We could also specify a new name if desired in place of just ./.
Our full path, including location, as used here, will also be useful if we want to stream our files directly in a script. In ROOT, we could just do:
auto f = TFile::Open("root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root")
or using python and uproot:
import uproot
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
root_file = uproot.open(file_path)
or directly with Pyroot:
import ROOT
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
file = ROOT.TFile.Open(file_path, "READ")
Testing File Streaming
We can quickly check the three methods above work.
To test the ROOT approach, we can make a macro called Test.C and add:
void Test(){
auto f = TFile::Open("root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root");
auto tree = f->Get<TTree>("events");
Long64_t nEntries = tree->GetEntries(); // read the number of entries in the tree
cout << nEntries << " events in tree" << endl;
}
If we run this script with root Test.C, it should stream our file and print the number of entries (the number of events) in the file.
Similarly in Python, we can make Test.py and use Uproot:
import uproot
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
root_file = uproot.open(file_path)
print(root_file['events'].num_entries, "events in this tree.")
or directly using PyRoot:
import ROOT
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
file = ROOT.TFile.Open(file_path, "READ")
print((file.Get("events")).GetEntries(), "events in this tree.")
All three approaches should yield the same result.
Key Points
Rucio works with datasets and Data Identifiers (DIDs)
ePIC DIDs may look or be formatted like a nested filepath, but they are flat
Tags can be used to quickly sort and find data of interest
Once you find the file location with Rucio, you can use xrootd to download or stream it too