Elegant way to refer to files in data science project
I'm maintaining a economics data project based on DataDriven Cookiecutter, which I feel is a great template.
Separating you data folders and code seems as an advantage to me, allowing to treat your work as directed flow of tranformations (a 'DAG'), starting with immutable intiial data, and going to interim and final results.
Initially, I reviewed pkg_resources
, but declined using it (long syntax and short of understanding cretaing a package) in favour of own helper functions/classes that navigate through directory.
Essentially, the helpers do two things
1. Persist project root folder and some other paths in constansts:
# shorter version
ROOT = Path(__file__).parents[3]
# longer version
def find_repo_root():
"""Returns root folder for repository.
Current file is assumed to be:
<repo_root>/src/kep/helper/<this file>.py
"""
levels_up = 3
return Path(__file__).parents[levels_up]
ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data'
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')
This is similar to what you do with DATA_DIR
. A possible weak point is that here I
manually hardcode the relative location of helper file in relation to project root. If the helper file location is moved, this needs to be adjusted. But hey, this the same way it is done in Django.
2. Allow access to specific data in raw
, interim
and processed
folders.
This can be a simple function returning a full path by a filename in a folder, for example:
def interim(filename):
"""Return path for *filename* in 'data/interim folder'."""
return str(ROOT / 'data' / 'interim' / filename)
In my project I have year-month subfolders for interim
and processed
directories and I address data by year, month and sometimes frequency. For this data structure I have
InterimCSV
and ProcessedCSV
classes that give reference specific paths, like:
from . helper import ProcessedCSV, InterimCSV
# somewhere in code
csv_text = InterimCSV(self.year, self.month).text()
# later in code
path = ProcessedCSV(2018,4).path(freq='q')
The code for helper is here. Additionally the classes create subfolders if they are not present (I want this for unittest in temp directory), and there are methods for checking files exist and for reading their contents.
In your example, you can easily have root directory fixed in setting.py
,
but I think you can go a step forward with abstracting your data.
Currently data_sample()
mixes file access and data transformations, not a great sign, and also uses a global name, another bad sign for a function. I suggest you may consider following:
# keep this in setting.py
def processed(filename):
return os.path.join(DATA_DIR, filename)
# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
# FIXME: what is `code`?
if not code:
code = random.choice(df.code.unique())
return df[df.code == code].sort_values('Date')
# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)
As long as you are not committing lots of data and you make clear the difference between snapshots of the uncontrolled outside world and your own derived data (code + raw
) == state. It is sometimes useful to use append-only-ish raw and think about symlinking steps like raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01
or some similar pattern to establish a "use latest" input structure. Try to clearly separate config settings (my_project/__init__.py
, config.py
, settings.py
or whatever) that might need to be changed depending on env (imagine swapping out fs for blobstore or whatever). setup.py is usually at the top level my_project/setup.py
and anything related to runnable stuff (not docs, examples not sure) in my_project/my_project
. Define one _mydir = os.path.dirname(os.path.realpath(__file__))
in one place (config.py
) and rely on that to avoid refactoring pain.