Constraint database
I found a sophisticated solution to achieve more than what I thought; talking about checking data consistency. Apparently this is what we would call test-driven data analysis
So now with this implementation we are bound to Python, and Pandas, but fortunately, not only. We can even check data consistency in MySQL, PostgreSQL ... tables.
The plus I did not think about, is that we can infer rules based on sample data. This could be helpful for setting rules.
This is why there is tdda.constraints.verify_df
and the tdda.constraints.discover_df
.
As far as I read about, It does not propose a solution for checking (a weaker) consistency on last (n) files. Something I thought about that we could call batch files consistency, that only ensures a rule satisfaction for some set of runs (last n runs) and not all data. It only acts on single files, it needs a higher level wiring to be able to condition (n) files that arrive successively.
For more: https://tdda.readthedocs.io/en/latest/constraints.html#module-tdda.constraints
assertCSVFilesCorrect
Checks a set of files in a directory, same is possible for Pandas dataframes, etc.
From the official documentation:
The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records.
ps: I am still open to other solutions, let me know as I imagine this is a use case for any ETL solution.
I also open a bounty to further enrich responses.
You can also look into SQL transactions
. A transaction consists of one or more statements, which are asked to be executed by a single user or an application. They can read or even modify data in a database.
START TRANSACTION
Do DB stuff, check if constraints are violated
COMMIT
You can specify certain constrains and use ROLLBACK
if one of these constraints is violated. The rollback can can be explicitly coded by the developer but can be
thrown from the system as well. (e.g. when an error appeared that is not handled explicitly by the developer, or when executing a trigger). Transactions may not be in the way of each other. They have to be executed in „isolated“ manner. several concurrent transactions
must produce the same results in the data as those same transactions executed sequentially, in some (unspecified) order. Since all modern DBMS guarantee ACID properties when it comes to transactions, the execution of transactions is reliable, so the state of your database shouldn't have any inconsistencies in it.
Not sure if this is what you mean, but maybe it helps.