About
Main Components
1. Node
What
A single logical unit of computation, this could include generating data, applying transformations, training models or generating results.
Each node extends the BaseNode
class from fastpipeline package. Following diagram shows the relevant methods and attributes of the BaseNode class:
classDiagram
BaseNode <|-- Node
BaseNode : +config: Dict
BaseNode: +__init__( config: Dict )
BaseNode: +hash( ) -> str
BaseNode: +run( input: Dict ) -> Dict
Why
- This sort of structure forces the user to experiment in the form of logical chunks each responsible for a specific task
- It allows allows us to uniquely identify a node run based on:
- config
- source code for the node class
- input to the run function
- Since we can identify runs uniquely, it allows us to reuse existing results from previous runs
- Unlike sklearn pipelines here input and output of the
run
method are dictionaries, allowing data other than arrays or dataframes
2. Pipeline
Runs a series of nodes on an input consecutively
graph LR
subgraph pipeline:
Node1-->Node2-->Node3
end
Input:::input-->Node1
Node3-->Output:::output
Node1 <--> disk[Disk Serialization]:::disk
Node2 <--> disk[Disk Serialization]:::disk
Node3 <--> disk[Disk Serialization]:::disk
classDef input fill:#f96;
classDef output fill:#99ff99;
classDef disk fill:#fcf787;
It is responsible for:
- Collecting all the node objects
- Calling the
run
method for each of the node objects - Identifying based on the node and input hash if a run on same data has happened previously, if so then reusing the saved outputs
- Storing all the intermediate results (if not already saved): config, ouput and source code
- Logging for each stage
The entire process (Pseudocode)
For each node object within a pipeline:
- Convert the config dict into string and calcuate its hash (currently MD5)
graph LR A["{'alpha': 0.03, 'gamma': 0.01 }"]-- hash -->B["8d13c57118d69de715250ab3c084c66e"]
- Get the source code for the object's class and calcuate its hash
- Compute the
node_hash
: hash of the concatenated string of previous two hash valuesgraph LR A[config hash] --> C{Concatenate and hash} B[source code hash] --> C C --> D[node_hash] - Create a folder by the name
[node_class]_[node_hash]
, for exampleClassifierNode_8d13c57118d69de715250ab3c084c681
- Store the config as json, source code for the node class as a python file and the node object as a pickle in the created folder
- Convert the input dict passed to the run method into a string and compute the
input_hash
- Create a folder
[node_class]_[node_hash]/input_[input_hash]
if not already exists - Check if the result from previous run exists if not then call the run method on the input dict
- Convert the output dict obtained from the run method into a string and compute the
output_hash
- Store the output as
[node_class]_[node_hash]/input_[input_hash]/result_[output_hash].pkl