Mixins
Mixins to handle common situations
AWSMixin
→
Bases: object
AWS helper functions..
get_aws_attributes(name)
→
Lookup credentials
Credentials are stored in siteconf on a enrich deployment site. This function looks up the credentials, opens the s3 connections, and cleans the paths
Args: name (str): Name for credentials
Returns: dict: A dictionary with credentials, s3 hangle, bucket, and path
Source code in enrichsdk/core/mixins.py
CheckpointMixin
→
Bases: object
checkpoint(df, filename, output='pq', metafile=None, extra_metadata={}, state=None, **kwargs)
→
Checkpoint a dataframe. Collect all stats
Args:
df (object): Dataframe filename (str): Output filename. Probably unresolved metafile (str): Metadata filename. If not specified filename +'.metadata.json' extra_metadata (dict): Any additional information to be logged state (object): Enrich pipeline state. kwargs (dict): Any extra parameters to be passed to the pandas
Returns:
dict: Metadata
Source code in enrichsdk/core/mixins.py
1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 |
|
DatasetSpecMixin
→
DoodleMixin
→
Bases: object
Access doodle to update metadata
update_doodle(dataset, filename, action='read', state=None)
→
Update Doodle server metadata with access information
Source code in enrichsdk/core/mixins.py
EmailMixin
→
Bases: object
Send email
send_email_helper(sender, receivers, subject, body, attachment_prefix='', attachment_suffix='', frames={}, reply_to=[], bcc=[], cred={})
→
Send email to a destination..
Source code in enrichsdk/core/mixins.py
1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 |
|
FilesMixin
→
Bases: object
Helper functions for file operations
file_preprocessed_get(metadata, name)
→
Lookup the preprocessed metadata for a given name
Args: metadata (dict): Loaded metadata name (str): Label for the dataframe/filename to lookup
Returns: tuple: A tuple of (loaded dataframe, details)
Source code in enrichsdk/core/mixins.py
file_preprocessed_read(root, load=True, create=True)
→
Load the metadata of preprocessed files
Args: root (str): Directory of preprocessed files load (bool): if True load the preprocessed files into dataframes create (bool): if True create the root directory
Returns: dict: Metadata dictionary
Source code in enrichsdk/core/mixins.py
file_preprocessed_update(metadata, name, filename, df, params, context)
→
Update the preprocessed metadata with new information
Args: metadata (dict): Loaded metadata name (str): Label for the dataframe/filename being updated filename (str): Expected filename on the disk df (object): Dataframe that must be stored params (dict): Params to be used while saving the dataframe context (dict): Any additional information
Source code in enrichsdk/core/mixins.py
file_preprocessed_write(metadata)
→
Write the preprocessed files
Load the metadata of preprocessed files
Args: metadata (dict): Loaded metadata with updates
Source code in enrichsdk/core/mixins.py
1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 |
|
file_split_apply(label, files, splitfunc, applyfunc)
→
A function to split a list based on a name obtained from a splitting function
Args: label (str): name to be used for documentation files (list): List of strings/files/names splitfunc (def): func name -> partition-name applyfunc (def): func list -> result
Returns: dict: A dictionary with an entry for each partition provided by splitfunc
Source code in enrichsdk/core/mixins.py
GCSMixin
→
Bases: CloudMixin
GCS Blockstore functions usable by any module
gcs_init_handle(cred=None)
→
Open a GCS connection. Takes credentials as an explicit argument or can pickup from 'gcs_cred' attribute of self.
Args:
cred (object): Credentials (a dictionary with 'keyfile')
Returns:
object: GCSFS handle
Source code in enrichsdk/core/mixins.py
gcs_list_directories(path, gcs=None)
→
List files specified by a glob (path)
Args:
path (str): Glob of files gcs (object): gcsfs instance obtained from get_gcs_handle If not specified, 'gcs' object attribute used
Returns:
tuple: List of directories and non directories
first value is the list of directories and the second is the list of non directories
Source code in enrichsdk/core/mixins.py
gcs_list_files(path, gcs=None, include=None)
→
List files specified by a glob (path)
Args:
path (str): Glob of files gcs (object): gcsfs instance obtained from get_gcs_handle If not specified, 'gcs' object attribute used include (def): Function that is called for every file to determine whether to include or not
Returns:
tuple: List of included and excluded files
first value is the list of included and the second is the list of excluded
Source code in enrichsdk/core/mixins.py
gcs_open_file(path, gcs=None)
→
Open a GCS file and return a file descriptor
Args:
path (str): Path of the file that must be opened gcs (object): gcsfs instance obtained from get_gcs_handle If not specified, 'gcs' object attribute used
Returns:
object: File descriptor
Source code in enrichsdk/core/mixins.py
gcs_read_file(path, gcs=None)
→
Read the contents of a GCS object
Args:
path (str): Path of the file that must be opened gcs (object): gcsfs instance obtained from get_gcs_handle If not specified, 'gcs' object attribute used
Returns:
str: content of the path
Source code in enrichsdk/core/mixins.py
gcs_write_file(path, content, gcs=None)
→
Write the contents to a GCS object
Args:
path (str): Path of the file that must be opened content (bytes): Content of the file gcs (object): gcsfs instance obtained from get_gcs_handle If not specified, 'gcs' object attribute used
Returns:
metadata (dict): Metadata of the output file
Source code in enrichsdk/core/mixins.py
gcs_write_frame(path, df, gcs=None)
→
Write the contents to a GCS object
Args:
path (str): Path of the file that must be opened df (dataframe): Pandas dataframe gcs (object): gcsfs instance obtained from get_gcs_handle If not specified, 'gcs' object attribute used
Returns:
metadata (dict): Metadata of the output file
Source code in enrichsdk/core/mixins.py
get_gcs_handle(cred=None)
→
metadata_normalizer(detail)
→
Modify the GCP Metadata to be consistent with S3 metadata
Source code in enrichsdk/core/mixins.py
PandasMixin
→
Bases: object
Helper functions that load/store files
pandas_read_file(name, path, params=None, test=False)
→
Read a pandas file
Args: name (str): Label for the object. We use it to get the params path (str): Path of the local file that must be read params (dict): If explicitly passed then the function will not lookup the pandas_params dictionary test (bool): Whether this is a test run. This gets a limited number of rows
Returns: obj: a Pandas dataframe
Source code in enrichsdk/core/mixins.py
pandas_read_s3file(name, path, params=None, test=False)
→
Read a pandas file but one that is in s3
Args: name (str): Label for the object. We use it to get the params path (str): Path of the s3 file that must be read including the bucket name params (dict): If explicitly passed then the function will not lookup the pandas_params dictionary test (bool): Whether this is a test run. This gets a limited number of rows
Returns: obj: a Pandas dataframe
Source code in enrichsdk/core/mixins.py
ParallelMixin
→
Bases: object
Execute a function in parallel
pexec_multiple(dfs, func, cores=1)
→
Take a list of DFs, and run a func on each in parallel. Get either a combined dataframe or a list of computed outputs
Args:
dfs (obj): List of Pandas dataframes func (def): Function that must be called cores (int): Number of cores to run on
Returns: object: list of func outputs or a combined dataframe
Source code in enrichsdk/core/mixins.py
pexec_single(df, func, partitions=10, cores=1)
→
Take a single DF, split it and run funcs. Get either a combined dataframe or a list of computed outputs
Args:
df (obj): Pandas dataframe func (def): Function that must be called partitions (int): Number of dataframe partitions cores (int): Number of cores to run on
Returns: object: list of func outputs or a combined dataframe
Source code in enrichsdk/core/mixins.py
S3Mixin
→
Bases: CloudMixin
AWS S3 functions usable by any module
get_s3_handle(cred=None, client_kwargs=None)
→
s3_cache_files(files, localdir, s3=None)
→
Cache files from S3 into local directory
Args: files (list): List of paths localdir (str): Path to cache the file s3 (object): s3fs instance obtained from get_s3_handle If not specified, 's3' object attribute used Returns: list: list of updated files
Source code in enrichsdk/core/mixins.py
410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 |
|
s3_init_fshandle(cred=None, client_kwargs=None)
→
Open a S3 connection. Takes credentials as an explicit argument or can pickup from 'aws_cred' attribute of self.
Args:
cred (object): Optional Credentials (a dictionary with 'secret_key' and 'access_key' parameters)
Returns:
object: S3FS handle
Source code in enrichsdk/core/mixins.py
s3_list_directories(path, s3=None)
→
List files specified by a glob (path)
Args:
s3 (object): s3fs instance obtained from get_s3_handle If not specified, 's3' object attribute used path (str): Glob of files
Returns:
tuple: List of directories and non directories
first value is the list of directories and the second is the list of non directories
Source code in enrichsdk/core/mixins.py
s3_list_files(path, s3=None, include=None)
→
List files specified by a glob (path)
Args:
s3 (object): s3fs instance obtained from get_s3_handle If not specified, 's3' object attribute used path (str): Glob of files include (method): Function that is called for every file to determine whether to include or not
Returns:
tuple: List of included and excluded files
first value is the list of included and the second is the list of excluded
Source code in enrichsdk/core/mixins.py
s3_open_file(path, s3=None)
→
Open a S3 file and return a file descriptor
Args:
s3 (object): s3fs instance obtained from get_s3_handle If not specified, 's3' object attribute used path (str): Path of the file that must be opened
Returns:
object: File descriptor
Source code in enrichsdk/core/mixins.py
s3_read_file(path, s3=None)
→
Read the contents of a S3 object
Args:
s3 (object): s3fs instance obtained from get_s3_handle If not specified, 's3' object attribute used path (str): Path of the file that must be opened
Returns:
str: content of the path
Source code in enrichsdk/core/mixins.py
s3_write_file(path, content, s3=None)
→
Write the contents to a S3 object
Args:
s3 (object): s3fs instance obtained from get_s3_handle If not specified, 's3' object attribute used path (str): Path of the file that must be opened content (bytes): Content of the file
Returns:
metadata (dict): Dictionary having file metadata
Source code in enrichsdk/core/mixins.py
s3_write_frame(path, df, s3=None)
→
Write the contents to a S3 object
Args:
s3 (object): s3fs instance obtained from get_s3_handle If not specified, 's3' object attribute used path (str): Path of the file that must be opened df (dataframe): Dataframe to write
Returns:
metadata (dict): Dictionary having file metadata