®®®® SIIA Público

Título del libro: IEEE-ACM International Symposium on Cluster Cloud and Grid Computing
Título del capítulo: Managing Big Data Analytics Workflows with a Database System

Autores UNAM:
JAVIER GARCIA GARCIA;
Autores externos:

Idioma:
Inglés
Año de publicación:
2016
Palabras clave:

Cluster computing; Data handling; Database systems; Distributed computer systems; Grid computing; Information management; Metadata; Translation (languages); Computing model; Data analytics; Data cleaning; Data preprocessing; Denormalization; Practical issues; Time-consuming tasks; workflow; Big data


Resumen:

A big data analytics workflow is long and complex, with many programs, tools and scripts interacting together. In general, in modern organizations there is a significant amount of big data analytics processing performed outside a database system, which creates many issues to manage and process big data analytics workflows. In general, data preprocessing is the most time-consuming task in a big data analytics workflow. In this work, we defend the idea of preprocessing, computing models and scoring data sets inside a database system. In addition, we discuss recommendations and experiences to improve big data analytics workflows by pushing data preprocessing (i.e. data cleaning, aggregation and column transformation) into a database system. We present a discussion of practical issues and common solutions when transforming and preparing data sets to improve big data analytics workflows. As a case study validation, based on experience from real-life big data analytics projects, we compare pros and cons between running big data analytics workflows inside and outside the database system. We highlight which tasks in a big data analytics workflow are easier to manage and faster when processed by the database system, compared to external processing. © 2016 IEEE.


Entidades citadas de la UNAM: