Skip to main content

Data & Analytics Glossary

Data & Analytics Glossary

Data Lake Concepts

  • Data Lake: A centralized repository that stores raw data in its native format, structured and unstructured, at any scale.
  • Raw Zone: The area in a data lake where data is ingested in its original format, without any transformation.
  • Cleansed Zone: Contains data that has been cleaned and structured to a usable form.
  • Curated Zone: Contains refined, business-ready data used for analytics and reporting.
  • Schema-on-Read: A data processing approach where the data schema is applied only when the data is read, rather than when it's written.
  • Data Ingestion: The process of importing data into a data lake from various sources.
  • Data Catalog: A metadata repository that helps users find and understand data assets.
  • Data Lakehouse: A hybrid architecture that combines elements of data lakes and data warehouses, enabling both structured querying and large-scale data processing.
  • Object Storage: Storage architecture that manages data as objects, used in data lakes (e.g., Amazon S3, Azure Blob Storage).
  • Medallion Architecture: A data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables). Medallion architectures are sometimes also referred to as "multi-hop" architectures.

Analytical Product Concepts

  • Business Intelligence (BI): Technologies and practices for collecting, integrating, analyzing, and presenting business data.
  • Dashboards: Visual interfaces displaying key metrics and trends to support decision-making.
  • KPIs (Key Performance Indicators): Quantifiable metrics used to evaluate the success of an organization or specific activities.
  • Semantic Layer: A centralized layer that defines and manages business metrics consistently across tools and teams.
  • Self-Service Analytics: Tools and platforms that enable end users to explore and analyze data without needing technical expertise.
  • Embedded Analytics: Integration of data analysis and visualization capabilities directly into business applications.
  • Data Storytelling: The practice of using data visualizations and narrative to convey insights effectively.
  • Descriptive Analytics: a statistical interpretation used to analyze historical data to identify patterns and relationships.
  • Predictive Analytics: uses statistical techniques and modeling to forecast future outcomes by analyzing current and historical data patterns.
  • Prescriptive Analytics: a type of data analysis that goes beyond describing what happened or predicting what might happen. It recommends specific actions to take to achieve a desired outcome or optimize performance.

Tools & Technologies

  • ETL (Extract, Transform, Load): A process to extract data from sources, transform it into a usable format, and load it into a destination (e.g., a data warehouse).
  • ELT (Extract, Load, Transform): A variation where data is loaded before being transformed, often used with cloud-based data lakes.
  • Data Pipeline: A series of steps to move and process data from source to destination.
  • Delta Lake: An open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.
  • Apache Spark: A distributed data processing engine often used with data lakes for analytics and ETL.
  • Data Warehouse: A structured, optimized storage system for querying and analysis (e.g., Synapse, BigQuery).
  • Data Mesh: A decentralized data architecture where domain-oriented teams own and manage their own data products.
  • Lakehouse Platform: Technologies like Databricks that blend data lake flexibility with warehouse performance.
  • Data Partition: dividing a large dataset into smaller, manageable subsets called partitions
  • File Level Encryption: a data protection method that encrypts individual files or folders rather than an entire storage device or partition.
  • Data Encryption: a data protection method that encrypts individual files or folders rather than an entire storage device or partition. It transforms data into an unreadable format (ciphertext) called plaintext, making it readable only to authorized parties with the correct decryption key

Governance & Quality

  • Data Governance: Policies and processes for ensuring data availability, usability, integrity, and security.
  • Data Lineage: The history of data flow from source to final destination.
  • Data Quality: The condition of data based on factors like accuracy, completeness, reliability, and timeliness.
  • Data Stewardship: The management and oversight of an organization's data assets to ensure high quality and compliance.
  • Data Security:
    • Row Level Security (RLS): restrict data visibility to individual users or groups based on their roles, effectively controlling access to specific rows within a table or model
    • Field Level Security (FLS): controls user access to individual fields on objects, determining whether a user can see the fields value.
  • Access Control: Mechanisms to restrict who can view or manipulate data based on roles or permissions.
    • Access Control List (ACL): a security mechanism that controls access to data and resources by specifying which users or systems are granted or denied access.
    • Role Based Access Control (RBAC): a model for authorizing end-user access to systems, applications and data based on a user's predefined role.
    • Attribute Based Access Control (ABAC): an authorization model that determines access to resources based on attributes associated with the subject, the object, and the environment.
    • Policy Based Access Control (PBAC): a method of managing user access to systems and resources by using a set of policies rather than hard-coded rules or static permissions. It determines user privileges by combining their organizational role with defined policies, allowing for dynamic and flexible access control.

Data Products & Management

  • Data Product: A curated dataset or analytic asset designed to serve a particular purpose or business function.
  • Data-as-a-Product: Treating datasets as products with defined owners, SLAs, and user-centric design.
  • Metadata: Data about data, including its origin, format, usage, and relationships.
  • Master Data: The core data that is essential to operations in a business, such as customer, product, or supplier data.
  • Reference Data: Static data used to categorize or classify other data (e.g., country codes, units of measure).
**** Chat GPT and Google AI has been used

Comments