Glossaries Archive | Databricks

Glossary

What is a transaction?In the context of databases and data storage systems, a transaction is any operation that is treated as a single unit of work, which either completes fully or does not complete at all, and leaves the storage system in a consiste{...}

AdaGrad

Gradient descent is the most commonly used optimization method deployed in machine learning and deep learning algorithms. It’s used to train a machine learning model. Types of Gradient Descent There are three primary types of gradient descent used in{...}

AI Agents

What Are AI Agents?SummaryUnderstand what makes AI agents different from traditional AI systems, including how they perceive, decide and act autonomously.Explore the evolution of AI agents from early rule-based programs in the 1960s to today’s advanc{...}

Alternative Data

What is Alternative Data? Alternative data is information gathered by using alternative sources of data that others are not using; non-traditional information sources. Analysis of alternative data can provide insights beyond that which an indus{...}

Anomaly Detection

Anomaly Detection is the technique of identifying rare events or observations which can raise suspicions by being statistically different from the rest of the observations. Such “anomalous” behavior typically translates to some kind of a problem like{...}

Apache Hive

What is Apache Hive? Apache Hive is open-source data warehouse software designed to read, write, and manage large datasets extracted from the Apache Hadoop Distributed File System (HDFS) , one aspect of a larger Hadoop Ecosystem. With extensive Apach{...}

Apache Kudu

What is Apache Kudu? Apache Kudu is a free and open source columnar storage system developed for the Apache Hadoop. It is an engine intended for structured data that supports low-latency random access millisecond-scale access to individual rows toget{...}

Apache Kylin

What is Apache Kylin? Apache Kylin is a distributed open source online analytics processing (OLAP) engine for interactive analytics Big Data. Apache Kylin has been designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop/Spar{...}

Apache Spark

What Is Apache Spark? Apache Spark is an open source analytics engine used for big data workloads. It can handle both batches as well as real-time analytics and data processing workloads. Apache Spark started in 2009 as a research project at the{...}

Apache Spark as a Service

What is Apache Spark as a Service? Apache Spark is an open source cluster computing framework for fast real-time large-scale data processing. Since its inception in 2009 at UC Berkeley’s AMPLab, Spark has seen major growth. It is currently rated{...}

Artificial General Intelligence: Understanding the Next Frontier of AI

Artificial general intelligence (AGI) refers to a hypothetical form of artificial intelligence (AI) capable of performing the full range of human-level intellectual tasks. More specifically, artificial general intelligence refers to systems with broa{...}

Artificial Neural Network

What is an Artificial Neural Network? An artificial neuron network (ANN) is a computing system patterned after the operation of neurons in the human brain. How Do Artificial Neural Networks Work? Artificial Neural Networks can be best viewed as weigh{...}

Automation Bias

What is Automation Bias? Automation bias is an over-reliance on automated aids and decision support systems. As the availability of automated decision aids is increasing additions to critical decision-making contexts such as intensive care units, or {...}

Bayesian Neural Network

What Are Bayesian Neural Networks? Bayesian Neural Networks (BNNs) refers to extending standard networks with posterior inference in order to control over-fitting. From a broader perspective, the Bayesian approach uses the statistical methodology so {...}

Big Data Analytics

What Is Big Data Analytics?Big data analytics is the often complex process of examining large and varied data sets ("big data") generated by sources such as eCommerce, mobile devices, social media and the Internet of Things (IoT). It involves integra{...}

Bioinformatics

Bioinformatics is a field of study that uses computation to extract knowledge from large collections of biological data. Bioinformatics refers to the use of IT in biotechnology for storing, retrieving, organizing and analyzing biological data. An out{...}

Business intelligence tools overview

Business intelligence (BI) tools are a critical category of software applications designed to collect, process, analyze and present business data in meaningful ways. At their core, these tools transform raw data into actionable insights that drive st{...}

Business Intelligence vs. Business Analytics: What's the Difference?

Business intelligence (BI) is a set of technologies, processes and strategies designed to generate actionable insights from business data. Business intelligence systems gather and store raw business operations data, which is analyzed to transform it {...}

Catalyst Optimizer

At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer. Catalyst is based on functional program{...}

Complex Event Processing

What is Complex Event Processing [CEP]? Complex event processing [CEP] also known as event, stream or event stream processing is the use of technology for querying data before storing it within a database or, in some cases, without it ever being stor{...}

Compound AI Systems

What Are Compound AI Systems?Compound AI systems, as defined by the Berkeley AI Research (BAIR) blog, are systems that tackle AI tasks by combining multiple interacting components. These components can include multiple calls to models, retrievers or {...}

Continuous Applications

Continuous applications are an end-to-end application that reacts to data in real-time. In particular, developers would like to use a single programming interface to support the facets of continuous applications that are currently handled in separate{...}

Convolutional Layer

In deep learning, a convolutional neural network (CNN or ConvNet) is a class of deep neural networks, that are typically used to recognize patterns present in images but they are also used for spatial data analysis, computer vision, natural language {...}

Data Analysis Platform

What is a Data Analysis Platform? A data analytics platform is an ecosystem of services and technologies that needs to perform analysis on voluminous, complex and dynamic data that allows you to retrieve, combine, interact with, explore, and visuali{...}

Data Automation

As the amount of data, data sources and data types grow, organizations increasingly require tools and strategies to help them transform that data and derive business insights. Processing raw, messy data into clean, quality data is a critical step bef{...}

Data Catalog

What is a data catalog?A data catalog is a centralized inventory and management system that serves as the ultimate “treasure map” for your organization’s data assets. It provides a comprehensive, searchable repository of metadata that enables data pr{...}

Data Governance

What is Data Governance? Data governance is the oversight to ensure data brings value and supports the business strategy. Data governance is more than just a tool or a process. It aligns data-related requirements to the business strategy using a fram{...}

Data Ingestion

Data ingestion is the first step in the data engineering lifecycle. It involves gathering data from diverse sources such as databases, SaaS applications, file sources, APIs and IoT devices into a centralized repository like a data lake, data warehous{...}

Data Integration

What is data integration?Data integration is the process of combining data from multiple systems into a unified, reliable view. It brings together information from databases, applications, event streams, files, APIs and third-party platforms so organ{...}

Data Lakehouse

What is a Data Lakehouse?A data lakehouse is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelli{...}

Data Lineage

What is data lineage?Data lineage is the process of recording, tracking and visualizing data and AI over time, from origin to consumption. Effective data lineage provides data teams with an end-to-end view of how data is transformed and flows across {...}

Data Literacy

What is Data Literacy?Data literacy is the ability to read, work with, analyze and communicate data effectively. It’s understanding what data means, how it’s created and how to use it so you can ask the right questions, interpret data correctly and m{...}

Data Management

What is data management?Let’s start out with a data management definition.Data management is the practice of organizing, processing, storing, securing and analyzing an organization’s data throughout its lifecycle, including managing all of the organi{...}

Data Marketplace

What is a data marketplace or data market? Data marketplaces, or data markets, are online stores that enable data sharing and collaboration. They connect data providers and data consumers, offering participants the opportunity to buy and sell data an{...}

Data Mart

What is a data mart?A data mart is a curated database including a set of tables that are designed to serve the specific needs of a single data team, community, or line of business, like the marketing or engineering department. It is normally smaller {...}

Data Mesh

Data is critical to enterprises, serving as the raw material for innovation and advancement. Its importance has grown as organizations become more data- and decision-centric, creating major challenges for organizations trying to keep up. Legacy data {...}

Data Orchestration

What is data orchestration?Orchestration is the coordination and management of multiple computer systems, applications and/or services, stringing together multiple tasks in order to execute a larger workflow or process. More specifically, orchestrati{...}

Data Processing

What Is Data Processing?Data processing refers to the end-to-end transformation of raw data into meaningful, actionable insights. Organizations rely on these systems to process structured and unstructured data in real time (or at scale) to make timel{...}

Data Sharing

What is data sharing? Data sharing is the ability to make the same data available to one or many consumers. The ever-growing amount of data has become a strategic asset for any company. Sharing data — within business units as well as consuming data f{...}

Data Streaming

What is Data Streaming?Data streaming is the continuous collection, processing and analysis of data as it is generated, allowing organizations to act on information in real time. Over the last several years, the need for real-time data has grown expo{...}

Data Transformation

What is data transformation?Data transformation is the process of taking raw data that has been extracted from data sources and turning it into usable datasets. Data pipelines often include multiple data transformations, changing messy information in{...}

Data Vault

What is a data vault?A data vault is a data modeling design pattern used to build a data warehouse for enterprise-scale analytics. The data vault has three types of entities: hubs, links, and satellites.Hubs represent core business concepts, links re{...}

Data Virtualization: Unified Real-Time Access Across Multiple Data Sources

What is Data Virtualization?Data virtualization is a data integration method that enables organizations to create unified views of information from multiple data sources without physically moving or copying the data. As a core data virtualization tec{...}

Data Warehouse

What is a data warehouse? A data warehouse is a data management system that stores current and historical data from multiple sources in a business friendly manner for easier insights and reporting. Data warehouses are typically used for business inte{...}

Database Schema: A Comprehensive Guide to Structure, Design, and Implementation

Introduction: Understanding Database Schemas in Modern Data ManagementA database schema acts as a blueprint for how a database is organized and structured. It defines how database tables are laid out, what fields they contain and how those tables rel{...}

Databricks Runtime

Databricks Runtime is the set of software artifacts that run on the clusters of machines managed by Databricks. It includes Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of {...}

DataFrames

What is a DataFrame?A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a fl{...}

Dataset

What is a Dataset? A dataset is a structured collection of data organized and stored together for analysis or processing. The data within a dataset is typically related in some way and taken from a single source or intended for a single project. For {...}

Deep Learning

What is Deep Learning? Deep Learning is a subset of machine learning concerned with large amounts of data with algorithms that have been inspired by the structure and function of the human brain, which is why deep learning models are often referred t{...}

Demand Forecasting

What is demand forecasting? Demand forecasting is the process of projecting consumer demand (equating to future revenue). Specifically, it is projecting the assortment of products shoppers will buy using quantitative and qualitative data. Retailers a{...}

Dense Tensor

Dense tensors store values in a contiguous sequential block of memory where all values are represented. Tensors or multi-dimensional arrays are used in a diverse set of multi-dimensional data analysis applications. There are a number of software prod{...}

Digital Twin

What is a Digital Twin? The classical definition of of digital twin is; ""A digital twin is a virtual model designed to accurately reflect a physical object."" – IBM[KVK4] For a discrete or continuous manufacturing process, a digital twin gathers sys{...}

DNA Sequence

What is a DNA Sequence? The DNA sequence is the process of determining the exact sequence of nucleotides of DNA (deoxyribonucleic acid). Sequencing DNA the order of the four chemical building blocks - adenine, guanine, cytosine, and thymine als{...}

Enterprise Data Warehouse (EDW)

What is an enterprise data warehouse (EDW)?An enterprise data warehouse (EDW) is a centralized, structured repository designed to consolidate and manage organizational data. The core benefit of an EDW is that it provides a governed environment where {...}

Extract Transform Load (ETL)

What is ETL? As the amount of data, data sources, and data types at organizations grow, the importance of making use of that data in analytics, data science and machine learning initiatives to derive business insights grows as well. The need to prior{...}

Fine-tuning

Understanding fine-tuning When training artificial intelligence (AI) and machine learning (ML) models for a specific purpose, data scientists and engineers have found it easier and less expensive to modify existing pretrained foundation large languag{...}

Generative AI

Generative AI is changing the way humans create, work and communicate. Databricks explains how generative AI works and where it’s heading next. {...}

Genomics

Genomics is an area within genetics that concerns the sequencing and analysis of an organism's genome. Its main task is to determine the entire sequence of DNA or the composition of the atoms that make up the DNA and the chemical bonds between the DN{...}

Hadoop Cluster

What Is a Hadoop Cluster? Apache Hadoop is an open source, Java-based, software framework and parallel data processing engine. It enables big data analytics processing tasks to be broken down into smaller tasks that can be performed in parallel by us{...}

Hadoop Distributed File System (HDFS)

HDFSHDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop applications. This open source framework works by rapidly transferring data between nodes. It's often used by companies who need to handle and store big data. HDFS{...}

Hadoop Ecosystem

What is the Hadoop Ecosystem?Apache Hadoop ecosystem refers to the various components of the Apache Hadoop software library; it includes open source projects as well as a complete range of complementary tools. Some of the most well-known tools of the{...}

Hash Buckets

In computing, a hash table [hash map] is a data structure that provides virtually direct access to objects based on a key [a unique String or Integer]. A hash table uses a hash function to compute an index into an array of buckets or slots, from whic{...}

Hive Date Function

What is a Hive Date Function? Hive provides many built-in functions to help us in the processing and querying of data. Some of the functionalities provided by these functions include string manipulation, date manipulation, type conversion, conditiona{...}

Hosted Spark

What is Hosted Spark? Apache Spark is a fast and general cluster computing system for Big Data built around speed, ease of use, and advanced analytics that was originally built in 2009 at UC Berkeley. It provides high-level APIs in Scala, Java, Pytho{...}

Jupyter Notebook

What is a Jupyter Notebook?A Jupyter Notebook is an open source web application that allows data scientists to create and share documents that include live code, equations, and other multimedia resources.What are Jupyter Notebooks used for?Jupyter no{...}

Keras Model

What is a Keras Model? Keras is a high-level library for deep learning, built on top of Theano and Tensorflow. It is written in Python and provides a clean and convenient way to create a range of deep learning models. Keras has become one of the{...}

Lakehouse for Retail

What is Lakehouse for Retail? Lakehouse for Retail is Databricks’ first industry-specific Lakehouse. It helps retailers get up and running quickly through solution accelerators, data sharing capabilities, and a partner ecosystem. Lakehouse for Retail{...}

Lambda Architecture

What is Lambda Architecture?Lambda architecture is a way of processing massive quantities of data (i.e. "Big Data") that provides access to batch-processing and stream-processing methods with a hybrid approach. Lambda architecture is used to solve th{...}

Large Language Models (LLMs)

What are large language models (LLMs)?Language models are a type of generative AI (GenAI) that use natural language processing (NLP) to understand and generate human language. Large language models (LLMs) are the most powerful of these. LLMs are trai{...}

LLMOps

What Is LLMOps?Large Language Model Ops (LLMOps) encompasses the practices, techniques and tools used for the operational management of large language models in production environments.The latest advances in LLMs, underscored by releases such as Open{...}

Machine Learning Library (MLlib)

Apache Spark’s Machine Learning Library (MLlib) is designed for simplicity, scalability, and easy integration with other tools. With the scalability, language compatibility, and speed of Spark, data scientists can focus on their data problems and mod{...}

Machine Learning Models

What is a machine learning Model?A machine learning model is a program that can find patterns or make decisions from a previously unseen dataset. For example, in natural language processing, machine learning models can parse and correctly recognize t{...}

Managed Spark

What is Managed Spark? A managed Spark service lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. By using such an automation you will be able to quickly create clusters on -demand, mana{...}

MapReduce

What is MapReduce? MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem. It takes away the complexity of distributed programming by exposing two processing steps that developers implement: 1) Map and 2) Reduce{...}

Materialized Views

What is a materialized view? A materialized view is a database object that stores the results of a query as a physical table. Unlike regular database views, which are virtual and derive their data from the underlying tables, materialized views contai{...}

Medallion Architecture

What is a medallion architecture?A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through {...}

ML Pipelines

Typically when running machine learning algorithms, it involves a sequence of tasks including pre-processing, feature extraction, model fitting, and validation stages. For example, when classifying text documents might involve text segmentation and c{...}

MLOps

What is MLOps?MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them. MLOps{...}

Model Risk Management

Model risk management refers to the supervision of risks from the potential adverse consequences of decisions based on incorrect or misused models. The aim of model risk management is to employ techniques and practices that will identify, measure and{...}

Neural Network

What is a Neural Network? A neural network is a computing model whose layered structure resembles the networked structure of neurons in the brain. It features interconnected processing elements called neurons that work together to produce an output f{...}

Open Banking

What is Open Banking? Open banking is a secure way to provide access to consumers' financial data, all contingent on customer consent.² Driven by regulatory, technology, and competitive dynamics, Open Banking calls for the democratization of customer{...}

Overall Equipment Effectiveness

What is Overall Equipment Effectiveness? Overall Equipment Effectiveness(OEE) is a measure of how well a manufacturing operation is utilized (facilities, time and material) compared to its full potential, during the periods when it is scheduled to ru{...}

pandas DataFrame

When it comes to data science, it's no exaggeration to say that you can transform the way your business works by using it to its full potential with pandas DataFrame. To do that, you'll need the right data structures. These will help you be as effici{...}

Parquet

{...}

Personalized Finance

What is Personalized Finance? Financial products and services are becoming increasingly commoditized and consumers are becoming more discerning as the media and retail industries have increased their penchant for personalized experiences. To remain c{...}

Polars vs pandas: Choosing the Right Python DataFrame Library for Your Data Workflow

Introduction: Understanding DataFrame Library OptionsDataFrames are two-dimensional data structures, usually tables, similar to spreadsheets, that allow you to store and manipulate tabular data in rows of observations and columns of variables, as wel{...}

Predictive Maintenance

What is predictive maintenance? Predictive Maintenance, in a nutshell, is all about figuring out when an asset should be maintained, and what specific maintenance activities need to be performed, based on an asset’s actual condition or state, rather {...}

Prompt Engineering

Prompt engineering is an emerging field at the forefront of artificial intelligence (AI) development that focuses on the critical processes of crafting effective inputs for generative AI (GenAI) models. As AI systems become increasingly sophisticated{...}

PyCharm

PyCharm is an integrated development environment (IDE) used in computer programming, created for the Python programming language. When using PyCharm on Databricks, by default PyCharm creates a Python Virtual Environment, but you can configure to cre{...}

PySpark

What is PySpark?Apache Spark is written in Scala programming language. PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with R{...}

Real-Time Analytics

What Is Real-Time Analytics? Real-time analytics refers to the practice of collecting and analyzing streaming data as it is generated, with minimal latency between the generation of data and the analysis of that data. Real-time analytics is often use{...}

Real-Time Retail

What is real-time data for Retail? Real-time retail is real-time access to data. Moving from batch-oriented access, analysis and compute will allow data to be “always on,” therefore driving accurate, timely decisions and business intelligence. Real-t{...}

Resilient Distributed Dataset (RDD)

RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that{...}

Retrieval Augmented Generation

SummaryLearn how retrieval augmented generation (RAG) works by combining large language models (LLMs) with real-time, external data for more accurate and relevant outputs.See how RAG solves specific problems, such as reducing hallucinations and deliv{...}

Semantic Layer

{...}

Serverless Computing

Serverless computing is the latest evolution of the compute infrastructure. Organizations used to need physical servers to run web applications. Then the rise of cloud computing enabled them to create virtual servers — although they still had to take{...}

Snowflake Schema

What is a snowflake schema? A snowflake schema is a multi-dimensional data model that is an extension of a star schema, where dimension tables are broken down into subdimensions. Snowflake schemas are commonly used for business intelligence and{...}

Spark API

If you are working with Spark, you will come across the three APIs: DataFrames, Datasets, and RDDs What are Resilient Distributed Datasets? RDD or Resilient Distributed Datasets, is a collection of records with distributed computing, which are fault {...}

Spark Applications

Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; res{...}

Spark Elasticsearch

What is Spark Elasticsearch? Spark Elasticsearch is a NoSQL, distributed database that stores, retrieves, and manages document-oriented and semi-structured data. It is a GitHub open source, RESTful search engine built on top of Apache Lucene and rele{...}

Spark SQL

Many data scientists, analysts, and general business intelligence users rely on interactive SQL queries for exploring data. Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can al{...}

Spark Streaming

Apache Spark Streaming is the previous generation of Apache Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Apache Spark called Structured Stre{...}

Spark Tuning

What is Spark Performance Tuning? Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents {...}

Sparklyr

What is Sparklyr? Sparklyr is an open-source package that provides an interface between R and Apache Spark. You can now leverage Spark’s capabilities in a modern R environment, due to Spark’s ability to interact with distributed data with little late{...}

SparkR

SparkR is a tool for running R on Spark. It follows the same principles as all of Spark’s other language bindings. To use SparkR, we simply import it into our environment and run our code. It’s all very similar to the Python API except that it follow{...}

Sparse Tensor

Python offers an inbuilt library called numpy to manipulate multi-dimensional arrays. The organization and use of this library is a primary requirement for developing the pytensor library. Sptensor is a class that represents the sparse tensor. A spar{...}

Star Schema

{...}

Streaming Analytics

How Does Stream Analytics Work? Streaming analytics, also known as event stream processing, is the analysis of huge pools of current and “in-motion” data through the use of continuous queries, called event streams. These streams are triggered by a s{...}

Structured Streaming

Structured Streaming is a high-level API for stream processing that became production-ready in Spark 2.2. Structured Streaming allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a stre{...}

Supply Chain Management

What is supply chain management? Supply chain management is the process of planning, implementing and controlling operations of the supply chain with the goal of efficiently and effectively producing and delivering products and services to the end cu{...}

TensorFlow

In November of 2015, Google released its open-source framework for machine learning and named it TensorFlow. It supports deep-learning, neural networks, and general numerical computations on CPUs, GPUs, and clusters of GPUs. One of the biggest advant{...}

Tensorflow Estimator API

What is the Tensorflow Estimator API? Estimators represent a complete model but also look intuitive enough to less user. The Estimator API provides methods to train the model, to judge the model’s accuracy, and to generate predictions. TensorFlow pro{...}

Tungsten

What is the Tungsten Project? Tungsten is the codename for the umbrella project to make changes to Apache Spark’s execution engine that focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance cl{...}

Understanding the PostgreSQL Database: Features and Advantages Explained

Introduction to PostgreSQLA PostgreSQL database is an open-source relational database management system that stores, organizes and retrieves structured data. This relational database enforces relationships between data tables, validates data as it en{...}

Unified AI Framework

Unified Artificial Intelligence or UAI was announced by Facebook during F8 this year. This brings together 2 specific deep learning frameworks that Facebook created and outsourced - PyTorch focused on research assuming access to large-scale compute r{...}

Unified Data Analytics

Unified Data Analytics is a new category of solutions that unify data processing with AI technologies, making AI much more achievable for enterprise organizations and enabling them to accelerate their AI initiatives. Unified Data Analytics makes it e{...}

Unified Data Analytics Platform

Databricks' Unified Data Analytics Platform helps organizations accelerate innovation by unifying data science with engineering and business. With Databricks as your Unified Data Analytics Platform, you can quickly prepare and clean data at mass{...}

Unified Data Warehouse

What is a Unified Data Warehouse? A unified database also known as an enterprise data warehouse holds all the business information of an organization and makes it accessible all across the company. Most companies today, have their data managed in iso{...}

Vector Database

What is a vector database? A vector database is a specialized database designed to store and manage data as high-dimensional vectors. The term comes from vectors, which are mathematical representations of features or attributes contained in data. In {...}

What is a business intelligence platform?

A business intelligence (BI) platform is a comprehensive technology solution that helps organizations gather, understand and visualize their data to make informed business decisions. These platforms serve as the technological backbone of a company’s {...}

What is a Data Pipeline?

A data pipeline is a method in which raw data is ingested from various data sources, transformed, and then moved to a destination—such as a data lake or data warehouse—for analysis. It consists of a series of steps that are carried out in a specific {...}

What Is a Feature Platform for Machine Learning?

Up until two years ago, only giant technology companies had the resources and expertise to build products that fully depended on machine learning systems. Think Google powering ad auctions, TikTok recommending content, and Uber dynamically adjusting {...}

What Is a Feature Store?

Updated: May 15, 2025About the authors:Mike Del Balso, CEO & Co-Founder of TectonWillem Pienaar, Creator of FeastData teams are starting to realize that operational machine learning requires solving data problems that extend far beyond the creati{...}

What Is a Relational Database (RDBMS)? Key Features and Uses

What Is a Relational Database?A relational database is a type of database that stores and provides access to data in tables that can be linked to each through shared columns and rows, called relations, with unique identifiers (keys) that show the dif{...}

What Is Agentic AI?

Understanding Autonomous AI Systems and Their Real-World ApplicationsIntroduction to Agentic AIAgentic AI refers to intelligent platforms that can autonomously plan, decide and act to achieve goals with minimal human intervention, rather than respond{...}

What is AI Agent Evaluation?

AI agent evaluation is the discipline of measuring how effectively an autonomous AI system performs tasks, guides its own decisions, interacts with tools, reasons over multiple steps and produces safe, reliable outcomes. As organizations extend AI ag{...}

What Is AI Governance? A Clear Guide to Responsible AI

What Is AI Governance?AI governance is the set of frameworks, policies, and processes organizations use to ensure artificial intelligence systems are developed, deployed, and operated responsibly throughout their lifecycle. The term refers to any ove{...}

What Is an AI Model?

An AI model is a computer program or algorithm that has been trained on data to recognize patterns, make predictions, and automate decisions without human intervention. AI models use algorithms — step-by-step rules based on arithmetic, repetition, an{...}

What is Augmented Analytics?

Augmented analytics represents the evolution of business intelligence (BI) through the integration of artificial intelligence (AI) and machine learning (ML) into the data analysis workflow. Rather than replacing human analysts, augmented analytics en{...}

What is Business Intelligence?

Business intelligence (BI) is a set of strategies, technologies and processes that collect, manage and analyze business data to transform it into actionable insights for better decision-making. BI systems transform raw data into meaningful informatio{...}

What is Change Data Capture?

What is Change Data Capture?Change Data Capture (CDC) is a data integration technique that identifies and records row-level changes made to a dataset, such as inserts, updates, and deletes. Instead of repeatedly extracting entire tables, CDC captures{...}

What Is Computer Vision?

Computer vision is a field of study within computer science that focuses on enabling machines to analyze and understand visual information as closely as possible to the way humans do through the power of sight. At its core, computer vision is about g{...}

What Is Data Architecture?

Data architecture is defined as a framework of concepts, standards, policies, models and rules used to manage data within an organization. Data architectures are blueprints for organizing enterprise data processes and flows, with the goal of ensuring{...}

What Is Data Classification?

Data classification is the process of organizing data into clearly defined categories based on its sensitivity, value and risk to the organization. These categories — often expressed as levels such as public, internal, confidential or restricted — es{...}

What is Data Collection?

Data collection is the systematic gathering and measuring of information from different sources that will later be used for decision-making, insights, and to power data-driven systems.Data collection is the first stage in the data lifecycle. It repre{...}

What is Data Engineering?

Data engineering is the practice of designing, building and maintaining systems that collect, store, transform and deliver data for analysis, reporting, machine learning and decision-making. It’s about making sure the data actually shows up, on time,{...}

What is Data Flow?

Data flow describes the movement of data through a system’s architecture, from one process or component to another. It describes how data is input, processed, stored and output within a computer system, application or network. Data flow has a direct {...}

What Is Data Intelligence?

Data intelligence is the process of using artificial intelligence (AI) systems to learn, understand and reason on an organization’s data, enabling the creation of custom AI applications and democratizing access to data across the enterprise. How does{...}

What is Data Migration?

Data migration is the process of selecting, preparing, extracting and transferring data from one system to another—including storage systems, databases, applications or computing environments. The process involves transforming and validating data to {...}

What Is Data Mining?

Introduction to Data MiningData mining is the process of discovering meaningful patterns, relationships and insights from large volumes of data. It draws on techniques from statistics, machine learning and data management to surface signals that are {...}

What Is Data Modeling?

Data modeling is a key process in designing and organizing data structures to support efficient storage, retrieval and analysis of information. It is the architectural foundation for any data warehousing system, and effective data modeling can help o{...}

What is Data Modernization?

Data modernization is the comprehensive transformation of an organization’s data infrastructure, practices and tools to enable agility, innovation and data-driven decision making. It is not a single technology upgrade or a one-time project. Instead, {...}

What is Data Observability?

Data Observability is the practice of and processes involved in continuously monitoring the health, quality, reliability and performance across data systems—from ingestion pipelines to storage layers to downstream analytics—so organizations can detec{...}

What Is Data Quality?

Data quality measures how well data meets an organization's standards for accuracy, completeness, consistency, validity, timeliness and uniqueness. High-quality data is fit for its intended purpose, whether for analytics, AI, reporting or operational{...}

What is Data Security?

Data security is a set of practices and technologies that protect digital data from unauthorized access, theft, corruption, poisoning or accidental loss while preserving its confidentiality, integrity and availability across the data lifecycle. Data {...}

What Is Data Storytelling?

{...}

What is Data Visualization?

Data visualization is the process of converting raw data into visual formats that make patterns and relationships easier to interpret. Translating raw data into formats like charts, plots, or maps brings abstract information into a spatial structure {...}

What is Directed Acyclic Graph (DAG)?

A directed acyclic graph, commonly known as a DAG, is a foundational concept in data engineering, analytics and AI. It provides a structured way to represent tasks, dependencies and flows of information. Whether you are building a data pipeline, orch{...}

What is Extract, Load, Transform? (ELT)

ELT, short for extract, load, transform, is a modern data integration approach designed for cloud-native analytics platforms. In an ELT pipeline, data is first extracted from source systems, then loaded directly into a central data repository and fin{...}

What is Feature Engineering?

Feature engineering is the process of transforming raw data into relevant features for use by machine learning models. It involves selecting and creating input variables (features) that help ML algorithms learn patterns more effectively and make accu{...}

What Is Hadoop?

Apache Hadoop is an open source, Java-based software platform that manages data processing and storage for big data applications. The platform works by distributing Hadoop big data and analytics jobs across nodes in a computing cluster, breaking them{...}

What is Machine Learning vs Deep Learning?

Understand foundational distinctions and where each fits within AI.Understanding the AI, ML and DL HierarchyIn the broader world of artificial intelligence (AI), the concepts of machine learning and deep learning are often confused. AI is the broad f{...}

What is OLAP? Understanding Online Analytical Processing for Business Intelligence

OLAP is a way to analyze data across multiple dimensions quickly and interactively. Online analytical processing structures information so users can explore trends and investigate performance questions without writing new queries for each step. By st{...}

What is Online Transaction Processing (OLTP)?

OLTP, or Online Transaction Processing, is a type of data processing that can efficiently handle large numbers of short, fast transactions with low latency. At its core, OLTP is designed to store and retrieve data quickly. It focuses on day-to-day es{...}

What Is Operational Machine Learning?

Author: Kevin Stumpf, Co-founder and CTOIn 2015, when we started rolling out Uber’s Machine Learning Platform, Michelangelo, we noticed an interesting pattern: 80% of the ML models launched on the platform powered operational machine learning use cas{...}

What is Predictive Analytics?

Predictive analytics is a form of advanced analytics that uses historical and real-time data, statistical modeling, data mining, and machine learning to identify patterns and forecast future outcomes and trends. Organizations use predictive analytics{...}

What Is the Model Context Protocol (MCP)? A Practical Guide to AI Integration

Introduction: Understanding the Model Context ProtocolThe Model Context Protocol (MCP) is an open standard that enables AI applications to connect seamlessly with external data sources, tools, and systems. Think of the Model Context Protocol as a USB{...}