[Jan-2025] Databricks Databricks-Certified-Data-Engineer-Associate Exam Basic Questions With Answers [Q10-Q31]

Share

[Jan-2025] Databricks Databricks-Certified-Data-Engineer-Associate Exam: Basic Questions With Answers

New 2025 Realistic Free Databricks Databricks-Certified-Data-Engineer-Associate Exam Dump Questions and Answer

NEW QUESTION # 10
A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to a data analytics dashboard for a retail use case. The job has a Databricks SQL query that returns the number of store-level records where sales is equal to zero. The data engineer wants their entire team to be notified via a messaging webhook whenever this value is greater than 0.
Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of stores with $0 in sales is greater than zero?

  • A. They can set up an Alert without notifications.
  • B. They can set up an Alert with one-time notifications.
  • C. They can set up an Alert with a new email alert destination.
  • D. They can set up an Alert with a new webhook alert destination.
  • E. They can set up an Alert with a custom template.

Answer: D


NEW QUESTION # 11
A data engineer is working with two tables. Each of these tables is displayed below in its entirety.

The data engineer runs the following query to join these tables together:

Which of the following will be returned by the above query?

  • A. Option E
  • B. Option C
  • C. Option D
  • D. Option B
  • E. Option A

Answer: B


NEW QUESTION # 12
Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?

  • A. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.
  • B. CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used.
  • C. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
  • D. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
  • E. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.

Answer: D

Explanation:
Explanation
The CREATE STREAMING LIVE TABLE syntax is used when you want to create Delta Live Tables (DLT) tables that are designed for processing data incrementally. This is typically used when your data pipeline involves streaming or incremental data updates, and you want the table to stay up to date as new data arrives.
It allows you to define tables that can handle data changes incrementally without the need for full table refreshes.


NEW QUESTION # 13
In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?

  • A. When another task needs to fail before the new task begins
  • B. When another task has the same dependency libraries as the new task
  • C. When another task needs to be replaced by the new task
  • D. When another task needs to successfully complete before the new task begins
  • E. When another task needs to use as little compute resources as possible

Answer: D

Explanation:
A data engineer can create a multi-task job in Databricks that consists of multiple tasks that run in a specific order. Each task can have one or more dependencies, which are other tasks that must run before the current task. The Depends On field of a new Databricks Job Task allows the data engineer to specify the dependencies of the task. The data engineer should select a task in the Depends On field when they want the new task to run only after the selected task has successfully completed. This can help the data engineer to create a logical sequence of tasks that depend on each other's outputs or results. For example, a data engineer can create a multi-task job that consists of the following tasks:
Task A: Ingest data from a source using Auto Loader
Task B: Transform the data using Spark SQL
Task C: Write the data to a Delta Lake table
Task D: Analyze the data using Spark ML
Task E: Visualize the data using Databricks SQL
In this case, the data engineer can set the dependencies of each task as follows:
Task A: No dependencies
Task B: Depends on Task A
Task C: Depends on Task B
Task D: Depends on Task C
Task E: Depends on Task D
This way, the data engineer can ensure that each task runs only after the previous task has successfully completed, and the data flows smoothly from ingestion to visualization.
The other options are incorrect because they do not describe valid scenarios for selecting a task in the Depends On field. The Depends On field does not affect the following aspects of a task:
Whether the task needs to be replaced by another task
Whether the task needs to fail before another task begins
Whether the task has the same dependency libraries as another task
Whether the task needs to use as little compute resources as possible Reference: Create a multi-task job, Run tasks conditionally in a Databricks job, Databricks Jobs.


NEW QUESTION # 14
A data engineer is attempting to drop a Spark SQL table my_table and runs the following command:
DROP TABLE IF EXISTS my_table;
After running this command, the engineer notices that the data files and metadata files have been deleted from the file system.
Which of the following describes why all of these files were deleted?

  • A. The table was managed
  • B. The table was external
  • C. The table's data was smaller than 10 GB
  • D. The table's data was larger than 10 GB
  • E. The table did not have a location

Answer: A

Explanation:
The reason why all of the data files and metadata files were deleted from the file system after dropping the table is that the table was managed. A managed table is a table that is created and managed by Spark SQL. It stores both the data and the metadata in the default location specified by the spark.sql.warehouse.dir configuration property. When a managed table is dropped, both the data and the metadata are deleted from the file system.
Option B is not correct, as the size of the table's data does not affect the behavior of dropping the table.
Whether the table's data is smaller or larger than 10 GB, the data files and metadata files will be deleted if the table is managed, and will be preserved if the table is external.
Option C is not correct, for the same reason as option B.
Option D is not correct, as an external table is a table that is created and managed by the user. It stores the data in a user-specified location, and only stores the metadata in the Spark SQL catalog. When an external table is dropped, only the metadata is deleted from the catalog, but the data files are preserved in the file system.
Option E is not correct, as a table must have a location to store the data. If the location is not specified by the user, it will use the default location for managed tables. Therefore, a table without a location is a managed table, and dropping it will delete both the data and the metadata.
References:
* Managing Tables
* [Databricks Data Engineer Professional Exam Guide]


NEW QUESTION # 15
A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?

  • A. Temporary view
  • B. View
  • C. Delta Table
  • D. Database
  • E. Spark SQL Table

Answer: A

Explanation:
A temporary view is a relational object that is defined in the metastore and points to an existing DataFrame. It does not copy or store any physical data, but only saves the query that defines the view. The lifetime of a temporary view is tied to the SparkSession that was used to create it, so it does not persist across different sessions or applications. A temporary view is useful for accessing the same data multiple times within the same notebook or session, without incurring additional storage costs. The other options are either materialized (A, E), persistent (B, C), or not relational objects . Reference: Databricks Documentation - Temporary View, Databricks Community - How do temp views actually work?, Databricks Community - What's the difference between a Global view and a Temp view?, Big Data Programmers - Temporary View in Databricks.


NEW QUESTION # 16
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

  • A. trigger(once="5 seconds")
  • B. trigger(continuous="5 seconds")
  • C. trigger()
  • D. trigger(processingTime="5 seconds")
  • E. trigger("5 seconds")

Answer: D

Explanation:
The processingTime option specifies a time-based trigger interval for fixed interval micro-batches. This means that the query will execute a micro-batch to process data every 5 seconds, regardless of how much data is available. This option is suitable for near-real time processing workloads that require low latency and consistent processing frequency. The other options are either invalid syntax (A, C), default behavior (B), or experimental feature (E). References: Databricks Documentation - Configure Structured Streaming trigger intervals, Databricks Documentation - Trigger.


NEW QUESTION # 17
Which of the following describes the storage organization of a Delta table?

  • A. Delta tables are stored in a collection of files that contain data, history, metadata, and other attributes.
  • B. Delta tables are stored in a collection of files that contain only the data stored within the table.
  • C. Delta tables are stored in a single file that contains only the data stored within the table.
  • D. Delta tables store their data in a single file and all metadata in a collection of files in a separate location.
  • E. Delta tables are stored in a single file that contains data, history, metadata, and other attributes.

Answer: A

Explanation:
Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks lakehouse. Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling1. Delta Lake stores its data and metadata in a collection of files in a directory on a cloud storage system, such as AWS S3 or Azure Data Lake Storage2. Each Delta table has a transaction log that records the history of operations performed on the table, such as insert, update, delete, merge, etc. The transaction log also stores the schema and partitioning information of the table2. The transaction log enables Delta Lake to provide ACID guarantees, time travel, schema enforcement, and other features1. References:
* What is Delta Lake? | Databricks on AWS
* Quickstart - Delta Lake Documentation


NEW QUESTION # 18
A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.
Which of the following describes why Auto Loader inferred all of the columns to be of the string type?

  • A. JSON data is a text-based format
  • B. Auto Loader only works with string data
  • C. All of the fields had at least one null value
  • D. Auto Loader cannot infer the schema of ingested data
  • E. There was a type mismatch between the specific schema and the inferred schema

Answer: A

Explanation:
JSON data is a text-based format that represents data as a collection of name-value pairs. By default, when Auto Loader infers the schema of JSON data, it treats all columns as strings. This is because JSON data can have varying data types for the same column across different files or records, and Auto Loader does not attempt to reconcile these differences. For example, a column named "age" may have integer values in some files, but string values in others. To avoid data loss or errors, Auto Loader infers the column as a string type. However, Auto Loader also provides an option to infer more precise column types based on the sample data. This option is called cloudFiles.inferColumnTypes and it can be set to true or false. When set to true, Auto Loader tries to infer the exact data types of the columns, such as integers, floats, booleans, or nested structures. When set to false, Auto Loader infers all columns as strings. The default value of this option is false. Reference: Configure schema inference and evolution in Auto Loader, Schema inference with auto loader (non-DLT and DLT), Using and Abusing Auto Loader's Inferred Schema, Explicit path to data or a defined schema required for Auto loader.


NEW QUESTION # 19
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

  • A. trigger(once="5 seconds")
  • B. trigger(continuous="5 seconds")
  • C. trigger()
  • D. trigger(processingTime="5 seconds")
  • E. trigger("5 seconds")

Answer: D

Explanation:
The processingTime option specifies a time-based trigger interval for fixed interval micro-batches. This means that the query will execute a micro-batch to process data every 5 seconds, regardless of how much data is available. This option is suitable for near-real time processing workloads that require low latency and consistent processing frequency. The other options are either invalid syntax (A, C), default behavior (B), or experimental feature (E). Reference: Databricks Documentation - Configure Structured Streaming trigger intervals, Databricks Documentation - Trigger.


NEW QUESTION # 20
A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE What is the expected behavior when a batch of data containing data that violates these constraints is processed?

  • A. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
  • B. Records that violate the expectation cause the job to fail.
  • C. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.
  • D. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
  • E. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

Answer: B

Explanation:
The expected behavior when a batch of data containing data that violates the expectation is processed is that the job will fail. This is because the expectation clause has the ON VIOLATION FAIL UPDATE option, which means that if any record in the batch does not meet the expectation, the entire batch will be rejected and the job will fail. This option is useful for enforcing strict data quality rules and preventing invalid data from entering the target dataset.
Option A is not correct, as the ON VIOLATION FAIL UPDATE option does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and record them as invalid in the event log, the ON VIOLATION DROP RECORD option should be used.
Option C is not correct, as the ON VIOLATION FAIL UPDATE option does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and load them into a quarantine table, the ON VIOLATION QUARANTINE RECORD option should be used.
Option D is not correct, as the ON VIOLATION FAIL UPDATE option does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and record them as invalid in the event log, the ON VIOLATION LOG RECORD option should be used.
Option E is not correct, as the ON VIOLATION FAIL UPDATE option does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and flag them as invalid in a field added to the target dataset, the ON VIOLATION FLAG RECORD option should be used.
Reference:
Delta Live Tables Expectations
[Databricks Data Engineer Professional Exam Guide]


NEW QUESTION # 21
In order for Structured Streaming to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing, which of the following two approaches is used by Spark to record the offset range of the data being processed in each trigger?

  • A. Write-ahead Logs and Idempotent Sinks
  • B. Structured Streaming cannot record the offset range of the data being processed in each trigger.
  • C. Replayable Sources and Idempotent Sinks
  • D. Checkpointing and Write-ahead Logs
  • E. Checkpointing and Idempotent Sinks

Answer: D

Explanation:
Structured Streaming uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. This ensures that the engine can reliably track the exact progress of the processing and handle any kind of failure by restarting and/or reprocessing. Checkpointing is the mechanism of saving the state of a streaming query to fault-tolerant storage (such as HDFS) so that it can be recovered after a failure. Write-ahead logs are files that record the offset range of the data being processed in each trigger and are written to the checkpoint location before the processing starts. These logs are used to recover the query state and resume processing from the last processed offset range in case of a failure. Reference: Structured Streaming Programming Guide, Fault Tolerance Semantics


NEW QUESTION # 22
Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?

  • A.
  • B.
  • C.
  • D.
  • E.

Answer: C

Explanation:
The best practice is to use "Complete" as output mode instead of "append" when working with aggregated tables. Since gold layer is work final aggregated tables, the only option with output mode as complete is option


NEW QUESTION # 23
A data engineer needs to apply custom logic to string column city in table stores for a specific use case. In order to apply this custom logic at scale, the data engineer wants to create a SQL user-defined function (UDF).
Which of the following code blocks creates this SQL UDF?

  • A.
  • B.
  • C.
  • D.
  • E.

Answer: A

Explanation:
https://www.databricks.com/blog/2021/10/20/introducing-sql-user-defined-functions.html


NEW QUESTION # 24
A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells.
Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?

  • A. It is not possible to use SQL in a Python notebook
  • B. They can simply write SQL syntax in the cell
  • C. They can add %sql to the first line of the cell
  • D. They can change the default language of the notebook to SQL
  • E. They can attach the cell to a SQL endpoint rather than a Databricks cluster

Answer: C

Explanation:
In Databricks, you can use different languages within the same notebook by using magic commands. Magic commands are special commands that start with a percentage sign (%) and allow you to change the behavior of the cell. To use SQL within a cell of a Python notebook, you can add %sql to the first line of the cell. This will tell Databricks to interpret the rest of the cell as SQL code and execute it against the default database. You can also specify a different database by using the USE statement. The result of the SQL query will be displayed as a table or a chart, depending on the output mode. You can also assign the result to a Python variable by using the -o option. For example, %sql -o df SELECT * FROM my_table will run the SQL query and store the result as a pandas DataFrame in the Python variable df. Option A is incorrect, as it is possible to use SQL in a Python notebook using magic commands. Option B is incorrect, as attaching the cell to a SQL endpoint is not necessary and will not change the language of the cell. Option C is incorrect, as simply writing SQL syntax in the cell will result in a syntax error, as the cell will still be interpreted as Python code. Option E is incorrect, as changing the default language of the notebook to SQL will affect all the cells, not just one. Reference: Use SQL in Notebooks - Knowledge Base - Noteable, [SQL magic commands - Databricks], [Databricks SQL Guide - Databricks]


NEW QUESTION # 25
A data engineer needs to create a table in Databricks using data from a CSV file at location /path/to/csv.
They run the following command:

Which of the following lines of code fills in the above blank to successfully complete the task?

  • A. USING CSV
  • B. None of these lines of code are needed to successfully complete the task
  • C. USING DELTA
  • D. FROM CSV
  • E. FROM "path/to/csv"

Answer: E

Explanation:
A data lakehouse is a new paradigm that can be used to simplify and unify siloed data architectures that are specialized for specific use cases. A data lakehouse combines the best of both data lakes and data warehouses, providing a single platform that supports diverse data types, open standards, low-cost storage, high-performance queries, ACID transactions, schema enforcement, and governance. A data lakehouse enables data engineers to build reliable and scalable data pipelines that can serve various downstream applications and users, such as data science, machine learning, analytics, and reporting. A data lakehouse leverages the power of Delta Lake, a storage layer that brings reliability and performance to data lakes. References: What is a data lakehouse?, Delta Lake, Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics


NEW QUESTION # 26
A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when It is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job.
Which approach can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

  • A. O They can set up the dashboard's SQL endpoint to be serverless.
  • B. Q They can turn on the Auto Stop feature for the SQL endpoint.
  • C. O They can reduce the cluster size of the SQL endpoint.
  • D. 0 They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints.

Answer: B

Explanation:
To minimize the total running time of the SQL endpoint used in the refresh schedule of a dashboard in Databricks, the most effective approach is to utilize the Auto Stop feature. This feature allows the SQL endpoint to automatically stop after a period of inactivity, ensuring that it only runs when necessary, such as during the dashboard refresh or when actively queried. This minimizes resource usage and associated costs by ensuring the SQL endpoint is not running idle outside of these operations.
Reference:
Databricks documentation on SQL endpoints: SQL Endpoints in Databricks


NEW QUESTION # 27
Which of the following data workloads will utilize a Gold table as its source?

  • A. A job that aggregates uncleaned data to create standard summary statistics
  • B. A job that cleans data by removing malformatted records
  • C. A job that queries aggregated data designed to feed into a dashboard
  • D. A job that enriches data by parsing its timestamps into a human-readable format
  • E. A job that ingests raw data from a streaming source into the Lakehouse

Answer: C


NEW QUESTION # 28
Which of the following describes the relationship between Bronze tables and raw data?

  • A. Bronze tables contain raw data with a schema applied.
  • B. Bronze tables contain more truthful data than raw data.
  • C. Bronze tables contain aggregates while raw data is unaggregated.
  • D. Bronze tables contain less data than raw data files.
  • E. Bronze tables contain a less refined view of data than raw data.

Answer: A

Explanation:
Bronze tables are the first layer of a medallion architecture, which is a data design pattern used to organize data in a lakehouse. Bronze tables contain raw data ingested from various sources, such as RDBMS data, JSON files, IoT data, etc. The table structures in this layer correspond to the source system table structures "as-is", along with any additional metadata columns that capture the load date/time, process ID, etc. The only transformation applied to the raw data in this layer is to apply a schema, which defines the column names and data types of the table. The schema can be inferred from the data source or specified explicitly. Applying a schema to the raw data enables the use of SQL and other structured query languages to access and analyze the data. Therefore, option E is the correct answer. Reference: What is a Medallion Architecture?, Raw Data Ingestion into Delta Lake Bronze tables using Azure Synapse Mapping Data Flow, Apache Spark + Delta Lake concepts, Delta Lake Architecture & Azure Databricks Workspace.


NEW QUESTION # 29
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

  • A. trigger(once="5 seconds")
  • B. trigger(continuous="5 seconds")
  • C. trigger()
  • D. trigger(processingTime="5 seconds")
  • E. trigger("5 seconds")

Answer: D


NEW QUESTION # 30
A dataset has been defined using Delta Live Tables and includes an expectations clause:
CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION FAIL UPDATE What is the expected behavior when a batch of data containing data that violates these constraints is processed?

  • A. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
  • B. Records that violate the expectation cause the job to fail.
  • C. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset.
  • D. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
  • E. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

Answer: B

Explanation:
The expected behavior when a batch of data containing data that violates the expectation is processed is that the job will fail. This is because the expectation clause has the ON VIOLATION FAIL UPDATE option, which means that if any record in the batch does not meet the expectation, the entire batch will be rejected and the job will fail. This option is useful for enforcing strict data quality rules and preventing invalid data from entering the target dataset.
Option A is not correct, as the ON VIOLATION FAIL UPDATE option does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and record them as invalid in the event log, the ON VIOLATION DROP RECORD option should be used.
Option C is not correct, as the ON VIOLATION FAIL UPDATE option does not drop the records that violate the expectation, but fails the entire batch. To drop the records that violate the expectation and load them into a quarantine table, the ON VIOLATION QUARANTINE RECORD option should be used.
Option D is not correct, as the ON VIOLATION FAIL UPDATE option does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and record them as invalid in the event log, the ON VIOLATION LOG RECORD option should be used.
Option E is not correct, as the ON VIOLATION FAIL UPDATE option does not add the records that violate the expectation, but fails the entire batch. To add the records that violate the expectation and flag them as invalid in a field added to the target dataset, the ON VIOLATION FLAG RECORD option should be used.
References:
* Delta Live Tables Expectations
* [Databricks Data Engineer Professional Exam Guide]


NEW QUESTION # 31
......

Guaranteed Success in Databricks Certification Databricks-Certified-Data-Engineer-Associate Exam Dumps: https://passleader.passsureexam.com/Databricks-Certified-Data-Engineer-Associate-pass4sure-exam-dumps.html