单选题 A student chains several DataFrame transformations and sees no work happen until calling count. Which Spark idea explains this behavior.

A、 Replication
B、 Schema evolution
C、 Lazy evaluation
D、 Write concern
下载APP答题
由4l***hm提供 分享 举报 纠错

相关试题

单选题 An API source changes one complaint category label and begins sending some coordinates as strings instead of numbers. The pipeline uses incremental date windows and aims to keep outputs idempotent. Which response best matches the course philosophy.

A、Land the raw data, detect the schema or category issue during validation, stop trusted downstream outputs if critical rules fail, and rerun the same window after fixing the transformation logic
B、Advance the watermark and accept the corrupted rows because the change is minor
C、Edit the raw Bronze files in place so the source appears unchanged
D、Ignore the issue until the final report is due

单选题 A MapReduce style workflow extracts raw records, cleans them, joins a lookup table, and aggregates results. The join step fails due to a transient network issue. Which architecture most directly minimizes recovery cost without sacrificing correctness.

A、Store no intermediate outputs and rerun the whole pipeline from the source every time
B、Persist validated intermediate outputs between major stages so the orchestrator can retry from the failed boundary
C、Append partial join output into the final aggregate and reconcile later by hand
D、Convert the workflow into one notebook cell so fewer steps can fail

单选题 A product team serves customer profiles from MongoDB, wants strong read performance for repeated dashboard filters, and needs to survive node failure without immediate manual intervention. Which design is most consistent with the course material.

A、Use one giant CSV with daily batch exports only
B、Use a single server with no indexes so writes stay simple
C、Model records as documents with indexes for common queries and run them in a replica set so a new primary can be elected after failure
D、Shard randomly before deciding what the query patterns are

单选题 A team stores one month of taxi data as a single unpartitioned CSV, but most business queries ask for one day and only three columns. They are considering moving to a Silver Parquet layout partitioned by pickup date. Why is this change powerful in the course architecture.

A、It matters only because Parquet files can be opened in spreadsheets
B、It guarantees no future schema drift
C、It replaces the need for any compute engine
D、It combines partition pruning with columnar reads, reducing unnecessary I O for the most common analytical filters

单选题 A financial service during a network partition would rather reject some balance checks than risk returning stale money information. At the same time, it logs all requests for later replay and debugging. Which statement best interprets this design.

A、It favors availability and removes the need for audit logs
B、It proves partition tolerance can be disabled in practice
C、It favors consistency over availability during the partition, while separate logs support later analysis rather than replacing the trade off
D、It shows CAP applies only to batch systems, not user facing ones

单选题 A Spark pipeline reads partitioned Parquet, filters one week of data, joins a dimension table, and then performs a large groupBy. The Spark UI later shows one task much slower than all others. Which diagnosis is most likely and most useful.

A、The SparkSession was probably created twice
B、Parquet disables all query optimization after a join
C、The filter caused the data to become row oriented
D、Skew likely sent disproportionate data for one key or partition into a costly shuffle stage

单选题 A city pipeline ingests daily complaint data from an API into Bronze JSON, normalizes it to Silver Parquet, and publishes Gold aggregates. The team wants reruns of the same date window to be safe, fast to recover after a failure, and easy to audit later. Which design best satisfies all three goals together.

A、Append every rerun into the Gold folder and keep audit notes only in email
B、Use deterministic date window outputs, safe temporary writes, quality gates before Gold, and a run log recording the source window and code version
C、Skip Bronze storage so fewer layers need documentation
D、Store only the final dashboard result and rebuild everything manually when needed

单选题 A distributed analytics job runs on data stored across HDFS. One DataNode fails during the run, and the scheduler must minimize wasted work and network movement. Which explanation best combines the relevant course concepts.

A、Replication lets another block copy be read, while data locality and scheduling try to keep replacement work near the surviving data
B、The NameNode stores all real data, so the failed DataNode does not matter
C、CAP theorem guarantees the job finishes with no retries
D、The system switches to schema on read to avoid the failure