-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-55285][SQL][PYTHON] Fix the initialization of PythonArrowInput
#54068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
JIRA Issue Information=== Improvement SPARK-55285 === This comment was automatically generated by GitHub Actions |
PythonArrowInputPythonArrowInput
| throw SparkException.internalError( | ||
| s"Unsupported Arrow compression codec: $other. Supported values: none, zstd, lz4") | ||
| } | ||
| protected val unloader = new VectorUnloader(root, true, codec, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unloader is only used in BasicPythonArrowInput, move it to the subclass
|
A simple example to illustrate the initialization order issue: scala> trait A {
val a: Int
val b = a + 1
}
// defined trait A
scala> class B extends A { override val a: Int = 2 }
// defined class B
scala> val b = new B
val b: B = B@1492c9d
scala> b.b
val res0: Int = 1
scala> trait A {
def a: Int
val b = a + 1
}
// defined trait A
scala>
scala> class B extends A { override val a: Int = 2 }
// defined class B
scala> val b = new B
val b: B = B@3d741969
scala> b.b
val res1: Int = 1
scala>
scala> trait A {
def a: Int
def b = a + 1
}
// defined trait A
scala> class B extends A { override val a: Int = 2 }
// defined class B
scala> val b = new B
val b: B = B@5fca8642
scala> b.b
val res2: Int = 3
scala> trait A {
def a: Int
lazy val b = a + 1
}
// defined trait A
scala> class B extends A { override val a: Int = 2 }
// defined class B
scala> val b = new B
val b: B = B@2536edc3
scala> b.b
val res3: Int = 3see https://docs.scala-lang.org/tutorials/FAQ/initialization-order.html for more details |
|
|
||
| private val sqlConf = SQLConf.get | ||
|
|
||
| // Use lazy val to initialize the fields before these are accessed in [[PythonArrowInput]]'s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't apply such fix in all python runners
PythonArrowInputPythonArrowInput
What changes were proposed in this pull request?
Delay the initialization of
PythonArrowInputWhy are the changes needed?
1, the initialization of
PythonArrowInputis too earlywe have such code around the python plans to resolve
arrowSchemainitialization with nullschema/timeZoneId.If I revert such code then some Python UDF tests fails due to the
Cannot invoke "org.apache.spark.sql.types.StructType.map(scala.Function1)" because "schema" is null, see https://github.com/zhengruifeng/spark/actions/runs/21478957166/job/61870333678Current fix is kind of tricky and doesn't cover all python nodes (e.g, we don't do it for
ArrowPythonUDTFRunnerand the behvaior might be undefined), and I suspect we still have similar issues on othervals in some cases, andoverride vals from subclasses are not always respected (Anytreated as null, integers treated as zero, etc) .To resolve it, I think we can change
schema,timeZoneId,errorOnDuplicatedFieldNames,largeVarTypestodefsince they are only used once (to get the arrow schema), and makeallocator/rootlazy2, in case of mixin of
ArrowPythonRunnerandGroupedPythonArrowInputlikespark/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowWindowPythonEvaluatorFactory.scala
Lines 371 to 381 in a03bedb
the
runnerhere actually inherits theallocator/rootfromArrowPythonWithNamedArgumentRunner->RowInputArrowPythonRunner->BasicPythonArrowInput->PythonArrowInput, but theallocatorandrootare not used in thisrunner, that's because the theGroupedPythonArrowInputhas override thedef newWritermethod, and theGroupedPythonArrowInputwill create anotherallocator/rootin its call ofcreateAndStartArrowWriter.In this case, we also need to make
allocatorandrootlazy to avoid unnecessary resource allocation.Does this PR introduce any user-facing change?
undefined behavior / failures due to weird initialization order -> clearly respect subclass's override
How was this patch tested?
CI
Was this patch authored or co-authored using generative AI tooling?
No