profile
viewpoint
Sean Owen srowen @databricks Austin, TX Primary author, zxing and Oryx project. Apache Spark PMC / committer.

sryza/aas 1402

Code to accompany Advanced Analytics with Spark from O'Reilly Media

databricks/spark-sklearn 1038

(Deprecated) Scikit-learn integration package for Apache Spark

databricks/spark-xml 306

XML data source for Spark SQL and DataFrames

srowen/cdsw-simple-serving 14

Modeling Lifecycle with ACME Occupancy Detection and Cloudera

srowen/oryx 14

Oryx 2 (incubating): Lambda architecture on Spark for real-time large scale machine learning

srowen/zxing-bsplus 11

Enhanced version of Barcode Scanner

srowen/spark 6

Mirror of Apache Spark

srowen/quatrains-rnn 3

Simple example applying Keras, TensorFlow to Nostradamus's prophecies with Cloudera Data Science Workbench

srowen/autoenc-neurons 0

Simple example of Keras, TensorFlow and autoencoders applied to text

srowen/rnn-fft 0

Experiment with RNNs to generate audio in the frequency domain

pull request commentapache/spark

[SPARK-32574][CORE] Race condition in FsHistoryProvider listing iteration

How does this error arise BTW? when did you encounter it and what was the stack trace? I'm not so clear how this comes up in practice. It sounds good to synchronize but this is serializing all reads now too, which has its own implications

yanxiaole

comment created time in 4 hours

Pull request review commentapache/spark

[SPARK-32559][SQL]Fix the trim logic in UTF8String.toInt/toLong did't handle non-ASCII characters correctly

 select cast(' 1' as bigint); select cast(' 1' as float); select cast(' 1 ' as DOUBLE); select cast('1.0 ' as DEC);+select cast('1中文' as tinyint);

This is purely to educate me, but those characters are considered whitespace?

WangGuangxin

comment created time in 5 hours

pull request commentapache/spark

[SPARK-32574][CORE] Race condition in FsHistoryProvider listing iteration

Yeah, but I'm not seeing any of these operations that cause a delete. If they aren't somehow mutually excluded with a lock too, then this can't fix the problem, no?

yanxiaole

comment created time in 6 hours

pull request commentapache/spark

[SPARK-32574][CORE] Race condition in FsHistoryProvider listing iteration

I'm not sure, what is the problem here? is it this data structure getting modified or something else underneath it in the FS? if the latter, then this synchronization might not be a fix?

yanxiaole

comment created time in a day

pull request commentapache/spark

[SPARK-32526][SQL]Pass DataSourceV2AnalysisBaseSuite and TreeNodeSuite related test cases in scala 2.13

Looking good. I think you're welcome to keep going and we can 'checkpoint' and commit some fixes whenever there is a logical and large enough group of them to commit. That could be now too, up to you.

LuciferYang

comment created time in a day

pull request commentapache/spark

[WIP] more bm test hardening tweaking

Please fill out the description

agrawaldevesh

comment created time in a day

pull request commentapache/spark

[SPARK-32319][PYSPARK] Disallow the use of unused imports

No -1 here. I don't object to the current change.

Fokko

comment created time in 2 days

pull request commentapache/spark

[SPARK-32319][PYSPARK] Disallow the use of unused imports

No, I don't feel that way. If others would like to merge, go ahead. I would merge the narrower change myself, but do not object to the broader one.

Fokko

comment created time in 2 days

pull request commentapache/spark

[SPARK-32319][PYSPARK] Disallow the use of unused imports

I did. If most of the changes are exceptions, that makes me wonder how much this rule will just trigger false positives. I am not sure how worth it is to enforce this if mostly it turns up false positives.

Fokko

comment created time in 2 days

pull request commentapache/spark

[SPARK-32319][PYSPARK] Disallow the use of unused imports

My last comment was, why do we need to add the rule and then a ton of exclusions? just remove the unused imports. That's a much narrower change

Fokko

comment created time in 2 days

Pull request review commentapache/spark

[SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms

 public void putFloats(int rowId, int count, byte[] src, int srcIndex) {     }   } +  @Override+  public void putFloatsLittleEndian(int rowId, int count, byte[] src, int srcIndex) {+    if (!bigEndianPlatform) {

Just leave it as-is for consistency, yeah. OK I see we have lots of "as little endian" variants so maybe this is the best approach for symmetry.

I dont' really know enough about this part to validate the change but it seems plausible.

tinhto-000

comment created time in 2 days

pull request commentapache/spark

[SPARK-32554][K8S][DOCS] Update the k8s document according to the current development status

let me ask this: what is the problem with marking it not-experimental? compared to perhaps people questioning using it in its current state? I don't know that its state is going to change much soon anyway. I'd just unmark it, I think.

maropu

comment created time in 2 days

Pull request review commentapache/spark

[SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms

 public void putFloats(int rowId, int count, byte[] src, int srcIndex) {     }   } +  @Override+  public void putFloatsLittleEndian(int rowId, int count, byte[] src, int srcIndex) {+    if (!bigEndianPlatform) {

Really doesn't matter, but you could flip the comparison here to avoid the negation

tinhto-000

comment created time in 2 days

pull request commentapache/spark

[SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms

Seems reasonable - is there any case where we would need putFloats separately from putFloatsLittleEndian? like would you ever legitimately not care about endianness of the serialization? (may be a dumb question)

tinhto-000

comment created time in 2 days

PR closed saurfang/spark-sas7bdat

[WIP] Minimal change to compile for scala 2.12 / Spark 3

This just adjusts Spark and Scala versions for Spark 3 + Scala 2.12. It compiles and most tests pass, except the one concerning SQLContext implicits.

This removes the cross compile but I think it's likely that it's possible to maintain cross-compatibility with Scala 2.11 and Spark 2.4, so I could change that.

Just a WIP for now.

+4 -5

1 comment

1 changed file

srowen

pr closed time in 3 days

pull request commentsaurfang/spark-sas7bdat

[WIP] Minimal change to compile for scala 2.12 / Spark 3

Closed in favor of the much better #59

srowen

comment created time in 3 days

pull request commentapache/spark

[SPARK-32526][SQL]Pass DataSourceV2AnalysisBaseSuite and TreeNodeSuite related test cases in scala 2.13

This change itself looks fine. If you think the rest of the changes in sql / catalyst are similar, I'd make them in one go. If there are difficult or logically distinct other fixes, that can be another PR. They can all be attached to one JIRA

LuciferYang

comment created time in 3 days

pull request commentapache/spark

[SPARK-32554][K8S][DOCS] Update the k8s document according to the current development status

I don't know either. I'd wonder, what actionable difference would this make, to declare it still partly experimental? is there any work coming in the foreseeable future on it? is it dangerous to use in production? will an API possibly change? If we're basically not warning people away from using it as is with the usual understanding that stuff can have minor bugs, then, maybe no value in making this ill-defined statement at all, and avoid confusion.

maropu

comment created time in 3 days

Pull request review commentapache/spark

[SPARK-32546][SQL][3.0] Get table names directly from Hive tables

 private[hive] class HiveClientImpl(       dbName: String,       pattern: String,       tableType: CatalogTableType): Seq[String] = withHiveState {+    val hiveTableType = toHiveTableType(tableType)     try {       // Try with Hive API getTablesByType first, it's supported from Hive 2.3+.-      shim.getTablesByType(client, dbName, pattern, toHiveTableType(tableType))+      shim.getTablesByType(client, dbName, pattern, hiveTableType)     } catch {       case _: UnsupportedOperationException =>         // Fallback to filter logic if getTablesByType not supported.         val tableNames = client.getTablesByPattern(dbName, pattern).asScala-        val tables = getTablesByName(dbName, tableNames).filter(_.tableType == tableType)

It's fine if we 'break' Scala 2.13 a bit in master while we're still getting it working, if it makes it easier to not reason about it. If you carry back to .toSeq / .toMap-style changes into 3.0 it's fine though, still works

MaxGekk

comment created time in 3 days

issue commentsaurfang/spark-sas7bdat

Spark 3 / Scala 2.12 compatibility

Sure, it's trivial, but: https://github.com/saurfang/spark-sas7bdat/pull/58

Tagar

comment created time in 4 days

PR opened saurfang/spark-sas7bdat

[WIP] Minimal change to compile for scala 2.12 / Spark 3

This just adjusts Spark and Scala versions for Spark 3 + Scala 2.12. It compiles and most tests pass, except the one concerning SQLContext implicits.

This removes the cross compile but I think it's likely that it's possible to maintain cross-compatibility with Scala 2.11 and Spark 2.4, so I could change that.

Just a WIP for now.

+4 -5

0 comment

1 changed file

pr created time in 4 days

create barnchsrowen/spark-sas7bdat

branch : Spark3

created branch time in 4 days

fork srowen/spark-sas7bdat

Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL

http://spark-packages.org/package/saurfang/spark-sas7bdat

fork in 4 days

issue commentsaurfang/spark-sas7bdat

Spark 3 / Scala 2.12 compatibility

Could be, though I might expect a different error, not 'class not found'. It's entirely possible my assembly JAR isn't quite right. Like I notice the assembly from the build would include Scala libs (including scala-reflect) and I don't know, may cause some problem. But the scala version matches Spark 3's.

Tagar

comment created time in 4 days

Pull request review commentapache/spark

[SPARK-30654] Bootstrap4 docs upgrade

         <link rel="stylesheet" href="css/bootstrap.min.css">         <style>             body {+	        {% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "migration-guide.html" %}

I built the docs to take a look, and wondered, why special case these? the /ml and /sql pages look like they have too little padding at the top.

clarkead

comment created time in 4 days

push eventsrowen/spark

Michael Munday

commit sha 4a0427cbc1b557a3e08135756bf089d882a0994f

[SPARK-32485][SQL][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite ### What changes were proposed in this pull request? PR #26548 means that RecordBinaryComparator now uses big endian byte order for long comparisons. However, this means that some of the constants in the regression tests no longer map to the same values in the comparison that they used to. For example, one of the tests does a comparison between Long.MIN_VALUE and 1 in order to trigger an overflow condition that existed in the past (i.e. Long.MIN_VALUE - 1). These constants correspond to the values 0x80..00 and 0x00..01. However on a little-endian machine the bytes in these values are now swapped before they are compared. This means that we will now be comparing 0x00..80 with 0x01..00. 0x00..80 - 0x01..00 does not overflow therefore missing the original purpose of the test. To fix this the constants are now explicitly written out in big endian byte order to match the byte order used in the comparison. This also fixes the tests on big endian machines (which would otherwise get a different comparison result to the little-endian machines). ### Why are the changes needed? The regression tests no longer serve their initial purposes and also fail on big-endian systems. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests run on big-endian system (s390x). Closes #29259 from mundaym/fix-endian. Authored-by: Michael Munday <mike.munday@ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

push time in 4 days

issue commentsaurfang/spark-sas7bdat

Spark 3 / Scala 2.12 compatibility

Oh yeah, ignore that. In the Maven published version, it says parso is a dependency, so apps will correctly pick it up. For this test with a JAR file only I needed to create the assembly jar with dependencies bundled. The second error is what I'm now looking at.

Tagar

comment created time in 4 days

push eventsrowen/spark

Huaxin Gao

commit sha 40e6a5bbb0dbedae4a270f830aafd4cb310a8fe2

[SPARK-32449][ML][PYSPARK] Add summary to MultilayerPerceptronClassificationModel ### What changes were proposed in this pull request? Add training summary to MultilayerPerceptronClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes MultilayerPerceptronClassificationModel.summary MultilayerPerceptronClassificationModel.evaluate ### How was this patch tested? new tests Closes #29250 from huaxingao/mlp_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>

view details

Max Gekk

commit sha d897825d2d0430cb52ae9ac0f6fd742582041682

[SPARK-32346][SQL] Support filters pushdown in Avro datasource ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in Avro datasource V1 and V2. 1. Added new SQL config `spark.sql.avro.filterPushdown.enabled` to control filters pushdown to Avro datasource. It is on by default. 2. Renamed `CSVFilters` to `OrderedFilters`. 3. `OrderedFilters` is used in `AvroFileFormat` (DSv1) and in `AvroPartitionReaderFactory` (DSv2) 4. Modified `AvroDeserializer` to return None from the `deserialize` method when pushdown filters return `false`. ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to **2** times on JDK 11: ``` OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 9614 9669 54 0.1 9614.1 1.0X pushdown disabled 10077 10141 66 0.1 10077.2 1.0X w/ filters 4681 4713 29 0.2 4681.5 2.1X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added UT to `AvroCatalystDataConversionSuite` and `AvroSuite` - Re-running `AvroReadBenchmark` using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge (spot instance) | | AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) | | Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`| and `./dev/run-benchmarks`: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #29145 from MaxGekk/avro-filters-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>

view details

Dongjoon Hyun

commit sha 9dc02378518b4b0b8a069684f575ed40813fa417

[SPARK-32476][CORE] ResourceAllocator.availableAddrs should be deterministic ### What changes were proposed in this pull request? This PR aims to make `ResourceAllocator.availableAddrs` deterministic. ### Why are the changes needed? Currently, this function returns indeterministically due to the underlying `HashMap`. So, the test case itself is creating a list `[0, 1, 2]` initially, but ends up with comparing `[2, 1, 0]`. Not only this happens in the 3.0.0, but also this causes UT failures on Scala 2.13 environment. ### Does this PR introduce _any_ user-facing change? Yes, but this fixes the in-deterministic behavior. ### How was this patch tested? - Scala 2.12: This should pass the UT with the modified test case. - Scala 2.13: This can be tested like the following (at least `JsonProtocolSuite`) ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.JsonProtocolSuite ``` **BEFORE** ``` *** 2 TESTS FAILED *** ``` **AFTER** ``` All tests passed. ``` Closes #29281 from dongjoon-hyun/SPARK-32476. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

Thomas Graves

commit sha e926d419d305c9400f6f2426ca3e8d04a9180005

[SPARK-30322][DOCS] Add stage level scheduling docs ### What changes were proposed in this pull request? Document the stage level scheduling feature. ### Why are the changes needed? Document the stage level scheduling feature. ### Does this PR introduce _any_ user-facing change? Documentation. ### How was this patch tested? n/a docs only Closes #29292 from tgravescs/SPARK-30322. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

view details

Wenchen Fan

commit sha a025a89f4ef3a05d7e70c02f03a9826bb97eceac

[SPARK-32332][SQL] Support columnar exchanges ### What changes were proposed in this pull request? This PR adds abstract classes for shuffle and broadcast, so that users can provide their columnar implementations. This PR updates several places to use the abstract exchange classes, and also update `AdaptiveSparkPlanExec` so that the columnar rules can see exchange nodes. This is an alternative of https://github.com/apache/spark/pull/29134 . Close https://github.com/apache/spark/pull/29134 ### Why are the changes needed? To allow columnar exchanges. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #29262 from cloud-fan/columnar. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

view details

Holden Karau

commit sha 50911df08eb7a27494dc83bcec3d09701c2babfe

[SPARK-32397][BUILD] Allow specifying of time for build to keep time consistent between modules ### What changes were proposed in this pull request? Upgrade codehaus maven build helper to allow people to specify a time during the build to avoid snapshot artifacts with different version strings. ### Why are the changes needed? During builds of snapshots the maven may assign different versions to different artifacts based on the time each individual sub-module starts building. The timestamp is used as part of the version string when run `maven deploy` on a snapshot build. This results in different sub-modules having different version strings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual build while specifying the current time, ensured the time is consistent in the sub components. Open question: Ideally I'd like to backport this as well since it's sort of a bug fix and while it does change a dependency version it's not one that is propagated. I'd like to hear folks thoughts about this. Closes #29274 from holdenk/SPARK-32397-snapshot-artifact-timestamp-differences. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

view details

Dongjoon Hyun

commit sha 163867435a6af1e9a37521e34ea41b07168f4730

[SPARK-32487][CORE] Remove j.w.r.NotFoundException from `import` in [Stages|OneApplication]Resource ### What changes were proposed in this pull request? This PR aims to remove `java.ws.rs.NotFoundException` from two problematic `import` statements. All the other use cases are correct. ### Why are the changes needed? In `StagesResource` and `OneApplicationResource`, there exist two `NotFoundException`s. - javax.ws.rs.NotFoundException - org.apache.spark.status.api.v1.NotFoundException To use `org.apache.spark.status.api.v1.NotFoundException` correctly, we should not import `java.ws.rs.NotFoundException`. This causes UT failures in Scala 2.13 environment. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Scala 2.12: Pass the GitHub Action or Jenkins. - Scala 2.13: Do the following manually. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.HistoryServerSuite ``` **BEFORE** ``` *** 4 TESTS FAILED *** ``` **AFTER** ``` *** 1 TEST FAILED *** ``` Closes #29293 from dongjoon-hyun/SPARK-32487. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

Dongjoon Hyun

commit sha 08a66f8fd0df38280dfd54bb79aa8a8ae1272fc9

[SPARK-32248][BUILD] Recover Java 11 build in Github Actions ### What changes were proposed in this pull request? This PR aims to recover Java 11 build in `GitHub Action`. ### Why are the changes needed? This test coverage is removed before. Now, it's time to recover it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. Closes #29295 from dongjoon-hyun/SPARK-32248. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

HyukjinKwon

commit sha 89d9b7cc64f01de9b3e88352d6a1979852873a77

[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode ### What changes were proposed in this pull request? This PR proposes: 1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`. This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below: ```python import pyspark spark.sparkContext.setLocalProperty("a", "hi") def print_prop(): print(spark.sparkContext.getLocalProperty("a")) pyspark.InheritableThread(target=print_prop).start() ``` ``` hi ``` 2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify: ```bash PYSPARK_PIN_THREAD=true ./bin/pyspark ``` ```python >>> from threading import Thread >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) ``` This issue is fixed now. 3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue. ### Why are the changes needed? To support pinned thread mode properly without a resource leak, and a proper inheritable local properties. ### Does this PR introduce _any_ user-facing change? Yes, it adds an API `InheritableThread` class for pinned thread mode. ### How was this patch tested? Manually tested as described above, and unit test was added as well. Closes #28968 from HyukjinKwon/SPARK-32010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

zhengruifeng

commit sha 81b0785fb2d9a2d45d4366a58a3c30fe478c299a

[SPARK-32455][ML] LogisticRegressionModel prediction optimization ### What changes were proposed in this pull request? for binary `LogisticRegressionModel`: 1, keep variables `_threshold` and `_rawThreshold` instead of computing them on each instance; 2, in `raw2probabilityInPlace`, make use of the characteristic that the sum of probability is 1.0; ### Why are the changes needed? for better performance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuite and performace test in REPL Closes #29255 from zhengruifeng/pred_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>

view details

Max Gekk

commit sha 99a855575c3a5554443a27385caf49661cc7f139

[SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources ### What changes were proposed in this pull request? When `spark.sql.caseSensitive` is `false` (by default), check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: ``` ### Why are the changes needed? To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error when `spark.sql.caseSensitive` is `false`: ```Scala org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase` ``` Checking of top-level duplicates was introduced by https://github.com/apache/spark/pull/17758. ### Does this PR introduce _any_ user-facing change? Yes. For the example from SPARK-32431: ORC: ```scala java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329) at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) ... Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3 at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323) ``` JSON: ```scala +------------+ |StructColumn| +------------+ | [,,]| +------------+ ``` Parquet: ```scala +------------+ |StructColumn| +------------+ | [0,, 1]| +------------+ ``` Avro: ```scala +------------+ |StructColumn| +------------+ | [,,]| +------------+ ``` After the changes, Parquet, ORC, JSON and Avro output the same error: ```scala Found duplicate column(s) in the data schema: `camelcase`; org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`; at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51) at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67) ``` ### How was this patch tested? Run modified test suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*" ``` and added new UT to `SchemaUtilsSuite`. Closes #29234 from MaxGekk/nested-case-insensitive-column. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

HyukjinKwon

commit sha e1d73210341a314601a953e6ac483112660874e6

[SPARK-32478][R][SQL] Error message to show the schema mismatch in gapply with Arrow vectorization ### What changes were proposed in this pull request? This PR proposes to: 1. Fix the error message when the output schema is misbatched with R DataFrame from the given function. For example, ```R df <- createDataFrame(list(list(a=1L, b="2"))) count(gapply(df, "a", function(key, group) { group }, structType("a int, b int"))) ``` **Before:** ``` Error in handleErrors(returnStatus, conn) : ... java.lang.UnsupportedOperationException ... ``` **After:** ``` Error in handleErrors(returnStatus, conn) : ... java.lang.AssertionError: assertion failed: Invalid schema from gapply: expected IntegerType, IntegerType, got IntegerType, StringType ... ``` 2. Update documentation about the schema matching for `gapply` and `dapply`. ### Why are the changes needed? To show which schema is not matched, and let users know what's going on. ### Does this PR introduce _any_ user-facing change? Yes, error message is updated as above, and documentation is updated. ### How was this patch tested? Manually tested and unitttests were added. Closes #29283 from HyukjinKwon/r-vectorized-error. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Kent Yao

commit sha 510a1656e650246a708d3866c8a400b7a1b9f962

[SPARK-32412][SQL] Unify error handling for spark thrift server operations ### What changes were proposed in this pull request? Log error/warn message only once at the server-side for both sync and async modes ### Why are the changes needed? In https://github.com/apache/spark/commit/b151194299f5ba15e0d9d8d7d2980fd164fe0822 we make the error logging for SparkExecuteStatementOperation with `runInBackground=true` not duplicated, but the operations with runInBackground=false and other metadata operation still will be log twice which happened in the operation's `runInternal` method and ThriftCLIService. In this PR, I propose to reflect the logic to get a unified error handling approach. ### Does this PR introduce _any_ user-facing change? Yes, when spark.sql.hive.thriftServer.async=false and people call sync APIs the error message will be logged only once at server-side. ### How was this patch tested? locally verified the result in target/unit-test.log add unit tests. Closes #29204 from yaooqinn/SPARK-32412. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Takeshi Yamamuro

commit sha 30e3042dc5febf49123483184e6282fefde8ebc0

[SPARK-32488][SQL] Use @parser::members and @lexer::members to avoid generating unused code ### What changes were proposed in this pull request? This PR aims to update `SqlBse.g4` for avoiding generating unused code. Currently, ANTLR generates unused methods and variables; `isValidDecimal` and `isHint` are only used in the generated lexer. This PR changed the code to use `parser::members` and `lexer::members` to avoid it. ### Why are the changes needed? To reduce unnecessary code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29296 from maropu/UpdateSqlBase. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

HyukjinKwon

commit sha 1f7fe5415e88cc289b44e366cd4e74290784db5f

[SPARK-32491][INFRA] Do not install SparkR in test-only mode in testing script ### What changes were proposed in this pull request? This PR proposes to skip SparkR installation that is to run R linters (see SPARK-8505) in the test-only mode at `dev/run-tests.py` script. As of SPARK-32292, the test-only mode in `dev/run-tests.py` was introduced, for example: ``` dev/run-tests.py --modules sql,core ``` which only runs the relevant tests and does not run other tests such as linters. Therefore, we don't need to install SparkR when `--modules` are specified. ### Why are the changes needed? GitHub Actions build is currently failed as below: ``` ERROR: this R is version 3.4.4, package 'SparkR' requires R >= 3.5 [error] running /home/runner/work/spark/spark/R/install-dev.sh ; received return code 1 ##[error]Process completed with exit code 10. ``` For some reasons, looks GitHub Actions started to have R 3.4.4 installed by default; however, R 3.4 was dropped as of SPARK-32073. When SparkR tests are not needed, GitHub Actions still builds SparkR with a low R version and it causes the test failure. This PR partially fixes it by avoid the installation of SparkR. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions tests should run to confirm this fix is correct. Closes #29300 from HyukjinKwon/install-r. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

HyukjinKwon

commit sha e0c8bd07af6ea2873c77ae6428b3ab4ee68e8e32

[SPARK-32493][INFRA] Manually install R instead of using setup-r in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to manually install R instead of using `setup-r` which seems broken. Currently, GitHub Actions uses its default R 3.4.4 installed, which we dropped as of SPARK-32073. While I am here, I am also upgrading R version to 4.0. Jenkins will test the old version and GitHub Actions tests the new version. AppVeyor uses R 4.0 but it does not check CRAN which is important when we make a release. ### Why are the changes needed? To recover GitHub Actions build. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Manually tested at https://github.com/HyukjinKwon/spark/pull/15 Closes #29302 from HyukjinKwon/SPARK-32493. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

HyukjinKwon

commit sha 12f443cd99a91689dc5b44b6794205289ef2d998

[SPARK-32496][INFRA] Include GitHub Action file as the changes in testing ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/26556 excluded `.github/workflows/master.yml`. So tests are skipped if the GitHub Actions configuration file is changed. As of SPARK-32245, we now run the regular tests via the testing script. We should include it to test to make sure GitHub Actions build does not break due to some changes such as Python versions. ### Why are the changes needed? For better test coverage in GitHub Actions build. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions in this PR will test. Closes #29305 from HyukjinKwon/SPARK-32496. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Warren Zhu

commit sha 743772095273b464f845efefb3eb59284b06b9be

[SPARK-32227] Fix regression bug in load-spark-env.cmd with Spark 3.0.0 ### What changes were proposed in this pull request? Fix regression bug in load-spark-env.cmd with Spark 3.0.0 ### Why are the changes needed? cmd doesn't support set env twice. So set `SPARK_ENV_CMD=%SPARK_CONF_DIR%\%SPARK_ENV_CMD%` doesn't take effect, which caused regression. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. 1. Create a spark-env.cmd under conf folder. Inside this, `echo spark-env.cmd` 2. Run old load-spark-env.cmd, nothing printed in the output 2. Run fixed load-spark-env.cmd, `spark-env.cmd` showed in the output. Closes #29044 from warrenzhu25/32227. Lead-authored-by: Warren Zhu <zhonzh@microsoft.com> Co-authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

HyukjinKwon

commit sha 32f4ef005fd590e0e7c319b43a459cb3828bba5a

[SPARK-32497][INFRA] Installs qpdf package for CRAN check in GitHub Actions ### What changes were proposed in this pull request? CRAN check fails due to the size of the generated PDF docs as below: ``` ... WARNING ‘qpdf’ is needed for checks on size reduction of PDFs ... Status: 1 WARNING, 1 NOTE See ‘/home/runner/work/spark/spark/R/SparkR.Rcheck/00check.log’ for details. ``` This PR proposes to install `qpdf` in GitHub Actions. Note that I cannot reproduce in my local with the same R version so I am not documenting it for now. Also, while I am here, I piggyback to install SparkR when the module includes `sparkr`. it is rather a followup of SPARK-32491. ### Why are the changes needed? To fix SparkR CRAN check failure. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions will test it out. Closes #29306 from HyukjinKwon/SPARK-32497. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Dongjoon Hyun

commit sha 7cf3b54a2a7528e815841015af50e08ce4515cb9

[SPARK-32489][CORE] Pass `core` module UTs in Scala 2.13 ### What changes were proposed in this pull request? So far, we fixed many stuffs in `core` module. This PR fixes the remaining UT failures in Scala 2.13. - `OneApplicationResource.environmentInfo` will return a deterministic result for `sparkProperties`, `hadoopProperties`, `systemProperties`, and `classpathEntries`. - `SubmitRestProtocolSuite` has Scala 2.13 answer in addition to the existing Scala 2.12 answer, and uses the expected answer based on the Scala runtime version. ### Why are the changes needed? To support Scala 2.13. ### Does this PR introduce _any_ user-facing change? Yes, `environmentInfo` is changed, but this fixes the indeterministic behavior. ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Do the following. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 ``` **BEFORE** ``` Tests: succeeded 2612, failed 3, canceled 1, ignored 8, pending 0 *** 3 TESTS FAILED *** ``` **AFTER** ``` Tests: succeeded 2615, failed 0, canceled 1, ignored 8, pending 0 All tests passed. ``` Closes #29298 from dongjoon-hyun/SPARK-32489. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

push time in 4 days

pull request commentapache/spark

[SPARK-32539][INFRA] Disallow `FileSystem.get(Configuration conf)` in style check by default

Seems reasonable to guard against. I know we have fixed this a few times over the years.

gengliangwang

comment created time in 4 days

issue commentsaurfang/spark-sas7bdat

Spark 3 / Scala 2.12 compatibility

Darn, OK, maybe this doesn't Just Work. That's a weird error though. It doesn't say what class isn't found. And it compiles, so that's surprising. hm, not sure I have other ideas, sorry. But the changes to compile are at least not hard!

Tagar

comment created time in 4 days

Pull request review commentapache/spark

[SPARK-32485][SQL][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

 public void testBinaryComparatorForNullColumns() throws Exception {   public void testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue() throws Exception {     int numFields = 1; +    // Place the following bytes (hex) into UnsafeRows for the comparison:+    //+    //   index | 00 01 02 03 04 05 06 07+    //   ------+------------------------+    //   row1  | 00 00 00 00 00 00 00 0b+    //   row2  | 00 00 00 00 80 00 00 0a+    //+    // The byte layout needs to be identical on all platforms regardless of+    // of endianness. To achieve this the bytes in each value are reversed+    // on little-endian platforms.+    long row1Data = 11L;+    long row2Data = 11L + Integer.MAX_VALUE;+    if (ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN)) {

That is perhaps the only remaining issue. We should fix the test that this sets up of course, but, which was the intended test? I'm also not 100% sure, but @mundaym had a reasonable argument at https://github.com/apache/spark/pull/29259#issuecomment-664480885

mundaym

comment created time in 4 days

issue commentsaurfang/spark-sas7bdat

Spark 3 / Scala 2.12 compatibility

Oops, right, I needed to produce the assembly JAR. Try this:

https://drive.google.com/file/d/1N_i-I34rFzboJ7J_aIs2rirGKjjIawDv/view?usp=sharing

Tagar

comment created time in 4 days

issue commentsaurfang/spark-sas7bdat

Spark 3 / Scala 2.12 compatibility

OK I'm curious to know if this Just Works: https://drive.google.com/file/d/1VnF0gnG88ClyXWoHMlXFgvHcVUY7Yo-j/view?usp=sharing

Tagar

comment created time in 4 days

issue commentsaurfang/spark-sas7bdat

Spark 3 / Scala 2.12 compatibility

I got it compiling with minor changes, and seemingly just one test failure, related to the SQLContext implicit (which is not really to be used in Spark 3). I haven't tested it but getting a compiled JAR is easy enough. I could open a PR (or branch) as an example if anyone wants to try it or run with it.

Tagar

comment created time in 4 days

pull request commentapache/spark

[SPARK-30654] Bootstrap4 docs upgrade

Looks pretty reasonable to go for at this point. Still up to date and worth me checking the docs build now?

clarkead

comment created time in 5 days

PR closed apache/spark

[SPARK-32526][SQL]Let sql/catalyst module compile for Scala 2.13 SQL

What changes were proposed in this pull request?

The purpose of this PR is to resolve SPARK-32526 which should allow Spark to compile sql/catalyst module with Scala 2.13 profile.

The main changes of this pr is adding .toSeq calls where mutable collections are returned as Seq, similar as #28791 .

Why are the changes needed?

We need to support a Scala 2.13 build.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. (2.13 was not tested, there are still about 100 failed test cases with 2.13 profile in sql/catalyst module.)

+4 -4

6 comments

2 changed files

LuciferYang

pr closed time in 5 days

pull request commentapache/spark

[SPARK-32526][SQL]Let sql/catalyst module compile for Scala 2.13

Yeah, certainly we need to make these fixes to newly-arrived code over time, but I'd just do them when fixing a chunk of tests.

LuciferYang

comment created time in 5 days

Pull request review commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

 public void testBinaryComparatorForNullColumns() throws Exception {   public void testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue() throws Exception {     int numFields = 1; +    long row1Data = 11L;+    long row2Data = 11L + Integer.MAX_VALUE;++    // BinaryComparator compares longs in big-endian byte order.+    if (ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN)) {+      row1Data = Long.reverseBytes(row1Data);+      row2Data = Long.reverseBytes(row2Data);+    }+     UnsafeRow row1 = new UnsafeRow(numFields);     byte[] data1 = new byte[100];     row1.pointTo(data1, computeSizeInBytes(numFields * 8));-    row1.setLong(0, 11);+    row1.setLong(0, row1Data);      UnsafeRow row2 = new UnsafeRow(numFields);     byte[] data2 = new byte[100];     row2.pointTo(data2, computeSizeInBytes(numFields * 8));-    row2.setLong(0, 11L + Integer.MAX_VALUE);+    row2.setLong(0, row2Data);      insertRow(row1);     insertRow(row2); -    Assert.assertTrue(compare(0, 1) > 0);+    Assert.assertTrue(compare(0, 1) < 0);

That's right. There is no endian-ness issue in the logic being tested. There is endian-ness in how this test constructs the bytes to compare, because it writes a long into memory. I doubt this test's purpose holds both ways? It "works" but either little- or big-endian machines aren't running the test as intended. It feels like there's one intended test to run here, not either of two.

mundaym

comment created time in 5 days

pull request commentapache/spark

[WIP][SPARK-32357][INFRA] Publish failed and succeeded test reports in GitHub Actions

If there's not a clean way to do it, and this isn't essential but nice-to-have, then option 1 seems the most viable. It would be opt-in and only for people who want to take the time to set this up, so not sure it would cover much. But probably more useful than doing it only after the fact, or exposing a token.

HyukjinKwon

comment created time in 5 days

Pull request review commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

 public void testBinaryComparatorForNullColumns() throws Exception {   public void testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue() throws Exception {     int numFields = 1; +    long row1Data = 11L;+    long row2Data = 11L + Integer.MAX_VALUE;++    // BinaryComparator compares longs in big-endian byte order.+    if (ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN)) {+      row1Data = Long.reverseBytes(row1Data);+      row2Data = Long.reverseBytes(row2Data);+    }+     UnsafeRow row1 = new UnsafeRow(numFields);     byte[] data1 = new byte[100];     row1.pointTo(data1, computeSizeInBytes(numFields * 8));-    row1.setLong(0, 11);+    row1.setLong(0, row1Data);      UnsafeRow row2 = new UnsafeRow(numFields);     byte[] data2 = new byte[100];     row2.pointTo(data2, computeSizeInBytes(numFields * 8));-    row2.setLong(0, 11L + Integer.MAX_VALUE);+    row2.setLong(0, row2Data);      insertRow(row1);     insertRow(row2); -    Assert.assertTrue(compare(0, 1) > 0);+    Assert.assertTrue(compare(0, 1) < 0);

That would also work here. I think it's a little less ideal, because then you are running a different test on little- vs big-endian (and those are the correct answers to the two different tests). I think only one of those tests was the intended one AFAICT, and I buy the argument that the proposed change restores it as well as addresses endianness.

mundaym

comment created time in 5 days

push eventapache/spark-website

Sean Owen

commit sha b66b88ec9f378e720efd9417557fd0a1118364b4

Add note about CVEs for archived versions (I think the content is self-explanatory) Author: Sean Owen <srowen@gmail.com> Closes #282 from srowen/CVEarchived.

view details

push time in 6 days

PR closed apache/spark-website

Add note about CVEs for archived versions

(I think the content is self-explanatory)

+8 -2

0 comment

2 changed files

srowen

pr closed time in 6 days

PR opened apache/spark-website

Add note about CVEs for archived versions

(I think the content is self-explanatory)

+8 -2

0 comment

2 changed files

pr created time in 6 days

create barnchsrowen/spark-website

branch : CVEarchived

created branch time in 6 days

pull request commentapache/spark-website

Update committer page

I'm on jekyll 4.1.0, and ideally, we'd use the latest one consistently. But yeah I usually revert any spurious changes and don't deal with it

mridulm

comment created time in 6 days

push eventsrowen/spark-website

Xiao Li

commit sha b5e249e6cc86dca728fc0ed7c8b86949eb2c92f9

Fix the release note of Spark 3.0.0 This PR is to fix the release note of Spark 3.0.0. Update the md file to reflect the changes made in HTML. Author: Xiao Li <gatorsmile@gmail.com> Closes #274 from gatorsmile/updateReleaseMessage3.0.

view details

Sean Owen

commit sha 08327a24e163dbd674ce83c29bd485f5e90c2ea5

CVE-2020-9480 details (#275)

view details

Xiao Li

commit sha 2c5679f415c3605726e68c0a2b8c204c91131d0c

Update the artifactId in the Download Page #276 The existing artifactId is not correct. We need to update it from 2.11 to 2.12

view details

Jungtaek Lim (HeartSaVioR)

commit sha 00a4de3e0b7e2545dd3b67c4c43bd96dc8db83a0

Add Jungtaek Lim to committers.md Author: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Closes #277 from HeartSaVioR/add-committer-heartsavior.

view details

Huaxin Gao

commit sha 18d7e2103f9713adc09d69b65ebd4a48107c88f0

Add Huaxin Gao to committers.md Author: Huaxin Gao <huaxing@us.ibm.com> Closes #278 from huaxingao/asf-site.

view details

Dilip Biswal

commit sha 78726aef37378a9ec803846f9d15dc1e22e415b3

Add Dilip Biswal to committers.md Add "Dilip Biswal" to committer list. Author: Dilip Biswal <dkbiswal@gmail.com> Closes #279 from dilipbiswal/dkb_asf_site.

view details

push time in 6 days

pull request commentapache/spark

[SPARK-32310][ML][PySpark] ML params default value parity in feature and tuning

I think you can merge when ready

huaxingao

comment created time in 6 days

pull request commentapache/spark

[WIP][SPARK-21708][BUILD] Migrate build to sbt 1.x

Jenkins test this please

gemelen

comment created time in 6 days

pull request commentapache/spark

[RFC][SPARK-32495] Update jackson versions to a maintained release, to fix various security vulnerabilities.

Yeah, the problem is that there are behavior changes. You can see one in the UT failures. It outputs an explicit null for fields that are null. I think it can be disabled, but not sure what else we'd have to fix where to maintain behavior. I recall it wasn't trivial when I looked at it a while ago.

ScrapCodes

comment created time in 6 days

pull request commentapache/spark

[SPARK-28818][SQL][2.4] Respect source column nullability in the arrays created by `freqItems()`

OK, I mean, I'd also consider @maropu the author of this backport. Sure cherry-pick takes care of that but it also kind of loses some info about who did the conflict resolution, etc. Either way seems reasonable.

maropu

comment created time in 7 days

pull request commentapache/spark

[SPARK-28818][SQL][2.4] Respect source column nullability in the arrays created by `freqItems()`

What are you referring to here @dongjoon-hyun ... the requester and author do not seem to be available to do this work. The JIRA remains credited to the origin author. The original PR is linked here.

maropu

comment created time in 7 days

pull request commentapache/spark

[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre and Hadoop to 3.2.1

Is this duplicated by https://github.com/apache/spark/pull/29325 ? Yes, this can only happen for Hadoop 3.2.1+, so would at best be in the Hadoop 3.2 profile.

viirya

comment created time in 7 days

pull request commentapache/spark

[WIP][SPARK-32502][BUILD] Upgrade Guava to 27.0-jre

Is this duplicated by https://github.com/apache/spark/pull/29326 ? Yes, this can only happen for Hadoop 3.2.1+

viirya

comment created time in 7 days

Pull request review commentapache/spark

[WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation

 PySpark Documentation ===================== +PySpark is an interface for Apache Spark in Python language. It not only offers for you+to write an application in the Python APIs but also provides PySpark shell so you can+interactively analyze your data in a distributed environment. PySpark supports most+of Spark features such as Spark SQL, DataFrmae, Streaming, MLlib

most of Spark's features

HyukjinKwon

comment created time in 8 days

Pull request review commentapache/spark

[WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation

 PySpark Documentation ===================== +PySpark is an interface for Apache Spark in Python language. It not only offers for you

in the Python language, or just "in Python"

HyukjinKwon

comment created time in 8 days

Pull request review commentapache/spark

[WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation

 PySpark Documentation ===================== +PySpark is an interface for Apache Spark in Python language. It not only offers for you+to write an application in the Python APIs but also provides PySpark shell so you can

Maybe "It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing ..."

HyukjinKwon

comment created time in 8 days

Pull request review commentapache/spark

[WIP][SPARK-32507][DOCS][PYTHON] Add main page for PySpark documentation

 PySpark Documentation ===================== +PySpark is an interface for Apache Spark in Python language. It not only offers for you+to write an application in the Python APIs but also provides PySpark shell so you can+interactively analyze your data in a distributed environment. PySpark supports most+of Spark features such as Spark SQL, DataFrmae, Streaming, MLlib+(Machine Learning) and Spark Core.++.. image:: ../../../docs/img/pyspark-components.png+  :alt: PySpark Compoenents++**Spark SQL and DataFrame**++Spark SQL is a Spark module for structured data processing. It provides+a programming abstraction called DataFrame and can also act as distributed+SQL query engine.++**Streaming**++Running on top of Spark, the streaming feature in Apache Spark enables powerful+interactive and analytical applications across both streaming and historical data,+while inheriting Spark’s ease of use and fault tolerance characteristics.++**MLlib**++Built on top of Spark, MLlib is a scalable machine learning library that provides+a uniform set of high-level APIs that help users create and tune practical machine+learning pipelines.++**Spark Core**++Spark Core is the underlying general execution engine for the Spark platform that all+other functionality is built on top of. It provides an RDD (Resilient Disributed Data)

Data -> Dataset

HyukjinKwon

comment created time in 8 days

Pull request review commentapache/spark

[SPARK-31448] [PYTHON] Fix storage level used in cache() in dataframe.py

 def cache(self):         .. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.         """         self.is_cached = True-        self._jdf.cache()+        self.persist(StorageLevel.MEMORY_AND_DISK)

OK, right. That could be a simpler fix. But then I'm still not so much following why there is a behavior difference per https://github.com/apache/spark/pull/29242#discussion_r462086489 I'm trying to understand whether this difference is expected or not, per earlier comments.

abhishekd0907

comment created time in 9 days

issue commentdatabricks/spark-xml

NoSuchMethodError when calling from_xml from PySpark

Ah OK, that makes a lot more sense then

fcvr1010

comment created time in 9 days

issue commentdatabricks/spark-xml

NoSuchMethodError when calling from_xml from PySpark

Hm, that is weird. The method was added in January so should be in the 0.9.0 version. No chance you have multiple copies of the library on your classpath? How are you calling from Pyspark, as in https://github.com/databricks/spark-xml#pyspark-notes ?

fcvr1010

comment created time in 9 days

pull request commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

I think we can move forward on this if we just change the comments, to show the desired bytes that are being written. Functionally this seems OK.

mundaym

comment created time in 9 days

Pull request review commentapache/spark

[SPARK-32503][CORE][TESTS] Add Scala 2.13 `core` and related module test in GitHub Action

 jobs:         mkdir -p ~/.m2         ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Djava.version=11 install         rm -rf ~/.m2/repository/org/apache/spark++  scala213:+    name: Scala 2.13+    runs-on: ubuntu-latest+    steps:+    - name: Checkout Spark repository+      uses: actions/checkout@v2+    - name: Cache Maven local repository+      uses: actions/cache@v2+      with:+        path: ~/.m2/repository+        key: scala213-maven-${{ hashFiles('**/pom.xml') }}+        restore-keys: |+          scala213-maven-+    - name: Install Java 11+      uses: actions/setup-java@v1+      with:+        java-version: 11+    - name: Test with Maven+      run: |+        export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g -Dorg.slf4j.simpleLogger.defaultLogLevel=WARN"+        export MAVEN_CLI_OPTS="--no-transfer-progress"+        mkdir -p ~/.m2+        dev/change-scala-version.sh 2.13+        build/mvn test -pl core --am -Pscala-2.13

I mean, how about just adding an end-to-end Scala 2.13 test? it will fail beyond core now, but that would help track the progress?

dongjoon-hyun

comment created time in 10 days

Pull request review commentapache/spark

[SPARK-31448] [PYTHON] Fix storage level used in cache() in dataframe.py

 def cache(self):         .. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.         """         self.is_cached = True-        self._jdf.cache()+        self.persist(StorageLevel.MEMORY_AND_DISK)

Yes but that's the desired level, no? I understand there is a subtle difference between the meaning of that in Scala and Python, but, would that be the difference you are trying to fix here?

abhishekd0907

comment created time in 10 days

pull request commentapache/spark

[SPARK-32490][BUILD] Upgrade netty-all to 4.1.51.Final

Jenkins test this please

LuciferYang

comment created time in 10 days

pull request commentapache/spark

[WIP][SPARK-21708][BUILD] Migrate build to sbt 1.x

+1 this would be great to fix. I tried upgrading a long time ago and hit problems I didn't know how to solve. Hope you can!

gemelen

comment created time in 10 days

pull request commentapache/spark

[SPARK-32490][BUILD] Upgrade netty-all to 4.1.51.Final

Sounds fine, if it doesn't introduce behavior changes or anything.

LuciferYang

comment created time in 10 days

Pull request review commentapache/spark

[SPARK-31448] [PYTHON] Fix storage level used in cache() in dataframe.py

 def cache(self):         .. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.         """         self.is_cached = True-        self._jdf.cache()+        self.persist(StorageLevel.MEMORY_AND_DISK)

Hm, yeah that is strange. What if Pyspark cache() just calls persist() in Pyspark, rather than cache() in Scala?

abhishekd0907

comment created time in 10 days

push eventapache/spark

Huaxin Gao

commit sha 40e6a5bbb0dbedae4a270f830aafd4cb310a8fe2

[SPARK-32449][ML][PYSPARK] Add summary to MultilayerPerceptronClassificationModel ### What changes were proposed in this pull request? Add training summary to MultilayerPerceptronClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes MultilayerPerceptronClassificationModel.summary MultilayerPerceptronClassificationModel.evaluate ### How was this patch tested? new tests Closes #29250 from huaxingao/mlp_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>

view details

push time in 11 days

PR closed apache/spark

[SPARK-32449][ML][PySpark] Add summary to MultilayerPerceptronClassificationModel DOCS ML PYTHON

What changes were proposed in this pull request?

Add training summary to MultilayerPerceptronClassificationModel...

Why are the changes needed?

so that user can get the training process status, such as loss value of each iteration and total iteration number.

Does this PR introduce any user-facing change?

Yes MultilayerPerceptronClassificationModel.summary MultilayerPerceptronClassificationModel.evaluate

How was this patch tested?

new tests

+222 -14

4 comments

7 changed files

huaxingao

pr closed time in 11 days

pull request commentapache/spark

[SPARK-32449][ML][PySpark] Add summary to MultilayerPerceptronClassificationModel

Merged to master

huaxingao

comment created time in 11 days

push eventsrowen/spark

Dongjoon Hyun

commit sha 4f79b9fffd592a2f24e5e759d726232134bda409

[SPARK-32447][CORE] Use python3 by default in pyspark and find-spark-home scripts ### What changes were proposed in this pull request? This PR aims to use `python3` instead of `python` inside `bin/pyspark`, `bin/find-spark-home` and `bin/find-spark-home.cmd` script. ``` $ git diff master --stat bin/find-spark-home | 4 ++-- bin/find-spark-home.cmd | 4 ++-- bin/pyspark | 4 ++-- ``` ### Why are the changes needed? According to [PEP 394](https://www.python.org/dev/peps/pep-0394/), we have four different cases for `python` while `python3` will be there always. ``` - Distributors may choose to set the behavior of the python command as follows: python2, python3, not provide python command, allow python to be configurable by an end user or a system administrator. ``` Moreover, these scripts already depend on `find_spark_home.py` which is using `#!/usr/bin/env python3`. ``` FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py" ``` ### Does this PR introduce _any_ user-facing change? No. Apache Spark 3.1 already drops Python 2.7 via SPARK-32138 . ### How was this patch tested? Pass the Jenkins or GitHub Action. Closes #29246 from dongjoon-hyun/SPARK-FIND-SPARK-HOME. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

Liang-Chi Hsieh

commit sha 70ac594bb3aa74a97f942796115ec86261ef7828

[SPARK-32450][PYTHON] Upgrade pycodestyle to v2.6.0 ### What changes were proposed in this pull request? This patch upgrades pycodestyle from v2.4.0 to v2.6.0. The changes at each release: 2.5.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id3 2.6.0a1: https://pycodestyle.pycqa.org/en/latest/developer.html#a1-2020-04-23 2.6.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id2 Changes: Dropped Python 2.6 and 3.3 support, added Python 3.7 and 3.8 support... ### Why are the changes needed? Including bug fixes and newer Python version support. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Ran `dev/lint-python` locally. Closes #29249 from viirya/upgrade-pycodestyle. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Dongjoon Hyun

commit sha 8153f56286b02f3d290340289d033e81b505fe4a

[SPARK-32451][R] Support Apache Arrow 1.0.0 ### What changes were proposed in this pull request? Currently, `GitHub Action` is broken due to `SparkR UT failure` from new Apache Arrow 1.0.0. ![Screen Shot 2020-07-26 at 5 12 08 PM](https://user-images.githubusercontent.com/9700541/88492923-3409f080-cf63-11ea-8fea-6051298c2dd0.png) This PR aims to update R code according to Apache Arrow 1.0.0 recommendation to pass R unit tests. An alternative is pinning Apache Arrow version at 0.17.1 and I also created a PR to compare with this. - https://github.com/apache/spark/pull/29251 ### Why are the changes needed? - Apache Spark 3.1 supports Apache Arrow 0.15.1+. - Apache Arrow released 1.0.0 a few days ago and this causes GitHub Action SparkR test failures due to warnings. - https://github.com/apache/spark/commits/master ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [x] Pass the Jenkins (https://github.com/apache/spark/pull/29252#issuecomment-664067492) - [x] Pass the GitHub (https://github.com/apache/spark/runs/912656867) Closes #29252 from dongjoon-hyun/SPARK-ARROW. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

Dongjoon Hyun

commit sha 13c64c298016eb3882ed20a6f6c60f1ea3988b3b

[SPARK-32448][K8S][TESTS] Use single version for exec-maven-plugin/scalatest-maven-plugin ### What changes were proposed in this pull request? Two different versions are used for the same artifacts, `exec-maven-plugin` and `scalatest-maven-plugin`. This PR aims to use the same versions for `exec-maven-plugin` and `scalatest-maven-plugin`. In addition, this PR removes `scala-maven-plugin.version` from `K8s` integration suite because it's unused. ### Why are the changes needed? This will prevent the mistake which upgrades only one place and forgets the others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins K8S IT. Closes #29248 from dongjoon-hyun/SPARK-32448. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

Cheng Su

commit sha 01cf8a4ce8a986364526588954ffc85da6c9e390

[SPARK-32383][SQL] Preserve hash join (BHJ and SHJ) stream side ordering ### What changes were proposed in this pull request? Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins. Example: ``` withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { val df1 = spark.range(100).select($"id".as("k1")) val df2 = spark.range(100).select($"id".as("k2")) val df3 = spark.range(3).select($"id".as("k3")) val df4 = spark.range(100).select($"id".as("k4")) val plan = df1.join(df2, $"k1" === $"k2") .join(df3, $"k1" === $"k3") .join(df4, $"k1" === $"k4") .queryExecution .executedPlan } ``` Current physical plan (extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` Ideal physical plan (no extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` ### Why are the changes needed? To avoid unnecessary sort in query, and it has most impact when users read sorted bucketed table. Though the unnecessary sort is operating on already sorted data, it would have obvious negative impact on IO and query run time if the data is large and external sorting happens. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29181 from c21/ordering. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

HyukjinKwon

commit sha bfa5d57bbd9b190724c584c43075c49c2a7e507c

[SPARK-32452][R][SQL] Bump up the minimum Arrow version as 1.0.0 in SparkR ### What changes were proposed in this pull request? This PR proposes to set the minimum Arrow version as 1.0.0 to minimise the maintenance overhead and keep the minimal version up to date. Other required changes to support 1.0.0 were already made in SPARK-32451. ### Why are the changes needed? R side, people rather aggressively encourage people to use the latest version, and SparkR vectorization is very experimental that was added from Spark 3.0. Also, we're technically not testing old Arrow versions in SparkR for now. ### Does this PR introduce _any_ user-facing change? Yes, users wouldn't be able to use SparkR with old Arrow. ### How was this patch tested? GitHub Actions and AppVeyor are already testing them. Closes #29253 from HyukjinKwon/SPARK-32452. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

SaurabhChawla

commit sha 99f33ec30f04bb0f7b09c3c2abfc5d5b6af50599

[SPARK-32234][FOLLOWUP][SQL] Update the description of utility method ### What changes were proposed in this pull request? As the part of this PR https://github.com/apache/spark/pull/29045 added the helper method. This PR is the FOLLOWUP PR to update the description of helper method. ### Why are the changes needed? For better readability and understanding of the code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since its only change of updating the description , So ran the Spark shell Closes #29232 from SaurabhChawla100/SPARK-32234-Desc. Authored-by: SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

HyukjinKwon

commit sha 6ab29b37cf1270b139bf93b821e7a6c2f6e27c6f

[SPARK-32179][SPARK-32188][PYTHON][DOCS] Replace and redesign the documentation base ### What changes were proposed in this pull request? This PR proposes to redesign the PySpark documentation. I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html. Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html. In more details, this PR proposes: 1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark. 2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow. 3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively. One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage. ### Why are the changes needed? Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/). It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate. Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it. ### Does this PR introduce _any_ user-facing change? Yes, PySpark API documentation will be redesigned. ### How was this patch tested? Manually tested, and the demo site was made to show. Closes #29188 from HyukjinKwon/SPARK-32179. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

HyukjinKwon

commit sha a82aee044127825ffefa0ed09b0ae5b987b9dd21

[SPARK-32435][PYTHON] Remove heapq3 port from Python 3 ### What changes were proposed in this pull request? This PR removes the manual port of `heapq3.py` introduced from SPARK-3073. The main reason of this was to support Python 2.6 and 2.7 because Python 2's `heapq.merge()` doesn't not support `key` and `reverse`. See - https://docs.python.org/2/library/heapq.html#heapq.merge in Python 2 - https://docs.python.org/3.8/library/heapq.html#heapq.merge in Python 3 Since we dropped the Python 2 at SPARK-32138, we can remove this away. ### Why are the changes needed? To remove unnecessary codes. Also, we can leverage bug fixes made in Python 3.x at `heapq`. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Existing tests should cover. I locally ran and verified: ```bash ./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_shuffle" ./python/run-tests --python-executable=python3 --testname="pyspark.shuffle ExternalSorter" ./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_rdd RDDTests.test_external_group_by_key" ``` Closes #29229 from HyukjinKwon/SPARK-32435. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Warren Zhu

commit sha 998086c9a179692b2687bc9a104dbbb35f5a44e2

[SPARK-30794][CORE] Stage Level scheduling: Add ability to set off heap memory ### What changes were proposed in this pull request? Support set off heap memory in `ExecutorResourceRequests` ### Why are the changes needed? Support stage level scheduling ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT in `ResourceProfileSuite` and `DAGSchedulerSuite` Closes #28972 from warrenzhu25/30794. Authored-by: Warren Zhu <zhonzh@microsoft.com> Signed-off-by: Thomas Graves <tgraves@apache.org>

view details

HyukjinKwon

commit sha ea58e5282304a99f1c12489f722d6b3fb18c4085

[SPARK-32434][CORE][FOLLOW-UP] Fix load-spark-env.cmd to be able to run in Windows properly ### What changes were proposed in this pull request? This PR is basically a followup of SPARK-26132 and SPARK-32434. You can't define an environment variable within an-if to use it within the block. See also https://superuser.com/questions/78496/variables-in-batch-file-not-being-set-when-inside-if ### Why are the changes needed? For Windows users to use Spark and fix the build in AppVeyor. ### Does this PR introduce _any_ user-facing change? No, it's only in unreleased branches. ### How was this patch tested? Manually tested on a local Windows machine, and AppVeyor build at https://github.com/HyukjinKwon/spark/pull/13. See https://ci.appveyor.com/project/HyukjinKwon/spark/builds/34316409 Closes #29254 from HyukjinKwon/SPARK-32434. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Cheng Su

commit sha 548b7db345667fe2c7f98a8af2b0870c56804f10

[SPARK-32420][SQL] Add handling for unique key in non-codegen hash join ### What changes were proposed in this pull request? `HashRelation` has two separate code paths for unique key look up and non-unique key look up E.g. in its subclass [`UnsafeHashedRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177), unique key look up is more efficient as it does not have e.g. extra `Iterator[UnsafeRow].hasNext()/next()` overhead per row. `BroadcastHashJoinExec` has handled unique key vs non-unique key separately in [code-gen path](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321). But the non-codegen path for broadcast hash join and shuffled hash join do not separate it yet, so adding the support here. ### Why are the changes needed? Shuffled hash join and non-codegen broadcast hash join still rely on this code path for execution. So this PR will help save CPU for executing this two type of join. Adding codegen for shuffled hash join would be a different topic and I will add it in https://issues.apache.org/jira/browse/SPARK-32421 . Ran the same query as [`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153-L167), with enabling and disabling this feature. Verified 20% wall clock time improvement (switch control and test group order as well to verify the improvement to not be the noise). ``` Running benchmark: shuffle hash join Running case: shuffle hash join unique key SHJ off Stopped after 5 iterations, 4039 ms Running case: shuffle hash join unique key SHJ on Stopped after 5 iterations, 2898 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join unique key SHJ off 707 808 81 5.9 168.6 1.0X shuffle hash join unique key SHJ on 547 580 50 7.7 130.4 1.3X ``` ``` Running benchmark: shuffle hash join Running case: shuffle hash join unique key SHJ on Stopped after 5 iterations, 3333 ms Running case: shuffle hash join unique key SHJ off Stopped after 5 iterations, 4268 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join unique key SHJ on 565 667 60 7.4 134.8 1.0X shuffle hash join unique key SHJ off 774 854 85 5.4 184.4 0.7X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Added test in `OuterJoinSuite` to cover left outer and right outer join. * Added test in `ExistenceJoinSuite` to cover left semi join, and existence join. * [Existing `joinSuite` already covered inner join.](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala#L182) * [Existing `ExistenceJoinSuite` already covered left anti join, and existence join.](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala#L228) Closes #29216 from c21/unique-key. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

Kent Yao

commit sha d315ebf3a739a05a68d0f0ab319920765bf65b0f

[SPARK-32424][SQL] Fix silent data change for timestamp parsing if overflow happens ### What changes were proposed in this pull request? When using `Seconds.toMicros` API to convert epoch seconds to microseconds, ```scala /** * Equivalent to * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}. * param duration the duration * return the converted duration, * or {code Long.MIN_VALUE} if conversion would negatively * overflow, or {code Long.MAX_VALUE} if it would positively overflow. */ ``` This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)` ### Why are the changes needed? fix silent data change between 3.x and 2.x ``` ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" +294247-01-10 12:00:54.775807 ``` ``` kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S -e "select to_timestamp('300000', 'y');" 284550-10-19 15:58:1010.448384 ``` ### Does this PR introduce _any_ user-facing change? Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow. ### How was this patch tested? add unit test Closes #29220 from yaooqinn/SPARK-32424. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

HyukjinKwon

commit sha c1140661bf2cc174a380237f0f494ca56d4768ed

[SPARK-32443][CORE] Use POSIX-compatible `command -v` in testCommandAvailable ### What changes were proposed in this pull request? This PR aims to use `command -v` in non-Window operating systems instead of executing the given command. ### Why are the changes needed? 1. `command` is POSIX-compatible - **POSIX.1-2017**: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/command.html 2. `command` is faster and safer than the direct execution - `command` doesn't invoke another process. ```scala scala> sys.process.Process("ls").run().exitValue() LICENSE NOTICE bin doc lib man res1: Int = 0 ``` 3. The existing way behaves inconsistently. - `rm` cannot be checked. **AS-IS** ```scala scala> sys.process.Process("rm").run().exitValue() usage: rm [-f | -i] [-dPRrvW] file ... unlink file res0: Int = 64 ``` **TO-BE** ``` Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> sys.process.Process(Seq("sh", "-c", s"command -v ls")).run().exitValue() /bin/ls val res1: Int = 0 ``` 4. The existing logic is already broken in Scala 2.13 environment because it hangs like the following. ```scala $ bin/scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262). Type in expressions for evaluation. Or try :help. scala> sys.process.Process("cat").run().exitValue() // hang here. ``` ### Does this PR introduce _any_ user-facing change? No. Although this is inside `main` source directory, this is used for testing purpose. ``` $ git grep testCommandAvailable | grep -v 'def testCommandAvailable' core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("wc")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand)) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(!TestUtils.testCommandAvailable("some_nonexistent_command")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat")) core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand)) sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: private lazy val isPythonAvailable: Boolean = TestUtils.testCommandAvailable(pythonExec) sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: if (TestUtils.testCommandAvailable(pythonExec)) { sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("python")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("echo | sed")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash")) ``` ### How was this patch tested? - **Scala 2.12**: Pass the Jenkins with the existing tests and one modified test. - **Scala 2.13**: Do the following manually. It should pass instead of `hang`. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.PipedRDDSuite ... Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29241 from dongjoon-hyun/SPARK-32443. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

zhengruifeng

commit sha f7542d3b61faa29eee3a496935413653c714b6b2

[SPARK-32457][ML] logParam thresholds in DT/GBT/FM/LR/MLP ### What changes were proposed in this pull request? logParam `thresholds` in DT/GBT/FM/LR/MLP ### Why are the changes needed? param `thresholds` is logged in NB/RF, but not in other ProbabilisticClassifier ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29257 from zhengruifeng/instr.logParams_add_thresholds. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>

view details

GuoPhilipse

commit sha 8de43338be879f0cfeebca328dbbcfd1e5bd70da

[SPARK-31753][SQL][DOCS] Add missing keywords in the SQL docs ### What changes were proposed in this pull request? update sql-ref docs, the following key words will be added in this PR. CASE/ELSE WHEN/THEN MAP KEYS TERMINATED BY NULL DEFINED AS LINES TERMINATED BY ESCAPED BY COLLECTION ITEMS TERMINATED BY PIVOT LATERAL VIEW OUTER? ROW FORMAT SERDE ROW FORMAT DELIMITED FIELDS TERMINATED BY IGNORE NULLS FIRST LAST ### Why are the changes needed? let more users know the sql key words usage ### Does this PR introduce _any_ user-facing change? ![image](https://user-images.githubusercontent.com/46367746/88148830-c6dc1f80-cc31-11ea-81ea-13bc9dc34550.png) ![image](https://user-images.githubusercontent.com/46367746/88148968-fb4fdb80-cc31-11ea-8649-e8297cf5813e.png) ![image](https://user-images.githubusercontent.com/46367746/88149000-073b9d80-cc32-11ea-9aa4-f914ecd72663.png) ![image](https://user-images.githubusercontent.com/46367746/88149021-0f93d880-cc32-11ea-86ed-7db8672b5aac.png) ### How was this patch tested? No Closes #29056 from GuoPhilipse/add-missing-keywords. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

view details

Frank Yin

commit sha 8323c8eb566cdb49ccaf19537897cc6dcf74d174

[SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans ### What changes were proposed in this pull request? This PR is intended to solve schema pruning not working with window functions, as described in SPARK-32059. It also solved schema pruning not working with `Sort`. It also generalizes with `Project->Filter->[any node can be pruned]`. ### Why are the changes needed? This is needed because of performance issues with nested structures with querying using window functions as well as sorting. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Introduced two tests: 1) optimizer planning level 2) end-to-end tests with SQL queries. Closes #28898 from frankyin-factual/master. Authored-by: Frank Yin <frank@factual.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

view details

Shantanu

commit sha 77f2ca6cced1c723d1c2e6082a1534f6436c6d2a

[MINOR][PYTHON] Fix spacing in error message ### What changes were proposed in this pull request? Fixes spacing in an error message ### Why are the changes needed? Makes error messages easier to read ### Does this PR introduce _any_ user-facing change? Yes, it changes the error message ### How was this patch tested? This patch doesn't affect any logic, so existing tests should cover it Closes #29264 from hauntsaninja/patch-1. Authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

Tianshi Zhu

commit sha 44a5258ac28f843dc7385868b9a86db77eeff266

[SPARK-31525][SQL] Return an empty list for df.head() when df is empty ### What changes were proposed in this pull request? return an empty list instead of None when calling `df.head()` ### Why are the changes needed? `df.head()` and `df.head(1)` are inconsistent when df is empty. ### Does this PR introduce _any_ user-facing change? Yes. If a user relies on `df.head()` to return None, things like `if df.head() is None:` will be broken. ### How was this patch tested? Closes #29214 from tianshizz/SPARK-31525. Authored-by: Tianshi Zhu <zhutianshirea@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

view details

xuewei.linxuewei

commit sha 12b9787a7f07718eaef1215331034a17302ae93d

[SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize ### What changes were proposed in this pull request? Normally, a Null aware anti join will be planed into BroadcastNestedLoopJoin which is very time consuming, for instance, in TPCH Query 16. ``` select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#45' and p_type not like 'MEDIUM POLISHED%' and p_size in (49, 14, 23, 45, 19, 3, 36, 9) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size ``` In above query, will planed into LeftAnti condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey)) Inside BroadcastNestedLoopJoinExec will perform O(M\*N), BUT if there is only single column in NAAJ, we can always change buildSide into a HashSet, and streamedSide just need to lookup in the HashSet, then the calculation will be optimized into O(M). But this optimize is only targeting on null aware anti join with single column case, because multi-column support is much more complicated, we might be able to support multi-column in future. After apply this patch, the TPCH Query 16 performance decrease from 41mins to 30s The semantic of null-aware anti join is: ![image](https://user-images.githubusercontent.com/17242071/88077041-66a39a00-cbad-11ea-8fb6-c235c4d219b4.png) ### Why are the changes needed? TPCH is a common benchmark for distributed compute engine, all other 21 Query works fine on Spark, except for Query 16, apply this patch will make Spark more competitive among all these popular engine. BTW, this patch has restricted rules and only apply on NAAJ Single Column case, which is safe enough. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. SQLQueryTestSuite with NOT IN keyword SQL, add CONFIG_DIM with spark.sql.optimizeNullAwareAntiJoin on and off 2. added case in org.apache.spark.sql.JoinSuite. 3. added case in org.apache.spark.sql.SubquerySuite. 3. Compare performance before and after applying this patch against TPCH Query 16. 4. config combination against e2e test with following ``` Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "false", "spark.sql.codegen.wholeStage" -> "false" ), Map( "sspark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "false", "spark.sql.codegen.wholeStage" -> "true" ), Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "true", "spark.sql.codegen.wholeStage" -> "false" ), Map( "spark.sql.optimizeNullAwareAntiJoin" -> "true", "spark.sql.adaptive.enabled" -> "true", "spark.sql.codegen.wholeStage" -> "true" ) ``` Closes #29104 from leanken/leanken-SPARK-32290. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

view details

push time in 11 days

pull request commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

Maybe just leave it as is and write a comment about what the resulting byte order is, for readers

mundaym

comment created time in 11 days

pull request commentdatabricks/spark-xml

XSD -> schema tool with a test

Yes you can just build it with SBT and try it. I should just make an 0.10 release soon.

srowen

comment created time in 11 days

Pull request review commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

 public void testBinaryComparatorForNullColumns() throws Exception {   public void testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue() throws Exception {     int numFields = 1; +    long row1Data = 11L;+    long row2Data = 11L + Integer.MAX_VALUE;++    // BinaryComparator compares longs in big-endian byte order.+    if (ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN)) {+      row1Data = Long.reverseBytes(row1Data);+      row2Data = Long.reverseBytes(row2Data);+    }+     UnsafeRow row1 = new UnsafeRow(numFields);     byte[] data1 = new byte[100];     row1.pointTo(data1, computeSizeInBytes(numFields * 8));-    row1.setLong(0, 11);+    row1.setLong(0, row1Data);      UnsafeRow row2 = new UnsafeRow(numFields);     byte[] data2 = new byte[100];     row2.pointTo(data2, computeSizeInBytes(numFields * 8));-    row2.setLong(0, 11L + Integer.MAX_VALUE);+    row2.setLong(0, row2Data);      insertRow(row1);     insertRow(row2); -    Assert.assertTrue(compare(0, 1) > 0);+    Assert.assertTrue(compare(0, 1) < 0);   }    @Test   public void testBinaryComparatorWhenSubtractionCanOverflowLongValue() throws Exception {     int numFields = 1; +    long row1Data = Long.MIN_VALUE;+    long row2Data = 1;++    // BinaryComparator compares longs in big-endian byte order.+    if (ByteOrder.nativeOrder().equals(ByteOrder.LITTLE_ENDIAN)) {

If you mean, this is kind of a different issue -- yes. Should be a new JIRA. I'd summarize this as: the bytes that this test sets up and asserts about are different on big-endian. It creates the wrong test.

mundaym

comment created time in 12 days

Pull request review commentapache/spark

[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors.

 object SparkContext extends Logging {     }   } +  /**+   * Called to ensure that SparkContext is created or accessed only on the Driver.+   *+   * Throws an exception if a SparkContext is about to be created in executors.+   */+  private def assertOnDriver(): Unit = {+    if (TaskContext.get != null) {

Yep if that's the logic - that one might previously have harmlessly created a SparkContext that does not work and now it fails explicitly - then I'd say just revert it. It's just trying to fail-fast, but maybe that's not a good idea. If it affects mlflow, I'd guess it affects N other applications.

ueshin

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

 public void testBinaryComparatorForNullColumns() throws Exception {   public void testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue() throws Exception {     int numFields = 1; +    long row1Data = 11L;+    long row2Data = 11L + Integer.MAX_VALUE;++    // BinaryComparator compares longs in big-endian byte order.

I don't think overflow was the issue per se; signed vs unsigned bytes were, for sure, in the original issue. But not so much here in this test case.

mundaym

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

 public void testBinaryComparatorForNullColumns() throws Exception {   public void testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue() throws Exception {     int numFields = 1; +    long row1Data = 11L;+    long row2Data = 11L + Integer.MAX_VALUE;++    // BinaryComparator compares longs in big-endian byte order.

I don't even think that's quite true. The comparison isn't endian at all as it is byte-by-byte. But the point here is to write bytes in a certain order for the test, for sure.

mundaym

comment created time in 13 days

pull request commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

@viirya but this test asserts a particular ordering -- for test purposes, sure. The ordering won't be consistent, so the test fails based on arch. Because this is a special case, asserting the ordering, the endian-ness has to be fixed.

mundaym

comment created time in 13 days

pull request commentapache/spark

[SPARK-32457][ML] logParam thresholds in DT/GBT/FM/LR/MLP

Generally it's bug fixes that are back-ported to maintenance branches, and even there the question is risk vs benefit. Doc backports happen too, because they're very low risk and because the latest release (thus latest docs) are often from a maintenance branch. For small improvements, you could argue it's very low risk, but they usually don't have much upside, so I wouldn't unless there's a good reason. There's always some odd chance the behavior change might impact something. This is pretty safe, just don't see a compelling reason to backport it, unless you are arguing it's a tiny bug fix.

zhengruifeng

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors.

 object SparkContext extends Logging {     }   } +  /**+   * Called to ensure that SparkContext is created or accessed only on the Driver.+   *+   * Throws an exception if a SparkContext is about to be created in executors.+   */+  private def assertOnDriver(): Unit = {+    if (TaskContext.get != null) {

I see, but would that ever succeed, using the SparkContext that was created? I'm trying to work out what library would do this and not fail. Is it, perhaps, some boilerplate initialization code that gets executed on driver and executor, and it makes a SparkContext that is never actually used on the executor, but now it fails fast?

I get it, if that's the use case. A release note and/or hidden config to disable it might be an OK workaround.

Alternatively if arguing that this is not-uncommon, maybe we just don't do this at all, and revert it

ueshin

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-32160][CORE][PYSPARK] Disallow to create SparkContext in executors.

 object SparkContext extends Logging {     }   } +  /**+   * Called to ensure that SparkContext is created or accessed only on the Driver.+   *+   * Throws an exception if a SparkContext is about to be created in executors.+   */+  private def assertOnDriver(): Unit = {+    if (TaskContext.get != null) {

Does this actually affect any legitimate use case that would otherwise work? this should be more of a fail-fast for things that will already fail

ueshin

comment created time in 13 days

pull request commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

I see your point, the original intent was likely to write the bytes as if the big-endian representation of a long. OK. Sure, agree, just writing bytes is probably clearer still. Or at least write the intended bytes in the comment for the reader.

Up to you about how you want to attack it - if there are other similar failures we can fix them together in one PR to group them logically. If there are several distinct issues you can tackle them separately if desired, or, together could be OK too. I suspect they're all related to this general part of the code.

mundaym

comment created time in 13 days

pull request commentapache/spark

[SPARK-32443][CORE] Use POSIX-compatible `command -v` in testCommandAvailable

command is a shell built-in, right? maybe that's the issue, but then I don't know the fix other than to try to launch a new shell to run this, but then, does that still work?

dongjoon-hyun

comment created time in 13 days

pull request commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

The change you cite doesn't change how things are compared - just fixes the intended logic. Endian-ness isn't really at issue here in the comparison logic, because this is byte-by-byte comparison, so I'd remove comments to that effect.

What's at issue here is what the test writes. Writing 11 to memory directly on little-endian writes different bytes as big-endian. They result in a different comparison because the test case ends up expressing a different test.

I think the change is therefore reasonable, but would instead swap the bytes when the arch is big-endian, and leave the current assertion. Just a minor semantic difference that preserves the existing test, but, it's pretty fine either way.

Is that the only current big-endian test failure?

mundaym

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-32455][ML] LogisticRegressionModel prediction optimization

 class LogisticRegressionModel private[spark] (    private lazy val _intercept = interceptVector.toArray.head +  private lazy val _interceptVector = interceptVector.toDense++  private var _threshold = Double.NaN++  private var _rawThreshold = Double.NaN++  {

You probably also don't need the blank lines between vars

zhengruifeng

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-32455][ML] LogisticRegressionModel prediction optimization

 class LogisticRegressionModel private[spark] (    private lazy val _intercept = interceptVector.toArray.head +  private lazy val _interceptVector = interceptVector.toDense++  private var _threshold = Double.NaN++  private var _rawThreshold = Double.NaN++  {+    updateThreshold()+  }++  private def updateThreshold(): Unit = {+    if (!isMultinomial) {

Does anything need to happen in the multinomial case?

zhengruifeng

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-32455][ML] LogisticRegressionModel prediction optimization

 class LogisticRegressionModel private[spark] (    private lazy val _intercept = interceptVector.toArray.head +  private lazy val _interceptVector = interceptVector.toDense++  private var _threshold = Double.NaN++  private var _rawThreshold = Double.NaN++  {

Why the block here?

zhengruifeng

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-32455][ML] LogisticRegressionModel prediction optimization

 class LogisticRegressionModel private[spark] (    private lazy val _intercept = interceptVector.toArray.head

Maybe this can use _interceptVector now rather than .toArray again? or at least just read off its first element directly

zhengruifeng

comment created time in 13 days

pull request commentapache/spark

[SPARK-29918][SQL][FOLLOWUP][TEST] Fix endianness issues in tests in RecordBinaryComparatorSuite

Are you saying the test fails on big-endian machines, as-is?

mundaym

comment created time in 13 days

pull request commentapache/spark

[SPARK-32128][SQL]import SQLConf.PARTITION_OVERWRITE_VERIFY_PATH config

Jenkins retest this please

wankunde

comment created time in 13 days

pull request commentapache/spark

[SPARK-32128][SQL]import SQLConf.PARTITION_OVERWRITE_VERIFY_PATH config

Looks like this was killed by either a timeout, or a worker restarting

wankunde

comment created time in 13 days

Pull request review commentapache/spark

[SPARK-31448] [PYTHON] Fix storage level used in cache() in dataframe.py

 def cache(self):         .. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.         """         self.is_cached = True-        self._jdf.cache()+        self.persist(StorageLevel.MEMORY_AND_DISK)

I see, python StorageLevels are defined differently. Erm, @ScrapCodes or say @cloud-fan would you have an opinion on this? it does seem plausible but I don't know if there's a hidden reason for this difference.

abhishekd0907

comment created time in 14 days

Pull request review commentapache/spark

[SPARK-32443][CORE] Use POSIX-compatible `command -v` in testCommandAvailable

 class PythonUDFSuite extends QueryTest with SharedSparkSession {    val scalaTestUDF = TestScalaUDF(name = "scalaUDF")   val pythonTestUDF = TestPythonUDF(name = "pyUDF")+  assume(pythonExec.equals("python3"))

If this is the issue, assert that it's either python or python3? the python 3 binary could be just python.

dongjoon-hyun

comment created time in 14 days

push eventsrowen/spark

Itsuki Toyota

commit sha 86ead044e3789b3291a38ec2142cbb343d1290c1

[SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons… …istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com>

view details

Dongjoon Hyun

commit sha 7e0c5b3b539f208ef795d6848c34b52db2076a5c

[SPARK-32442][CORE][TESTS] Fix TaskSetManagerSuite by hiding `o.a.s.FakeSchedulerBackend` ### What changes were proposed in this pull request? There exists two `FakeSchedulerBackend` classes. ``` $ git grep "class FakeSchedulerBackend" core/src/test/scala/org/apache/spark/HeartbeatReceiverSuite.scala:private class FakeSchedulerBackend( core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala:class FakeSchedulerBackend extends SchedulerBackend { ``` This PR aims to hide the following at `TaskSetManagerSuite`. ```scala import org.apache.spark.{FakeSchedulerBackend => _, _} ``` ### Why are the changes needed? Although `TaskSetManagerSuite` is inside `org.apache.spark.scheduler` package, `import org.apache.spark._` makes Scala 2.13 confused and causes 4 UT failures. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.TaskSetManagerSuite ... Tests: succeeded 48, failed 4, canceled 0, ignored 0, pending 0 *** 4 TESTS FAILED *** ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - **Scala 2.12**: Pass the Jenkins or GitHub Action - **Scala 2.13**: Pass the following manually. ``` $ dev/change-scala-version.sh 2.13 $ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.TaskSetManagerSuite ... Tests: succeeded 52, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #29240 from dongjoon-hyun/SPARK-32442. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

view details

push time in 14 days

issue commentscala/bug

Constructors handles private[this] var

I agree - @dongjoon-hyun we could rewrite that code to simply not try to treat this as a member and save a few bytes. If the fundamental issue is that scala is correctly recognizing that this isn't needed as a member and so that's why it's not actually serialized or something, then maybe this isn't a scala issue per se. You all would know better whether this might generalize into other non trivial cases.

wangyum

comment created time in 14 days

Pull request review commentapache/spark

[SPARK-31448] [PYTHON] Fix storage level used in cache() in dataframe.py

 def cache(self):         .. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.         """         self.is_cached = True-        self._jdf.cache()+        self.persist(StorageLevel.MEMORY_AND_DISK)

Hm, this should already be the default on the Scala side. I am not clear why this changes behavior. I understand the idea but also not sure of all the implications here as so much has changed since SPARK-2014

abhishekd0907

comment created time in 14 days

pull request commentapache/spark

[SPARK-32428] [EXAMPLES] Make BinaryClassificationMetricsExample cons…

Merged to master/3.0/2.4

titsuki

comment created time in 14 days

push eventapache/spark

Itsuki Toyota

commit sha 62671af4160306f8f007fef0628b2a77da9b2824

[SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons… …istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 86ead044e3789b3291a38ec2142cbb343d1290c1) Signed-off-by: Sean Owen <srowen@gmail.com>

view details

push time in 14 days

push eventapache/spark

Itsuki Toyota

commit sha d71be73355033931a1b999f0ca4a05b95303187d

[SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons… …istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit 86ead044e3789b3291a38ec2142cbb343d1290c1) Signed-off-by: Sean Owen <srowen@gmail.com>

view details

push time in 14 days

push eventapache/spark

Itsuki Toyota

commit sha 86ead044e3789b3291a38ec2142cbb343d1290c1

[SPARK-32428][EXAMPLES] Make BinaryClassificationMetricsExample cons… …istently print the metrics on driver's stdout ### What changes were proposed in this pull request? Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout. ### Why are the changes needed? Some RDDs in this example (e.g., precision, recall) call println without calling collect. If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout. However if the job is under cluster mode, the job prints the metrics on the executor's stdout. It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout. All of the metrics should output its result on the driver's stdout. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is example code. It doesn't have any tests. Closes #29222 from titsuki/SPARK-32428. Authored-by: Itsuki Toyota <titsuki@cpan.org> Signed-off-by: Sean Owen <srowen@gmail.com>

view details

push time in 14 days

more