profile
viewpoint
Wes McKinney wesm @ursa-labs / @rstudio Nashville, TN http://wesmckinney.com Director of https://ursalabs.org/. Creator of Python pandas. Co-creator Apache Arrow. @apache Member and Apache Parquet PMC

weld-project/weld 2419

High-performance runtime for data analytics applications

scikit-learn-contrib/sklearn-pandas 2212

Pandas integration with sklearn

ibis-project/ibis 1443

A pandas-like deferred expression system, with first-class SQL support

cloudera/impyla 597

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)

fastavro/fastavro 380

Fast Avro for Python

BIDS/datarray 78

Prototyping numpy arrays with named axes for data management.

cloudera/hs2client 71

C++ native client for Impala and Hive, with Python / pandas bindings

kwgoodman/roly 16

A comparison of various moving window median algorithms

cloudera/thrift_sasl 14

Thrift SASL module that implements TSaslClientTransport

ellisonbg/py4science 9

Using Python for scientific computing

startedhasura/graphql-engine

started time in 4 hours

push eventapache/arrow

Mark Rushakoff

commit sha ac86123a3f013ba1eeac2b66c2ccd00810c67871

ARROW-9970: [Go] fix checkptr failure in sum methods When running the unit tests for github.com/influxdata/flux with the race detector enabled on Go 1.14 or newer, checkptr detects failures like "fatal error: checkptr: pointer arithmetic computed bad pointer value" which are fixed by this change. I did not have sufficient time to identify how the Flux tests were different from the Arrow tests -- I did look at the Arrow tests and I did not identify a difference -- but applying this change eliminated the checkptr failures in the Flux test suite. One such stack trace from running `PKG_CONFIG=$PWD/pkg-config.sh $GOPATH/bin/go1.15.1 test -race ./stdlib/universe/` against influxdata/flux@5e15be646a25767a802ee4dacd4e262c3211808d: ``` fatal error: checkptr: pointer arithmetic computed bad pointer value goroutine 3327 [running]: runtime.throw(0x5e7c5f2, 0x37) /Users/mr/sdk/go1.15.1/src/runtime/panic.go:1116 +0x72 fp=0xc000fb1c40 sp=0xc000fb1c10 pc=0x403a292 runtime.checkptrArithmetic(0x1, 0x0, 0x0, 0x0) /Users/mr/sdk/go1.15.1/src/runtime/checkptr.go:26 +0xd7 fp=0xc000fb1c70 sp=0xc000fb1c40 pc=0x4009af7 github.com/apache/arrow/go/arrow/math.sum_float64_avx2(0xc0040ac940, 0x0) /Users/mr/go/pkg/mod/github.com/apache/arrow/go/arrow@v0.0.0-20200817174511-886d87bdea78/math/float64_avx2_amd64.go:36 +0x7b fp=0xc000fb1cc0 sp=0xc000fb1c70 pc=0x48566bb github.com/apache/arrow/go/arrow/math.Float64Funcs.Sum(...) /Users/mr/go/pkg/mod/github.com/apache/arrow/go/arrow@v0.0.0-20200817174511-886d87bdea78/math/float64.go:38 github.com/influxdata/flux/stdlib/universe.(*MeanAgg).DoFloat(0xc004083610, 0xc0040ac940) /Users/mr/go/src/github.com/influxdata/flux/stdlib/universe/mean.go:144 +0x464 fp=0xc000fb1d68 sp=0xc000fb1cc0 pc=0x48e4104 github.com/influxdata/flux/execute/executetest.AggFuncTestHelper(0xc000583800, 0x70cbf60, 0xc004083600, 0xc0040ac900, 0x5bb80e0, 0x7066010) /Users/mr/go/src/github.com/influxdata/flux/execute/executetest/aggregate.go:24 +0x157 fp=0xc000fb1e28 sp=0xc000fb1d68 pc=0x486c537 github.com/influxdata/flux/stdlib/universe_test.TestMean_Process.func6(0xc000583800) /Users/mr/go/src/github.com/influxdata/flux/stdlib/universe/mean_test.go:85 +0x119 fp=0xc000fb1ed0 sp=0xc000fb1e28 pc=0x57fff99 testing.tRunner(0xc000583800, 0xc0040a0810) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1108 +0x203 fp=0xc000fb1fd0 sp=0xc000fb1ed0 pc=0x41b48c3 runtime.goexit() /Users/mr/sdk/go1.15.1/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000fb1fd8 sp=0xc000fb1fd0 pc=0x40742c1 created by testing.(*T).Run /Users/mr/sdk/go1.15.1/src/testing/testing.go:1159 +0x797 goroutine 1 [chan receive]: testing.(*T).Run(0xc000802780, 0x5e2b9d2, 0x10, 0x6c5f718, 0x1) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1160 +0x7cf testing.runTests.func1(0xc000802780) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1430 +0xa7 testing.tRunner(0xc000802780, 0xc000083cc8) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1108 +0x203 testing.runTests(0xc000c421a0, 0x814ae20, 0x9d, 0x9d, 0xbfccbe7a6d9bfac8, 0x8e116025f4, 0x8159c00, 0x0) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1428 +0x5ab testing.(*M).Run(0xc000f16a00, 0x0) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1338 +0x4ec main.main() _testmain.go:401 +0x237 goroutine 6 [select]: go.opencensus.io/stats/view.(*worker).start(0xc0000a5590) /Users/mr/go/pkg/mod/go.opencensus.io@v0.22.3/stats/view/worker.go:154 +0x1e6 created by go.opencensus.io/stats/view.init.0 /Users/mr/go/pkg/mod/go.opencensus.io@v0.22.3/stats/view/worker.go:32 +0x9a goroutine 3326 [chan receive]: testing.(*T).Run(0xc000583680, 0x5e1885d, 0x4, 0xc0040a0810, 0xc00c196000) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1160 +0x7cf github.com/influxdata/flux/stdlib/universe_test.TestMean_Process(0xc000583680) /Users/mr/go/src/github.com/influxdata/flux/stdlib/universe/mean_test.go:81 +0x2bb testing.tRunner(0xc000583680, 0x6c5f718) /Users/mr/sdk/go1.15.1/src/testing/testing.go:1108 +0x203 created by testing.(*T).Run /Users/mr/sdk/go1.15.1/src/testing/testing.go:1159 +0x797 FAIL github.com/influxdata/flux/stdlib/universe 16.737s FAIL ``` Closes #8112 from mark-rushakoff/fix/go-checkptr Authored-by: Mark Rushakoff <mark.rushakoff@influxdata.com> Signed-off-by: Wes McKinney <wesm@apache.org>

view details

push time in a day

PR closed apache/arrow

ARROW-9970: [Go] fix checkptr failure in sum methods

When running the unit tests for github.com/influxdata/flux with the race detector enabled on Go 1.14 or newer, checkptr detects failures like "fatal error: checkptr: pointer arithmetic computed bad pointer value" which are fixed by this change.

I did not have sufficient time to identify how the Flux tests were different from the Arrow tests -- I did look at the Arrow tests and I did not identify a difference -- but applying this change eliminated the checkptr failures in the Flux test suite.

One such stack trace from running PKG_CONFIG=$PWD/pkg-config.sh $GOPATH/bin/go1.15.1 test -race ./stdlib/universe/ against influxdata/flux@5e15be646a25767a802ee4dacd4e262c3211808d:

fatal error: checkptr: pointer arithmetic computed bad pointer value

goroutine 3327 [running]:
runtime.throw(0x5e7c5f2, 0x37)
	/Users/mr/sdk/go1.15.1/src/runtime/panic.go:1116 +0x72 fp=0xc000fb1c40 sp=0xc000fb1c10 pc=0x403a292
runtime.checkptrArithmetic(0x1, 0x0, 0x0, 0x0)
	/Users/mr/sdk/go1.15.1/src/runtime/checkptr.go:26 +0xd7 fp=0xc000fb1c70 sp=0xc000fb1c40 pc=0x4009af7
github.com/apache/arrow/go/arrow/math.sum_float64_avx2(0xc0040ac940, 0x0)
	/Users/mr/go/pkg/mod/github.com/apache/arrow/go/arrow@v0.0.0-20200817174511-886d87bdea78/math/float64_avx2_amd64.go:36 +0x7b fp=0xc000fb1cc0 sp=0xc000fb1c70 pc=0x48566bb
github.com/apache/arrow/go/arrow/math.Float64Funcs.Sum(...)
	/Users/mr/go/pkg/mod/github.com/apache/arrow/go/arrow@v0.0.0-20200817174511-886d87bdea78/math/float64.go:38
github.com/influxdata/flux/stdlib/universe.(*MeanAgg).DoFloat(0xc004083610, 0xc0040ac940)
	/Users/mr/go/src/github.com/influxdata/flux/stdlib/universe/mean.go:144 +0x464 fp=0xc000fb1d68 sp=0xc000fb1cc0 pc=0x48e4104
github.com/influxdata/flux/execute/executetest.AggFuncTestHelper(0xc000583800, 0x70cbf60, 0xc004083600, 0xc0040ac900, 0x5bb80e0, 0x7066010)
	/Users/mr/go/src/github.com/influxdata/flux/execute/executetest/aggregate.go:24 +0x157 fp=0xc000fb1e28 sp=0xc000fb1d68 pc=0x486c537
github.com/influxdata/flux/stdlib/universe_test.TestMean_Process.func6(0xc000583800)
	/Users/mr/go/src/github.com/influxdata/flux/stdlib/universe/mean_test.go:85 +0x119 fp=0xc000fb1ed0 sp=0xc000fb1e28 pc=0x57fff99
testing.tRunner(0xc000583800, 0xc0040a0810)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1108 +0x203 fp=0xc000fb1fd0 sp=0xc000fb1ed0 pc=0x41b48c3
runtime.goexit()
	/Users/mr/sdk/go1.15.1/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000fb1fd8 sp=0xc000fb1fd0 pc=0x40742c1
created by testing.(*T).Run
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1159 +0x797

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000802780, 0x5e2b9d2, 0x10, 0x6c5f718, 0x1)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1160 +0x7cf
testing.runTests.func1(0xc000802780)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1430 +0xa7
testing.tRunner(0xc000802780, 0xc000083cc8)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1108 +0x203
testing.runTests(0xc000c421a0, 0x814ae20, 0x9d, 0x9d, 0xbfccbe7a6d9bfac8, 0x8e116025f4, 0x8159c00, 0x0)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1428 +0x5ab
testing.(*M).Run(0xc000f16a00, 0x0)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1338 +0x4ec
main.main()
	_testmain.go:401 +0x237

goroutine 6 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0000a5590)
	/Users/mr/go/pkg/mod/go.opencensus.io@v0.22.3/stats/view/worker.go:154 +0x1e6
created by go.opencensus.io/stats/view.init.0
	/Users/mr/go/pkg/mod/go.opencensus.io@v0.22.3/stats/view/worker.go:32 +0x9a

goroutine 3326 [chan receive]:
testing.(*T).Run(0xc000583680, 0x5e1885d, 0x4, 0xc0040a0810, 0xc00c196000)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1160 +0x7cf
github.com/influxdata/flux/stdlib/universe_test.TestMean_Process(0xc000583680)
	/Users/mr/go/src/github.com/influxdata/flux/stdlib/universe/mean_test.go:81 +0x2bb
testing.tRunner(0xc000583680, 0x6c5f718)
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1108 +0x203
created by testing.(*T).Run
	/Users/mr/sdk/go1.15.1/src/testing/testing.go:1159 +0x797
FAIL	github.com/influxdata/flux/stdlib/universe	16.737s
FAIL
+14 -14

6 comments

7 changed files

mark-rushakoff

pr closed time in a day

PullRequestReviewEvent

push eventwesm/arrow

Jorge C. Leitao

commit sha 5e7be0712e871521c56fcc820ba3a5e428a4b091

ARROW-9678: [Rust] [DataFusion] Improve projection push down to remove unused columns This PR makes the projection optimizer remove any projection or aggregation that is not used down the plan, thus improving speed and convenience. This is worked on top of #7879 and only the last commit is specific to this PR. Closes #7919 from jorgecarleitao/projection Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Jorge C. Leitao

commit sha f98de241cb04deab3b9f8281638f79b9b57a736a

ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry. @andygrove and @alamb , I have no formal training in thread and mutex management, so I am not certain about this proposal or the following explanation: My understanding is that because the result of ``` ctx_state .lock() .expect("failed to lock mutex") .scalar_functions .lock() .expect("failed to lock mutex") .get(name) ``` is of temporary lifetime, using this in `match` blocks any access to `scalar_functions` until we leave the match, which deadlocks when we recursively call the function. Here I just cloned `.scalar_functions` so that we allow the lock to be released. I may also be dead wrong on every word that I wrote above. This does work, but if you could validate my reasoning above, I would appreciate very much! Note that we are also doing the same for `.datasources` in this file, which I suspect will also deadlock if when we have a plan with two sources. I did not touch that as I do not know the idiom/pattern to address this (locking within recursions). An alternative solution for this is to not make `PhysicalPlanner::create_physical_plan` recursive, and instead call a recursive function (with all the current logic of `create_physical_plan`) with references to `datasources` and `scalar_functions`, so that they can be used recursively (and we do not have to lock on every recursion. Closes #8018 from jorgecarleitao/fix_deadlock Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

arw2019

commit sha 085b44d916cd1266911c05850a2369f30dd1fd65

ARROW-9490: [Python][C++] Bug in pa.array when input mixes int8 with float Closes #8017 from arw2019/ARROW-9490 Authored-by: arw2019 <andrew.r.wieteska@gmail.com> Signed-off-by: Wes McKinney <wesm@apache.org>

view details

Jorge C. Leitao

commit sha 0a698c06d1e49976133cb64681d627faf8967511

ARROW-9831: [Rust][DataFusion] Fixed compilation error @andygrove , the commit that we just merged to master was not aligned with master, causing the code to fail to compile :( This fixes the compilation, but there are still failing tests. I am working on them... Closes #8026 from jorgecarleitao/fix_test Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Andy Grove

commit sha 2e8fcd418229c8dcd86cd60952d8fef692ddc742

ARROW-9762: [Rust] [DataFusion] ExecutionContext::sql now returns DataFrame I need this change so that I can have Ballista use the DataFusion DataFrame trait and start testing the extension points for the physical planner. Closes #8027 from andygrove/ARROW-9762 Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Neal Richardson

commit sha 85f4324af4648c2969274d1c6f32e87ab1bf861b

ARROW-9819: [C++] Bump mimalloc to 1.6.4 1.6.4 includes a fix for https://github.com/microsoft/mimalloc/issues/277, which we need to enable it in the R Windows CRAN packages. Closes #8019 from nealrichardson/mimalloc-1.6.4 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Jorge C. Leitao

commit sha 735c870cbaad5088413c2b198b66c5500bab88d1

ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking. This commit makes all type coercion happen on the physical plane instead of logical plane and fixes the supertype function. This makes field names to not change due to coercion rules, better control of how the coercion supports physical calculations, and others. This commit also makes it more clear how we enforce type checking during planning. the Logical plan now knows how to derive its schema directly from binary expressions, even before the coercion is applied. The rational for this change is that coercions are simplifications to a physical computation (it is easier to sum two numbers of the same type at the hardware level). This partially solves ARROW-9809 (for binary expressions, not for udfs), an issue on which the physical schema could be modified by coercion rules, causing the RecordBatch's schema to be different from the logical batch. This also addresses some inconsistencies in how we coerced certain types for binary operators, causing such inconsistencies to error during planning instead of execution. This also introduces a significant number of tests into the overall consistency of binary operators: it is now explicit what types they expect and how coercion happens to each operator. It also adds tests to different parts of the physical execution, to ensure schema consistency for binary operators, including negative tests (when it should error). This also makes `like` and `nlike` generally available, and added some tests to it. This closes ARROW-4957. @andygrove and @alamb, I am really sorry for this long commit, but I was unable to split this in smaller parts with passing tests. There was a strong coupling between the `get_supertype` and the physical expressions that made it hard to work this through. Closes #8024 from jorgecarleitao/fix_types Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Andy Grove

commit sha 657b3d3bedda821f43066342859cb4145684d413

ARROW-9833: [Rust] [DataFusion] TableProvider.scan now returns ExecutionPlan `TableProvider.scan()` now returns `ExecutionPlan` instead of `Vec<Partition>`. This is a step towards removing the `Partition` trait and passing the `partition_id` to `ExecutionPlan.execute()` instead. I also made some minor improvements to the unit tests to reduce duplicate code and use `?` instead of `unwrap()`. Closes #8028 from andygrove/ARROW-9833 Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Andy Grove

commit sha d1d85db24d81fe8c74a0af39242642c9091567af

ARROW-9464: [Rust] [DataFusion] Remove Partition trait This follows on from https://github.com/apache/arrow/pull/8028. - Removes `Partition` trait, which was really redundant. - `ExecutonPlan.execute()` now takes a partition index. - Introduced `Partitioning` enum so that execution plans can describe their partitioning scheme and number of partitions. Currently, it just has `UnknownPartitioning` but later we can add others, such as `HashPartitioning`. - Removed `DatasourceExec` since no longer needed now that `TableProvider.scan` returns an `ExecutionPlan` directly. This is a step towards extracting the threading model out of the operators like `MergeExec` and having the operators be able to work with different threading models. Closes #8029 from andygrove/execute-partition Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

liyafan82

commit sha 3fb1356ed2e4de7b00decbba081369019b9598a7

ARROW-9554: [Java] FixedWidthInPlaceVectorSorter sometimes produces wrong result See https://issues.apache.org/jira/browse/ARROW-9554 Closes #7837 from liyafan82/fly_0727_st Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: liyafan82 <fan_li_ya@foxmail.com>

view details

Jim Klucar

commit sha 5e192008df5e213b68833dd7b90dc6992f9625f7

ARROW-9840: [Python] fs documentation out of date with code (FileStats -> FileInfo) Documentation for the python filesystem interface was not updated when the code changed FileStats to FileInfo. The code still has a line saying FileStats = FileInfo "for backward compatibility" that may not be correct because as this shows, FileInfo doesn't have the `get_target_stats` method. Closes #8038 from klucar/python_doc_fs_update Authored-by: Jim Klucar <klucar@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

view details

Romain Francois

commit sha 55defbf7a14896dc1a166f04f62f576179b4a1d5

ARROW-9405: [R] Switch to cpp11 Closes #7819 from romainfrancois/cpp11 Lead-authored-by: Romain Francois <romain@rstudio.com> Co-authored-by: Romain François <romain@rstudio.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Jorge C. Leitao

commit sha 094392471815bd36fdfd0e738456bcde29d8aa5d

ARROW-9835: [Rust][DataFusion] Removed FunctionMeta and FunctionType Closes #8030 from jorgecarleitao/no_meta Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

alamb

commit sha f0bda5f229445e08a5467a371bf5fb973c4f7a23

ARROW-9815: [Rust][DataFusion] Add a trait for looking up scalar functions by name Inspired by the conversation on https://github.com/apache/arrow/pull/8018/files, I have been bothered by the use of Arc/Mutex and the resulting code complication in ExecutionContext and LogicalPlanning. The more I read the code, the more I am convinced that `ExecutionContextState` needs to be mutable *before* planning, but once planning has started the state is all read only (and any relevant state is copied / cloned into the `ExecutionPlan`) so the only thing Mutex/Arc are doing is making lurking bugs like ARROW-9815 more likely and requiring copies. This PR proposes a modest change to add a trait for looking up scalar functions by name, and thus removes the direct use of Box / HashMaps / etc in the TypeCoercion optimizer pass. I have several other changes in mind to avoid the need for Box / Mutex entirely in ExecutionContext but I want to keep the individual PRs small. Closes #8031 from alamb/alamb/execution_context_clean Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Neville Dipale

commit sha 7b2307f8aac5eb99f75038d778d6233c1b353787

ARROW-9841: [Rust] Update checked-in fbs files Built and manually fixed errors on required fields Closes #8039 from nevi-me/rust-ipc-fbs-update Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Yibo Cai

commit sha 6d02508f3caaaaf39fe726a3f08ccbcf1b5d5118

ARROW-9699: [C++][Compute] Optimize mode kernel for small integer types For small integers(bool, int8, uint8), instead of general hash table, using a value indexed array improves performance about 2x ~ 6x. Closes #7963 from cyb70289/mode-opt Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Frank Du

commit sha 7ed91f70b6389c6afd7ab9aad5c7c4efc76eea66

ARROW-9702: [C++] Register bpacking SIMD to runtime path. 1. Add AVX2 version for unpack32. 2. Add dynamic dispatch facility. 3. Pick the SIMD path at runtime by the CPU feature. Signed-off-by: Frank Du <frank.du@intel.com> Closes #7940 from jianxind/runtime_bpacking Lead-authored-by: Frank Du <frank.du@intel.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Neal Richardson

commit sha f8b285b00acc3cb4f87e3e2b6306f0c2a5056d4f

ARROW-8001: [R][Dataset] Bindings for dataset writing * r/R/dataset.R is broken out into smaller files. I did this in the first commit, isolated from the behavior changes, so if you do a diff without the first commit, it's easier to see what has changed * Normalize paths in `write_dataset()` as was done in `open_dataset()` in ARROW-9743 * Add bindings to create `InMemoryDataset` and use those in `write_dataset()` to enable you to write a `data.frame`, `RecordBatch`, or `Table` * Allow writing a subset of columns, and gather information from a previous `select()` call to do that by default. Renaming columns is not supported Closes #8041 from nealrichardson/r-dataset-writing2 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Neal Richardson

commit sha 74e64d01b4aa333733391628786041ce86c6192d

ARROW-9855: [R] Fix bad merge/Rcpp conflict Also adds a trailing slash to a URL, the lack of which technically causes a redirect and that led CRAN to pull the latest `arrow` submission for manual inspection 🙄 Closes #8053 from nealrichardson/fix-master-r Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Antoine Pitrou

commit sha 4c4193d2f1c39e4a8b441aef1eb69daa334163f0

ARROW-9813: [C++] Disable semantic interposition By default, gcc enables "semantic interposition" which allows overriding a symbol using LD_PRELOAD tricks (for example). Disabling it allows faster calling conventions when calling global functions internally, and can also help inlining. Basically, doing this shouldn't cause any harm, and could in some circumstances improve performance. A quick look at Arrow shared library sizes suggests code size is only minimally reduced. Closes #8048 from pitrou/ARROW-9813-gcc-semantic-interposition Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

push time in 2 days

push eventapache/arrow

Patrick Pai

commit sha 8563b422bdbaf4fdc48d9691b1f8433ad65c8734

PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec This patch makes arrow's Lz4Codec compatible with the Hadoop Lz4Codec by prepending 8 bytes (two `uint32_t`s representing expected decompressed size in bytes and compressed size in bytes) to arrow's lz4 decompression/compression. Closes #7789 from patrickpai/PARQUET-1878 Lead-authored-by: Patrick Pai <patrick.m.pai@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 2 days

PR closed apache/arrow

PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

This patch makes arrow's Lz4Codec compatible with the Hadoop Lz4Codec by prepending 8 bytes (two uint32_ts representing expected decompressed size in bytes and compressed size in bytes) to arrow's lz4 decompression/compression.

+296 -96

32 comments

16 changed files

patrickpai

pr closed time in 2 days

Pull request review commentapache/arrow

PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

 namespace arrow {  struct Compression {   /// \brief Compression algorithm-  enum type { UNCOMPRESSED, SNAPPY, GZIP, BROTLI, ZSTD, LZ4, LZ4_FRAME, LZO, BZ2 };+  enum type {

Note, in general we won't be especially timid about making ABI-breaking changes to enums, so we recommend that third party projects be defensive about this

patrickpai

comment created time in 2 days

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentapache/arrow-testing

PARQUET-1878: Add legacy lz4-compressed file

It's OK

pitrou

comment created time in 2 days

push eventapache/parquet-testing

Antoine Pitrou

commit sha d914f9d289488c7db1759d7a88a4a1b8f062c7dd

PARQUET-1878: Fix legacy LZ4-compressed file (#14) A Snappy-compressed file was mistakingly uploaded. Replace it with a "raw" LZ4-compressed file (generated by Arrow 0.17.0). This is the LZ4 block format without the custom Hadoop header. ``` Version: 1.0 Created By: parquet-cpp version 1.5.1-SNAPSHOT Total rows: 4 Number of RowGroups: 1 Number of Real Columns: 3 Number of Columns: 3 Number of Selected Columns: 3 Column 0: c0 (INT64) Column 1: c1 (BYTE_ARRAY) Column 2: v11 (DOUBLE) --- Row Group: 0 --- --- Total Bytes: 281 --- --- Rows: 4 --- Column 0 Values: 4, Null Values: 0, Distinct Values: 0 Max: 1593604801, Min: 1593604800 Compression: LZ4_RAW, Encodings: PLAIN_DICTIONARY PLAIN RLE Uncompressed Size: 100, Compressed Size: 103 Column 1 Values: 4, Null Values: 0, Distinct Values: 0 Max: def, Min: abc Compression: LZ4_RAW, Encodings: PLAIN_DICTIONARY PLAIN RLE Uncompressed Size: 68, Compressed Size: 70 Column 2 Values: 4, Null Values: 0, Distinct Values: 0 Max: 42.125, Min: 7.7 Compression: LZ4_RAW, Encodings: PLAIN_DICTIONARY PLAIN RLE Uncompressed Size: 109, Compressed Size: 108 --- Values --- c0 |c1 |v11 | 1593604800 |abc |42.000000 | 1593604800 |def |7.700000 | 1593604801 |abc |42.125000 | 1593604801 |def |7.700000 | ```

view details

push time in 2 days

PR merged apache/parquet-testing

Reviewers
PARQUET-1878: Fix legacy LZ4-compressed file

A Snappy-compressed file was mistakingly uploaded. Replace it with a "raw" LZ4-compressed file (generated by Arrow 0.17.0). This is the LZ4 block format without the custom Hadoop header.

Version: 1.0
Created By: parquet-cpp version 1.5.1-SNAPSHOT
Total rows: 4
Number of RowGroups: 1
Number of Real Columns: 3
Number of Columns: 3
Number of Selected Columns: 3
Column 0: c0 (INT64)
Column 1: c1 (BYTE_ARRAY)
Column 2: v11 (DOUBLE)
--- Row Group: 0 ---
--- Total Bytes: 281 ---
--- Rows: 4 ---
Column 0
  Values: 4, Null Values: 0, Distinct Values: 0
  Max: 1593604801, Min: 1593604800
  Compression: LZ4_RAW, Encodings: PLAIN_DICTIONARY PLAIN RLE
  Uncompressed Size: 100, Compressed Size: 103
Column 1
  Values: 4, Null Values: 0, Distinct Values: 0
  Max: def, Min: abc
  Compression: LZ4_RAW, Encodings: PLAIN_DICTIONARY PLAIN RLE
  Uncompressed Size: 68, Compressed Size: 70
Column 2
  Values: 4, Null Values: 0, Distinct Values: 0
  Max: 42.125, Min: 7.7
  Compression: LZ4_RAW, Encodings: PLAIN_DICTIONARY PLAIN RLE
  Uncompressed Size: 109, Compressed Size: 108
--- Values ---
c0                            |c1                            |v11                           |
1593604800                    |abc                           |42.000000                     |
1593604800                    |def                           |7.700000                      |
1593604801                    |abc                           |42.125000                     |
1593604801                    |def                           |7.700000                      |
+0 -0

1 comment

1 changed file

pitrou

pr closed time in 2 days

pull request commentapache/parquet-testing

PARQUET-1878: Fix legacy LZ4-compressed file

+1

pitrou

comment created time in 2 days

pull request commentapache/parquet-testing

PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

That is a mistake them

patrickpai

comment created time in 2 days

Pull request review commentapache/arrow

ARROW-9603: [C++] Fix parquet write

 class ArrowColumnWriterV2 {               std::shared_ptr<Array> values_array =                   result.leaf_array->Slice(range.start, range.Size()); -              return column_writer->WriteArrow(result.def_levels, result.rep_levels,-                                               result.def_rep_level_count, *values_array,-                                               ctx);+              PARQUET_CATCH_AND_RETURN(column_writer->WriteArrow(+                  result.def_levels, result.rep_levels, result.def_rep_level_count,+                  *values_array, ctx, level_builder->Nested(), result.leaf_is_nullable));

Since WriteArrow returns Status, should we adopt that APIs must either return Status or throw an exception, but not both? (FWIW I regret that we chose to allow exceptions in the Parquet C++ project back in 2016)

emkornfield

comment created time in 3 days

Pull request review commentapache/arrow

ARROW-9603: [C++] Fix parquet write

 class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<     return values_to_write;   } +  void MaybeCalculateValidityBits(const int16_t* def_levels, int64_t batch_size,+                                  int64_t* out_values_to_write,+                                  int64_t* out_spaced_values_to_write,+                                  int64_t* null_count) {+    if (bits_buffer_ == nullptr) {+      if (!level_info_.HasNullableValues()) {+        *out_values_to_write = batch_size;+        *out_spaced_values_to_write = batch_size;+        *null_count = 0;+      } else {+        for (int x = 0; x < batch_size; x++) {+          *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 : 0;+          *out_spaced_values_to_write +=+              def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 0;+        }+        *null_count = *out_values_to_write - *out_spaced_values_to_write;+      }+      return;+    }+    // Shrink to fit possible causes another allocation, and would only be necessary+    // on the last batch.+    int64_t new_bitmap_size = BitUtil::BytesForBits(batch_size);+    if (new_bitmap_size != bits_buffer_->size()) {+      PARQUET_THROW_NOT_OK(+          bits_buffer_->Resize(new_bitmap_size, /*shrink_to_fit=*/false));+      bits_buffer_->ZeroPadding();+    }+    internal::ValidityBitmapInputOutput io;+    io.valid_bits = bits_buffer_->mutable_data();+    io.values_read_upper_bound = batch_size;+    internal::DefLevelsToBitmap(def_levels, batch_size, level_info_, &io);+    *out_values_to_write = io.values_read - io.null_count;+    *out_spaced_values_to_write = io.values_read;+    *null_count = io.null_count;+  }++  std::shared_ptr<Array> MaybeUpdateArray(std::shared_ptr<Array> array,+                                          int64_t new_null_count) {+    if (bits_buffer_ == nullptr) {+      return array;+    }+    std::vector<std::shared_ptr<Buffer>> buffers = array->data()->buffers;+    buffers[0] = bits_buffer_;+    DCHECK(array->num_fields() == 0);+    return arrow::MakeArray(std::make_shared<ArrayData>(+        array->type(), array->length(), std::move(buffers), new_null_count));

Might be useful someday to have a helper function to make an array copy with a particular buffer replaced, I seem to recall a JIRA issue about this

emkornfield

comment created time in 3 days

Pull request review commentapache/arrow

ARROW-9603: [C++] Fix parquet write

 class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<     return values_to_write;   } +  void MaybeCalculateValidityBits(const int16_t* def_levels, int64_t batch_size,+                                  int64_t* out_values_to_write,+                                  int64_t* out_spaced_values_to_write,+                                  int64_t* null_count) {+    if (bits_buffer_ == nullptr) {+      if (!level_info_.HasNullableValues()) {+        *out_values_to_write = batch_size;+        *out_spaced_values_to_write = batch_size;+        *null_count = 0;+      } else {+        for (int x = 0; x < batch_size; x++) {+          *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 : 0;+          *out_spaced_values_to_write +=+              def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 0;+        }+        *null_count = *out_values_to_write - *out_spaced_values_to_write;+      }+      return;+    }+    // Shrink to fit possible causes another allocation, and would only be necessary+    // on the last batch.+    int64_t new_bitmap_size = BitUtil::BytesForBits(batch_size);+    if (new_bitmap_size != bits_buffer_->size()) {+      PARQUET_THROW_NOT_OK(+          bits_buffer_->Resize(new_bitmap_size, /*shrink_to_fit=*/false));+      bits_buffer_->ZeroPadding();+    }+    internal::ValidityBitmapInputOutput io;+    io.valid_bits = bits_buffer_->mutable_data();+    io.values_read_upper_bound = batch_size;+    internal::DefLevelsToBitmap(def_levels, batch_size, level_info_, &io);+    *out_values_to_write = io.values_read - io.null_count;+    *out_spaced_values_to_write = io.values_read;+    *null_count = io.null_count;+  }++  std::shared_ptr<Array> MaybeUpdateArray(std::shared_ptr<Array> array,

MaybeReplaceValidity?

emkornfield

comment created time in 3 days

Pull request review commentapache/arrow

ARROW-9603: [C++] Fix parquet write

 class MultipathLevelBuilderImpl : public MultipathLevelBuilder {                      std::move(write_leaf_callback));   } +  bool Nested() const override { return !data_->child_data.empty(); }

IsNested?

emkornfield

comment created time in 3 days

Pull request review commentapache/arrow

ARROW-9603: [C++] Fix parquet write

 TEST(ArrowReadWrite, SingleColumnNullableStruct) {       3); } +TEST(ArrowReadWrite, DisagreeingValidityBitmap) {}++TEST(ArrowReadWrite, NestedRequiredField) {+  auto int_field = ::arrow::field("int_array", ::arrow::int32(), /*nullable=*/false);+  auto int_array = ::arrow::ArrayFromJSON(int_field->type(), "[0, 1, 2, 3, 4, 5, 7, 8]");+  auto struct_field =+      ::arrow::field("root", ::arrow::struct_({int_field}), /*nullable=*/true);+  std::shared_ptr<Buffer> validity_bitmap;+  ASSERT_OK_AND_ASSIGN(validity_bitmap, ::arrow::AllocateBitmap(8));+  validity_bitmap->mutable_data()[0] = 0xCC;++  auto struct_data = std::make_shared<ArrayData>(+      struct_field->type(), /*length=*/8,+      std::vector<std::shared_ptr<Buffer>>{validity_bitmap},+      std::vector<std::shared_ptr<ArrayData>>{int_array->data()});+  CheckSimpleRoundtrip(+      ::arrow::Table::Make(+          ::arrow::schema({struct_field}),+          {std::make_shared<::arrow::ChunkedArray>(::arrow::MakeArray(struct_data))}),+      /*row_group_size=*/8);+}++TEST(ArrowReadWrite, NestedNullableField) {+  auto int_field = ::arrow::field("int_array", ::arrow::int32());+  auto int_array = ::arrow::ArrayFromJSON(int_field->type(), "[0, null, 2, null, 4, 5, null, 8]");+  auto struct_field =+      ::arrow::field("root", ::arrow::struct_({int_field}), /*nullable=*/true);+  std::shared_ptr<Buffer> validity_bitmap;+  ASSERT_OK_AND_ASSIGN(validity_bitmap, ::arrow::AllocateBitmap(8));+  validity_bitmap->mutable_data()[0] = 0xCC;++  auto struct_data = std::make_shared<ArrayData>(+      struct_field->type(), /*length=*/8,+      std::vector<std::shared_ptr<Buffer>>{validity_bitmap},+      std::vector<std::shared_ptr<ArrayData>>{int_array->data()});

You can use ArrayData::Make for nicer syntax (don't have to write out std::vector<std::shared_ptr<Buffer>>)

emkornfield

comment created time in 3 days

Pull request review commentapache/arrow

ARROW-9603: [C++] Fix parquet write

 class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter<    Status WriteArrow(const int16_t* def_levels, const int16_t* rep_levels,                     int64_t num_levels, const ::arrow::Array& array,-                    ArrowWriteContext* ctx) override {+                    ArrowWriteContext* ctx, bool nested, bool array_nullable) override {+    bool leaf_is_not_nullable = !level_info_.HasNullableValues();+    // Leaf nulls are canonical when there is only a single null element and it is at the+    // leaf.+    bool leaf_nulls_are_canonical =+        (level_info_.def_level == level_info_.repeated_ancestor_def_level + 1) &&+        array_nullable;+    bool maybe_has_nulls = nested && !(leaf_is_not_nullable || leaf_nulls_are_canonical);

The fact that maybe_has_nulls is false whenever nested is false seems odd

emkornfield

comment created time in 3 days

Pull request review commentapache/arrow

ARROW-9603: [C++] Fix parquet write

 TEST(ArrowReadWrite, SingleColumnNullableStruct) {       3); } +TEST(ArrowReadWrite, DisagreeingValidityBitmap) {}++TEST(ArrowReadWrite, NestedRequiredField) {+  auto int_field = ::arrow::field("int_array", ::arrow::int32(), /*nullable=*/false);+  auto int_array = ::arrow::ArrayFromJSON(int_field->type(), "[0, 1, 2, 3, 4, 5, 7, 8]");+  auto struct_field =+      ::arrow::field("root", ::arrow::struct_({int_field}), /*nullable=*/true);+  std::shared_ptr<Buffer> validity_bitmap;+  ASSERT_OK_AND_ASSIGN(validity_bitmap, ::arrow::AllocateBitmap(8));+  validity_bitmap->mutable_data()[0] = 0xCC;++  auto struct_data = std::make_shared<ArrayData>(+      struct_field->type(), /*length=*/8,+      std::vector<std::shared_ptr<Buffer>>{validity_bitmap},+      std::vector<std::shared_ptr<ArrayData>>{int_array->data()});+  CheckSimpleRoundtrip(+      ::arrow::Table::Make(+          ::arrow::schema({struct_field}),+          {std::make_shared<::arrow::ChunkedArray>(::arrow::MakeArray(struct_data))}),+      /*row_group_size=*/8);

Seems like there might be a helper function opportunity if this pattern is repeated in other test functions

emkornfield

comment created time in 3 days

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentapache/arrow

ARROW-9603: [C++] Fix parquet write

@xhochy might be the only one. I can do my best to provide some comments

emkornfield

comment created time in 3 days

pull request commentapache/arrow

ARROW-9761: [C/C++] Add experimental C stream inferface

Another thing that occurred to me is whether we want to enable batch-level metadata (which would be implementation-defined). This is supported in Flight for example

https://github.com/apache/arrow/blob/master/format/Flight.proto#L316

pitrou

comment created time in 3 days

pull request commentapache/arrow

ARROW-8601: [Go][Flight] Implementations Flight RPC server and client

Agree it would be good for this to be reviewed / merged. Can you rebase and get a more complete CI build?

zeroshade

comment created time in 3 days

pull request commentapache/arrow

ARROW-9603: [C++] Fix parquet write

I'm not sure I have enough mental context to review this PR carefully

emkornfield

comment created time in 3 days

issue commentibis-project/ibis-dask

Monorepo for Ibis?

I guess I don't see a problem with having a lot of committers in a single repository as long as everyone knows what their role is (e.g. in Arrow we have > 50 committers and have never had issues). Robust change validation and shared CI infrastructure is worth a lot.

wesm

comment created time in 3 days

issue commentibis-project/ibis-dask

Monorepo for Ibis?

You can achieve one-directional integration testing for changes in the core repo only, but bidirectional integration testing isn't possible on GitHub without using a monorepo. It may not become an issue, but I just wanted to emphasize that multiple Python packages doesn't require multiple git repositories (which has more downsides than upsides IMHO).

wesm

comment created time in 3 days

issue openedibis-project/ibis-dask

Monorepo for Ibis?

I'm curious about the many-repo approach versus a mono-repo approach (similar to what Vaex does https://github.com/vaexio/vaex/tree/master/packages). It seems like having many repositories will make CI/integration testing much more difficult (PRs in the core that break the dependent packages won't be detected -- they could be detected in a monorepo)

created time in 3 days

issue closedapache/arrow

How to transform the Arrow data column to array of array efficiently?

Hi, all: In my case, I'm asking if there is any efficient way that i can take to transform the arrow format table into the format such as Array[Array[Double]] in java or scala?

closed time in 3 days

Zarca

issue commentapache/arrow

How to transform the Arrow data column to array of array efficiently?

This would be best discussed further on user@arrow.apache.org

Zarca

comment created time in 3 days

startedkpamnany/partr

started time in 6 days

startedclibs/coro

started time in 6 days

Pull request review commentapache/arrow

ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

 cdef class Scanner(_Weakrefable):     def from_fragment(Fragment fragment not None, Schema schema=None,                       bint use_threads=True, MemoryPool memory_pool=None,                       list columns=None, Expression filter=None,-                      int batch_size=32*2**10):+                      int batch_size=2**20):

use common variable

bkietz

comment created time in 7 days

Pull request review commentapache/arrow

ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

 class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {     /// @{     std::unordered_set<std::string> dict_columns;     /// @}++    /// Parallelize conversion across columns. This option is ignored if a scan is already+    /// parallelized across input files.

Add note that this is experimental. Once we fix our nested parallelism problem this shouldn't be needed

bkietz

comment created time in 7 days

Pull request review commentapache/arrow

ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

 cdef class ParquetReadOptions(_Weakrefable):     dictionary_columns : list of string, default None         Names of columns which should be dictionary encoded as         they are read.+    enable_parallel_column_conversion : bool, default False+        Whether single files may be read in parallel (ignored when reading+        multiple files to avoid thread contention).

Add note that this is experimental. Once we fix our nested parallelism problem this shouldn't be needed

bkietz

comment created time in 7 days

Pull request review commentapache/arrow

ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

 cdef shared_ptr[CScanContext] _build_scan_context(bint use_threads=True,  cdef void _populate_builder(const shared_ptr[CScannerBuilder]& ptr,                             list columns=None, Expression filter=None,-                            int batch_size=32*2**10) except *:+                            int batch_size=2**20) except *:

Can this be put in a global variable (DEFAULT_BATCH_SIZE) rather than hard-coded?

bkietz

comment created time in 7 days

PullRequestReviewEvent
PullRequestReviewEvent

startedsnowflakedb/dplyr-snowflakedb

started time in 7 days

PullRequestReviewEvent

pull request commentapache/arrow

PARQUET-1878: [C++] lz4 codec is not compatible with Hadoop Lz4Codec

Seems like someone else may need to pick this up and finish it.

patrickpai

comment created time in 7 days

PR closed apache/arrow

[CI] Create a windows docker image for testing C++ [WIP]
+525 -19

2 comments

16 changed files

kszucs

pr closed time in 7 days

pull request commentapache/arrow

[CI] Create a windows docker image for testing C++ [WIP]

I'll close this until it can be picked up again in the future

kszucs

comment created time in 7 days

pull request commentapache/arrow

ARROW-10024: [C++][Parquet] Create nested reading benchmarks

Looks like there is something wrong with the Ursabot result reporting, but you can see the results here:

https://ci.ursalabs.org/#/builders/73/builds/103/steps/3/logs/result

pitrou

comment created time in 8 days

pull request commentapache/arrow

ARROW-9956: [C++] [Gandiva] Implementation of binary_string function in gandiva

We have hex encoding / decoding functions in arrow/util/string.h, is it possible for you to use them?

naman1996

comment created time in 8 days

startedfishtown-analytics/dbt

started time in 9 days

pull request commentapache/arrow

ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

In terms of benchmarking, it also strikes me that one issue is that it may be faster (especially on machines with a lot of cores -- e.g. 16/20 core servers) to read a 2-file (or even n-file where n is some number less than the number of cores on the machine) dataset by reading the files one at a time rather than using the datasets API. How many files do you have to have before the performance issue goes away? This is something that would be good to quantify in a collection of benchmarks

bkietz

comment created time in 9 days

pull request commentapache/arrow

ARROW-9848: [Rust] Implement 0.15 IPC alignment

In C++ we are more hardened against metadata integrity issues from all the IPC fuzzing, e.g.

https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata_internal.cc#L747

According to the Flatbuffers spec, this field should be length-0, not null.

nevi-me

comment created time in 11 days

startedorchest/orchest

started time in 15 days

PullRequestReviewEvent

Pull request review commentapache/arrow

ARROW-9078: [C++] Parquet read / write extension type with nested storage type

 class ARROW_EXPORT ExtensionType : public DataType {   /// \return the serialized representation   virtual std::string Serialize() const = 0; +  static std::shared_ptr<Array> WrapArray(const std::shared_ptr<DataType>& ext_type,+                                          const std::shared_ptr<Array>& storage);+  static std::shared_ptr<ChunkedArray> WrapArray(+      const std::shared_ptr<DataType>& ext_type,+      const std::shared_ptr<ChunkedArray>& storage);

Seems OK to me. May want doxygen comments

pitrou

comment created time in 16 days

startedKDAB/hotspot

started time in 16 days

issue closedapache/arrow

ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

It seems that the ParquetWriter doesn't behave as expected when I am passing a pyarrow schema that comes out of a pyarrow table. Approaching a problem in two ways, I notice unexpected behavior.

If I construct a pyarrow schema from the datatypes, so I get a schema that has no metadata attached:

print(pyarrow_schema)
---
sample_column1: string
sample_column2: date32[day]
sample_column3: float

Binding that to the ParquetWriter and a pyarrow table and writing it out:

pqwriter = pq.ParquetWriter(out_io, schema=pyarrow_schema, compression='snappy')
df = pa.Table.from_pandas(df, schema=pyarrow_schema)
pqwriter.write_table(table=df)

I get the an expected result, a queryable, well-formed, parquet file. I'm adding an external schema on top of the file to query through redshift spectrum.

However, if I create the schema, bind it to the table, and then bind the table schema to the ParquetWriter. The result is a bad parquet file.

df = pa.Table.from_pandas(df, schema=pyarrow_schema)
pqwriter = pq.ParquetWriter(out_io, schema=df.schema, compression='snappy')
pqwriter.write_table(table=df)

What I notice is that the schema coming from the pyarrow table comes with attached metadata, but removing the metadata does not seem to solve the issue.

df = pa.Table.from_pandas(df, schema=pyarrow_schema)
pqwriter = pq.ParquetWriter(out_io, schema=df.schema.remove_metadata(), compression='snappy')
pqwriter.write_table(table=df)

Should I report a bug?

closed time in 16 days

losze1cj

issue commentapache/arrow

ParquetWriter creates bad files when passed pyarrow schema from pyarrow table?

Ping. Closing in the meantime

losze1cj

comment created time in 16 days

issue closedapache/arrow

Is there any detailed documentation for deploying plasma on kubernetes?

Is there any detailed documentation for deploying plasma on kubernetes?

closed time in 16 days

Mervey0324

issue commentapache/arrow

Is there any detailed documentation for deploying plasma on kubernetes?

I don’t think so. We are discussing removing Plasma from the project because it is unmaintained. You can discuss it on dev@arrow.apache.org

Mervey0324

comment created time in 16 days

issue closedapache/arrow

Plasma contains

I ask if there is an object in plasma using client.contains(objectID), and it returns false, then I try to save the object and it says that it is already there. I know this happens beacuse another client has added the object in there. What I do not understand is why it says that the object it iis not contained in plasma. I even print the plasma store using list, and it is not there either??

How can I check if an object is stored in plasma without attempting to store it again. Thank you

closed time in 16 days

pereverges

issue commentapache/arrow

Plasma contains

You can ask questions on user@arrow.apache.org. Thanks

pereverges

comment created time in 16 days

issue closedapache/arrow

Add dplyr group_by, summarise and mutate support in function open_dataset R arrow package

Hi,

The open_dataset() function in the R arrow package already includes the support for dplyr filter, select and rename functions. However, it would be a huge improvement if it also could include other functions such as group_by, summarise and mutate before calling collect(). Is there any idea or projet going on to do so ? Would be it possible to include those features ?

Many thanks for this excellent job.

closed time in 17 days

PalGal2

startedapple/foundationdb

started time in 22 days

pull request commentapache/arrow

[C++][Python] Refactor python to arrow conversions based on a reusable conversion API [WIP]

Are there some related JIRAs that will be fixed by this?

kszucs

comment created time in 23 days

startedpingcap/tidb

started time in a month

pull request commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

Thanks. +1 from me

pitrou

comment created time in a month

PR closed apache/arrow

ARROW-7878: [C++][Compute] Draft LogicalPlan classes

Draft of the LogicalPlan basic classes for the query engine. One should read the classes in this order for review:

  • ExprType/Expr
  • LogicalPlanBuilder/LogicalPlan
  • Catalog
+2978 -3

14 comments

43 changed files

fsaintjacques

pr closed time in a month

pull request commentapache/arrow

ARROW-7878: [C++][Compute] Draft LogicalPlan classes

Will close for now until there is bandwidth to make a push on this project.

fsaintjacques

comment created time in a month

PullRequestReviewEvent

Pull request review commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

  #include <cstdint> #include <memory>-#include <unordered_map> #include <utility> #include <vector> -#include "arrow/memory_pool.h"+#include "arrow/result.h" #include "arrow/status.h"+#include "arrow/type_fwd.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h"  namespace arrow {+namespace ipc { -class Array;-class DataType;-class Field;-class RecordBatch;+class FieldPosition {+ public:+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}

Seems reasonable to me

pitrou

comment created time in a month

Pull request review commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

  #include <cstdint> #include <memory>-#include <unordered_map> #include <utility> #include <vector> -#include "arrow/memory_pool.h"+#include "arrow/result.h" #include "arrow/status.h"+#include "arrow/type_fwd.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h"  namespace arrow {+namespace ipc {++class FieldPosition {+ public:+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}++  FieldPosition child(int index) const { return {this, index}; }++  std::vector<int> path() const {+    std::vector<int> path(depth_);+    const FieldPosition* cur = this;+    for (int i = depth_ - 1; i >= 0; --i) {+      path[i] = cur->index_;+      cur = cur->parent_;+    }+    return path;+  }++ protected:+  FieldPosition(const FieldPosition* parent, int index)+      : parent_(parent), index_(index), depth_(parent->depth_ + 1) {}++  const FieldPosition* parent_;+  int index_;+  int depth_;+}; -class Array;-class DataType;-class Field;-class RecordBatch;+/// \brief Map fields in a schema to dictionary ids+///+/// The mapping is structural, i.e. the field path (as a vector of indices)+/// is associated to the dictionary id.+class ARROW_EXPORT DictionaryFieldMapper {+ public:+  DictionaryFieldMapper();+  explicit DictionaryFieldMapper(const Schema& schema);+  ~DictionaryFieldMapper(); -namespace ipc {+  Status AddSchemaFields(const Schema& schema);+  Status AddField(int64_t id, std::vector<int> field_path); -/// \brief Memoization data structure for assigning id numbers to-/// dictionaries and tracking their current state through possible-/// deltas in an IPC stream+  Result<int64_t> GetFieldId(std::vector<int> field_path) const;++  int num_fields() const;++ private:+  struct Impl;+  std::unique_ptr<Impl> impl_;+};++using DictionaryVector = std::vector<std::pair<int64_t, std::shared_ptr<Array>>>;++/// \brief Memoization data structure for reading dictionaries from IPC streams+///+/// This structure tracks the following associations:+/// - field position (structural) -> dictionary id+/// - dictionary id -> value type+/// - dictionary id -> dictionary (value) data+///+/// Together, they allow resolving dictionary data when reading an IPC stream,+/// using metadata recorded in the schema message and data recorded in the+/// dictionary batch messages (see ResolveDictionaries).+///+/// This structure isn't useful for writing an IPC stream, where only+/// DictionaryFieldMapper is necessary. class ARROW_EXPORT DictionaryMemo {  public:-  using DictionaryVector = std::vector<std::pair<int64_t, std::shared_ptr<Array>>>;-   DictionaryMemo();-  DictionaryMemo(DictionaryMemo&&) = default;-  DictionaryMemo& operator=(DictionaryMemo&&) = default;+  ~DictionaryMemo();++  DictionaryFieldMapper& fields();

Is this necessary? It might be clearer to have DictionaryFieldMapper* mutable_fields()

pitrou

comment created time in a month

Pull request review commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

 #include "arrow/status.h" #include "arrow/type.h" #include "arrow/util/checked_cast.h"+#include "arrow/util/logging.h"++namespace std {+template <>+struct hash<arrow::FieldPath> {+  size_t operator()(const arrow::FieldPath& path) const { return path.hash(); }+};

This might be better placed in a common header, but this is ok for now

pitrou

comment created time in a month

Pull request review commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

 namespace flatbuf = org::apache::arrow::flatbuf;  namespace ipc { +class DictionaryFieldMapper; class DictionaryMemo;

Perhaps we should move these both to ipc::internal?

pitrou

comment created time in a month

Pull request review commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

  #include <cstdint> #include <memory>-#include <unordered_map> #include <utility> #include <vector> -#include "arrow/memory_pool.h"+#include "arrow/result.h" #include "arrow/status.h"+#include "arrow/type_fwd.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h"  namespace arrow {+namespace ipc {++class FieldPosition {+ public:+  FieldPosition() : parent_(NULLPTR), index_(-1), depth_(0) {}++  FieldPosition child(int index) const { return {this, index}; }++  std::vector<int> path() const {+    std::vector<int> path(depth_);+    const FieldPosition* cur = this;+    for (int i = depth_ - 1; i >= 0; --i) {+      path[i] = cur->index_;+      cur = cur->parent_;+    }+    return path;+  }++ protected:+  FieldPosition(const FieldPosition* parent, int index)+      : parent_(parent), index_(index), depth_(parent->depth_ + 1) {}++  const FieldPosition* parent_;+  int index_;+  int depth_;+}; -class Array;-class DataType;-class Field;-class RecordBatch;+/// \brief Map fields in a schema to dictionary ids+///+/// The mapping is structural, i.e. the field path (as a vector of indices)+/// is associated to the dictionary id.

It would be helpful to state that a dictionary id can be associated with multiple field paths

pitrou

comment created time in a month

Pull request review commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

 Status FromProto(const pb::SchemaResult& pb_result, std::string* result) { }  Status SchemaToString(const Schema& schema, std::string* out) {-  // TODO(wesm): Do we care about better memory efficiency here?   ipc::DictionaryMemo unused_dict_memo;

This can be removed now?

pitrou

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentcwida/duckdb

Initial Arrow support

Need to open a follow up issue to fix the timestamp issue?

hannesmuehleisen

comment created time in a month

Pull request review commentcwida/duckdb

Initial Arrow support

 struct DuckDBPyResult { 		return py::module::import("pandas").attr("DataFrame").attr("from_dict")(fetchnumpy()); 	} +	py::object fetch_arrow_table() {+		if (!result) {+			throw runtime_error("result closed");+		}++		auto pyarrow_lib_module = py::module::import("pyarrow").attr("lib");++		auto batch_import_func = pyarrow_lib_module.attr("RecordBatch").attr("_import_from_c");+		auto from_batches_func = pyarrow_lib_module.attr("Table").attr("from_batches");+		auto schema_import_func = pyarrow_lib_module.attr("Schema").attr("_import_from_c");+		ArrowSchema schema;+		result->ToArrowSchema(&schema);+		auto schema_obj = schema_import_func((uint64_t)&schema);++		py::list batches;+		while (true) {+			auto data_chunk = result->Fetch();+			if (data_chunk->size() == 0) {+				break;+			}+			ArrowArray data;+			data_chunk->ToArrowArray(&data);+			ArrowSchema schema;+			result->ToArrowSchema(&schema);+			batches.append(batch_import_func((uint64_t)&data, (uint64_t)&schema));+		}+		return from_batches_func(batches, schema_obj);+	}

As soon as pyarrow has built-in support for the C interface stream/iterator (i.e. hopefully in the next release!) this can be basically dropped

hannesmuehleisen

comment created time in a month

Pull request review commentcwida/duckdb

Initial Arrow support

 string QueryResult::HeaderToString() { 	return result; } +struct DuckDBArrowSchemaHolder {+	// unused in root+	ArrowSchema schema;+	// unused in children+	unique_ptr<ArrowSchema *[]> children; // just space for the *pointers* to children, not the children themselves+};++static void release_duckdb_arrow_schema(ArrowSchema *schema) {+	if (!schema || !schema->release) {+		return;+	}+	schema->release = nullptr;+	auto holder = (DuckDBArrowSchemaHolder *)schema->private_data;+	delete holder;+}++void QueryResult::ToArrowSchema(ArrowSchema *out_schema) {+	assert(out_schema);++	auto root_holder = new DuckDBArrowSchemaHolder();++	root_holder->children = unique_ptr<ArrowSchema *[]>(new ArrowSchema *[column_count()]);+	out_schema->private_data = root_holder;+	out_schema->release = release_duckdb_arrow_schema;++	out_schema->children = root_holder->children.get();++	out_schema->format = "+s"; // struct apparently

More precisely you could say that the C interface uses a struct to communicate an Arrow record batch (to avoid having an unwieldy dichotomy between an Array and a RecordBatch).

https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst#record-batches

hannesmuehleisen

comment created time in a month

Pull request review commentcwida/duckdb

Initial Arrow support

+#include "duckdb.hpp"++#include "duckdb/common/arrow.hpp"+#include "duckdb/function/table/arrow.hpp"++#include "duckdb/parser/parsed_data/create_table_function_info.hpp"+#include "duckdb/main/client_context.hpp"+#include "duckdb/main/connection.hpp"+#include "duckdb/function/table_function.hpp"+#include "duckdb/common/types/timestamp.hpp"+#include "duckdb/common/types/date.hpp"++#include "utf8proc_wrapper.hpp"++namespace duckdb {++struct ArrowScanFunctionData : public TableFunctionData {+	ArrowArrayStream *stream;+	ArrowSchema schema_root;+	ArrowArray current_chunk_root;+	idx_t chunk_idx = 0;+	idx_t chunk_offset = 0;++	void ReleaseArray() {+		if (current_chunk_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)current_chunk_root.n_children; child_idx++) {+				auto &child = *current_chunk_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			current_chunk_root.release(&current_chunk_root);+		}+	}++	void ReleaseSchema() {+		if (schema_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)schema_root.n_children; child_idx++) {+				auto &child = *schema_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			schema_root.release(&schema_root);+		}+	}++	~ArrowScanFunctionData() {+		ReleaseSchema();+		ReleaseArray();+	}+};++static unique_ptr<FunctionData> arrow_scan_bind(ClientContext &context, vector<Value> &inputs,+                                                unordered_map<string, Value> &named_parameters,+                                                vector<LogicalType> &return_types, vector<string> &names) {++	auto res = make_unique<ArrowScanFunctionData>();+	auto &data = *res;++	data.stream = (ArrowArrayStream *)inputs[0].GetValue<uintptr_t>();+	if (!data.stream) {+		throw InvalidInputException("arrow_scan: NULL pointer passed");+	}++	if (data.stream->get_schema(data.stream, &data.schema_root)) {+		throw InvalidInputException("arrow_scan: get_schema failed(): %s",+		                            string(data.stream->get_last_error(data.stream)));+	}++	if (!data.schema_root.release) {+		throw InvalidInputException("arrow_scan: released schema passed");+	}++	if (data.schema_root.n_children < 1) {+		throw InvalidInputException("arrow_scan: empty schema passed");+	}++	for (idx_t col_idx = 0; col_idx < (idx_t)data.schema_root.n_children; col_idx++) {+		auto &schema = *data.schema_root.children[col_idx];+		if (!schema.release) {+			throw InvalidInputException("arrow_scan: released schema passed");+		}+		if (schema.dictionary) {+			throw NotImplementedException("arrow_scan: dictionary vectors not supported yet");+		}+		auto format = string(schema.format);+		if (format == "b") {+			return_types.push_back(LogicalType::BOOLEAN);+		} else if (format == "c") {+			return_types.push_back(LogicalType::TINYINT);+		} else if (format == "s") {+			return_types.push_back(LogicalType::SMALLINT);+		} else if (format == "i") {+			return_types.push_back(LogicalType::INTEGER);+		} else if (format == "l") {+			return_types.push_back(LogicalType::BIGINT);+		} else if (format == "f") {+			return_types.push_back(LogicalType::FLOAT);+		} else if (format == "g") {+			return_types.push_back(LogicalType::DOUBLE);+		} else if (format == "d:38,0") { // decimal128+			return_types.push_back(LogicalType::HUGEINT);+		} else if (format == "u") {+			return_types.push_back(LogicalType::VARCHAR);+		} else if (format == "tsn:") {+			return_types.push_back(LogicalType::TIMESTAMP);+		} else {+			throw NotImplementedException("Unsupported Arrow type %s", format);+		}+		auto name = string(schema.name);+		if (name.empty()) {+			name = string("v") + to_string(col_idx);+		}+		names.push_back(name);+	}+	data.ReleaseSchema();+	return move(res);+}++static void arrow_scan_function(ClientContext &context, vector<Value> &input, DataChunk &output,+                                FunctionData *dataptr) {+	auto &data = *((ArrowScanFunctionData *)dataptr);++	if (!data.stream->release) { // no more chunks+		return;+	}++	// have we run out of data on the current chunk? move to next one+	if (data.chunk_offset >= (idx_t)data.current_chunk_root.length) {+		data.chunk_offset = 0;+		data.ReleaseArray();+		if (data.stream->get_next(data.stream, &data.current_chunk_root)) {+			throw InvalidInputException("arrow_scan: get_next failed(): %s",+			                            string(data.stream->get_last_error(data.stream)));+		}+	}++	// have we run out of chunks? we done+	if (!data.current_chunk_root.release) {+		data.stream->release(data.stream);+		return;+	}++	if ((idx_t)data.current_chunk_root.n_children != output.column_count()) {+		throw InvalidInputException("arrow_scan: array column count mismatch");+	}++	output.SetCardinality(+	    std::min((int64_t)STANDARD_VECTOR_SIZE, (int64_t)(data.current_chunk_root.length - data.chunk_offset)));++	for (idx_t col_idx = 0; col_idx < output.column_count(); col_idx++) {+		auto &array = *data.current_chunk_root.children[col_idx];+		if (!array.release) {+			throw InvalidInputException("arrow_scan: released array passed");+		}+		if (array.length != data.current_chunk_root.length) {+			throw InvalidInputException("arrow_scan: array length mismatch");+		}+		if (array.dictionary) {+			throw NotImplementedException("arrow_scan: dictionary vectors not supported yet");+		}+		if (array.null_count != 0 && array.buffers[0]) {+			auto &nullmask = FlatVector::Nullmask(output.data[col_idx]);++			auto bit_offset = data.chunk_offset + array.offset;+			auto n_bitmask_bytes = (output.size() + 8 - 1) / 8;++			if (bit_offset % 8 == 0) {+				// just memcpy nullmask+				memcpy(&nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes);+			} else {+				// need to re-align nullmask :/+				bitset<STANDARD_VECTOR_SIZE + 8> temp_nullmask;+				memcpy(&temp_nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes + 1);++				temp_nullmask >>= (bit_offset % 8); // why this has to be a right shift is a mystery to me+				memcpy(&nullmask, (data_ptr_t)&temp_nullmask, n_bitmask_bytes);+			}+			nullmask.flip(); // arrow uses inverse nullmask logic+		}++		switch (output.data[col_idx].type.id()) {+		case LogicalTypeId::BOOLEAN:+		case LogicalTypeId::TINYINT:+		case LogicalTypeId::SMALLINT:+		case LogicalTypeId::INTEGER:+		case LogicalTypeId::FLOAT:+		case LogicalTypeId::DOUBLE:+		case LogicalTypeId::BIGINT:+		case LogicalTypeId::HUGEINT:+			FlatVector::SetData(output.data[col_idx],+			                    (data_ptr_t)array.buffers[1] + GetTypeIdSize(output.data[col_idx].type.InternalType()) *+			                                                       (data.chunk_offset + array.offset));+			break;++		case LogicalTypeId::VARCHAR: {+			auto offsets = (uint32_t *)array.buffers[1] + array.offset + data.chunk_offset;+			auto cdata = (char *)array.buffers[2];++			for (idx_t row_idx = 0; row_idx < output.size(); row_idx++) {+				if (FlatVector::Nullmask(output.data[col_idx])[row_idx]) {+					continue;+				}+				auto cptr = cdata + offsets[row_idx];+				auto str_len = offsets[row_idx + 1] - offsets[row_idx];++				auto utf_type = Utf8Proc::Analyze(cptr, str_len);+				switch (utf_type) {+				case UnicodeType::ASCII:+					FlatVector::GetData<string_t>(output.data[col_idx])[row_idx] =+					    StringVector::AddString(output.data[col_idx], cptr, str_len);+					break;+				case UnicodeType::UNICODE:+					// this regrettably copies to normalize+					FlatVector::GetData<string_t>(output.data[col_idx])[row_idx] =+					    StringVector::AddString(output.data[col_idx], Utf8Proc::Normalize(string(cptr, str_len)));++					break;+				case UnicodeType::INVALID:+					throw runtime_error("Invalid UTF8 string encoding");+				}+			}++			break;+		} // TODO timestamps in duckdb are subject to change+		case LogicalTypeId::TIMESTAMP: {+			auto src_ptr = (uint64_t *)array.buffers[1] + data.chunk_offset;+			auto tgt_ptr = (timestamp_t *)FlatVector::GetData(output.data[col_idx]);++			for (idx_t row = 0; row < output.size(); row++) {+				auto source_idx = data.chunk_offset + row;++				auto ms = src_ptr[source_idx] / 1000000; // nanoseconds

Need to use the time unit from the schema -- this will be incorrect for units other than nanoseconds?

hannesmuehleisen

comment created time in a month

Pull request review commentcwida/duckdb

Initial Arrow support

+#include "duckdb.hpp"++#include "duckdb/common/arrow.hpp"+#include "duckdb/function/table/arrow.hpp"++#include "duckdb/parser/parsed_data/create_table_function_info.hpp"+#include "duckdb/main/client_context.hpp"+#include "duckdb/main/connection.hpp"+#include "duckdb/function/table_function.hpp"+#include "duckdb/common/types/timestamp.hpp"+#include "duckdb/common/types/date.hpp"++#include "utf8proc_wrapper.hpp"++namespace duckdb {++struct ArrowScanFunctionData : public TableFunctionData {+	ArrowArrayStream *stream;+	ArrowSchema schema_root;+	ArrowArray current_chunk_root;+	idx_t chunk_idx = 0;+	idx_t chunk_offset = 0;++	void ReleaseArray() {+		if (current_chunk_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)current_chunk_root.n_children; child_idx++) {+				auto &child = *current_chunk_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			current_chunk_root.release(&current_chunk_root);+		}+	}++	void ReleaseSchema() {+		if (schema_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)schema_root.n_children; child_idx++) {+				auto &child = *schema_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			schema_root.release(&schema_root);+		}+	}++	~ArrowScanFunctionData() {+		ReleaseSchema();+		ReleaseArray();+	}+};++static unique_ptr<FunctionData> arrow_scan_bind(ClientContext &context, vector<Value> &inputs,+                                                unordered_map<string, Value> &named_parameters,+                                                vector<LogicalType> &return_types, vector<string> &names) {++	auto res = make_unique<ArrowScanFunctionData>();+	auto &data = *res;++	data.stream = (ArrowArrayStream *)inputs[0].GetValue<uintptr_t>();+	if (!data.stream) {+		throw InvalidInputException("arrow_scan: NULL pointer passed");+	}++	if (data.stream->get_schema(data.stream, &data.schema_root)) {+		throw InvalidInputException("arrow_scan: get_schema failed(): %s",+		                            string(data.stream->get_last_error(data.stream)));+	}++	if (!data.schema_root.release) {+		throw InvalidInputException("arrow_scan: released schema passed");+	}++	if (data.schema_root.n_children < 1) {+		throw InvalidInputException("arrow_scan: empty schema passed");+	}++	for (idx_t col_idx = 0; col_idx < (idx_t)data.schema_root.n_children; col_idx++) {+		auto &schema = *data.schema_root.children[col_idx];+		if (!schema.release) {+			throw InvalidInputException("arrow_scan: released schema passed");+		}+		if (schema.dictionary) {+			throw NotImplementedException("arrow_scan: dictionary vectors not supported yet");+		}+		auto format = string(schema.format);+		if (format == "b") {+			return_types.push_back(LogicalType::BOOLEAN);+		} else if (format == "c") {+			return_types.push_back(LogicalType::TINYINT);+		} else if (format == "s") {+			return_types.push_back(LogicalType::SMALLINT);+		} else if (format == "i") {+			return_types.push_back(LogicalType::INTEGER);+		} else if (format == "l") {+			return_types.push_back(LogicalType::BIGINT);+		} else if (format == "f") {+			return_types.push_back(LogicalType::FLOAT);+		} else if (format == "g") {+			return_types.push_back(LogicalType::DOUBLE);+		} else if (format == "d:38,0") { // decimal128+			return_types.push_back(LogicalType::HUGEINT);+		} else if (format == "u") {+			return_types.push_back(LogicalType::VARCHAR);+		} else if (format == "tsn:") {+			return_types.push_back(LogicalType::TIMESTAMP);+		} else {+			throw NotImplementedException("Unsupported Arrow type %s", format);+		}+		auto name = string(schema.name);+		if (name.empty()) {+			name = string("v") + to_string(col_idx);+		}+		names.push_back(name);+	}+	data.ReleaseSchema();+	return move(res);+}++static void arrow_scan_function(ClientContext &context, vector<Value> &input, DataChunk &output,+                                FunctionData *dataptr) {+	auto &data = *((ArrowScanFunctionData *)dataptr);++	if (!data.stream->release) { // no more chunks+		return;+	}++	// have we run out of data on the current chunk? move to next one+	if (data.chunk_offset >= (idx_t)data.current_chunk_root.length) {+		data.chunk_offset = 0;+		data.ReleaseArray();+		if (data.stream->get_next(data.stream, &data.current_chunk_root)) {+			throw InvalidInputException("arrow_scan: get_next failed(): %s",+			                            string(data.stream->get_last_error(data.stream)));+		}+	}++	// have we run out of chunks? we done+	if (!data.current_chunk_root.release) {+		data.stream->release(data.stream);+		return;+	}++	if ((idx_t)data.current_chunk_root.n_children != output.column_count()) {+		throw InvalidInputException("arrow_scan: array column count mismatch");+	}++	output.SetCardinality(+	    std::min((int64_t)STANDARD_VECTOR_SIZE, (int64_t)(data.current_chunk_root.length - data.chunk_offset)));++	for (idx_t col_idx = 0; col_idx < output.column_count(); col_idx++) {+		auto &array = *data.current_chunk_root.children[col_idx];+		if (!array.release) {+			throw InvalidInputException("arrow_scan: released array passed");+		}+		if (array.length != data.current_chunk_root.length) {+			throw InvalidInputException("arrow_scan: array length mismatch");+		}+		if (array.dictionary) {+			throw NotImplementedException("arrow_scan: dictionary vectors not supported yet");+		}+		if (array.null_count != 0 && array.buffers[0]) {+			auto &nullmask = FlatVector::Nullmask(output.data[col_idx]);++			auto bit_offset = data.chunk_offset + array.offset;+			auto n_bitmask_bytes = (output.size() + 8 - 1) / 8;++			if (bit_offset % 8 == 0) {+				// just memcpy nullmask+				memcpy(&nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes);+			} else {+				// need to re-align nullmask :/+				bitset<STANDARD_VECTOR_SIZE + 8> temp_nullmask;+				memcpy(&temp_nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes + 1);++				temp_nullmask >>= (bit_offset % 8); // why this has to be a right shift is a mystery to me+				memcpy(&nullmask, (data_ptr_t)&temp_nullmask, n_bitmask_bytes);+			}+			nullmask.flip(); // arrow uses inverse nullmask logic+		}++		switch (output.data[col_idx].type.id()) {+		case LogicalTypeId::BOOLEAN:+		case LogicalTypeId::TINYINT:+		case LogicalTypeId::SMALLINT:+		case LogicalTypeId::INTEGER:+		case LogicalTypeId::FLOAT:+		case LogicalTypeId::DOUBLE:+		case LogicalTypeId::BIGINT:+		case LogicalTypeId::HUGEINT:+			FlatVector::SetData(output.data[col_idx],+			                    (data_ptr_t)array.buffers[1] + GetTypeIdSize(output.data[col_idx].type.InternalType()) *+			                                                       (data.chunk_offset + array.offset));+			break;++		case LogicalTypeId::VARCHAR: {+			auto offsets = (uint32_t *)array.buffers[1] + array.offset + data.chunk_offset;+			auto cdata = (char *)array.buffers[2];++			for (idx_t row_idx = 0; row_idx < output.size(); row_idx++) {+				if (FlatVector::Nullmask(output.data[col_idx])[row_idx]) {+					continue;+				}+				auto cptr = cdata + offsets[row_idx];+				auto str_len = offsets[row_idx + 1] - offsets[row_idx];++				auto utf_type = Utf8Proc::Analyze(cptr, str_len);+				switch (utf_type) {+				case UnicodeType::ASCII:+					FlatVector::GetData<string_t>(output.data[col_idx])[row_idx] =+					    StringVector::AddString(output.data[col_idx], cptr, str_len);+					break;+				case UnicodeType::UNICODE:+					// this regrettably copies to normalize

Why is that?

hannesmuehleisen

comment created time in a month

Pull request review commentcwida/duckdb

Initial Arrow support

+#include "duckdb.hpp"++#include "duckdb/common/arrow.hpp"+#include "duckdb/function/table/arrow.hpp"++#include "duckdb/parser/parsed_data/create_table_function_info.hpp"+#include "duckdb/main/client_context.hpp"+#include "duckdb/main/connection.hpp"+#include "duckdb/function/table_function.hpp"+#include "duckdb/common/types/timestamp.hpp"+#include "duckdb/common/types/date.hpp"++#include "utf8proc_wrapper.hpp"++namespace duckdb {++struct ArrowScanFunctionData : public TableFunctionData {+	ArrowArrayStream *stream;+	ArrowSchema schema_root;+	ArrowArray current_chunk_root;+	idx_t chunk_idx = 0;+	idx_t chunk_offset = 0;++	void ReleaseArray() {+		if (current_chunk_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)current_chunk_root.n_children; child_idx++) {+				auto &child = *current_chunk_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			current_chunk_root.release(&current_chunk_root);+		}+	}++	void ReleaseSchema() {+		if (schema_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)schema_root.n_children; child_idx++) {+				auto &child = *schema_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			schema_root.release(&schema_root);+		}+	}++	~ArrowScanFunctionData() {+		ReleaseSchema();+		ReleaseArray();+	}+};++static unique_ptr<FunctionData> arrow_scan_bind(ClientContext &context, vector<Value> &inputs,+                                                unordered_map<string, Value> &named_parameters,+                                                vector<LogicalType> &return_types, vector<string> &names) {++	auto res = make_unique<ArrowScanFunctionData>();+	auto &data = *res;++	data.stream = (ArrowArrayStream *)inputs[0].GetValue<uintptr_t>();+	if (!data.stream) {+		throw InvalidInputException("arrow_scan: NULL pointer passed");+	}++	if (data.stream->get_schema(data.stream, &data.schema_root)) {+		throw InvalidInputException("arrow_scan: get_schema failed(): %s",+		                            string(data.stream->get_last_error(data.stream)));+	}++	if (!data.schema_root.release) {+		throw InvalidInputException("arrow_scan: released schema passed");+	}++	if (data.schema_root.n_children < 1) {+		throw InvalidInputException("arrow_scan: empty schema passed");+	}++	for (idx_t col_idx = 0; col_idx < (idx_t)data.schema_root.n_children; col_idx++) {+		auto &schema = *data.schema_root.children[col_idx];+		if (!schema.release) {+			throw InvalidInputException("arrow_scan: released schema passed");+		}+		if (schema.dictionary) {+			throw NotImplementedException("arrow_scan: dictionary vectors not supported yet");+		}+		auto format = string(schema.format);+		if (format == "b") {+			return_types.push_back(LogicalType::BOOLEAN);+		} else if (format == "c") {+			return_types.push_back(LogicalType::TINYINT);+		} else if (format == "s") {+			return_types.push_back(LogicalType::SMALLINT);+		} else if (format == "i") {+			return_types.push_back(LogicalType::INTEGER);+		} else if (format == "l") {+			return_types.push_back(LogicalType::BIGINT);+		} else if (format == "f") {+			return_types.push_back(LogicalType::FLOAT);+		} else if (format == "g") {+			return_types.push_back(LogicalType::DOUBLE);+		} else if (format == "d:38,0") { // decimal128+			return_types.push_back(LogicalType::HUGEINT);+		} else if (format == "u") {+			return_types.push_back(LogicalType::VARCHAR);

are your VARCHAR required to be utf8? Curious how Arrow BINARY might be handled

hannesmuehleisen

comment created time in a month

Pull request review commentcwida/duckdb

Initial Arrow support

+#include "duckdb.hpp"++#include "duckdb/common/arrow.hpp"+#include "duckdb/function/table/arrow.hpp"++#include "duckdb/parser/parsed_data/create_table_function_info.hpp"+#include "duckdb/main/client_context.hpp"+#include "duckdb/main/connection.hpp"+#include "duckdb/function/table_function.hpp"+#include "duckdb/common/types/timestamp.hpp"+#include "duckdb/common/types/date.hpp"++#include "utf8proc_wrapper.hpp"++namespace duckdb {++struct ArrowScanFunctionData : public TableFunctionData {+	ArrowArrayStream *stream;+	ArrowSchema schema_root;+	ArrowArray current_chunk_root;+	idx_t chunk_idx = 0;+	idx_t chunk_offset = 0;++	void ReleaseArray() {+		if (current_chunk_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)current_chunk_root.n_children; child_idx++) {+				auto &child = *current_chunk_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			current_chunk_root.release(&current_chunk_root);+		}+	}++	void ReleaseSchema() {+		if (schema_root.release) {+			for (idx_t child_idx = 0; child_idx < (idx_t)schema_root.n_children; child_idx++) {+				auto &child = *schema_root.children[child_idx];+				if (child.release) {+					child.release(&child);+				}+			}+			schema_root.release(&schema_root);+		}+	}++	~ArrowScanFunctionData() {+		ReleaseSchema();+		ReleaseArray();+	}+};++static unique_ptr<FunctionData> arrow_scan_bind(ClientContext &context, vector<Value> &inputs,+                                                unordered_map<string, Value> &named_parameters,+                                                vector<LogicalType> &return_types, vector<string> &names) {++	auto res = make_unique<ArrowScanFunctionData>();+	auto &data = *res;++	data.stream = (ArrowArrayStream *)inputs[0].GetValue<uintptr_t>();+	if (!data.stream) {+		throw InvalidInputException("arrow_scan: NULL pointer passed");+	}++	if (data.stream->get_schema(data.stream, &data.schema_root)) {+		throw InvalidInputException("arrow_scan: get_schema failed(): %s",+		                            string(data.stream->get_last_error(data.stream)));+	}++	if (!data.schema_root.release) {+		throw InvalidInputException("arrow_scan: released schema passed");+	}++	if (data.schema_root.n_children < 1) {+		throw InvalidInputException("arrow_scan: empty schema passed");+	}++	for (idx_t col_idx = 0; col_idx < (idx_t)data.schema_root.n_children; col_idx++) {+		auto &schema = *data.schema_root.children[col_idx];+		if (!schema.release) {+			throw InvalidInputException("arrow_scan: released schema passed");+		}+		if (schema.dictionary) {+			throw NotImplementedException("arrow_scan: dictionary vectors not supported yet");+		}+		auto format = string(schema.format);+		if (format == "b") {+			return_types.push_back(LogicalType::BOOLEAN);+		} else if (format == "c") {+			return_types.push_back(LogicalType::TINYINT);+		} else if (format == "s") {+			return_types.push_back(LogicalType::SMALLINT);+		} else if (format == "i") {+			return_types.push_back(LogicalType::INTEGER);+		} else if (format == "l") {+			return_types.push_back(LogicalType::BIGINT);+		} else if (format == "f") {+			return_types.push_back(LogicalType::FLOAT);+		} else if (format == "g") {+			return_types.push_back(LogicalType::DOUBLE);+		} else if (format == "d:38,0") { // decimal128+			return_types.push_back(LogicalType::HUGEINT);+		} else if (format == "u") {+			return_types.push_back(LogicalType::VARCHAR);+		} else if (format == "tsn:") {+			return_types.push_back(LogicalType::TIMESTAMP);+		} else {+			throw NotImplementedException("Unsupported Arrow type %s", format);+		}+		auto name = string(schema.name);+		if (name.empty()) {+			name = string("v") + to_string(col_idx);+		}+		names.push_back(name);+	}+	data.ReleaseSchema();+	return move(res);+}++static void arrow_scan_function(ClientContext &context, vector<Value> &input, DataChunk &output,+                                FunctionData *dataptr) {+	auto &data = *((ArrowScanFunctionData *)dataptr);++	if (!data.stream->release) { // no more chunks+		return;+	}++	// have we run out of data on the current chunk? move to next one+	if (data.chunk_offset >= (idx_t)data.current_chunk_root.length) {+		data.chunk_offset = 0;+		data.ReleaseArray();+		if (data.stream->get_next(data.stream, &data.current_chunk_root)) {+			throw InvalidInputException("arrow_scan: get_next failed(): %s",+			                            string(data.stream->get_last_error(data.stream)));+		}+	}++	// have we run out of chunks? we done+	if (!data.current_chunk_root.release) {+		data.stream->release(data.stream);+		return;+	}++	if ((idx_t)data.current_chunk_root.n_children != output.column_count()) {+		throw InvalidInputException("arrow_scan: array column count mismatch");+	}++	output.SetCardinality(+	    std::min((int64_t)STANDARD_VECTOR_SIZE, (int64_t)(data.current_chunk_root.length - data.chunk_offset)));++	for (idx_t col_idx = 0; col_idx < output.column_count(); col_idx++) {+		auto &array = *data.current_chunk_root.children[col_idx];+		if (!array.release) {+			throw InvalidInputException("arrow_scan: released array passed");+		}+		if (array.length != data.current_chunk_root.length) {+			throw InvalidInputException("arrow_scan: array length mismatch");+		}+		if (array.dictionary) {+			throw NotImplementedException("arrow_scan: dictionary vectors not supported yet");+		}+		if (array.null_count != 0 && array.buffers[0]) {+			auto &nullmask = FlatVector::Nullmask(output.data[col_idx]);++			auto bit_offset = data.chunk_offset + array.offset;+			auto n_bitmask_bytes = (output.size() + 8 - 1) / 8;++			if (bit_offset % 8 == 0) {+				// just memcpy nullmask+				memcpy(&nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes);+			} else {+				// need to re-align nullmask :/+				bitset<STANDARD_VECTOR_SIZE + 8> temp_nullmask;+				memcpy(&temp_nullmask, (uint8_t *)array.buffers[0] + bit_offset / 8, n_bitmask_bytes + 1);++				temp_nullmask >>= (bit_offset % 8); // why this has to be a right shift is a mystery to me

In case it's helpful https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#validity-bitmaps

hannesmuehleisen

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentapache/arrow

ARROW-9852: [C++] Validate dictionaries fully on IPC read

 Status ReadDictionary(const Buffer& metadata, DictionaryMemo* dictionary_memo,     return Status::Invalid("Dictionary record batch must only contain one field");   }   auto dictionary = batch->column(0);-  // Validate the dictionary for safe delta concatenation-  RETURN_NOT_OK(dictionary->Validate());+  // Validate the dictionary for safe delta concatenation.  Full validation+  // is necessary for certain types (for example nested dictionaries).+  // XXX: instead only validate when a concatenation is actually done?+  RETURN_NOT_OK(dictionary->ValidateFull());

Could this be deferred until the actual time that the delta must be applied? That seems better but I'm OK with this if a stopgap is needed

pitrou

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

push eventwesm/wesm

Wes McKinney

commit sha 62ad889f9cfb258a3c504387f9cfb9680d10f915

Update README.md

view details

push time in a month

startedfoursquare/quiver

started time in a month

startedbufbuild/buf

started time in a month

pull request commentapache/arrow

ARROW-9266: [Python][Packaging] enable C++ S3FS in macOS wheels

@github-actions crossbow submit wheel-osx-*

corleyma

comment created time in a month

pull request commentapache/arrow

ARROW-9266: [Python][Packaging] enable C++ S3FS in macOS wheels

We experimented a bit with conda-press a while back but it yielded poor results for us (wheels that were much larger than our current ones). I expect we are going to be fighting to keep our wheels at an acceptable size for the foreseeable future (at some point we need to try to break up the Python project into multiple interdependent wheels to enable modular installations)

corleyma

comment created time in a month

pull request commentapache/arrow

ARROW-9660: [C++] Revamp dictionary association in IPC

I'm sorry about the delay on this, I will try to complete this code review today

pitrou

comment created time in a month

startedUWQuickstep/hustle

started time in a month

Pull request review commentwesm/dataframe-protocol

Draft strawman data frame "__dataframe__" interchange / data export protocol for discussion

+from abc import ABC, abstractmethod+from collections.abc import Mapping+from typing import Any, Iterable, Sequence++# ----------------------------------------------------------------------+# A simple data type class hierarchy for illustration+++class DataType(ABC):+    """+    A metadata object representing the logical value type of a cell in a data+    frame column. This metadata does not guarantee an specific underlying data+    representation+    """+    def __eq__(self, other: 'DataType'):+        return self.equals(other)++    def __str__(self):+        return self.to_string()++    def __repr__(self):+        return str(self)++    @abstractmethod+    def to_string(self) -> str:+        """+        Return human-readable representation of the data type+        """++    @abstractmethod+    def equals(self, other: 'DataType') -> bool:+        """+        Return true if other DataType contains the same metadata as this+        DataType+        """+        pass+++class PrimitiveType(DataType):++    def equals(self, other: DataType) -> bool:+        return type(self) == type(other)+++class NullType(PrimitiveType):+    """+    A data type whose values are always null+    """+    def to_string(self):+        return "null"+++class Boolean(PrimitiveType):++    def to_string(self):+        return "bool"+++class NumberType(PrimitiveType):+    pass+++class IntegerType(NumberType):+    pass+++class SignedIntegerType(IntegerType):+    pass+++class Int8(SignedIntegerType):++    def to_string(self):+        return "int8"+++class Int16(SignedIntegerType):++    def to_string(self):+        return "int16"+++class Int32(SignedIntegerType):++    def to_string(self):+        return "int32"+++class Int64(SignedIntegerType):++    def to_string(self):+        return "int64"+++class Binary(PrimitiveType):+    """+    A variable-size binary (bytes) value+    """+    def to_string(self):+        return "binary"+++class String(PrimitiveType):+    """+    A UTF8-encoded string value+    """+    def to_string(self):+        return "string"+++class Object(PrimitiveType):+    """+    Any PyObject value+    """+    def to_string(self):+        return "object"+++class Categorical(DataType):+    """+    A categorical value is an ordinal (integer) value that references a+    sequence of category values of an arbitrary data type+    """++    def __init__(self, index_type: IntegerType, category_type: DataType,+                 ordered: bool = False):

There's lots of reasons to have access to the complete schema, from preallocating space for data (in-memory or on disk) to building query plans (including kernel selection) before data arrives. Not having access to the index type and category type would (IMHO) make a lot of things much more difficult.

That said, if you have an application where the index or category type is simply not known, you could use an Any type to indicate that the actual type must be observed in the actual data

wesm

comment created time in a month

PullRequestReviewEvent

issue commentcwida/duckdb

Support Arrow for table scans and result sets

Got it. Arrow is swizzling-free by design so there are pros and cons

fsaintjacques

comment created time in a month

more