profile
viewpoint
Wes McKinney wesm @ursa-labs / @rstudio Nashville, TN http://wesmckinney.com Director of https://ursalabs.org/. Creator of Python pandas. Co-creator Apache Arrow. @apache Member and Apache Parquet PMC

weld-project/weld 2147

High-performance runtime for data analytics applications

scikit-learn-contrib/sklearn-pandas 2065

Pandas integration with sklearn

ibis-project/ibis 1332

A pandas-like deferred expression system, with first-class SQL support

cloudera/impyla 559

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)

fastavro/fastavro 337

Fast Avro for Python

BIDS/datarray 77

Prototyping numpy arrays with named axes for data management.

cloudera/hs2client 70

C++ native client for Impala and Hive, with Python / pandas bindings

kwgoodman/roly 16

A comparison of various moving window median algorithms

cloudera/thrift_sasl 10

Thrift SASL module that implements TSaslClientTransport

ellisonbg/py4science 8

Using Python for scientific computing

issue commentapache/arrow

CRAN package installation issues on Linux

We have the GitHub issues turned on here strictly to ease onboarding new contributors who are unfamiliar with how the project community operates. The primary communication channels are the developer mailing list and JIRA. It would be great if you would open a JIRA account there so we can include you in discussions there (and you can edit the issues to add links etc -- I'll do that now)

eddelbuettel

comment created time in 12 hours

issue commentwesm/feather

In R, Feather cannot read dataframe back if missing or having empty column header

Is this issue present in the arrow::read_feather function? If so, could you open a JIRA issue?

guojingyu

comment created time in 13 hours

pull request commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

@kou @shiro615 this patch needs some GLib fixes to check schema equality without comparing the metadata, would you be able to take a look (I can try too)?

wesm

comment created time in 13 hours

push eventwesm/arrow

Wes McKinney

commit sha 902471c18be3bc40c1bc416dfb377e0676ee1cb5

export internal::MultipleChunkIterator

view details

push time in 13 hours

issue commentapache/arrow

CRAN package installation issues on Linux

Thanks @eddelbuettel, we'll get this sorted out -- if you haven't opened a JIRA account yet it would be helpful so we can move further discussions to the issues there.

eddelbuettel

comment created time in 13 hours

pull request commentapache/arrow

ARROW-7863: [C++][Python][CI] Ensure running HDFS related tests

Also to be honest I would be OK with removing support for libhdfs3 altogether since it is abandonware

kou

comment created time in 13 hours

Pull request review commentapache/arrow

ARROW-7863: [C++][Python][CI] Ensure running HDFS related tests

 PANDAS=latest DASK=latest TURBODBC=latest HDFS=2.9.2+LIBHDFS3=2.3

libhdfs3 2.3 is not supported at all. Version 2.2.31 is the last one we support but I'm not sure if that still works with the current set of conda-forge dependencies / compilers. We might propose reverting the feedstock to the last "good" version that uses the libhdfs C ABI

https://github.com/conda-forge/libhdfs3-feedstock/commit/97fd9320efd5cbac0145db60c9b867582e9a748a#diff-ae18fe4e01e61a9862ceaf165003213a

kou

comment created time in 14 hours

pull request commentapache/arrow

[Go] Add Reset method to various components and clean up comments.

Can you please open a JIRA issue for the benefit of our changelog?

richardartoul

comment created time in 14 hours

issue commentapache/arrow

Unable to build pyarrow 0.16 using Python3.8

Can you please open a JIRA issue? If you need pyarrow.orc I suggest you use the conda packages since this is not enabled in the wheels. It might be more productive to work on fixing ORC in the manylinux* builds https://issues.apache.org/jira/browse/ARROW-7811

SaloSentinelOne

comment created time in 14 hours

push eventwesm/arrow

Neal Richardson

commit sha 0e12a5a1b7de3969bf3eaf5610b934ce4b63bae3

ARROW-7859: [R] Minor patches for CRAN submission 0.16.0.2 Closes #6426 from nealrichardson/0.16.0.2-patch and squashes the following commits: 04d7e725b <Neal Richardson> News 01d2ffda0 <Neal Richardson> 0.16.0.2 followup Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Antoine Pitrou

commit sha 54119157acd5f7257843bf8f78e26e1fc690b901

ARROW-7330: [C++] Migrate Arrow Cuda to Result<T> Closes #6418 from pitrou/ARROW-7330-cuda-result and squashes the following commits: 86b5f7fd2 <Sutou Kouhei> Follow API change 5eb866f07 <Antoine Pitrou> ARROW-7330: Migrate Arrow Cuda to Result<T> Lead-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Markus Westerlind

commit sha f41b863b4fe47153ca106952d6a0859fd9e383ae

ARROW-7775: [Rust] fix: Don't let safe code arbitrarily transmute readers and writers This code were quite clearly unsafe (which seems to be known, judging from the comment about it). As the functions appeared to callable indirectly from other public functions I changed these to just panic instead to keep the API. Closes #6256 from Marwes/unsafe_rust and squashes the following commits: 97e022a57 <Markus Westerlind> s/typ/type/g 9fe6ddfd9 <Markus Westerlind> Remove unnecessary Send/Sync impls for RecordBatch d484e17e3 <Markus Westerlind> Better error message for typed conversion 66ce5b109 <Markus Westerlind> fix: Don't let safe code arbitrarily transmute readers and writers Authored-by: Markus Westerlind <markus.westerlind@distilnetworks.com> Signed-off-by: Chao Sun <sunchao@apache.org>

view details

Projjal Chanda

commit sha 4ffa286dc98647ed384b27c66074f4e03a0d2253

ARROW-7615: [CI][Gandiva] Ensure gandiva_jni library has only a whitelisted set of shared dependencies Closes #6423 from projjal/travis_debug and squashes the following commits: fefe0d192 <Projjal Chanda> set macos_target if already set 76b6ef784 <Projjal Chanda> added a print statement d03c3e3d4 <Projjal Chanda> check based on environment variable 4f10a74ab <Projjal Chanda> Check shared dependecies against whitelisted set Authored-by: Projjal Chanda <iam@pchanda.com> Signed-off-by: Praveen <praveen@dremio.com>

view details

Wes McKinney

commit sha 08890cb225a9e768c6cdbca5e4d219330d16989a

Start working on properly preserving and deserializing field_id in C++. Some field_id round trips working Schema tests passing, some refactoring to simplify schema serialization code Add KeyValueMetadata::Merge method, Arrow schema access for ParquetFile Add Field::WithMergedMetadata Add Parquet field_id metadata to Arrow schemas. Better diffing in AssertTablesEqual Ignore unequal child field metadata in ChunkedArray::Equals, fix Python Parquet unit tests Add Python unit test to verify that field_id's are correctly filtered through to the Arrow schema in Python decruft

view details

Wes McKinney

commit sha 86ecd7e1787dd94cda408a764b7c8b33c9498ba9

Code review comments

view details

Wes McKinney

commit sha 174df22e44a40a82b72c5496dc96eacaf8a534cb

Fix dataset Parquet unit tests

view details

Wes McKinney

commit sha e51ddb44926c20c399ccdfbc5382b13bfe10a4b1

Correct inconsistent comments about null field_id's

view details

Wes McKinney

commit sha d82dd2e7e7881de69a8b0b4dcb255260557130e4

export internal::MultipleChunkIterator

view details

push time in 14 hours

push eventwesm/arrow

Wes McKinney

commit sha 0c66f40e1ab9c15bdb1bd96701219803d557ca51

Correct inconsistent comments about null field_id's

view details

push time in 3 days

Pull request review commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

 class PARQUET_EXPORT Node {    const std::shared_ptr<const LogicalType>& logical_type() const { return logical_type_; } -  int id() const { return id_; }+  /// \brief The field_id value for the serialized SchemaElement. If the+  /// field_id is less than 0 (e.g. -1), it will not be set when serialized to+  /// Thrift

I fixed these comments. There's follow up investigation to do per PARQUET-1798

wesm

comment created time in 3 days

push eventwesm/arrow

Antoine Pitrou

commit sha 220c43789fc0e2b6b623c30a18e5464ffa9f1e82

ARROW-7754: [C++] Make Result<> faster The main culprit for slowness was the underlying variant<> implementation. The new implementation has comparable speed to returning Status + passing a result-out pointer parameter. Closes #6356 from pitrou/ARROW-7754-result-perf and squashes the following commits: d3fc5456f <Antoine Pitrou> Remove commented out code + add a tests 649541699 <Antoine Pitrou> ARROW-7754: Make Result<> faster Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Antoine Pitrou

commit sha b9fbc21bfe12050daac2e6f4d49a35ecd4040282

PARQUET-1770: [C++][CI] Add fuzz target for reading Parquet files This fuzz target goes through the Parquet Arrow file reader. Closes #6405 from pitrou/ARROW-6273-parquet-fuzz and squashes the following commits: d71778a0d <Antoine Pitrou> Restore dynamic_cast 5a0fae38c <Antoine Pitrou> Improve data generation with more varied columns 2c08e4875 <Antoine Pitrou> PARQUET-1770: Add fuzz target for reading Parquet files Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

David Li

commit sha 431fdbfbc0fb4bb16c4a89298404ad16784db2a0

ARROW-7722: [FlightRPC][Java] disable flaky Flight auth test I was unable to reproduce this one locally or with docker, so let's disable it, since it keeps failing on master. Closes #6392 from lidavidm/arrow-7722 and squashes the following commits: c7bfbac81 <David Li> ARROW-7722: disable flaky Flight auth test Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Antoine Pitrou

commit sha 9f0c70c8337b1a8c75df5cc9410b54e97351bfc0

ARROW-2447: [C++] Device and MemoryManager API Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The layer exposes two interfaces: * the `Device` interface exposes information a particular memory-holding device * the `MemoryManager` allows allocating, copying, reading or writing memory located on a particular device The `Buffer` API is modified so that calling `data()` fails on non-CPU buffers. A separate `address()` method returns the buffer address as an integer, and is allowed on any buffer. The API provides convenience functions to view or copy a buffer from one device to the other. For example, a on-GPU buffer can be copied to the CPU, and in some situations a zero-copy CPU view can also be created (depending on the GPU capabilities and how the GPU memory was allocated). An example use in the PR is IPC. On the write side, a new `SerializeRecordBatch` overload takes a `MemoryManager` argument and is able to serialize data either to any kind of memory (CPU, GPU). On the read side, `ReadRecordBatch` now works on any kind of input buffer, and returns record batches backed by either CPU or GPU memory. It introduces a slight complexity in the CUDA namespace, since there are both `CudaContext` and `CudaMemoryManager` classes. We could solve this by merging the two concepts (but doing so may break compatibility for existing users of CUDA). Closes #6295 from pitrou/ARROW-2447-device-api-memory-manager and squashes the following commits: c665f6177 <Antoine Pitrou> ARROW-2447: Device and MemoryManager API Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

Antoine Pitrou

commit sha 3bc01ec94eb2e310b28402a35196e1e8c5c9aec8

ARROW-7815: [C++] Improve input validation This should fix the following issues: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20260 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20282 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20307 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20324 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20330 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20575 Closes #6396 from pitrou/ARROW-7815-fuzz-issues and squashes the following commits: 0efa846e1 <Antoine Pitrou> Update testing submodule fd8e865a4 <Antoine Pitrou> ARROW-7815: Improve input validation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

Antoine Pitrou

commit sha 412145b8a736a5dce4f17b03d7f72efb45053176

ARROW-5757: [Python] Remove Python 2.7 support Part of the changes were done using [pyupgrade](https://github.com/asottile/pyupgrade). Closes #6410 from pitrou/ARROW-5757-py2-goodbye and squashes the following commits: f0f9f513b <Antoine Pitrou> Address review comments 561ac965c <Antoine Pitrou> ARROW-5757: Remove Python 2.7 support Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

Projjal Chanda

commit sha d43717590665b912e0f4b3e4ac38127d19f2f6e0

ARROW-7793: [Java] Release accounted-for reservation memory to parent in case of leak Closes #6401 from projjal/allcator_leak and squashes the following commits: c12261412 <Projjal Chanda> setting the test to ignore 4abb4e45a <Projjal Chanda> Turn off debug 1573fd563 <Projjal Chanda> Added unit test 0a9f743d3 <Projjal Chanda> Release accounted-for reservation memory to parent in case of leak Authored-by: Projjal Chanda <iam@pchanda.com> Signed-off-by: Pindikura Ravindra <ravindra@dremio.com>

view details

David Li

commit sha 7dc9f9b04267d439cdcbc578fff833ef2922767b

ARROW-6875: [FlightRPC] implement criteria for ListFlights I'm now working on the integration tests. I got to the end and realized it would be an enormous PR, so instead I'm splitting out the various fixes I made. None of the languages fully supported passing filter criteria in ListFlights. Closes #6390 from lidavidm/arrow-6875 and squashes the following commits: eb1942a7c <David Li> ARROW-6875: implement criteria for ListFlights Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

mubai

commit sha 5099450969cd1c680bb3443b894ba119b0ec7a2c

ARROW-7848: [C++][Python][Doc] Add MapType API doc Add doc for MapType Closes #6413 from chaokunyang/fix_map_type_doc and squashes the following commits: ca401547b <mubai> add MapType doc for cpp 4a955be07 <mubai> add doc for map type Authored-by: mubai <chaokun.yck@antfin.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Krisztián Szűcs

commit sha 66185d6751fb8e2523c1260334be1fc878208207

ARROW-7834: [Release] Post release task for updating the documentations Generate prose and apidocs and update arrow-site's asf-site branch as a post-release task. Closes #6407 from kszucs/release-docs and squashes the following commits: bfb9c546a <Krisztián Szűcs> fix review comments b34729f07 <Krisztián Szűcs> post release script for updating docs Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Krisztián Szűcs

commit sha 217c0d333b5af4e3e469a1872185a58020eef114

ARROW-7849: [Packaging][Python] Remove the remaining py27 crossbow wheel tasks from the nightlies Follow-up for #6410 Closes #6415 from kszucs/nightly-crossbow-tasks and squashes the following commits: 6a875144d <Krisztián Szűcs> Remove the remaining py27 crossbow wheel tasks from the nightlies Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Krisztián Szűcs

commit sha 8446000454dab78ebb7530b61892b49dd3fa6906

ARROW-1581: [Packaging] Tooling to make nightly wheels available for install Generates a GitHub page to a specific crossbow branch (typically gh-pages). It creates two directories: - [nightly](https://ursalabs.org/crossbow/nightly/): containing links for the nightly artifacts grouped by day and task, also creates a "latest" directory for the most recent nightly artifacts - [pypi/pyarrow](https://ursalabs.org/crossbow/pypi/pyarrow/): compatible with the [simple PyPI repository specification](https://www.python.org/dev/peps/pep-0503/) so lets user install PyArrow with ```bash pip install -v --extra-index-url https://ursalabs.org/crossbow/pypi/ pyarrow==0.15.0.dev718 ``` Gotchas: - API rate limit: the script uses the GitHub API extensively so it can consume the rate limit pretty quickly - GitHub API doesn't provide a checksum for the assets, so we cannot include them in the links Follow-up work: Run this on a daily bases, the simplest would be to execute after the nightly report via [ursabot](https://github.com/ursa-labs/ursabot/blob/master/projects/arrow/arrow/crossbow.py#L81) Closes #6366 from kszucs/crossbow-wheel-pages and squashes the following commits: b9d53f723 <Krisztián Szűcs> address review comments 53f44229e <Krisztián Szűcs> re-enable fetch 978519245 <Krisztián Szűcs> use separate github token for pushing and querying the api b30b2c34a <Krisztián Szűcs> flake8 16dd3c90b <Krisztián Szűcs> use bullets a601266c4 <Krisztián Szűcs> spare a couple of queries a42d3bfe9 <Krisztián Szűcs> simple pypi repo 98ca5952e <Krisztián Szűcs> github page generator Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Antoine Pitrou

commit sha b8c1f6c6faadf34ca9e7456709cb8a6ee1a7ad38

ARROW-7846: [Python][Dev] Remove dependencies on six We're Python 3-only now, except in the PR merge script. Closes #6414 from pitrou/ARROW-7846-remove-six and squashes the following commits: 13ab24e90 <Antoine Pitrou> ARROW-7846: Remove dependencies on six Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

François Saint-Jacques

commit sha 7bcdbb3c1808392aea9d58d2b29b67b73949d7b4

ARROW-7844: [R] Converter_List is not thread-safe This would trigger issues such as this: ``` Parquet file reading/writing: .........S..............Warning: stack imbalance in '.Call', 145 then 144 Warning: stack imbalance in '{', 141 then 140 Warning: stack imbalance in '{', 134 then 133 1.. ══ Skipped ═════════════════════════════════════════════════════════════════════ 1. write_parquet() handles various compression_level= specs (@test-parquet.R#66) - Reason: Arrow C++ not built with support for gzip ══ Failed ══════════════════════════════════════════════════════════════════════ ── 1. Failure: Lists are preserved when writing/reading from Parquet (@test-parq `object` not equivalent to `expected`. Component “int”: Component 3: Numeric: lengths (4, 2) differ ``` Closes #6419 from fsaintjacques/ARROW-7844-r-flakyness and squashes the following commits: 99b0a0b8a <François Saint-Jacques> ARROW-7844: Converter_List is not thread-safe Authored-by: François Saint-Jacques <fsaintjacques@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Neal Richardson

commit sha 1a5cb5617bdf93f9d437c53db0e5cfbf4070e988

ARROW-7833: [R] Make install_arrow() actually install arrow `install_arrow()` now takes a few arguments, which let you * Install the latest CRAN version (no arguments) * Install the latest nightly build (`install_arrow(nightly = TRUE)`) * Install with Linux C++ binaries by default (they are off by default in the current CRAN release), and conveniently change that setting without messing with env vars * Ignore system-installed arrow packages by default (to ensure that R and C++ versions match) It will also attempt to reload the package after installation, if `pkgload` is available. The other important change in this patch is to make `LIBARROW_DOWNLOAD` off by default, in order to appease CRAN. This unfortunately makes Linux installation only "just work" if you have set an env var. Other improvements in this patch: * Rename the env var `LIBARROW_BINARY_DISTRO` to `LIBARROW_BINARY` (brevity, and it also takes boolean values to enable or disable binary downloading) * Fix the default value setting of that variable * Update installation guide and README accordingly * Remove README.Rmd and just keep the static README.md. There's no value for us to have an R Markdown readme and have to worry about keeping it in sync; we have vignettes and help pages for examples. Closes #6406 from nealrichardson/install-arrow-binary and squashes the following commits: 767668c0a <Neal Richardson> Script fixes 4a1550ed7 <Neal Richardson> LIBARROW_BINARY on should entail download_ok 471050053 <Neal Richardson> Update docs for new configure reality 660d0e7cb <Neal Richardson> LIBARROW_DOWNLOAD is false by default now 3e02b72e6 <Neal Richardson> Reload the package if loaded already c731542d6 <Neal Richardson> Fix for the fix 42dc32763 <Neal Richardson> Fix test setup e48ecbbf1 <Neal Richardson> Docs e92f4b7da <Neal Richardson> Make these tests always run 87951bd7e <Neal Richardson> Update readme and add message e8861163b <Neal Richardson> Delete README.Rmd (keep static README.md) bcd0237cb <Neal Richardson> Update docs a459b49ac <Neal Richardson> Switch var name to LIBARROW_BINARY 705d28a49 <Neal Richardson> Fix default value of LIBARROW_BINARY_DISTRO 7ff136510 <Neal Richardson> Change install_arrow() to actually install Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Antoine Pitrou

commit sha d4e3898211207da4d8ff940a54d51338041d285a

ARROW-6165: [Integration] Run integration tests on multiple cores On a local test with C++, Java, Flight and Gold files enabled, the runtime goes down from 1m56 (serial) to 35s (parallel). Closes #6416 from pitrou/ARROW-6165-parallel-integration and squashes the following commits: 3ac06e53b <Antoine Pitrou> ARROW-6165: Run integration tests on multiple cores Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

François Saint-Jacques

commit sha 9392531e239f278746f6b04e39833597e2d2e416

ARROW-7781: [C++] Improve message when referencing a missing field - The failure now indicate which field name is causing the issue. - Add the `CanReferenceFieldsByNames` in Schema and relevant tests. Closes #6421 from fsaintjacques/ARROW-7781-missing-filter-field-message and squashes the following commits: 461d66395 <François Saint-Jacques> ARROW-7781: Improve message when referencing a missing field Authored-by: François Saint-Jacques <fsaintjacques@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Sutou Kouhei

commit sha 6e143846b5fe17a7e91b70d70f91114113c8301a

ARROW-7841: [C++] Use ${HADOOP_HOME}/lib/native/ to find libhdfs.so again This is a regression bug of e12d2851fe60a3628d900f432950dc321db65ff9 . Closes #6424 from kou/cpp-hdfs-fix-path and squashes the following commits: 6d5f651e3 <Sutou Kouhei> ARROW-7841: Use ${HADOOP_HOME}/lib/native/ to find libhdfs.so again Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

François Saint-Jacques

commit sha 52255a1ba4a53a0ce1a9eb75012cf7b334884149

ARROW-7819: [C++][Gandiva] Add DumpIR to Filter/Projector object The following patch exposes the generated IR as a method of the objects for further inspection. This is a breaking change for the internal method `FinalizeModule` which doesn't take the dump_ir and optimize flags, it receives `optimize` from Configuration now. - Refactored Engine, notably removed dead code, organized init in a single function and simplified LLVMGenerator. - Dumping IR should not write to stdout, but instead return it as a string in the `DumpIR` method. - Refactored Types, fixing some bad methods type. - Added the optimize field to `Configuration` class. - Simplified some unit tests. But more importantly, we can now inspect dynamically: ```python >>> filter = gandiva.make_filter(table.schema, condition) >>> print(filter.ir) ; ModuleID = 'codegen' source_filename = "codegen" target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @llvm.global_ctors = appending global [0 x { i32, void ()*, i8* }] zeroinitializer @_ZN5arrow7BitUtilL8kBitmaskE = internal unnamed_addr constant [8 x i8] c"\01\02\04\08\10 @\80", align 1 ; Function Attrs: norecurse nounwind define i32 @expr_0_0(i64* nocapture readonly %args, i64* nocapture readonly %arg_addr_offsets, i64* nocapture readnone %local_bitmaps, i16* nocapture readnone %selection_vector, i64 %context_ptr, i64 %nrecords) local_unnamed_addr #0 { entry: %0 = bitcast i64* %args to i8** %cond_mem56 = load i8*, i8** %0, align 8 %1 = getelementptr i64, i64* %arg_addr_offsets, i64 3 %2 = load i64, i64* %1, align 8 %a_mem_addr = getelementptr i64, i64* %args, i64 3 %3 = bitcast i64* %a_mem_addr to double** %a_mem7 = load double*, double** %3, align 8 %scevgep = getelementptr double, double* %a_mem7, i64 %2 br label %loop loop: ; preds = %loop, %entry %loop_var = phi i64 [ 0, %entry ], [ %"loop_var+1", %loop ] %scevgep8 = getelementptr double, double* %scevgep, i64 %loop_var %a = load double, double* %scevgep8, align 8 %4 = fcmp olt double %a, 1.000000e+03 %5 = sext i1 %4 to i8 ``` Closes #6417 from fsaintjacques/ARROW-7819-gandiva-dump-ir-tool and squashes the following commits: c8d274f98 <François Saint-Jacques> Address comments 0bcebc8b0 <François Saint-Jacques> ARROW-7819: Add DumpIR to Filter/Projector object Authored-by: François Saint-Jacques <fsaintjacques@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

Wes McKinney

commit sha abc9f8b1240e5fd2882d9db04ce9afedc7c5db11

Start working on properly preserving and deserializing field_id in C++. Some field_id round trips working Schema tests passing, some refactoring to simplify schema serialization code Add KeyValueMetadata::Merge method, Arrow schema access for ParquetFile Add Field::WithMergedMetadata Add Parquet field_id metadata to Arrow schemas. Better diffing in AssertTablesEqual Ignore unequal child field metadata in ChunkedArray::Equals, fix Python Parquet unit tests Add Python unit test to verify that field_id's are correctly filtered through to the Arrow schema in Python decruft

view details

push time in 3 days

pull request commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

I looked a little more at the automatic field_id assignment and I'm not sure what we're doing is correct, but we should fix it in a dedicated PR. This PR leaves the existing behavior unchanged

https://issues.apache.org/jira/browse/PARQUET-1798

wesm

comment created time in 3 days

push eventapache/arrow

François Saint-Jacques

commit sha 52255a1ba4a53a0ce1a9eb75012cf7b334884149

ARROW-7819: [C++][Gandiva] Add DumpIR to Filter/Projector object The following patch exposes the generated IR as a method of the objects for further inspection. This is a breaking change for the internal method `FinalizeModule` which doesn't take the dump_ir and optimize flags, it receives `optimize` from Configuration now. - Refactored Engine, notably removed dead code, organized init in a single function and simplified LLVMGenerator. - Dumping IR should not write to stdout, but instead return it as a string in the `DumpIR` method. - Refactored Types, fixing some bad methods type. - Added the optimize field to `Configuration` class. - Simplified some unit tests. But more importantly, we can now inspect dynamically: ```python >>> filter = gandiva.make_filter(table.schema, condition) >>> print(filter.ir) ; ModuleID = 'codegen' source_filename = "codegen" target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" @llvm.global_ctors = appending global [0 x { i32, void ()*, i8* }] zeroinitializer @_ZN5arrow7BitUtilL8kBitmaskE = internal unnamed_addr constant [8 x i8] c"\01\02\04\08\10 @\80", align 1 ; Function Attrs: norecurse nounwind define i32 @expr_0_0(i64* nocapture readonly %args, i64* nocapture readonly %arg_addr_offsets, i64* nocapture readnone %local_bitmaps, i16* nocapture readnone %selection_vector, i64 %context_ptr, i64 %nrecords) local_unnamed_addr #0 { entry: %0 = bitcast i64* %args to i8** %cond_mem56 = load i8*, i8** %0, align 8 %1 = getelementptr i64, i64* %arg_addr_offsets, i64 3 %2 = load i64, i64* %1, align 8 %a_mem_addr = getelementptr i64, i64* %args, i64 3 %3 = bitcast i64* %a_mem_addr to double** %a_mem7 = load double*, double** %3, align 8 %scevgep = getelementptr double, double* %a_mem7, i64 %2 br label %loop loop: ; preds = %loop, %entry %loop_var = phi i64 [ 0, %entry ], [ %"loop_var+1", %loop ] %scevgep8 = getelementptr double, double* %scevgep, i64 %loop_var %a = load double, double* %scevgep8, align 8 %4 = fcmp olt double %a, 1.000000e+03 %5 = sext i1 %4 to i8 ``` Closes #6417 from fsaintjacques/ARROW-7819-gandiva-dump-ir-tool and squashes the following commits: c8d274f98 <François Saint-Jacques> Address comments 0bcebc8b0 <François Saint-Jacques> ARROW-7819: Add DumpIR to Filter/Projector object Authored-by: François Saint-Jacques <fsaintjacques@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 3 days

PR closed apache/arrow

ARROW-7819: [C++][Gandiva] Add DumpIR to Filter/Projector object

The following patch exposes the generated IR as a method of the objects for further inspection. This is a breaking change for the internal method FinalizeModule which doesn't take the dump_ir and optimize flags, it receives optimize from Configuration now.

  • Refactored Engine, notably removed dead code, organized init in a single function and simplified LLVMGenerator.
  • Dumping IR should not write to stdout, but instead return it as a string in the DumpIR method.
  • Refactored Types, fixing some bad methods type.
  • Added the optimize field to Configuration class.
  • Simplified some unit tests.

But more importantly, we can now inspect dynamically:

>>> filter = gandiva.make_filter(table.schema, condition)                                                                                    
>>> print(filter.ir)
; ModuleID = 'codegen'                                                
source_filename = "codegen"                    
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"           
target triple = "x86_64-unknown-linux-gnu"                 
                                                                      
@llvm.global_ctors = appending global [0 x { i32, void ()*, i8* }] zeroinitializer
@_ZN5arrow7BitUtilL8kBitmaskE = internal unnamed_addr constant [8 x i8] c"\01\02\04\08\10 @\80", align 1
                                                                                                                                             
; Function Attrs: norecurse nounwind                      
define i32 @expr_0_0(i64* nocapture readonly %args, i64* nocapture readonly %arg_addr_offsets, i64* nocapture readnone %local_bitmaps, i16* nocapture readnone %selection_vector, i64 %context_ptr, i64 %nrecords) local_unnamed_addr #0 {
entry:                                                                
  %0 = bitcast i64* %args to i8**                                                                                                            
  %cond_mem56 = load i8*, i8** %0, align 8                            
  %1 = getelementptr i64, i64* %arg_addr_offsets, i64 3            
  %2 = load i64, i64* %1, align 8                                                                                                            
  %a_mem_addr = getelementptr i64, i64* %args, i64 3                  
  %3 = bitcast i64* %a_mem_addr to double**            
  %a_mem7 = load double*, double** %3, align 8                        
  %scevgep = getelementptr double, double* %a_mem7, i64 %2   
  br label %loop                                            
                                                                                                                                             
loop:                                             ; preds = %loop, %entry
  %loop_var = phi i64 [ 0, %entry ], [ %"loop_var+1", %loop ]                                                                                
  %scevgep8 = getelementptr double, double* %scevgep, i64 %loop_var
  %a = load double, double* %scevgep8, align 8      
  %4 = fcmp olt double %a, 1.000000e+03                   
  %5 = sext i1 %4 to i8                             
+286 -285

1 comment

16 changed files

fsaintjacques

pr closed time in 3 days

Pull request review commentapache/arrow

ARROW-7819: [C++][Gandiva] Add DumpIR to Filter/Projector object

 class TestLLVMGenerator : public ::testing::Test { // Verify that a valid pc function exists for every function in the registry. TEST_F(TestLLVMGenerator, VerifyPCFunctions) {   std::unique_ptr<LLVMGenerator> generator;-  auto status = LLVMGenerator::Make(TestConfiguration(), &generator);-  EXPECT_TRUE(status.ok()) << status.message();+  ASSERT_OK(LLVMGenerator::Make(TestConfiguration(), &generator));    llvm::Module* module = generator->module();   for (auto& iter : registry_) {-    llvm::Function* fn = module->getFunction(iter.pc_name());-    EXPECT_NE(fn, nullptr) << "function " << iter.pc_name()-                           << " missing in precompiled module\n";+    EXPECT_NE(module->getFunction(iter.pc_name()), nullptr);

He might mean that the problem may be more evident from looking at test log files :) I think it's OK for now, if it becomes an issue we can improve

fsaintjacques

comment created time in 3 days

push eventapache/arrow

Sutou Kouhei

commit sha 6e143846b5fe17a7e91b70d70f91114113c8301a

ARROW-7841: [C++] Use ${HADOOP_HOME}/lib/native/ to find libhdfs.so again This is a regression bug of e12d2851fe60a3628d900f432950dc321db65ff9 . Closes #6424 from kou/cpp-hdfs-fix-path and squashes the following commits: 6d5f651e3 <Sutou Kouhei> ARROW-7841: Use ${HADOOP_HOME}/lib/native/ to find libhdfs.so again Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 3 days

PR closed apache/arrow

ARROW-7841: [C++] Use ${HADOOP_HOME}/lib/native/ to find libhdfs.so again

This is a regression bug of e12d2851fe60a3628d900f432950dc321db65ff9 .

+13 -1

5 comments

1 changed file

kou

pr closed time in 3 days

pull request commentapache/arrow

ARROW-7841: [C++] Use ${HADOOP_HOME}/lib/native/ to find libhdfs.so again

@kou @kszucs is there a weakness in our testing that caused this regression to slip through? Anything we can do to prevent it from happening again?

kou

comment created time in 3 days

pull request commentapache/arrow

ARROW-7462: [C++] Add CpuInfo detection for Arm64 Architecture

@guyuqi thanks. I actually did buy a 48-core ThunderX desktop but I've been having a hard time getting Debian installed at set up. Once I can do that I'll set up the build agent so it can be used for testing and benchmarking

guyuqi

comment created time in 3 days

Pull request review commentapache/arrow

ARROW-7833: [R] Make install_arrow() actually install arrow

 dst_dir <- paste0("libarrow/arrow-", VERSION)  arrow_repo <- "https://dl.bintray.com/ursalabs/arrow-r/libarrow/" apache_src_url <- paste0(-  "https://www.apache.org/dyn/closer.cgi/arrow/arrow-", VERSION,+  "https://archive.apache.org/dist/arrow/arrow-", VERSION,

ASF release policies advise against this -- maybe let's open a JIRA about making the mirror resolution more robust?

nealrichardson

comment created time in 3 days

pull request commentapache/arrow

ARROW-7669: [CI] Turn optimizations off in AppVeyor MinGW builds

Hm the mingw64 failure is reproducible, seems like it is hanging on linking libarrow.dll. Is it possible because of the debug settings that it's running against an OOM causing a hang of some kind (@kou any ideas)?

[ 35%] Linking CXX shared library ../../release/libarrow.dll
cd /C/projects/arrow/cpp/build/src/arrow && C:/msys64/usr/bin/ccache.exe /C/msys64/mingw64/bin/cmake.exe -E remove -f CMakeFiles/arrow_shared.dir/objects.a
cd /C/projects/arrow/cpp/build/src/arrow && C:/msys64/usr/bin/ccache.exe /C/msys64/mingw64/bin/ar.exe cr CMakeFiles/arrow_shared.dir/objects.a @CMakeFiles/arrow_shared.dir/objects1.rsp
cd /C/projects/arrow/cpp/build/src/arrow && C:/msys64/usr/bin/ccache.exe /C/msys64/mingw64/bin/g++.exe   -Wno-noexcept-type -O3 -DNDEBUG -O0 -fdiagnostics-color=always  -Wall -msse4.2  -O0 -Wl,--version-script=C:/projects/arrow/cpp/src/arrow/symbols.map -shared -o ../../release/libarrow.dll -Wl,--out-implib,../../release/libarrow.dll.a -Wl,--major-image-version,100,--minor-image-version,0 -Wl,--whole-archive CMakeFiles/arrow_shared.dir/objects.a -Wl,--no-whole-archive  /C/OpenSSL-Win64/lib/libeay32.lib /C/OpenSSL-Win64/lib/ssleay32.lib /C/msys64/mingw64/lib/libbrotlienc-static.a /C/msys64/mingw64/lib/libbrotlidec-static.a /C/msys64/mingw64/lib/libbrotlicommon-static.a -lWs2_32 /C/OpenSSL-Win64/lib/libeay32.lib /C/OpenSSL-Win64/lib/ssleay32.lib /C/msys64/mingw64/lib/libbrotlienc-static.a /C/msys64/mingw64/lib/libbrotlidec-static.a /C/msys64/mingw64/lib/libbrotlicommon-static.a /C/msys64/mingw64/lib/libbz2.dll.a /C/msys64/mingw64/lib/liblz4.dll.a /C/msys64/mingw64/lib/libsnappy.a /C/msys64/mingw64/lib/libz.dll.a /C/msys64/mingw64/lib/libzstd.dll.a /C/msys64/mingw64/lib/libboost_filesystem-mt.dll.a /C/msys64/mingw64/lib/libboost_system-mt.dll.a /C/OpenSSL-Win64/lib/libeay32.lib -lkernel32 -luser32 -lgdi32 -lwinspool -lshell32 -lole32 -loleaut32 -luuid -lcomdlg32 -ladvapi32 
pitrou

comment created time in 4 days

pull request commentClickHouse/ClickHouse

Parquet list reading fix

Per comments on the Arrow JIRA, if someone can clarify what you mean by "This issue was caused by lack of list type support in the used version of arrow lib." We do support reading ListArray so long as the child type is either a list or a primitive type (but not struct). The main JIRA for this is

https://issues.apache.org/jira/browse/ARROW-1644

It's being worked on right now, having your feedback on this would be helpful

maxulan

comment created time in 4 days

Pull request review commentClickHouse/ClickHouse

Parquet list reading fix

 [submodule "contrib/icu"] 	path = contrib/icu 	url = https://github.com/unicode-org/icu.git+[submodule "contrib/flatbuffers"]+	path = contrib/flatbuffers+	url = https://github.com/google/flatbuffers.git

Note that Flatbuffers is no longer a build dependency of Apache Arrow in C++ (we have vendored the necessary files)

maxulan

comment created time in 4 days

pull request commentapache/arrow

ARROW-7841: [C++] Use ${HADOOP_HOME}/lib/native/ to find libhdfs.so again

I just created a 0.16.1 patch release version in JIRA (and added this issue to it) since it seems likely that we will accumulate some regressions over the next couple of weeks

kou

comment created time in 4 days

push eventwesm/arrow

Krisztián Szűcs

commit sha fb2a9f36204d71092bc08ee919cc859f6797b807

[Release] Update CHANGELOG.md for 0.16.0

view details

Krisztián Szűcs

commit sha 2f395e924b7007778a31e78310acc0f667b04e91

[Release] Update .deb/.rpm changelogs for 0.16.0

view details

Krisztián Szűcs

commit sha 66d9941ffdd3da8a14b5fd00e82eb1e2385e0a53

[Release] Update versions for 0.16.0

view details

Krisztián Szűcs

commit sha 729a7689fd87572e6a14ad36f19cd579a8b8d9c5

[maven-release-plugin] prepare release apache-arrow-0.16.0

view details

Krisztián Szűcs

commit sha 7b065b36623b0022bbda497b00b68266397479fc

[maven-release-plugin] prepare for next development iteration

view details

Krisztián Szűcs

commit sha 3468f881b13d3a6657b5ede2a1ad1f1d4bce2f16

[Release] Update versions for 1.0.0-SNAPSHOT

view details

Krisztián Szűcs

commit sha 7dab2f85732b0da0a871c1de4f32f13abb4bc90a

[Release] Update .deb package names for 1.0.0

view details

Antoine Pitrou

commit sha b5a9839b39b4601725006b85bf5de648f33a1204

ARROW-7712: [CI] [Crossbow] Delete fuzzit jobs We will focus on OSS-Fuzz, now that we have been accepted there. Also add an OSS-Fuzz badge in the README. Closes #6328 from pitrou/ARROW-7712-delete-fuzzit-jobs and squashes the following commits: 6c6d2c462 <Antoine Pitrou> Remove remaining fuzzit-related configuration a6abaa9ba <Antoine Pitrou> ARROW-7712: Delete fuzzit jobs Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

liyafan82

commit sha 195643f30dba33ea3335af31581faa2fb7c0a5fb

ARROW-6738: [Java] Fix problems with current union comparison logic There are some problems with the current union comparison logic. For example: 1. For type check, we should not require fields to be equal. It is possible that two vectors' value ranges are equal but their fields are different. 2. We should not compare the number of sub vectors, as it is possible that two union vectors have different numbers of sub vectors, but have equal values in the range. Closes #5544 from liyafan82/fly_0930_share and squashes the following commits: d6ef3d220 <liyafan82> Refine test case c008289ec <liyafan82> Resolve test failure after rebasing c5153931e <liyafan82> Rule out the change for union type comparison bab74028f <liyafan82> Compare fields for all vectors except union vectors 5b2225e3c <liyafan82> Fix the bug with decimal vector 4d8b570df <liyafan82> Fix problems with current union comparison logic Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

David Li

commit sha 9e8590d7df083bc7a3cd59b1a22805c1f3e86169

ARROW-7734: [C++] check status details for nullptr in equality When checking statuses for equality, check to make sure that both have status detail objects before proceeding to compare those objects. Closes #6332 from lidavidm/arrow-7734 and squashes the following commits: 9084bcec7 <David Li> ARROW-7734: check status details for nullptr in equality Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Antoine Pitrou

commit sha 350c7bde14fda070753d2e9b06ba86e5ffb4578d

ARROW-6724: [C++] Allow simpler BufferOutputStream creation Make the initial capacity argument optional. Closes #6327 from pitrou/ARROW-6724-bufferoutstream-simpler-ctor and squashes the following commits: b563f4601 <Antoine Pitrou> ARROW-6724: Allow simpler BufferOutputStream creation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

tianchen

commit sha 819b18dbcb103a4db9c293879a02ced69309b06f

ARROW-6871: [Java] Enhance TransferPair related parameters check and tests Related to [ARROW-6871](https://issues.apache.org/jira/browse/ARROW-6871). TransferPair related param checks in different classes have potential problems: i. splitAndTansfer has no indices check in classes like VarcharVector ii. splitAndTranser indices check in classes like UnionVector is not correct (Preconditions.checkArgument(startIndex + length <= valueCount)), should check params separately. iii. should add more UT to cover corner cases. Closes #5645 from tianchen92/ARROW-6871 and squashes the following commits: f3b897ddd <tianchen> fix style 0d3c7eab3 <tianchen> add benchmark 01f9a48f2 <tianchen> revert changes in copyFrom a22d58abb <tianchen> ARROW-6871: Enhance TransferPair related parameters check and tests Authored-by: tianchen <niki.lj@alibaba-inc.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

liyafan82

commit sha 273e1ce9aecc0e14ad7af4c8954eb3141344aa51

ARROW-7301: [Java] Sql type DATE should correspond to DateDayVector According to SQL convertion, sql type DATE should correspond to a format of YYYY-MM-DD, without the components for hour/minute/second/millis Therefore, JDBC type DATE should correspond to DateDayVector, with a type width of 4, instead of 8. Closes #5944 from liyafan82/fly_1203_date and squashes the following commits: a6de37756 <liyafan82> Remove division in time conversion be731925d <liyafan82> Resolve comments eea8b7911 <liyafan82> Sql type DATE should correspond to DateDayVector Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Sutou Kouhei

commit sha e4ff2763fb5b71a6e3d29da9f30425626860259e

ARROW-7736: [Release] Retry binary download on transient error It makes binary verification more robust. Closes #6335 from kou/release-verification-binaries-retry and squashes the following commits: 38944a9c7 <Sutou Kouhei> Retry binary download on transient error Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Krisztián Szűcs

commit sha c477183c35ed827833ef969bb72160bf786c1942

ARROW-7466: [CI][Java] Fix gandiva-jar-osx nightly build failure Closes #6331 from kszucs/ARROW-7466 and squashes the following commits: d50b8c8975 <Krisztián Szűcs> don't install already installed python3 d956297c74 <Krisztián Szűcs> re-enable steps 421142dba8 <Krisztián Szűcs> start all arguments at the same column... 2131836b6f <Krisztián Szűcs> travis multi line string b50865e70c <Krisztián Szűcs> use travis_build_dir 6f1beb6851 <Krisztián Szűcs> debug paths 7b054ecabf <Krisztián Szűcs> queue path 6929f3d2c7 <Krisztián Szűcs> fix crossbow path 2a2d7c378b <Krisztián Szűcs> deploy using crossbow Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com>

view details

Andy Grove

commit sha fee1209ac642f32d37aff4846f9d220c6f306ed6

ARROW-7684: [Rust] Example Flight client and server for DataFusion This PR adds DataFusion examples for a Flight client and server where the client can send a SQL query to the server and then receive the results. I have manually tested with a Java client as well to confirm that it works. Closes #6308 from andygrove/datafusion-flight-example and squashes the following commits: 788feef8c <Andy Grove> code cleanup 9c473389a <Neville Dipale> Complete flight client's record batch reader 1337b9890 <Andy Grove> parse recordbatch 459bef386 <Andy Grove> client parses schema from ipc batches 31c894bbc <Andy Grove> update release test script efe05aef5 <Andy Grove> update release test script 5ecea83b1 <Andy Grove> formatting 8b419dab3 <Andy Grove> update release test script 03d2c8437 <Andy Grove> client streams results 0a39a513d <Andy Grove> client can stream batches e72c60545 <Andy Grove> add starting point for flight-client example ab28da82e <Andy Grove> get schema from query plan instead of from first batch 0901a3f88 <Neville Dipale> Merge branch 'datafusion-flight-example' of https://github.com/andygrove/arrow into datafusion-flight-example ad2e3b066 <Neville Dipale> send schema before batches 996f2a4a2 <Andy Grove> Use PARQUET_TEST_DATA env var 260f9cac1 <Neville Dipale> fix license violation 516b66dc8 <Neville Dipale> add helpers to convert record batch to flight data proto message 6beb4eafb <Andy Grove> WIP example Flight server for DataFusion Lead-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

view details

Krisztián Szűcs

commit sha baaf702aad7ea76e3816ca44e8f624143cc103d8

ARROW-7729: [Python][CI] Pin pandas version to 0.25 in the dask integration test Additionally test agains dask's latest release not just the master revision. Closes #6326 from kszucs/dask-pandas-pin and squashes the following commits: b5cb40eec <Krisztián Szűcs> pin pandas depending on dask's version 083221ce9 <Krisztián Szűcs> pin pandas in the dask integration test Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Krisztián Szűcs

commit sha 0ed81a8fb2a630538b980ed61f304d843fc11dc5

ARROW-7735: [Release][Python] Use pip to install dependencies for wheel verification The wheel verification script fails for python 3.5. At the same time the wheel properly works for python 3.5 docker images without conda environment. Conda forge doesn't maintain packages for python 3.5 anymore and something must have mixed with the numpy versions. This change fixed the wheel verification locally for me. Closes #6339 from kszucs/wheel-verification and squashes the following commits: 3e9694958 <Krisztián Szűcs> remove pytest verbose flags 026e5fb34 <Krisztián Szűcs> use pip to install dependencies for wheel verification Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Antoine Pitrou

commit sha 3f0c41a23d14a3f3d519fc9d3031e2411bceea3b

ARROW-7726: [CI] [C++] Use boost binaries on Windows GHA build The binaries are installed using Chocolatey, which takes a bit of time (it's a 2+GB install...), but less so than recompiling Boost from scratch during the CMake build. [skip appveyor] Closes #6325 from pitrou/ARROW-7726-download-boost-gha and squashes the following commits: e877622b9 <Antoine Pitrou> Revert "Try a more flexible way of finding Boost" eb5db8fc1 <Antoine Pitrou> Try a more flexible way of finding Boost d57064960 <Antoine Pitrou> ARROW-7726: Use boost binaries on Windows GHA build Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

liyafan82

commit sha 5928a9e412a2e9fcc3bb8937bca77813bfbb9f05

ARROW-7073: [Java] Support concating vectors values in batch We need a way to copy vector values in batch. Currently, we have copyFrom and copyFromSafe APIs. However, they are not enough, as copying values individually is not performant. Closes #5916 from liyafan82/fly_1125_veccat and squashes the following commits: 94b407c85 <liyafan82> Support dense union vector ee49dc678 <liyafan82> Add tests with null values ad33e234c <liyafan82> Rewrite tests with vector populator for result verification c89211abc <liyafan82> Rewrite tests with vector populator and provide static utility 7c13ede14 <liyafan82> Support concating vectors values in batch Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

push time in 4 days

push eventapache/arrow

Antoine Pitrou

commit sha d4e3898211207da4d8ff940a54d51338041d285a

ARROW-6165: [Integration] Run integration tests on multiple cores On a local test with C++, Java, Flight and Gold files enabled, the runtime goes down from 1m56 (serial) to 35s (parallel). Closes #6416 from pitrou/ARROW-6165-parallel-integration and squashes the following commits: 3ac06e53b <Antoine Pitrou> ARROW-6165: Run integration tests on multiple cores Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 4 days

PR closed apache/arrow

ARROW-6165: [Integration] Run integration tests on multiple cores

On a local test with C++, Java, Flight and Gold files enabled, the runtime goes down from 1m56 (serial) to 35s (parallel).

+308 -154

3 comments

10 changed files

pitrou

pr closed time in 4 days

Pull request review commentapache/arrow

ARROW-6165: [Integration] Run integration tests on multiple cores

 def guid(): )  +class _Printer:+    """+    A print()-providing object that can override the stream output on+    a per-thread basis.+    """++    def __init__(self):+        self._tls = threading.local()++    def _get_stdout(self):+        try:+            return self._tls.stdout+        except AttributeError:+            self._tls.stdout = sys.stdout+            self._tls.corked = False+            return self._tls.stdout++    def print(self, *args, **kwargs):+        """+        A variant of print() that writes to a thread-local stream.+        """+        print(*args, file=self._get_stdout(), **kwargs)++    @property+    def stdout(self):+        """+        A thread-local stdout wrapper that may be temporarily buffered+        using `cork()`.+        """+        return self._get_stdout()++    @contextlib.contextmanager+    def cork(self):+        """+        Temporarily buffer this thread's stream and write out its contents+        at the end of the context manager.  Useful to avoid interleaved+        output when multiple threads output progress information.+        """+        outer_stdout = self._get_stdout()+        assert not self._tls.corked, "reentrant call"+        inner_stdout = self._tls.stdout = io.StringIO()+        self._tls.corked = True+        try:+            yield+        finally:+            self._tls.stdout = outer_stdout+            self._tls.corked = False+            outer_stdout.write(inner_stdout.getvalue())+            outer_stdout.flush()

This is nice

pitrou

comment created time in 4 days

pull request commentapache/arrow

ARROW-7846: [Python][Dev] Remove dependencies on six

@pitrou I think it's okay to leave it as is until it needs to be worked on, some systems still have default Python 2.7 but we can remove it later and modernize the script

pitrou

comment created time in 4 days

pull request commentapache/arrow

ARROW-7462: [C++] Add CpuInfo detection for Arm64 Architecture

@guyuqi we are starting to experiment with Buildkite as a means of hosting ARM workers to validate PRs. If we get this working and you'd like to contribute workers (especially more powerful than a Jetson TX2 or Rockpro64 which is what we currently have available -- like ThunderX or ThunderX2 servers would be better) to the cluster as Buildkite agents (it's very easy to setup the agent) that would be super helpful

guyuqi

comment created time in 4 days

push eventursa-labs/dev-tools

Wes McKinney

commit sha ea157c3ad89bd6375bab7b26854f485cba20de5a

Add debian buildkite setup script Pass buildkite token to debian script

view details

push time in 5 days

pull request commentapache/arrow-site

[Website] Update documentations for 0.16.0

I restored the Doxygen to its own directory

kszucs

comment created time in 5 days

push eventapache/arrow-site

Krisztián Szűcs

commit sha b7e433b6f16eaecde7a16a70d85f9c01a5acab4a

[Website] Update documentations for 0.16.0 (#45) * [Website] Update documentations for 0.16.0 * Add Doxygen docs for 0.16.0 at docs/cpp-doxygen Co-authored-by: Wes McKinney <wesm@users.noreply.github.com>

view details

push time in 5 days

PR merged apache/arrow-site

[Website] Update documentations for 0.16.0

Used https://github.com/apache/arrow/pull/6407 to generate this. Doxygen is moved to docs/cpp-doxygen

+1011216 -1006742

7 comments

11317 changed files

kszucs

pr closed time in 5 days

push eventkszucs/arrow-site

Neal Richardson

commit sha 7e7c31c1eefa4872946f7a9fe27de7fd5d687ea9

Updating built site (build 2365ba238360c022fff9d32f21984d957696d94b)

view details

Krisztián Szűcs

commit sha 244017c563ca1e2137326adaad2165a477804995

[Website] Update documentations for 0.16.0

view details

Wes McKinney

commit sha e019afff0270bc323a541d355e2715b1f81ddbcc

Add Doxygen docs for 0.16.0 at docs/cpp-doxygen

view details

push time in 5 days

push eventkszucs/arrow-site

Wes McKinney

commit sha be165e5d09d8a9ea9eef186287a13f552756a7d2

Add Doxygen docs for 0.16.0 at docs/cpp-doxygen

view details

push time in 5 days

push eventursa-labs/dev-tools

Wes McKinney

commit sha a31f8f565d4035fa68c00ee5402ce1126c2ae21c

Pass buildkite token to debian script

view details

push time in 5 days

create barnchursa-labs/dev-tools

branch : buildkite-utils

created branch time in 5 days

pull request commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

A note for myself, the metadata key should be b'PARQUET:field_id' not b'PARQUET::field_id'

wesm

comment created time in 5 days

pull request commentapache/arrow

Add Reset method to buffer and cleanup comments on component.

The rebase here seems to have not gone well

richardartoul

comment created time in 5 days

push eventapache/arrow

Antoine Pitrou

commit sha 412145b8a736a5dce4f17b03d7f72efb45053176

ARROW-5757: [Python] Remove Python 2.7 support Part of the changes were done using [pyupgrade](https://github.com/asottile/pyupgrade). Closes #6410 from pitrou/ARROW-5757-py2-goodbye and squashes the following commits: f0f9f513b <Antoine Pitrou> Address review comments 561ac965c <Antoine Pitrou> ARROW-5757: Remove Python 2.7 support Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 5 days

PR closed apache/arrow

Reviewers
ARROW-5757: [Python] Remove Python 2.7 support

Part of the changes were done using pyupgrade.

+426 -1123

7 comments

98 changed files

pitrou

pr closed time in 5 days

pull request commentapache/arrow

ARROW-5757: [Python] Remove Python 2.7 support

Going to go ahead and merge this

pitrou

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-5757: [Python] Remove Python 2.7 support

 def _isfilestore(self):      @implements(FileSystem.isdir)     def isdir(self, path):-        return super(HadoopFileSystem, self).isdir(path)+        return super().isdir(path)

Oh wow I'm so happy about this.

pitrou

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-5757: [Python] Remove Python 2.7 support

 struct PythonType {   enum type {     BOOL,     INT,-    PY2INT,+    PY2INT,  // Kept for compatibility

There's probably no problem with removing it because we don't offer ABI stability

pitrou

comment created time in 5 days

push eventapache/arrow

Antoine Pitrou

commit sha 3bc01ec94eb2e310b28402a35196e1e8c5c9aec8

ARROW-7815: [C++] Improve input validation This should fix the following issues: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20260 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20282 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20307 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20324 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20330 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20575 Closes #6396 from pitrou/ARROW-7815-fuzz-issues and squashes the following commits: 0efa846e1 <Antoine Pitrou> Update testing submodule fd8e865a4 <Antoine Pitrou> ARROW-7815: Improve input validation Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 5 days

PR closed apache/arrow

ARROW-7815: [C++] Improve input validation

This should fix the following issues: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20260 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20282 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20307 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20324 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20330 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20575

+224 -49

1 comment

12 changed files

pitrou

pr closed time in 5 days

push eventapache/arrow

Antoine Pitrou

commit sha 9f0c70c8337b1a8c75df5cc9410b54e97351bfc0

ARROW-2447: [C++] Device and MemoryManager API Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The layer exposes two interfaces: * the `Device` interface exposes information a particular memory-holding device * the `MemoryManager` allows allocating, copying, reading or writing memory located on a particular device The `Buffer` API is modified so that calling `data()` fails on non-CPU buffers. A separate `address()` method returns the buffer address as an integer, and is allowed on any buffer. The API provides convenience functions to view or copy a buffer from one device to the other. For example, a on-GPU buffer can be copied to the CPU, and in some situations a zero-copy CPU view can also be created (depending on the GPU capabilities and how the GPU memory was allocated). An example use in the PR is IPC. On the write side, a new `SerializeRecordBatch` overload takes a `MemoryManager` argument and is able to serialize data either to any kind of memory (CPU, GPU). On the read side, `ReadRecordBatch` now works on any kind of input buffer, and returns record batches backed by either CPU or GPU memory. It introduces a slight complexity in the CUDA namespace, since there are both `CudaContext` and `CudaMemoryManager` classes. We could solve this by merging the two concepts (but doing so may break compatibility for existing users of CUDA). Closes #6295 from pitrou/ARROW-2447-device-api-memory-manager and squashes the following commits: c665f6177 <Antoine Pitrou> ARROW-2447: Device and MemoryManager API Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 5 days

PR closed apache/arrow

ARROW-2447: [C++] Device and MemoryManager API

Add an abstraction layer to allow safe handling of buffers residing on different devices (the CPU, a GPU...). The layer exposes two interfaces:

  • the Device interface exposes information a particular memory-holding device
  • the MemoryManager allows allocating, copying, reading or writing memory located on a particular device

The Buffer API is modified so that calling data() fails on non-CPU buffers. A separate address() method returns the buffer address as an integer, and is allowed on any buffer.

The API provides convenience functions to view or copy a buffer from one device to the other. For example, a on-GPU buffer can be copied to the CPU, and in some situations a zero-copy CPU view can also be created (depending on the GPU capabilities and how the GPU memory was allocated).

An example use in the PR is IPC. On the write side, a new SerializeRecordBatch overload takes a MemoryManager argument and is able to serialize data either to any kind of memory (CPU, GPU). On the read side, ReadRecordBatch now works on any kind of input buffer, and returns record batches backed by either CPU or GPU memory.

It introduces a slight complexity in the CUDA namespace, since there are both CudaContext and CudaMemoryManager classes. We could solve this by merging the two concepts (but doing so may break compatibility for existing users of CUDA).

+2245 -424

17 comments

35 changed files

pitrou

pr closed time in 5 days

Pull request review commentapache/arrow

ARROW-2447: [C++] Device and MemoryManager API

 class ARROW_EXPORT Buffer {   explicit operator util::bytes_view() const { return util::bytes_view(data_, size_); }    /// \brief Return a pointer to the buffer's data-  const uint8_t* data() const { return data_; }+  ///+  /// The buffer has to be a CPU buffer (`is_cpu()` is true).+  /// Otherwise, an assertion may be thrown or a null pointer may be returned.+  ///+  /// To get the buffer's data address regardless of its device, call `address()`.+  const uint8_t* data() const {+#ifndef NDEBUG+    CheckCPU();+#endif+    return ARROW_PREDICT_TRUE(is_cpu_) ? data_ : NULLPTR;

OK, a fair point (my CUDA is a bit rusty but I do recall this now)

pitrou

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-2447: [C++] Device and MemoryManager API

  #include "arrow/buffer.h" #include "arrow/buffer_builder.h"+#include "arrow/device.h" #include "arrow/memory_pool.h" #include "arrow/status.h" #include "arrow/testing/gtest_util.h"+#include "arrow/util/checked_cast.h"  namespace arrow { +using internal::checked_cast;+using internal::checked_pointer_cast;++static const char kMyDeviceTypeName[] = "arrowtest::MyDevice";++static const int kMyDeviceAllowCopy = 1;+static const int kMyDeviceAllowView = 2;+static const int kMyDeviceDisallowCopyView = 3;++class MyDevice : public Device {+ public:+  explicit MyDevice(int value) : Device(), value_(value) {}++  const char* type_name() const override { return kMyDeviceTypeName; }++  std::string ToString() const override {+    switch (value_) {+      case kMyDeviceAllowCopy:+        return "MyDevice[noview]";+      case kMyDeviceAllowView:+        return "MyDevice[nocopy]";+      default:+        return "MyDevice[nocopy][noview]";+    }+  }++  bool Equals(const Device& other) const override {+    if (other.type_name() != kMyDeviceTypeName) {+      return false;+    }+    return checked_cast<const MyDevice&>(other).value_ == value_;+  }++  std::shared_ptr<MemoryManager> default_memory_manager() override;++  int value() const { return value_; }++  bool allow_copy() const { return value_ == kMyDeviceAllowCopy; }++  bool allow_view() const { return value_ == kMyDeviceAllowView; }++ protected:+  int value_;+};++class MyMemoryManager : public MemoryManager {+ public:+  explicit MyMemoryManager(std::shared_ptr<Device> device) : MemoryManager(device) {}++  bool allow_copy() const {+    return checked_cast<const MyDevice&>(*device()).allow_copy();+  }++  bool allow_view() const {+    return checked_cast<const MyDevice&>(*device()).allow_view();+  }++  Result<std::shared_ptr<io::RandomAccessFile>> GetBufferReader(+      std::shared_ptr<Buffer> buf) override {+    return Status::NotImplemented("");+  }++  Result<std::shared_ptr<Buffer>> AllocateBuffer(int64_t size) override {+    return Status::NotImplemented("");+  }++  Result<std::shared_ptr<io::OutputStream>> GetBufferWriter(+      std::shared_ptr<Buffer> buf) override {+    return Status::NotImplemented("");+  }++ protected:+  Result<std::shared_ptr<Buffer>> CopyBufferFrom(+      const std::shared_ptr<Buffer>& buf,+      const std::shared_ptr<MemoryManager>& from) override;+  Result<std::shared_ptr<Buffer>> CopyBufferTo(+      const std::shared_ptr<Buffer>& buf,+      const std::shared_ptr<MemoryManager>& to) override;+  Result<std::shared_ptr<Buffer>> ViewBufferFrom(+      const std::shared_ptr<Buffer>& buf,+      const std::shared_ptr<MemoryManager>& from) override;+  Result<std::shared_ptr<Buffer>> ViewBufferTo(+      const std::shared_ptr<Buffer>& buf,+      const std::shared_ptr<MemoryManager>& to) override;+};++class MyBuffer : public Buffer {+ public:+  MyBuffer(std::shared_ptr<MemoryManager> mm, const std::shared_ptr<Buffer>& parent)+      : Buffer(parent->data(), parent->size()) {+    parent_ = parent;+    SetMemoryManager(mm);+  }+};++std::shared_ptr<MemoryManager> MyDevice::default_memory_manager() {+  return std::make_shared<MyMemoryManager>(shared_from_this());+}++Result<std::shared_ptr<Buffer>> MyMemoryManager::CopyBufferFrom(+    const std::shared_ptr<Buffer>& buf, const std::shared_ptr<MemoryManager>& from) {+  if (!allow_copy()) {+    return nullptr;+  }+  if (from->is_cpu()) {+    // CPU to MyDevice:+    // 1. CPU to CPU+    ARROW_ASSIGN_OR_RAISE(auto dest,+                          MemoryManager::CopyBuffer(buf, default_cpu_memory_manager()));+    // 2. Wrap CPU buffer result+    return std::make_shared<MyBuffer>(shared_from_this(), dest);+  }+  return nullptr;+}++Result<std::shared_ptr<Buffer>> MyMemoryManager::CopyBufferTo(+    const std::shared_ptr<Buffer>& buf, const std::shared_ptr<MemoryManager>& to) {+  if (!allow_copy()) {+    return nullptr;+  }+  if (to->is_cpu() && buf->parent()) {+    // MyDevice to CPU+    return MemoryManager::CopyBuffer(buf->parent(), to);+  }+  return nullptr;+}++Result<std::shared_ptr<Buffer>> MyMemoryManager::ViewBufferFrom(+    const std::shared_ptr<Buffer>& buf, const std::shared_ptr<MemoryManager>& from) {+  if (!allow_view()) {+    return nullptr;+  }+  if (from->is_cpu()) {+    // CPU on MyDevice: wrap CPU buffer+    return std::make_shared<MyBuffer>(shared_from_this(), buf);+  }+  return nullptr;+}++Result<std::shared_ptr<Buffer>> MyMemoryManager::ViewBufferTo(+    const std::shared_ptr<Buffer>& buf, const std::shared_ptr<MemoryManager>& to) {+  if (!allow_view()) {+    return nullptr;+  }+  if (to->is_cpu() && buf->parent()) {+    // MyDevice on CPU: unwrap buffer+    return buf->parent();+  }+  return nullptr;+}++// Like AssertBufferEqual, but doesn't call Buffer::data()+void AssertMyBufferEqual(const Buffer& buffer, util::string_view expected) {+  ASSERT_EQ(util::string_view(buffer), expected);

It's curious that this works

pitrou

comment created time in 5 days

pull request commentapache/arrow-site

[Website] Update documentations for 0.16.0

There are broken links in the release blog post. Does anything else need to be done in this aside from restoring the Doxygen docs?

kszucs

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-7815: [C++] Improve input validation

 class ARROW_EXPORT NullType : public DataType {   std::string ToString() const override;    DataTypeLayout layout() const override {-    return {{DataTypeLayout::kAlwaysNullBuffer}, false};+    return DataTypeLayout({DataTypeLayout::AlwaysNull()});

I think that is the case

pitrou

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

 class ARROW_EXPORT ChunkedArray {   ARROW_DISALLOW_COPY_AND_ASSIGN(ChunkedArray); }; +namespace internal {++/// \brief EXPERIMENTAL: Utility for incremental iteration over contiguous+/// pieces of potentially differently-chunked ChunkedArray objects+class MultipleChunkIterator {+ public:+  MultipleChunkIterator(const ChunkedArray& left, const ChunkedArray& right)+      : left_(left),+        right_(right),+        pos_(0),+        length_(left.length()),+        chunk_idx_left_(0),+        chunk_idx_right_(0),+        chunk_pos_left_(0),+        chunk_pos_right_(0) {}++  bool Next(std::shared_ptr<Array>* next_left, std::shared_ptr<Array>* next_right);++  int64_t position() const { return pos_; }++ private:+  const ChunkedArray& left_;+  const ChunkedArray& right_;++  // The amount of the entire ChunkedArray consumed+  int64_t pos_;++  // Length of the chunked array(s)+  int64_t length_;++  // Current left chunk+  int chunk_idx_left_;++  // Current right chunk+  int chunk_idx_right_;++  // Offset into the current left chunk+  int64_t chunk_pos_left_;++  // Offset into the current right chunk+  int64_t chunk_pos_right_;+};++// Execute the passed function

Will do

wesm

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

 namespace arrow { class ARROW_EXPORT KeyValueMetadata {  public:   KeyValueMetadata();-  KeyValueMetadata(const std::vector<std::string>& keys,-                   const std::vector<std::string>& values);+  KeyValueMetadata(std::vector<std::string> keys, std::vector<std::string> values);   explicit KeyValueMetadata(const std::unordered_map<std::string, std::string>& map);   virtual ~KeyValueMetadata() = default; +  /// \brief Convenience constructor+  static std::shared_ptr<KeyValueMetadata> Make(std::vector<std::string> keys,

Good point. I'll remove this

wesm

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

 class PARQUET_EXPORT Node {    const std::shared_ptr<const LogicalType>& logical_type() const { return logical_type_; } -  int id() const { return id_; }+  /// \brief The field_id value for the serialized SchemaElement. If the+  /// field_id is less than 0 (e.g. -1), it will not be set when serialized to+  /// Thrift

The next_id() default assignments only occur during serialization of the entire SchemaDescriptor (I think, will double check), but if you have field_id == -1 and call Node::ToParquet it will happily serialize. I'll add some comments to clarify

wesm

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

 namespace schema {  static inline SchemaElement NewPrimitive(const std::string& name,                                          FieldRepetitionType::type repetition,-                                         Type::type type, int id = 0) {+                                         Type::type type, int field_id = -1) {   SchemaElement result;   result.__set_name(name);   result.__set_repetition_type(repetition);   result.__set_type(static_cast<format::Type::type>(type));-+  if (field_id >= 0) {+    result.__set_field_id(field_id);+  }   return result; }  static inline SchemaElement NewGroup(const std::string& name,                                      FieldRepetitionType::type repetition,-                                     int num_children, int id = 0) {+                                     int num_children, int field_id = -1) {   SchemaElement result;   result.__set_name(name);   result.__set_repetition_type(repetition);   result.__set_num_children(num_children); +  if (field_id >= 0) {+    result.__set_field_id(field_id);+  }+   return result; } +template <typename NodeType>+static void CheckNodeRoundtrip(const Node& node) {+  format::SchemaElement serialized;+  node.ToParquet(&serialized);+  std::unique_ptr<Node> recovered = NodeType::FromParquet(&serialized);+  ASSERT_TRUE(node.Equals(recovered.get()))+      << "Recovered node not equivalent to original node constructed "+      << "with logical type " << node.logical_type()->ToString() << " got "+      << recovered->logical_type()->ToString();+  ASSERT_EQ(recovered->field_id(), node.field_id());

I did that later, I'll remove this superfluous check

wesm

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

 static std::string Print(const NodePtr& node) { TEST(TestSchemaPrinter, Examples) {   // Test schema 1   NodeVector fields;-  fields.push_back(Int32("a", Repetition::REQUIRED));+  fields.push_back(Int32("a", Repetition::REQUIRED, 1));    // 3-level list encoding-  NodePtr item1 = Int64("item1");-  NodePtr item2 = Boolean("item2", Repetition::REQUIRED);+  NodePtr item1 = Int64("item1", Repetition::OPTIONAL, 4);+  NodePtr item2 = Boolean("item2", Repetition::REQUIRED, 5);   NodePtr list(-      GroupNode::Make("b", Repetition::REPEATED, {item1, item2}, ConvertedType::LIST));-  NodePtr bag(GroupNode::Make("bag", Repetition::OPTIONAL, {list}));+      GroupNode::Make("b", Repetition::REPEATED, {item1, item2}, ConvertedType::LIST, 3));+  NodePtr bag(+      GroupNode::Make("bag", Repetition::OPTIONAL, {list}, /*logical_type=*/nullptr, 2));   fields.push_back(bag);    fields.push_back(PrimitiveNode::Make("c", Repetition::REQUIRED, Type::INT32,-                                       ConvertedType::DECIMAL, -1, 3, 2));+                                       ConvertedType::DECIMAL, -1, 3, 2, 6));    fields.push_back(PrimitiveNode::Make("d", Repetition::REQUIRED,-                                       DecimalLogicalType::Make(10, 5), Type::INT64, -1));+                                       DecimalLogicalType::Make(10, 5), Type::INT64,+                                       /*length=*/-1, 7)); -  NodePtr schema = GroupNode::Make("schema", Repetition::REPEATED, fields);+  NodePtr schema = GroupNode::Make("schema", Repetition::REPEATED, fields,+                                   /*logical_type=*/nullptr, 0);    std::string result = Print(schema);-  std::string expected = R"(message schema {-  required int32 a;-  optional group bag {-    repeated group b (List) {-      optional int64 item1;-      required boolean item2;+  std::string expected = R"(repeated group field_id=0 schema {

Right, that's how they're auto-assigned by the library but another library may set different ids

wesm

comment created time in 5 days

Pull request review commentapache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata. Assorted additional changes

 def _read_table(*args, **kwargs):     return pq.read_table(*args, **kwargs)  +def assert_tables_equal(left, right):

We may, however this is a detail of this module because the roundtrip to Parquet now introduces extra Field metadata so that result.equals(expected) will fail unless the right metadata is set in expected

wesm

comment created time in 5 days

pull request commentapache/arrow

PARQUET-1770: [C++][CI] Add fuzz target for reading Parquet files

+1 (sorry for the late review)

pitrou

comment created time in 5 days

pull request commentapache/arrow

PARQUET-1770: [C++][CI] Add fuzz target for reading Parquet files

@pitrou that's right -- you can disable dictionary encoding on a per-column basis, though

pitrou

comment created time in 5 days

startedpsf/black

started time in 5 days

issue commentwesm/feather

Allow saving to compressed files

I suggest following https://issues.apache.org/jira/browse/ARROW-5510 or discussing more on the dev@arrow.apache.org mailing list. I agree that having compression would be a big help but there are very few individuals who have volunteered time to work on this

ostrokach

comment created time in 5 days

startedhannesmuehleisen/miniparquet

started time in 5 days

pull request commentapache/arrow

ARROW-2447: [C++] Device and MemoryManager API

The performance change seems acceptable. I assume this is ready for a final review and merge?

pitrou

comment created time in 5 days

push eventapache/arrow-site

Neal Richardson

commit sha 2365ba238360c022fff9d32f21984d957696d94b

ARROW-7580: [Website] 0.16 release post (#41) * Create outline for 0.16 release post * Add release notes Add highlights for C++, Python, community discussions. * Add dataset changes * Add Rust notes * Add Ruby and C GLib notes * Add syntax highlighting * update Rust notes * Add some notes, simplify some of hte C++ section for readability * Revise/extend 0.16 blog post * Add some update fo specification and Java * Added a java bug fix bullet point. * add python 2.7 note * Fix typo * Rename file * Edits for publication * More cleanup * Update publication date Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: François Saint-Jacques <fsaintjacques@gmail.com> Co-authored-by: Wakahisa <nevilledips@gmail.com> Co-authored-by: Yosuke Shiro <yosuke.shiro615@gmail.com> Co-authored-by: Wes McKinney <wesm@users.noreply.github.com> Co-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Kenta Murata <mrkn@users.noreply.github.com>

view details

push time in 5 days

push eventwesm/arrow

Wes McKinney

commit sha 94e53c4adbbe851396f08c9deee5b5a42150ce75

Start working on properly preserving and deserializing field_id in C++. Some field_id round trips working Schema tests passing, some refactoring to simplify schema serialization code Add KeyValueMetadata::Merge method, Arrow schema access for ParquetFile Add Field::WithMergedMetadata Add Parquet field_id metadata to Arrow schemas. Better diffing in AssertTablesEqual Ignore unequal child field metadata in ChunkedArray::Equals, fix Python Parquet unit tests Add Python unit test to verify that field_id's are correctly filtered through to the Arrow schema in Python decruft

view details

push time in 6 days

PR opened apache/arrow

ARROW-7080: [C++][Parquet] Read and write "field_id" attribute in Parquet files, propagate to Arrow field metadata

The field_id is used for schema evolution and other things. It is surfaced in Python in the Field.metadata as b'PARQUET::field_id'

  • ChunkedArray::Equals would fail if a child field had unequal metadata, now it does not check the metadata
  • Improved diffing output in AssertTablesEqual in testing/gtest_util.h (may need some more tests around this)
  • Added a generic binary ChunkedArray iterator (see internal::MultipleChunkIterator) and helpful applicator internal::ApplyToChunkOverlaps. I retrofitted ChunkedArray::Equals to use this (needed it to improve the diffing output in AssertTablesEqual)
  • Add KeyValueMetadata::Merge method
  • Add KeyValueMetadata::Make convenience ctor
  • Add Field::WithMergedMetadata method that calls KeyValueMetadata::Merge
  • Print metadata in Field::ToString
  • Add parquet.ParquetFile.schema_arrow property to return the effective Arrow schema
  • Print field_ids in parquet::SchemaPrinter
+695 -432

0 comment

24 changed files

pr created time in 6 days

push eventwesm/arrow

Wes McKinney

commit sha 2ebdd5e4ca1b3230e7c59c9deeeb23349944a5a7

Add Python unit test to verify that field_id's are correctly filtered through to the Arrow schema in Python

view details

push time in 6 days

push eventwesm/arrow

Paddy Horan

commit sha e6eb6bd5ff75820cb31486279b96a2f8688bb65d

ARROW-7624: [Rust] Soundness issues via `Buffer` methods Closes #6397 from paddyhoran/from-raw-parts-unsafe and squashes the following commits: 963c177d7 <Paddy Horan> Fixes calls to unsafe functions d02fa8bbc <Paddy Horan> Fixed typo. cffcc0b80 <Paddy Horan> Makes `typed_data` unsafe. c6ad47c0d <Paddy Horan> Updates `from_raw_parts` to be unsafe Authored-by: Paddy Horan <paddyhoran@hotmail.com> Signed-off-by: Paddy Horan <paddyhoran@hotmail.com>

view details

Wes McKinney

commit sha 026f401b5ae36fb3b87b0e614d7876b6d1e2042c

Start working on properly preserving and deserializing field_id in C++

view details

Wes McKinney

commit sha ed31221732ba08bf62fd9050ae34519ccfd05cf2

Some field_id round trips working

view details

Wes McKinney

commit sha 0574cea904c8baf8c687018c892498be8a933f09

Schema tests passing, some refactoring to simplify schema serialization code

view details

Wes McKinney

commit sha aca8d033a321ea02647b49df11726137939b5da7

Add KeyValueMetadata::Merge method, Arrow schema access for ParquetFile

view details

Wes McKinney

commit sha 6e3b6a5d1119008f63c2c36dc520120c2def7d43

Add Field::WithMergedMetadata

view details

Wes McKinney

commit sha 661c737983e610799e31fcbf6595aa19f2460f3d

Add Parquet field_id metadata to Arrow schemas. Better diffing in AssertTablesEqual

view details

Wes McKinney

commit sha c73b87d5a424cc087a21cdb62035c1ac7f19b3b7

Ignore unequal child field metadata in ChunkedArray::Equals, fix Python Parquet unit tests

view details

push time in 6 days

pull request commentapache/arrow-site

[Website] Update documentations for 0.16.0

https://issues.apache.org/jira/browse/ARROW-7835

kszucs

comment created time in 6 days

pull request commentapache/arrow-site

[Website] Update documentations for 0.16.0

I think we should link to the Doxygen from the Sphinx project (you can put external links in the TOC tree). We can do that in a follow up change to the docs

kszucs

comment created time in 6 days

pull request commentapache/arrow-site

[Website] Update documentations for 0.16.0

@kszucs I am not sure we should delete the Doxygen? What is in Sphinx is not comprehensive

kszucs

comment created time in 6 days

push eventwesm/arrow

Joris Van den Bossche

commit sha b39283534f63a3a986625f9988f42251215553e5

ARROW-7591: [Python] Fix DictionaryArray.to_numpy() to return decoded numpy array Closes #6212 from jorisvandenbossche/ARROW-7591 and squashes the following commits: 230b7d2b0 <Joris Van den Bossche> ARROW-7591: Fix DictionaryArray.to_numpy() to return decoded numpy array Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

David Li

commit sha 0a718d376088ad231fb56eacef42f865423ebfef

ARROW-7551: [FlightRPC][C++] Flight test on macOS fails due to Homebrew gRPC This test is consistently flaky on MacOS, but I can't reproduce it locally. Closes #6207 from lidavidm/arrow-7551 and squashes the following commits: d1961f076 <David Li> ARROW-7551: don't use buggy gRPC on MacOS for now Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Neal Richardson

commit sha 85eef1c71d33af9f3e8bce3dab215ad09408b3a9

ARROW-7532: [CI] Unskip brew test after Homebrew fixes it upstream Closes #6196 from nealrichardson/brew-test and squashes the following commits: 9b5361dfe <Neal Richardson> Don't fail on brew doctor c32d20b30 <Neal Richardson> Add in brew config && brew doctor, in case of future issues d1bfd66d5 <Neal Richardson> Reinstate brew test after upstream fix Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Joris Van den Bossche

commit sha f743c1b6f3c7ca05321351166dc9563d8b12befb

ARROW-7432: [Python] Add higher level open_dataset function Follow-up on https://github.com/apache/arrow/pull/5237 adding a higher-level API for datasets Closes #6022 from jorisvandenbossche/dataset-python and squashes the following commits: 745c21842 <Joris Van den Bossche> rename keyword to partitioning + refactor tests + more coverage 8e032824f <Joris Van den Bossche> update for big renaming + doc updates 9c9593804 <Joris Van den Bossche> Use FileSystem.from_uri ac0d83daf <Joris Van den Bossche> split into source / dataset functions 866f72c8a <Joris Van den Bossche> Add single partitioning() function from kszucs + tests 7481fb6d9 <Joris Van den Bossche> fix import for python 2 d59595d7e <Joris Van den Bossche> add partition scheme creation functions 260b7373a <Joris Van den Bossche> add support for Pathlib 5e00c87ee <Joris Van den Bossche> fix with new partition discovery option 757fe8019 <Joris Van den Bossche> Add higher level open_dataset function Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Neal Richardson

commit sha 0f30b70e873785042cd27d029fcd6b7a75e54821

ARROW-7092: [R] Add vignette for dplyr and datasets The vignette could probably use some polishing, and it's not awesome that it's not a dynamic document (all of the code chunks and output are static, pregenerated) but I'm not sure of a better way since we can't include the example data in the package. Feedback greatly appreciated. Given the push for the 0.16 release, I'll probably merge this as soon as CI is green because I have another issue (https://issues.apache.org/jira/browse/ARROW-7581) for documentation polishing for the release and I don't want this to block that. I can incorporate suggestions and edits folks have in that followup PR. Closes #6222 from nealrichardson/dataset-vignette and squashes the following commits: cca95f053 <Neal Richardson> Add notes on extension points for datasets e3ba51daf <Neal Richardson> First pass at (static) dataset vignette; also add some print methods and various schema methods Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Fokko Driesprong

commit sha be4b245f35c23a7e42978c74b739d52a5cc6c58e

ARROW-7567: [Java] Bump Checkstyle from 6.19 to 8.19 Bump checkstyle to the latest version, and fix the new violations based on new rules that have been introduced. Closes #6186 from Fokko/fd-bump-checkstyle and squashes the following commits: b5e77d485 <Fokko Driesprong> Bump from 8.18 to 8.19 41b6b4ab1 <Fokko Driesprong> Merge branch 'master' of https://github.com/apache/arrow into fd-bump-checkstyle 593cfb6d3 <Fokko Driesprong> ARROW-7567: Bump Checkstyle from 6.19 to 8.18 Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Micah Kornfield

commit sha e3b68dd4fa7c98d807fb7ad21af83148c3e751c5

ARROW-7567: [Java] Fix races in checkstyle upgdae Closes #6223 from emkornfield/verify_checkstyle and squashes the following commits: ee8cbcc72 <Micah Kornfield> fix formatting 06ba7f403 <Micah Kornfield> fix formatting Authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Sutou Kouhei

commit sha d829d838a37da09838190f47e946dc982847e689

ARROW-7603: [Packaging][RPM] Add workaround for LLVM on CentOS 8 CentOS 8.0.1905 provides LLVM 7 but CentOS 8.1.1911 provides only LLVM 8. We should support LLVM 8 instead of using CentOS 8.0.1905. So this is just a workaround. We should remove this workaround later. Closes #6224 from kou/packaging-rpm-centos-8-llvm-7 and squashes the following commits: c3bb44d8f <Sutou Kouhei> Add workaround for LLVM on CentOS 8 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Sutou Kouhei

commit sha 45ad34603a065071673e6484e037a2374c53e43b

ARROW-7612: [Packaging][Python] Fix artifacts path for Conda on Windows See the following for real path: https://dev.azure.com/ursa-labs/2a3e076a-0cff-409a-87ab-3f3adb390ea7/_apis/build/builds/4988/logs/13 2020-01-17T09:56:43.7023218Z # If you want to upload package(s) to anaconda.org later, type: 2020-01-17T09:56:43.7023307Z 2020-01-17T09:56:43.7023432Z anaconda upload D:\bld\win-64\arrow-cpp-0.15.0.dev606-py37h3b6a26a_0.tar.bz2 2020-01-17T09:56:43.7023526Z anaconda upload D:\bld\win-64\parquet-cpp-1.5.1-0.tar.bz2 2020-01-17T09:56:43.7023659Z anaconda upload D:\bld\win-64\pyarrow-0.15.0.dev606-py37h803c963_0.tar.bz2 Closes #6226 from kou/packaging-conda-windows-fix-artifacts-path and squashes the following commits: 367b938bc <Sutou Kouhei> Fix artifacts path for Conda on Windows Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Wes McKinney

commit sha ebb3f6f42ed6cfae5262a723858d5bbbe684a8e6

ARROW-6899: [Python] Decode dictionary-encoded List children to dense when converting to pandas Previously we errored out when encountering a dictionary-encoded child Closes #6199 from wesm/ARROW-6899 and squashes the following commits: 6c8efeece <Wes McKinney> Decode dictionary-encoded List children to dense when converting to pandas Authored-by: Wes McKinney <wesm+git@apache.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

Sutou Kouhei

commit sha c5d8fae191141a79b2b1c9ee362df9018d5b1b27

ARROW-7611: [Packaging][Python] Fix artifacts patterns for wheel See the followings for real artifact names: * https://github.com/ursa-labs/crossbow/releases/tag/nightly-2020-01-17-0-travis-wheel-osx-cp38 * https://github.com/ursa-labs/crossbow/releases/tag/nightly-2020-01-17-0-appveyor-wheel-win-cp38 Closes #6225 from kou/packaging-wheel-fix-artifacts-pattern and squashes the following commits: 11544aded <Sutou Kouhei> Fix artifacts pattern for wheel Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Antoine Pitrou

commit sha cbea3275c52d59c3312f43f936a9d07dfbdc2e37

ARROW-7577: [CI][C++] Check OSS-Fuzz build in Github Actions This doesn't run the fuzzers, but checks they build correctly. Closes #6229 from pitrou/ARROW-7577-gha-fuzz-check and squashes the following commits: 7bb85a30c <Antoine Pitrou> ARROW-7577: Check OSS-Fuzz build in Github Actions Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Krisztián Szűcs

commit sha ec15d2f731e168dfbe01880b104e05bebf78a82c

ARROW-7519: [Python] Build wheels, conda packages with dataset support Closes #6169 from kszucs/ARROW-7519 and squashes the following commits: 0690e4b20 <Krisztián Szűcs> don't rely on order 97e421ab5 <Krisztián Szűcs> Enable dataset for conda windows builds e2ddf4fd6 <Krisztián Szűcs> wheel f9640c68b <Krisztián Szűcs> conda Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Antoine Pitrou

commit sha 92931939ac2c8c89c278c2413608c3f1d2663f91

ARROW-7621: [Doc] Fix doc build Closes #6231 from pitrou/ARROW-7621-fix-doc-build and squashes the following commits: 5034e29cb <Antoine Pitrou> ARROW-7621: Fix doc build Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: François Saint-Jacques <fsaintjacques@gmail.com>

view details

Antoine Pitrou

commit sha a6307cc8a0127a4852b7d3d2c20940a094eed0d6

ARROW-7618: [C++] Fix crashes or undefined behaviour on corrupt IPC input Should fix those issues: - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20116 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20186 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20199 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20217 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20221 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20241 - https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20246 Closes #6230 from pitrou/ARROW-7618-ipc-fuzz-fixes and squashes the following commits: f7293b782 <Antoine Pitrou> ARROW-7618: Fix crashes or undefined behaviour on corrupt IPC input Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

Antoine Pitrou

commit sha 3dc281397f42dd26bf77e45c7ba4ebeb1eb9aba8

ARROW-7623: [C++] Update generated flatbuffers code Also use scoped enums, which are much nicer. Closes #6233 from pitrou/ARROW-7623-update-fbs-generated-code and squashes the following commits: 9942d2c96 <Antoine Pitrou> ARROW-7623: Update generated flatbuffers code Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

Projjal Chanda

commit sha 17220bb622fbf88c42c914219e17258bcaf6a1cb

ARROW-7589: [C++][Gandiva] Calling castVarchar from java sometimes results in segmentation fault for input length 0 Closes #6210 from projjal/fix_castvarchar and squashes the following commits: 243b23696 <Projjal Chanda> Revert "ARROW-7215: Implement castVARCHAR(numeric_type) functions in Gandiva" 554cce794 <Projjal Chanda> Revert "Fixed castVarchar failure" 3671e1e25 <Projjal Chanda> Revert "updated unit test" 7fc843326 <Projjal Chanda> updated unit test 0436f85ea <Projjal Chanda> Fixed castVarchar failure Authored-by: Projjal Chanda <iam@pchanda.com> Signed-off-by: Praveen <praveen@dremio.com>

view details

Sutou Kouhei

commit sha d3f4763edb3f2f85982d7d48bbe0c41408910947

ARROW-7626: [Parquet][GLib] Add support for version macros Closes #6236 from kou/glib-parquet-version-h and squashes the following commits: 7294bf6c0 <Sutou Kouhei> Add support for version macros Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Yosuke Shiro <yosuke.shiro615@gmail.com>

view details

Joris Van den Bossche

commit sha ace72c2afa6b7608bca9ba858fdd10b23e7f2dbf

ARROW-7614: [Python] Limit size of data in test_parquet.py::test_set_data_page_size Closes #6240 from jorisvandenbossche/ARROW-7614 and squashes the following commits: a3ebf60ae <Joris Van den Bossche> ARROW-7614: Limit size of data in test_parquet.py::test_set_data_page_size Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Romain Francois

commit sha 3c452dc2b12aa2221b9bb319ed1e2613f53c9a31

ARROW-7093: [R] Support creating ScalarExpressions for more data types Closes #6201 from romainfrancois/ARROW-7093/ScalarExpressions and squashes the following commits: b94d0c6dd <Romain Francois> ScalarExpression$create() asserts that x is size 1 or NULL 97d8c86fd <Romain Francois> ScalarExpression$create(factor) 1debcaa74 <Romain Francois> ScalarExpression$create(<hms::hms>) 1a2768b12 <Romain Francois> bit64::integer64 656308a01 <Romain Francois> NullScalar dd2df0b7d <Romain Francois> scalar timestamp d360ae122 <Romain Francois> ScalarExpression == Date f700064e8 <Romain Francois> ScalarExpression$create(<Date>) 9a089a414 <Romain Francois> string scalar expression Authored-by: Romain Francois <romain@rstudio.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

push time in 6 days

push eventwesm/dev-toolchain

Wes McKinney

commit sha 1bd3daee1a0b21f8484a7d957de03c640ea8c113

Add notes syncing function

view details

push time in 6 days

push eventnealrichardson/arrow-site

Wes McKinney

commit sha 145345d77d06fb055b5e382cbf2ee6070d50f8f1

Update publication date

view details

push time in 6 days

issue openedconda-forge/pyarrow-feedstock

Enable S3

ARROW_S3=ON in arrow-cpp, so we can enable this by adding a flag to build.sh/bld.bat.

see also https://issues.apache.org/jira/browse/ARROW-7827

created time in 6 days

pull request commentapache/arrow

ARROW-7605: [C++] Bundle private jemalloc symbols into static library libarrow.a

I'm going to clean this up as soon as I can and try to include any BUNDLED dependencies in libarrow.a

wesm

comment created time in 6 days

pull request commentapache/arrow

ARROW-7645: [Packaging][deb][RPM] Fix arm64 packaging build

Yes, let's discuss on the mailing list. It looks like we be able to use Buildkite on ASF repos now (due to some permissions changes in GitHub's APIs) so having some dedicated ARM hardware available on BK (or GitHub Actions self-hosted, if that won't work) would work well. It is almost trivial for third parties to attach workers to a Buildkite instance

kou

comment created time in 6 days

startedgoogle/marl

started time in 6 days

MemberEvent

pull request commentapache/arrow-site

ARROW-7580: [Website] 0.16 release post

I rebased and gave a skim. I'll leave this open for another half hour or so in case anyone wants to give it a skim and fix typos or reword things

nealrichardson

comment created time in 6 days

push eventnealrichardson/arrow-site

JackBoosY

commit sha 4411bebc1ebcb7f75181bf57183933df247a3cdf

ARROW-7644: [Doc][Install] Add vcpkg installation instructions arrow is available as a port in vcpkg, a C++ library manager that simplifies installation for arrow and other project dependencies. Documenting the install process here will help users get started by providing a single set of commands to build arrow, ready to be included in their projects. We also test whether our library ports build in various configurations (dynamic, static) on various platforms (OSX, Linux, Windows: x86, x64, UWP, ARM) to keep a wide coverage for users. I'm a maintainer for vcpkg, and [here is what the port script looks like](https://github.com/microsoft/vcpkg/blob/master/ports/arrow/portfile.cmake). We try to keep the library maintained as close as possible to the original library. Closes #42 from JackBoosY/vcpkg-instructions and squashes the following commits: de0a3142 <Sutou Kouhei> Generalize "Other Binary Installer" section 1b5287bf <JackBoosY> Add vcpkg installation instructions Lead-authored-by: JackBoosY <yuzaiyang@beyondsoft.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Sutou Kouhei

commit sha f450d0d891c693e26821f98c9b02d356fab4fd07

ARROW-7790: [Website] Update how to install Linux packages Closes #43 from kou/install-linux and squashes the following commits: 778e9144 <Sutou Kouhei> Update how to install Linux packages Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Krisztián Szűcs

commit sha d622ec2e4955066c274c33f1a98d8a8bd3bf1a88

[Website] Add release note for 0.16.0 (#44)

view details

Neal Richardson

commit sha 24971021fa03c706089e1879942c897d60e85f56

Create outline for 0.16 release post

view details

Antoine Pitrou

commit sha e14e25819a0727beb15f0ad8c6567dce98db6c8f

Add release notes Add highlights for C++, Python, community discussions.

view details

François Saint-Jacques

commit sha 4e842103c985d3fe3dd0accdc2a7a9fa7db7a021

Add dataset changes

view details

Neville Dipale

commit sha d24bda012a8ab33a3831f079eaabc702d0551e8f

Add Rust notes

view details

Yosuke Shiro

commit sha 1bca6c48303b0922d71fbc29a6a4be39e63b51d7

Add Ruby and C GLib notes

view details

Yosuke Shiro

commit sha 545705908799a3eca974faab8b04322f9d73c273

Add syntax highlighting

view details

Neville Dipale

commit sha 50afe2b8468f468e31c2590177a7086be9f24bcf

update Rust notes

view details

Wes McKinney

commit sha 04617088e2ce616a0b640e0c8e286d40780f68f3

Add some notes, simplify some of hte C++ section for readability

view details

Neal Richardson

commit sha df713e1426a5d6b0f70037449720253b5261a6e3

Revise/extend 0.16 blog post

view details

emkornfield

commit sha d9d0c7692e2a33d99f050e41e4a87590c7eb57ba

Add some update fo specification and Java

view details

emkornfield

commit sha d6509d046c731a7f63530806f2f4029cd6192bb2

Added a java bug fix bullet point.

view details

Joris Van den Bossche

commit sha 7aec92b0ca47a77a45d04a5410c40cb0a31d496a

add python 2.7 note

view details

Kenta Murata

commit sha 196e95aa8c3e4eebbdb7a5444cbc31eab5e5a0d2

Fix typo

view details

Wes McKinney

commit sha 10ac57b0cc75ff8d9dd30ed3270c77385217c0ce

Rename file

view details

Wes McKinney

commit sha 4c606ef3c062b868658b8d8e424f670d71d79b7b

Edits for publication

view details

push time in 6 days

pull request commentapache/arrow-site

ARROW-7580: [Website] 0.16 release post

I think ample time has passed... I'm giving it a read through and will merge after that

nealrichardson

comment created time in 6 days

pull request commentapache/arrow

PARQUET-1788: Remove UBSan when rep/dev levels are null

Roger. I guess we might set up a UBSan target with clang "bleeding"

emkornfield

comment created time in 7 days

pull request commentapache/arrow

PARQUET-1788: Remove UBSan when rep/dev levels are null

Is this UBSan issue reflected in any unit test?

emkornfield

comment created time in 7 days

push eventapache/arrow

Micah Kornfield

commit sha 4ef8436555c5a20673fb943dc595956e8648d296

PARQUET-1788: Remove UBSan when rep/dev levels are null Closes #6378 from emkornfield/PARQUET_1788 and squashes the following commits: 498c62728 <Micah Kornfield> PARQUET-1788: Remove UBSan when rep/dev levels are null Authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

view details

push time in 7 days

issue commentwesm/feather

Should feather::read_feather respect options(stringsAsFactors = FALSE)?

Can you open an issue on the Arrow JIRA issue tracker?

khughitt

comment created time in 7 days

issue commentgetpelican/feedgenerator

Release without .tox directory?

thanks!

wesm

comment created time in 8 days

issue openedgetpelican/feedgenerator

Release without .tox directory?

I was digging into why the feedgenerator tarball is 4.1MB and see that there were some issues and PRs about this a few years ago but there has not been a release to PyPI since then. Any chance?

https://pypi.org/project/feedgenerator/#files

created time in 9 days

issue closedwesm/pydata-book

Not able to download or copy this data file

Not able to download or copy this data file

closed time in 10 days

prafuldash

issue commentwesm/pydata-book

Not able to download or copy this data file

https://github.com/wesm/pydata-book/archive/2nd-edition.zip

prafuldash

comment created time in 10 days

startedmichaelnowotny/cocos

started time in 11 days

pull request commentapache/arrow

PARQUET-1780: [C++] Set ColumnMetadata.encoding_stats field

Definitely will need some unit tests for this to verify correctness

omega-gamage

comment created time in 11 days

issue closedapache/arrow

Extending the API

I have some questions from the documentation on extending Apache Arrow for python and other languages.

  1. The first question is if I create a custom DType for pandas, do I have to register the dtype in pyarrow for the read/write functions?

  2. Is there a framework to extend the storage format to perform custom query operations like the postgis functions to arrow? See: https://postgis.net/

closed time in 11 days

achapkowski
more