profile
viewpoint
Christopher Lennan clennan @idealo Berlin, Germany

idealo/imagededup 3264

😎 Finding duplicate images made easy!

idealo/image-quality-assessment 1199

Convolutional Neural Networks to predict the aesthetic and technical quality of images.

idealo/imageatm 193

Image classification for everyone.

idealo/cnn-exposed 164

🕵️‍♂️ Interpreting Convolutional Neural Network (CNN) Results.

idealo/nvidia-docker-keras 52

Workflow that shows how to train neural networks on EC2 instances with GPU support and compares training times to CPUs

clennan/image-quality-assessment 1

Convolutional Neural Networks to predict the aesthetic and technical quality of images.

idealo/wheelwright 1

🎡 Automated build repo for Python wheels (based on spaCy's wheelwright repo)

issue openedvaexio/vaex

[FEATURE-REQUEST] scalable eigenvector computation

When calculating the PCA in https://github.com/vaexio/vaex/blob/5578f8a307f980651bf3390d10c8a9e0acf4315f/packages/vaex-ml/vaex/ml/transformations.py#L94 you first use https://github.com/vaexio/vaex/blob/923911f47e7324335dfd84bc58a14d4cd6eb7ee6/packages/vaex-core/vaex/dataframe.py#L1093 the vaex native scalable covariance calculation.

However, when computing the eigenvalues you resort to the default implementation of numpy:

eigen_values, eigen_vectors = np.linalg.eigh(C)

which has certain memory limitations and is crashing for me for a large matrix. Would it be possible to use eigh in a vaex optimized native way?

created time in 3 hours

issue closedvaexio/vaex

How to sum row-wise?

Hi,

For pandas' df.sum(axis=0), I can use df.sum(df.column_names) in vaex and it will return sum of all rows for each column.

How can I do similar for pandas' df.sum(axis=1) ? i.e sum of all columns for each row

Thanks

closed time in 4 hours

mudasirraza

issue commentvaexio/vaex

How to sum row-wise?

Fantastic! Thanks @maartenbreddels . It worked and is way faster (~6.5 mins) than the 'apply' workaround (~164 minutes) on a '2vCPUs, 8GB ram' machine for a dataframe with 10,000,000 rows and 100 columns.

mudasirraza

comment created time in 4 hours

issue commentvaexio/vaex

How to sum row-wise?

Hmm, this seems to be a problem with working with the Python AST, this is a workaround:

import vaex
import numpy as np
v = {f'c{k}': np.arange(20_000_000) for k in range(100)}
df = vaex.from_dict(v)


@vaex.register_function()
def sum_row_wise(*args):
    out = args[0].copy()
    for other in args[1:]:
        np.add(other, out, out=out)
    return out

df['z'] = df.func.sum_row_wise(*[df[k] for k in df.get_column_names()])
mudasirraza

comment created time in 7 hours

issue commentvaexio/vaex

How to sum row-wise?

df.sum('+'.join(df.get_column_names()))

Unfortunately no, it returns the overall sum - a single value.

mudasirraza

comment created time in 14 hours

issue commentvaexio/vaex

Slow join on CSV files

When I am joining first 6 files, vaex is running faster without conversion(csv to hdf5). These files are not having any duplicate in join key. But while joining first 7 files(last one is having duplicate in join key), it is taking hours to complete sometimes get message: "Killed".

To reproduce this issue, generate data using code already provided and join 7 files or more (last 3 files out of 9 have duplicate in join key)

lijose

comment created time in 16 hours

issue commentvaexio/vaex

How to sum row-wise?

Hi, Maybe this will work....

df.sum('+'.join(df.get_column_names()))

mudasirraza

comment created time in a day

issue commentvaexio/vaex

How to sum row-wise?

What did you do to trigger that? But are you saying that the apply works, while my solution gives a problem? If so, that would be interesting..

To be precise, I ran the following code in python console:

import vaex
df = vaex.open('100x100.csv.hdf5')
ds = sum(df[k] for k in df.get_column_names())
ds // MemoryError

I see following in the log:

s_push: parser stack overflow raise KeyError("Unknown variables or column: %r" % (variable,)) KeyError: "Unknown variables or column: '((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((0 + col0) + col1) + col2) + col3) + col4) + col5) + col6) + col7) + col8) + col9) + col10) + col11) + col12) + col13) + col14) + col15) + col16) + col17) + col18) + col19) + col20) + col21) + col22) + col23) + col24) + col25) + col26) + col27) + col28) + col29) + col30) + col31) + col32) + col33) + col34) + col35) + col36) + col37) + col38) + col39) + col40) + col41) + col42) + col43) + col44) + col45) + col46) + col47) + col48) + col49) + col50) + col51) + col52) + col53) + col54) + col55) + col56) + col57) + col58) + col59) + col60) + col61) + col62) + col63) + col64) + col65) + col66) + col67) + col68) + col69) + col70) + col71) + col72) + col73) + col74) + col75) + col76) + col77) + col78) + col79) + col80) + col81) + col82) + col83) + col84) + col85) + col86) + col87) + col88) + col89) + col90) + col91) + col92) + col93)'" .. MemoryError

And for apply;

import vaex
df = vaex.open('100x100.csv.hdf5')

def sum_row(*args):
    sum = 0
    for a in args:
        sum = sum + a
    return sum

ds = df.apply(sum_row, arguments=df.column_names)
ds.evaluate() //returns the array

NumPy version: 1.19.4 Vaex version: 3.0.0

mudasirraza

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

 def apply(self, f):         :param f: A function to be applied on the Expression values

yeah

maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

 def apply(self, f, arguments=None, dtype=None, delay=False, vectorize=False):          :param f: The function to be applied         :param arguments: List of arguments to be passed on to the function f.+        :param bool multiprocessing: Use multiple processes to avoid the GIL (Global interpreter lock)

yeah, vectorize at least

maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 1,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:23.880238Z",+     "start_time": "2020-11-27T18:59:23.282738Z"+    }+   },+   "outputs": [+    {+     "data": {+      "text/html": [+       "<table>\n",+       "<thead>\n",+       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th>is_prime  </th></tr>\n",+       "</thead>\n",+       "<tbody>\n",+       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>5</i></td><td style=\"text-align: right;\">  5</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>6</i></td><td style=\"text-align: right;\">  6</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>7</i></td><td style=\"text-align: right;\">  7</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>8</i></td><td style=\"text-align: right;\">  8</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>9</i></td><td style=\"text-align: right;\">  9</td><td>False     </td></tr>\n",+       "</tbody>\n",+       "</table>"+      ],+      "text/plain": [+       "  #    x  is_prime\n",+       "  0    0  False\n",+       "  1    1  False\n",+       "  2    2  True\n",+       "  3    3  True\n",+       "  4    4  False\n",+       "  5    5  True\n",+       "  6    6  False\n",+       "  7    7  True\n",+       "  8    8  False\n",+       "  9    9  False"+      ]+     },+     "execution_count": 1,+     "metadata": {},+     "output_type": "execute_result"+    }+   ],+   "source": [+    "import vaex\n",+    "\n",+    "def slow_is_prime(x):\n",+    "    return x > 1 and all((x % i) != 0 for i in range(2, x))\n",+    "\n",+    "df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))\n",+    "# you need to explicitly tell which arguments you need\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "df.head(10)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 2,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.178658Z",+     "start_time": "2020-11-27T18:59:23.881404Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "There are 9592 prime number between 0 and 100000\n"+     ]+    }+   ],+   "source": [+    "prime_count = df.is_prime.sum()\n",+    "print(f'There are {prime_count} prime number between 0 and {len(df)}')"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 3,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.181779Z",+     "start_time": "2020-11-27T18:59:25.179695Z"+    }+   },+   "outputs": [],+   "source": [+    "# both of these are equivalent\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "# but this form only works for a single argument\n",+    "df['is_prime'] = df.x.apply(slow_is_prime)"+   ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "## When not to use apply\n",

this never format correctly with nbsphinx

maartenbreddels

comment created time in a day

issue commentvaexio/vaex

How to sum row-wise?

Hi Mudasir.

Unfortunately, that solution didn't work for me (Memory Error)

What did you do to trigger that?

Probably because that apply is really slow, since it will not be done parallel/vectorized.

Yes, although that will change #1080 I would not recommend it. But are you saying that the apply works, while my solution gives a problem? If so, that would be interesting...

Which version of numpy do you use?

mudasirraza

comment created time in a day

issue commentvaexio/vaex

[BUG-REPORT] export_hdf5 fails after type change from string to datetime

Great, that file helped me reproduce it! Hope to be able to fix it soon.

joeybellerose

comment created time in a day

issue commentvaexio/vaex

[BUG-REPORT] Vaex cannot open more than 816 parquet files on s3

Being in the same region did not matter, I managed to reproduce it. It seems that for some reason the underlying library curl has some issues, caching the region resolving and filesystem object (which is a good idea anyway), avoid the issue. I'll let you know when #1085 is released. Thanks for reporting, would have never found this out otherwise.

aburkov

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

 def apply(self, f):         :param f: A function to be applied on the Expression values

Should we document all kwargs here?

maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

 def apply(self, f, arguments=None, dtype=None, delay=False, vectorize=False):          :param f: The function to be applied         :param arguments: List of arguments to be passed on to the function f.+        :param bool multiprocessing: Use multiple processes to avoid the GIL (Global interpreter lock)

Should we take this opportunity to document the rest of the kwargs here?

maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 1,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:23.880238Z",+     "start_time": "2020-11-27T18:59:23.282738Z"+    }+   },+   "outputs": [+    {+     "data": {+      "text/html": [+       "<table>\n",+       "<thead>\n",+       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th>is_prime  </th></tr>\n",+       "</thead>\n",+       "<tbody>\n",+       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>5</i></td><td style=\"text-align: right;\">  5</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>6</i></td><td style=\"text-align: right;\">  6</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>7</i></td><td style=\"text-align: right;\">  7</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>8</i></td><td style=\"text-align: right;\">  8</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>9</i></td><td style=\"text-align: right;\">  9</td><td>False     </td></tr>\n",+       "</tbody>\n",+       "</table>"+      ],+      "text/plain": [+       "  #    x  is_prime\n",+       "  0    0  False\n",+       "  1    1  False\n",+       "  2    2  True\n",+       "  3    3  True\n",+       "  4    4  False\n",+       "  5    5  True\n",+       "  6    6  False\n",+       "  7    7  True\n",+       "  8    8  False\n",+       "  9    9  False"+      ]+     },+     "execution_count": 1,+     "metadata": {},+     "output_type": "execute_result"+    }+   ],+   "source": [+    "import vaex\n",+    "\n",+    "def slow_is_prime(x):\n",+    "    return x > 1 and all((x % i) != 0 for i in range(2, x))\n",+    "\n",+    "df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))\n",+    "# you need to explicitly tell which arguments you need\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "df.head(10)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 2,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.178658Z",+     "start_time": "2020-11-27T18:59:23.881404Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "There are 9592 prime number between 0 and 100000\n"+     ]+    }+   ],+   "source": [+    "prime_count = df.is_prime.sum()\n",+    "print(f'There are {prime_count} prime number between 0 and {len(df)}')"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 3,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.181779Z",+     "start_time": "2020-11-27T18:59:25.179695Z"+    }+   },+   "outputs": [],+   "source": [+    "# both of these are equivalent\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "# but this form only works for a single argument\n",+    "df['is_prime'] = df.x.apply(slow_is_prime)"+   ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "## When not to use apply\n",
    "## When not to use `apply`\n",
maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 1,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:23.880238Z",+     "start_time": "2020-11-27T18:59:23.282738Z"+    }+   },+   "outputs": [+    {+     "data": {+      "text/html": [+       "<table>\n",+       "<thead>\n",+       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th>is_prime  </th></tr>\n",+       "</thead>\n",+       "<tbody>\n",+       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>5</i></td><td style=\"text-align: right;\">  5</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>6</i></td><td style=\"text-align: right;\">  6</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>7</i></td><td style=\"text-align: right;\">  7</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>8</i></td><td style=\"text-align: right;\">  8</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>9</i></td><td style=\"text-align: right;\">  9</td><td>False     </td></tr>\n",+       "</tbody>\n",+       "</table>"+      ],+      "text/plain": [+       "  #    x  is_prime\n",+       "  0    0  False\n",+       "  1    1  False\n",+       "  2    2  True\n",+       "  3    3  True\n",+       "  4    4  False\n",+       "  5    5  True\n",+       "  6    6  False\n",+       "  7    7  True\n",+       "  8    8  False\n",+       "  9    9  False"+      ]+     },+     "execution_count": 1,+     "metadata": {},+     "output_type": "execute_result"+    }+   ],+   "source": [+    "import vaex\n",+    "\n",+    "def slow_is_prime(x):\n",+    "    return x > 1 and all((x % i) != 0 for i in range(2, x))\n",+    "\n",+    "df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))\n",+    "# you need to explicitly tell which arguments you need\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "df.head(10)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 2,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.178658Z",+     "start_time": "2020-11-27T18:59:23.881404Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "There are 9592 prime number between 0 and 100000\n"+     ]+    }+   ],+   "source": [+    "prime_count = df.is_prime.sum()\n",+    "print(f'There are {prime_count} prime number between 0 and {len(df)}')"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 3,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.181779Z",+     "start_time": "2020-11-27T18:59:25.179695Z"+    }+   },+   "outputs": [],+   "source": [+    "# both of these are equivalent\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "# but this form only works for a single argument\n",+    "df['is_prime'] = df.x.apply(slow_is_prime)"+   ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "## When not to use apply\n",+    "When your function can easily be vectorized, you should not use apply. When you use Vaex' expression system, we know what you do, we see the expression, and can manipulate it. An apply function is like a black box, we cannot do anything with it, like JIT-ting for instance."+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 4,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.210441Z",+     "start_time": "2020-11-27T18:59:25.182779Z"+    }+   },+   "outputs": [],+   "source": [+    "df = vaex.from_arrays(x=vaex.vrange(0, 10_000_000, dtype='f4'))"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 5,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.216243Z",+     "start_time": "2020-11-27T18:59:25.211135Z"+    }+   },+   "outputs": [],+   "source": [+    "# ideal case\n",+    "df['y'] = df.x**2"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 6,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:27.620809Z",+     "start_time": "2020-11-27T18:59:25.217020Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "29.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"+     ]+    }+   ],+   "source": [+    "%%timeit\n",+    "df.y.sum()"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 7,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:27.623180Z",+     "start_time": "2020-11-27T18:59:27.621623Z"+    }+   },+   "outputs": [],+   "source": [+    "# will transfer the data to child processes, and execute the ** in Python it for each element\n",+    "df['y_slow'] = df.x.apply(lambda x: x**2)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 8,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:30.575048Z",+     "start_time": "2020-11-27T18:59:27.623860Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "353 ms ± 40 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"+     ]+    }+   ],+   "source": [+    "%%timeit\n",+    "df.y_slow.sum()"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 9,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:30.577718Z",+     "start_time": "2020-11-27T18:59:30.576036Z"+    }+   },+   "outputs": [],+   "source": [+    "# bad idea: it will transfer the data to the child process, where it will be executed in vectorized form \n",+    "df['y_slow_vectorized'] = df.x.apply(lambda x: x**2, vectorize=True)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 10,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:37.285060Z",+     "start_time": "2020-11-27T18:59:30.578516Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "82.8 ms ± 525 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"+     ]+    }+   ],+   "source": [+    "%%timeit\n",+    "df.y_slow_vectorized.sum()"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 11,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:37.287763Z",+     "start_time": "2020-11-27T18:59:37.286036Z"+    }+   },+   "outputs": [],+   "source": [+    "# bad idea: same performance as just dy['y'], but we lose the information about what you have done\n",
    "# bad idea: same performance as just dy['y'], but we lose the information about what was done\n",
maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 1,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:23.880238Z",+     "start_time": "2020-11-27T18:59:23.282738Z"+    }+   },+   "outputs": [+    {+     "data": {+      "text/html": [+       "<table>\n",+       "<thead>\n",+       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th>is_prime  </th></tr>\n",+       "</thead>\n",+       "<tbody>\n",+       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>5</i></td><td style=\"text-align: right;\">  5</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>6</i></td><td style=\"text-align: right;\">  6</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>7</i></td><td style=\"text-align: right;\">  7</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>8</i></td><td style=\"text-align: right;\">  8</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>9</i></td><td style=\"text-align: right;\">  9</td><td>False     </td></tr>\n",+       "</tbody>\n",+       "</table>"+      ],+      "text/plain": [+       "  #    x  is_prime\n",+       "  0    0  False\n",+       "  1    1  False\n",+       "  2    2  True\n",+       "  3    3  True\n",+       "  4    4  False\n",+       "  5    5  True\n",+       "  6    6  False\n",+       "  7    7  True\n",+       "  8    8  False\n",+       "  9    9  False"+      ]+     },+     "execution_count": 1,+     "metadata": {},+     "output_type": "execute_result"+    }+   ],+   "source": [+    "import vaex\n",+    "\n",+    "def slow_is_prime(x):\n",+    "    return x > 1 and all((x % i) != 0 for i in range(2, x))\n",+    "\n",+    "df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))\n",+    "# you need to explicitly tell which arguments you need\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "df.head(10)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 2,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.178658Z",+     "start_time": "2020-11-27T18:59:23.881404Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "There are 9592 prime number between 0 and 100000\n"+     ]+    }+   ],+   "source": [+    "prime_count = df.is_prime.sum()\n",+    "print(f'There are {prime_count} prime number between 0 and {len(df)}')"
    "print(f'There are {prime_count} prime numbers between 0 and {len(df)}')"
maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 1,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:23.880238Z",+     "start_time": "2020-11-27T18:59:23.282738Z"+    }+   },+   "outputs": [+    {+     "data": {+      "text/html": [+       "<table>\n",+       "<thead>\n",+       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th>is_prime  </th></tr>\n",+       "</thead>\n",+       "<tbody>\n",+       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>5</i></td><td style=\"text-align: right;\">  5</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>6</i></td><td style=\"text-align: right;\">  6</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>7</i></td><td style=\"text-align: right;\">  7</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>8</i></td><td style=\"text-align: right;\">  8</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>9</i></td><td style=\"text-align: right;\">  9</td><td>False     </td></tr>\n",+       "</tbody>\n",+       "</table>"+      ],+      "text/plain": [+       "  #    x  is_prime\n",+       "  0    0  False\n",+       "  1    1  False\n",+       "  2    2  True\n",+       "  3    3  True\n",+       "  4    4  False\n",+       "  5    5  True\n",+       "  6    6  False\n",+       "  7    7  True\n",+       "  8    8  False\n",+       "  9    9  False"+      ]+     },+     "execution_count": 1,+     "metadata": {},+     "output_type": "execute_result"+    }+   ],+   "source": [+    "import vaex\n",+    "\n",+    "def slow_is_prime(x):\n",+    "    return x > 1 and all((x % i) != 0 for i in range(2, x))\n",+    "\n",+    "df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))\n",+    "# you need to explicitly tell which arguments you need\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "df.head(10)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 2,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.178658Z",+     "start_time": "2020-11-27T18:59:23.881404Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "There are 9592 prime number between 0 and 100000\n"+     ]+    }+   ],+   "source": [+    "prime_count = df.is_prime.sum()\n",+    "print(f'There are {prime_count} prime number between 0 and {len(df)}')"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 3,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.181779Z",+     "start_time": "2020-11-27T18:59:25.179695Z"+    }+   },+   "outputs": [],+   "source": [+    "# both of these are equivalent\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "# but this form only works for a single argument\n",+    "df['is_prime'] = df.x.apply(slow_is_prime)"+   ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "## When not to use apply\n",+    "When your function can easily be vectorized, you should not use apply. When you use Vaex' expression system, we know what you do, we see the expression, and can manipulate it. An apply function is like a black box, we cannot do anything with it, like JIT-ting for instance."
    "You should not use `apply` when your function can be vectorized. When you use Vaex' expression system, we know what you do, we see the expression, and can manipulate it in order to achieve optimal performance. An `apply` function is like a black box, we cannot do anything with it, like JIT-ting for instance."
maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 1,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:23.880238Z",+     "start_time": "2020-11-27T18:59:23.282738Z"+    }+   },+   "outputs": [+    {+     "data": {+      "text/html": [+       "<table>\n",+       "<thead>\n",+       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th>is_prime  </th></tr>\n",+       "</thead>\n",+       "<tbody>\n",+       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>5</i></td><td style=\"text-align: right;\">  5</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>6</i></td><td style=\"text-align: right;\">  6</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>7</i></td><td style=\"text-align: right;\">  7</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>8</i></td><td style=\"text-align: right;\">  8</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>9</i></td><td style=\"text-align: right;\">  9</td><td>False     </td></tr>\n",+       "</tbody>\n",+       "</table>"+      ],+      "text/plain": [+       "  #    x  is_prime\n",+       "  0    0  False\n",+       "  1    1  False\n",+       "  2    2  True\n",+       "  3    3  True\n",+       "  4    4  False\n",+       "  5    5  True\n",+       "  6    6  False\n",+       "  7    7  True\n",+       "  8    8  False\n",+       "  9    9  False"+      ]+     },+     "execution_count": 1,+     "metadata": {},+     "output_type": "execute_result"+    }+   ],+   "source": [+    "import vaex\n",+    "\n",+    "def slow_is_prime(x):\n",+    "    return x > 1 and all((x % i) != 0 for i in range(2, x))\n",+    "\n",+    "df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))\n",+    "# you need to explicitly tell which arguments you need\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "df.head(10)"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 2,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.178658Z",+     "start_time": "2020-11-27T18:59:23.881404Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "There are 9592 prime number between 0 and 100000\n"+     ]+    }+   ],+   "source": [+    "prime_count = df.is_prime.sum()\n",+    "print(f'There are {prime_count} prime number between 0 and {len(df)}')"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 3,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.181779Z",+     "start_time": "2020-11-27T18:59:25.179695Z"+    }+   },+   "outputs": [],+   "source": [+    "# both of these are equivalent\n",+    "df['is_prime'] = df.apply(slow_is_prime, arguments=[df.x])\n",+    "# but this form only works for a single argument\n",+    "df['is_prime'] = df.x.apply(slow_is_prime)"+   ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "## When not to use apply\n",+    "When your function can easily be vectorized, you should not use apply. When you use Vaex' expression system, we know what you do, we see the expression, and can manipulate it. An apply function is like a black box, we cannot do anything with it, like JIT-ting for instance."+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 4,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.210441Z",+     "start_time": "2020-11-27T18:59:25.182779Z"+    }+   },+   "outputs": [],+   "source": [+    "df = vaex.from_arrays(x=vaex.vrange(0, 10_000_000, dtype='f4'))"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 5,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:25.216243Z",+     "start_time": "2020-11-27T18:59:25.211135Z"+    }+   },+   "outputs": [],+   "source": [+    "# ideal case\n",+    "df['y'] = df.x**2"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 6,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:27.620809Z",+     "start_time": "2020-11-27T18:59:25.217020Z"+    }+   },+   "outputs": [+    {+     "name": "stdout",+     "output_type": "stream",+     "text": [+      "29.6 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"+     ]+    }+   ],+   "source": [+    "%%timeit\n",+    "df.y.sum()"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 7,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:27.623180Z",+     "start_time": "2020-11-27T18:59:27.621623Z"+    }+   },+   "outputs": [],+   "source": [+    "# will transfer the data to child processes, and execute the ** in Python it for each element\n",
    "# will transfer the data to child processes, and execute the **  operation in Python for each element\n",
maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"+   ]+  },+  {+   "cell_type": "code",+   "execution_count": 1,+   "metadata": {+    "ExecuteTime": {+     "end_time": "2020-11-27T18:59:23.880238Z",+     "start_time": "2020-11-27T18:59:23.282738Z"+    }+   },+   "outputs": [+    {+     "data": {+      "text/html": [+       "<table>\n",+       "<thead>\n",+       "<tr><th>#                            </th><th style=\"text-align: right;\">  x</th><th>is_prime  </th></tr>\n",+       "</thead>\n",+       "<tbody>\n",+       "<tr><td><i style='opacity: 0.6'>0</i></td><td style=\"text-align: right;\">  0</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>1</i></td><td style=\"text-align: right;\">  1</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>2</i></td><td style=\"text-align: right;\">  2</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>3</i></td><td style=\"text-align: right;\">  3</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>4</i></td><td style=\"text-align: right;\">  4</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>5</i></td><td style=\"text-align: right;\">  5</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>6</i></td><td style=\"text-align: right;\">  6</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>7</i></td><td style=\"text-align: right;\">  7</td><td>True      </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>8</i></td><td style=\"text-align: right;\">  8</td><td>False     </td></tr>\n",+       "<tr><td><i style='opacity: 0.6'>9</i></td><td style=\"text-align: right;\">  9</td><td>False     </td></tr>\n",+       "</tbody>\n",+       "</table>"+      ],+      "text/plain": [+       "  #    x  is_prime\n",+       "  0    0  False\n",+       "  1    1  False\n",+       "  2    2  True\n",+       "  3    3  True\n",+       "  4    4  False\n",+       "  5    5  True\n",+       "  6    6  False\n",+       "  7    7  True\n",+       "  8    8  False\n",+       "  9    9  False"+      ]+     },+     "execution_count": 1,+     "metadata": {},+     "output_type": "execute_result"+    }+   ],+   "source": [+    "import vaex\n",+    "\n",+    "def slow_is_prime(x):\n",+    "    return x > 1 and all((x % i) != 0 for i in range(2, x))\n",+    "\n",+    "df = vaex.from_arrays(x=vaex.vrange(0, 100_000, dtype='i4'))\n",+    "# you need to explicitly tell which arguments you need\n",
    "# you need to explicitly specify which arguments you need\n",
maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",+    "\n",+    "We think apply should only be used as a last resort, because it needs to use multiprocessing (which spawns new processes) to avoid the Python Global Interpreter Lock (GIL) to make use of multiple cores. This comes at a cost of having to transfer the data between the main and child processes.\n",+    "\n",+    "In case you really want, or need to use it, here is how you use it:"
    "Here is an example which uses the `apply` method:"
maartenbreddels

comment created time in a day

Pull request review commentvaexio/vaex

feat: apply with multiprocessing

    "source": [     "df.scale.mul(2)"    ]+  },+  {+   "cell_type": "markdown",+   "metadata": {},+   "source": [+    "# The escape hatch: apply\n",+    "In case a calculation cannot be expressed in a Vaex expression, the last resort method is to use apply. This can be useful if the function you want to apply is written in pure Python, and difficult or impossible to vectorize.\n",
    "In case a calculation cannot be expressed as a Vaex expression, one can use the `apply` method as a last resort. This can be useful if the function you want to apply is written in pure Python, a third party library, and is difficult or impossible to vectorize.\n",
maartenbreddels

comment created time in a day

push eventvaexio/vaex

Maarten A. Breddels

commit sha fa056a1b64896080b50afa0d83bb14034cb18cd4

fix: cache filesystem object to avoid #1084 For some reason, if we request many files, and request the region many times, curl (underlying arrow) fails to resolve the DNS name. In any case, it makes sense to cache regions and filesystem objects.

view details

push time in a day

PR opened vaexio/vaex

fix: cache filesystem object to avoid #1084

For some reason, if we request many files, and request the region many times, curl (underlying arrow) fails to resolve the DNS name. In any case, it makes sense to cache regions and filesystem objects.

Fixes #1084

+39 -8

0 comment

2 changed files

pr created time in a day

create barnchvaexio/vaex

branch : fix_cache_fs_object

created branch time in a day

issue commentvaexio/vaex

[BUG-REPORT] export_hdf5 fails after type change from string to datetime

Hi Maarten,

The first row that has a problem is 1025 but am unable to replicate the error manually when using those text inputs. I also tried switching between arrow and hdf5 to see if that was the problem. I got a different error but no ability to write the file.

Here is a link to a sample of the data - https://gist.github.com/joeybellerose/356ffb6ee7e2d8393cab466bff8130cc

Downgrading to Vaex 3.0 solved the problem....although the reason I went to 4.0.0a4 was to solve some of the join issues I was running into.

Thanks,

Joey

joeybellerose

comment created time in a day

issue commentvaexio/vaex

How to sum row-wise?

Hi Maarten,

Thanks for the response. Unfortunately, that solution didn't work for me (Memory Error). I have a very big dataframe (hdf5) with upto 20 million rows and 100 columns.

However, I tried the following workaround; it worked but it takes very long time. Probably because that apply is really slow, since it will not be done parallel/vectorized.


def sum_row(*args):
    sum = 0
    for a in args:
        sum = sum + a
    return sum

ds = df.apply(sum_row, arguments=df.column_names)
ds.evaluate()
mudasirraza

comment created time in a day

more