Arrow Record Distributed Result Batches #544

ddl-giuliocapolino · 2022-02-08T02:00:17Z

Description

For performance and ease-of-use purposes, users may need to access data in arrow.Record format directly from queries, rather than getting it in sql.Rows format and having to covert it manually. This PR covers the implementation and tests to get these records via a new API endpoint called GetResultBatches. We use ResultBatch objects that can independently fetch chunks of data in arrow.Record format. GetResultBatches returns such objects in the correct order as to allow the user to compose, in parallel or not, the full data in Arrow format. Below is a sample client request to fetch the data:

rbCtx := sf.WithDistributedResultBatches(ctx)
d := sf.SnowflakeDriver{}
db, err := d.Open(dsn)
if err != nil {
	return nil, err
}
snowflakeDb := db.(sf.SnowflakeConnection)

rows, err := snowflakeDb.QueryContext(rbCtx, query, []driver.NamedValue{})
if err != nil {
	return nil, err
}

batches, err := rows.(sf.SnowflakeRows).GetBatches()
if err != nil {
	return nil, err
}

For a client using data in arrow.Record format, this can result in more than a 10x speedup in retrieving usable data, as well as provides the client with the ability to manage memory consumption independently.

Checklist

Code compiles correctly
Run make fmt to fix inconsistent formats
Run make lint to get lint errors and fix all of them
Created tests which fail without the change (if possible)
All tests passing
Extended the README / documentation, if necessary

…w_record_channel WithArrowRecordsChan - channel in context for Arrow Records

…_converter Converter for Snowflake arrow.Record type

…ork-repo ddl: update go.mod and publish fork

…ributed-batches withDistributedBatches refactor

…-dominodatalab-module Reverting dominodatalab module

sfc-gh-jbahk

This looks great; thank you for your work. I had a couple questions that I'd love to hear your answers on before we proceed.

monitoring.go

sfc-gh-jbahk · 2022-02-08T17:25:31Z

rows.go

@@ -27,6 +27,11 @@ var (
 	maxChunkDownloaderErrorCounter = 5
 )

+// SnowflakeRows provides an API for methods exposed to the clients
+type SnowflakeRows interface {


While this makes sense, I think we can consolidate this into SnowflakeResult since this object can and should be able to return both a query ID and its status as well. Asking the user to convert this into separate interfaces seems cumbersome for both sides in terms of usability & maintenance.

Sounds good. Felt a bit strange to have SnowflakeResult contain an endpoint related to data since Result is usually referred to by an exec query that does not return any data, but I agree it's better to expose only one interface. Related to that, would you prefer that the implementation of GetBatches() ([]*ResultBatch, error) by snowflakeResult returns an error, or returns an empty array (and nil)?

I pushed it out with the former (returns an error), but happy to change it to empty array if you rather have that!

Completely understandable - this SnowflakeResult object was meant to be detached from the snowflakeResult (related to Exec) however, and actually denote a literal result. Perhaps a confusing misnomer on our end.

I think the return value seems fine as is for now. I'll let you know if it needs changing or change it myself.

converter.go

sfc-gh-jbahk · 2022-02-08T17:33:19Z

chunk_test.go

@@ -387,3 +388,44 @@ func TestWithStreamDownloader(t *testing.T) {
 		}
 	})
 }
+
+func TestWithDistributedResultBatches(t *testing.T) {


This test succinctly demonstrates the API pathway, but is there another test that can be added to demonstrate that these batches are indeed fetched in a distributed manner? 100 rows is not a lot and would probably be fetched within one chunk. This test doesn't demonstrate to me that these batches can be called independently.

More importantly, with these changes, do you see your memory issues improved? Solved?

sounds good - I will have a first PR iteration with the logic changes, and a second only focusing on this test so that we can iterate on the former first! In terms of memory, it's fully up to the user to handle the workers as they wish - in our case we optimized the client side for parallel downloading, but releasing the records as soon as they've been read and written, and we have not seen any memory issue (queried all the way up to 10GB with very little memory pressure but really good performance).

^ changed it to 3000 rows which is approximately 6 ResultBatch objects (ran it a few times and, on average, it's 6 chunks of data)

The test seems better but is there a way to explicitly illustrate these batches being fetched on in different worker nodes? Something like multiple goroutines, perhaps? Maybe there are some other ways to demonstrate this but at face value, this test doesn't tell me much.

just updated the test with clear download workers - it is now checking also that the metadata received (rowcount) is correct :)

sfc-gh-jbahk · 2022-02-08T17:36:06Z

chunk_downloader.go

+}
+
+// Fetch returns an array of records representing a chunk in the query
+func (rb *ResultBatch) Fetch() error {


I'm curious as to the design here: shouldn't this return the actual array.Record itself (along with an error), instead of storing the the record as an accessible object within the ResultBatch struct?

My initial rationale was to not have a double memory burden but since the object contains []*array.Record, it won't hurt to have it twice so that the user can Fetch it, but also re-use it in the future if necessary (double benefit). Changing it now.

Does the object need to contain the record? Can it just be a product of the Fetch() function and nothing else? A close analogy would be returning a row object after a query, but you'd need to query the rows again in order to receive the same rows again; the driver won't do that for you.

In this case we do need the Rec on the object because we are leveraging the downloadChunkHelper which only returns an error. It's analogous to how rows are downloaded in chunks within the downloadChunkHelper and then are applied to the scd.Chunks object, without explicitly being returned. I don't think we can omit that without having a major refactor on how the download workflow happens unfortunately. However, we can make Rec non-public (rec) if you rather not have it exposed by the object, but we thought it'd be nice for a user to have a pointer to it within the object itself in case it is needed (it's a cheap extra functionality) - up to you though if you'd rather have it lowercase!

Thanks for the quick turnaround. I think the rec object might best be left private so the users can't control it and have fetch act as a getter (as you've done).

It looks overall very good to me - I'm going to go over these changes today in a design review and come back with final feedback.

sounds good. I will change it to private later this evening along with the clearer test. Consider it done for your design review!

…of Fetch

chunk_downloader.go

…e Rec private

sfc-gh-jbahk

We had a design review yesterday and there were a couple points --

the ultimate point we wanted to reach with distributed fetching was being able serialize these ResultBatches so workers across different machines are able to just pick up them independently
the fact that we require the chunkdownloader as a member in the struct hinders this and requires some refactoring, which we (Snowflake) will get to
this has a very good use case in and of itself but we envision having to reserve this API/pathway so we'd like to merge it with a couple nomenclature tweak

sfc-gh-jbahk · 2022-02-10T18:17:04Z

chunk_downloader.go

@@ -34,6 +35,7 @@ type chunkDownloader interface {
 	getRowType() []execResponseRowType
 	setNextChunkDownloader(downloader chunkDownloader)
 	getNextChunkDownloader() chunkDownloader
+	getResultBatches() []*ResultBatch


getArrowBatches() []*ArrowBatch

sfc-gh-jbahk · 2022-02-10T18:17:50Z

chunk_test.go

@@ -387,3 +389,78 @@ func TestWithStreamDownloader(t *testing.T) {
 		}
 	})
 }
+
+func TestWithDistributedResultBatches(t *testing.T) {


TestWithArrowBatches()

chunk_downloader.go

sfc-gh-jbahk · 2022-02-10T18:18:56Z

chunk_downloader.go

@@ -630,3 +683,43 @@ func copyChunkStream(body io.Reader, rows chan<- []*string) error {
 	}
 	return nil
 }
+
+// ResultBatch object represents a chunk of data, or subset of rows, retrievable in array.Record format
+type ResultBatch struct {


sfc-gh-jbahk · 2022-02-10T18:19:14Z

chunk_downloader.go

@@ -245,6 +259,33 @@ func getChunk(
 	return newRetryHTTP(ctx, scd.sc.rest.Client, http.NewRequest, u, headers, timeout).execute()
 }

+func (scd *snowflakeChunkDownloader) startDistributedBatches() error {


startArrowBatches()

sfc-gh-jbahk · 2022-02-10T18:19:57Z

result.go

 type SnowflakeResult interface {
 	GetQueryID() string
 	GetStatus() queryStatus
+	GetBatches() ([]*ResultBatch, error)


GetArrowBatches()

sfc-gh-jbahk · 2022-02-10T18:20:46Z

util.go

@@ -23,6 +23,7 @@ const (
 	fileStreamFile        contextKey = "STREAMING_PUT_FILE"
 	fileTransferOptions   contextKey = "FILE_TRANSFER_OPTIONS"
 	enableHigherPrecision contextKey = "ENABLE_HIGHER_PRECISION"
+	distributedResultBatches contextKey = "DISTRIBUTED_RESULT_BATCH"


arrowBatches

sfc-gh-jbahk · 2022-02-10T18:20:54Z

util.go

@@ -83,6 +84,12 @@ func WithHigherPrecision(ctx context.Context) context.Context {
 	return context.WithValue(ctx, enableHigherPrecision, true)
 }

+// WithDistributedResultBatches returns a context that allows users to retrieve
+// array.Record download workers upon querying
+func WithDistributedResultBatches(ctx context.Context) context.Context {


WithArrowBatches()

sfc-gh-jbahk · 2022-02-10T22:02:39Z

Local tests pass successfully. Merging.

ddl-giuliocapolino · 2022-02-10T23:02:11Z

Thanks @sfc-gh-jbahk !

* arrow record channel in context impl + unit test * iteration on comments 1 * changed mutex map struct to sync.map * final iteration on comments * added converter for snowflake records * added converter tests * releasing record as soon as possible * fix value has non-zero nanoseconds error * iteration on comments * withDistributedBatches refactor * reverting withArrowRecordChan changes * adjusting go.mod for forking * first iteration on comments * final iteration on comments * reverting module change * revert exposure of SnowflakeRows, QueryContext, and change signature of Fetch * improved chunk_test to query more than 1 chunk * improved chunk_test to show download workers, added metadata, and made Rec private * nomenclature tweaks

stoffeastrom · 2022-02-17T13:30:08Z

Hi 👋

I was playing around with the latest release where this is included but I can't seem to access the arrow batches? This seems to be completely internal. The example from @ddl-giuliocapolino seems outdated.

rbCtx := sf.WithDistributedResultBatches(ctx)
d := sf.SnowflakeDriver{}
db, err := d.Open(dsn)
if err != nil {
	return nil, err
}
snowflakeDb := db.(sf.SnowflakeConnection)

rows, err := snowflakeDb.QueryContext(rbCtx, query, []driver.NamedValue{})
if err != nil {
	return nil, err
}

batches, err := rows.(sf.SnowflakeRows).GetBatches()
if err != nil {
	return nil, err
}

Seems to suggest it could be access through SnowflakeConnection but this interface only has

type SnowflakeConnection interface {
	GetQueryStatus(ctx context.Context, queryID string) (*SnowflakeQueryStatus, error)
}

but I might misinterpret things :D

* Refactor & Lint (snowflakedb#525) * move function calls * fix lint * SNOW-530866: Bumped up GoLang connector PATCH version from 1.6.5 to 1.6.6 (snowflakedb#529) * SNOW-534077 Multi-Statement Missing Result IDs (snowflakedb#534) * SNOW-440362 Implement UUIDv4 (snowflakedb#541) * implement uuid * unexport rfc * SNOW-535399 Fix GetQueryStatus Bug (snowflakedb#539) * change errorcode to string * fix emtpy query status * fix null pointer dereference * SNOW-521578 PUT Memory Enhancements (snowflakedb#527) * modify block size for azure * fix incorrect source * open file v read * move chunk out of loop * Arrow Record Result Batches (snowflakedb#544) * arrow record channel in context impl + unit test * iteration on comments 1 * changed mutex map struct to sync.map * final iteration on comments * added converter for snowflake records * added converter tests * releasing record as soon as possible * fix value has non-zero nanoseconds error * iteration on comments * withDistributedBatches refactor * reverting withArrowRecordChan changes * adjusting go.mod for forking * first iteration on comments * final iteration on comments * reverting module change * revert exposure of SnowflakeRows, QueryContext, and change signature of Fetch * improved chunk_test to query more than 1 chunk * improved chunk_test to show download workers, added metadata, and made Rec private * nomenclature tweaks * Formatting & Lint (snowflakedb#545) * Fix Select 1 Bug (snowflakedb#511) Make port optional Co-authored-by: sfc-gh-jbahk <[email protected]> * SNOW-526255 Fix Time Scale for Arrow (snowflakedb#547) * fix time scale issue * add timezone in ltz * remove high precision context * Allow Client to specify ClientTimeout in the DSN (snowflakedb#484) * add failing test * add fix * SNOW-544029: Bumped up GoLang connector PATCH version from 1.6.6 to 1.6.7 (snowflakedb#548) (snowflakedb#549) * Fixes after syncing with upstream. * Modify instructions to reflect the process I followed this time. Co-authored-by: sfc-gh-jbahk <[email protected]> Co-authored-by: Kiran Dama <[email protected]> Co-authored-by: ddl-giuliocapolino <[email protected]> Co-authored-by: yiksanchan <[email protected]> Co-authored-by: Greg Owen <[email protected]> Co-authored-by: Agam Brahma <[email protected]>

stoffeastrom · 2022-04-11T12:27:57Z

@sfc-gh-jbahk Pls see my comment above. Can we get an example of how to use arrow batches directly?

sfc-gh-jbahk · 2022-04-11T17:19:50Z

Hi @stoffeastrom, this is an internal feature and we have not publicly documented it for this reason.

I also don't actively develop the go driver anymore - please tag @sfc-gh-cshi for future inquiries.

ddl-giuliocapolino and others added 23 commits December 8, 2021 10:34

arrow record channel in context impl + unit test

9c26d05

iteration on comments 1

a907324

changed mutex map struct to sync.map

857f8b6

final iteration on comments

a52ba51

Merge pull request #1 from dominodatalab/ddl-giuliocapolino.with_arro…

fce1c82

…w_record_channel WithArrowRecordsChan - channel in context for Arrow Records

added converter for snowflake records

4d0efe7

added converter tests

ee807bf

releasing record as soon as possible

8943664

fix value has non-zero nanoseconds error

bb07f34

iteration on comments

6dae32a

Merge pull request #2 from dominodatalab/ddl-giuliocapolino.add_arrow…

f56da50

…_converter Converter for Snowflake arrow.Record type

Merge branch 'master' into master

ccfbe40

withDistributedBatches refactor

4874c89

Merge branch 'master' into ddl-giuliocapolino.with-distributed-batches

701e083

reverting withArrowRecordChan changes

b18a4de

adjusting go.mod for forking

5023beb

Merge pull request #5 from dominodatalab/ddl-giuliocapolino.publish-f…

8b89936

…ork-repo ddl: update go.mod and publish fork

first iteration on comments

82fbc2f

final iteration on comments

37799e5

Merge pull request #3 from dominodatalab/ddl-giuliocapolino.with-dist…

d39e2e6

…ributed-batches withDistributedBatches refactor

Merge remote-tracking branch 'upstream/master'

aedd242

reverting module change

da0f03b

Merge pull request #7 from dominodatalab/ddl-giuliocapolino.reverting…

f4e0b06

…-dominodatalab-module Reverting dominodatalab module

sfc-gh-jbahk suggested changes Feb 8, 2022

View reviewed changes

ddl-giuliocapolino added 2 commits February 8, 2022 15:52

revert exposure of SnowflakeRows, QueryContext, and change signature …

d9a0df1

…of Fetch

improved chunk_test to query more than 1 chunk

3756cf8

ddl-giuliocapolino force-pushed the master branch from 76543da to 3756cf8 Compare February 9, 2022 00:00

sfc-gh-jbahk reviewed Feb 9, 2022

View reviewed changes

chunk_downloader.go Outdated Show resolved Hide resolved

improved chunk_test to show download workers, added metadata, and mad…

b5a4331

…e Rec private

ddl-giuliocapolino force-pushed the master branch from 042d7a5 to b5a4331 Compare February 10, 2022 01:21

sfc-gh-jbahk suggested changes Feb 10, 2022

View reviewed changes

nomenclature tweaks

ed3fe29

sfc-gh-jbahk merged commit 1bd84af into snowflakedb:master Feb 10, 2022

SimbaGithub mentioned this pull request Sep 27, 2022

Change GetArrowBatches() in snowflakeRows to be public and blocking call #665

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow Record Distributed Result Batches #544

Arrow Record Distributed Result Batches #544

ddl-giuliocapolino commented Feb 8, 2022 •

edited

Loading

sfc-gh-jbahk left a comment

sfc-gh-jbahk Feb 8, 2022

ddl-giuliocapolino Feb 8, 2022

ddl-giuliocapolino Feb 8, 2022

sfc-gh-jbahk Feb 9, 2022

sfc-gh-jbahk Feb 8, 2022

ddl-giuliocapolino Feb 8, 2022

ddl-giuliocapolino Feb 9, 2022

sfc-gh-jbahk Feb 9, 2022

ddl-giuliocapolino Feb 10, 2022

sfc-gh-jbahk Feb 8, 2022

ddl-giuliocapolino Feb 8, 2022

sfc-gh-jbahk Feb 9, 2022

ddl-giuliocapolino Feb 9, 2022

sfc-gh-jbahk Feb 9, 2022

ddl-giuliocapolino Feb 9, 2022

sfc-gh-jbahk left a comment

sfc-gh-jbahk Feb 10, 2022

sfc-gh-jbahk Feb 10, 2022

sfc-gh-jbahk Feb 10, 2022

sfc-gh-jbahk Feb 10, 2022

sfc-gh-jbahk Feb 10, 2022

sfc-gh-jbahk Feb 10, 2022

sfc-gh-jbahk Feb 10, 2022

sfc-gh-jbahk commented Feb 10, 2022

ddl-giuliocapolino commented Feb 10, 2022

stoffeastrom commented Feb 17, 2022

stoffeastrom commented Apr 11, 2022

sfc-gh-jbahk commented Apr 11, 2022

Arrow Record Distributed Result Batches #544

Arrow Record Distributed Result Batches #544

Conversation

ddl-giuliocapolino commented Feb 8, 2022 • edited Loading

Description

Checklist

sfc-gh-jbahk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-jbahk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-jbahk commented Feb 10, 2022

ddl-giuliocapolino commented Feb 10, 2022

stoffeastrom commented Feb 17, 2022

stoffeastrom commented Apr 11, 2022

sfc-gh-jbahk commented Apr 11, 2022

ddl-giuliocapolino commented Feb 8, 2022 •

edited

Loading