-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1896153: Non deterministic errors with concurrent context aware queries in v1.12.1 #1292
Comments
hi - thank you for letting us know about this issue and for the example. Will look into it. |
I'm trying to reproduce the issue the following way:
create stage mystage;
create table testtable (contents VARIANT);
package main
import (
"context"
"database/sql"
"fmt"
"log"
"os"
"sync"
"time"
_ "github.com/snowflakedb/gosnowflake" // Snowflake driver
)
func UploadFilesAndCopy(db *sql.DB, stageName, tableName string, files []string) error {
// Create a context that will be used for all operations
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
// A channel that will hold the file paths we want to upload
fileCh := make(chan string)
// We'll use a WaitGroup to ensure all goroutines finish
var wg sync.WaitGroup
// Start a fixed number of worker goroutines to process the file uploads
workerCount := 5
for i := 0; i < workerCount; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for file := range fileCh {
putQuery := fmt.Sprintf(
"PUT /* UploadFilesAndCopy */ file:///go/test/%s @%s AUTO_COMPRESS=TRUE OVERWRITE=TRUE",
file, stageName,
)
if _, err := db.ExecContext(ctx, putQuery); err != nil {
log.Printf("failed to PUT file %q to stage %q: %v\n", file, stageName, err)
}
}
}()
}
// Send file names into the channel for workers to pick up
go func() {
defer close(fileCh) // Close once we're done sending
for _, file := range files {
fileCh <- file
}
}()
// Wait for all workers to finish uploading
wg.Wait()
// Now that all files are in the stage, run the COPY INTO command
copyQuery := fmt.Sprintf(`COPY INTO %s FROM @%s /* UploadFilesAndCopy */ FILE_FORMAT = (TYPE = PARQUET)`, tableName, stageName)
if _, err := db.ExecContext(ctx, copyQuery); err != nil {
return fmt.Errorf("failed to COPY INTO %s: %w", tableName, err)
}
return nil
}
func Cleanup(db *sql.DB) error {
ctx := context.Background()
queries := [2]string {
"RM /* Cleanup() */ @GO1292.PUBLIC.MYSTAGE",
"TRUNCATE TABLE /* Cleanup() */ GO1292.PUBLIC.TESTTABLE",
}
for _, query := range queries {
log.Printf("==> Cleanup: %s\n", query)
_, err := db.ExecContext(ctx, query)
if err != nil {
fmt.Errorf("==> Cleanup: failed to execute query due to error: %s\n", err)
return err
}
}
return nil
}
func main() {
dsn := os.Getenv("TEST_DSN")
if dsn == "" {
log.Fatalf("==> Please set TEST_DSN envvar to an actual full DSN to connect with gosnowflake.\n")
}
db, err := sql.Open("snowflake", dsn)
if err != nil {
log.Fatalf("failed to connect. %v, err: %v", dsn, err)
}
defer db.Close()
ctx := context.Background()
conn, err := db.Conn(ctx)
if err != nil {
log.Fatalf("failed to connect, err: %v", err)
}
defer conn.Close()
filesToUpload := []string{ "userdata1.parquet", "userdata2.parquet", "userdata3.parquet", "userdata4.parquet", "userdata5.parquet" }
log.Printf("==> UploadFilesAndCopy start\n")
err = UploadFilesAndCopy(db, "GO1292.PUBLIC.MYSTAGE", "GO1292.PUBLIC.TESTTABLE", filesToUpload)
if err != nil {
fmt.Errorf("==> UploadFilesAndCopy failed: %s\n", err)
}
err = Cleanup(db)
if err != nil {
fmt.Errorf("==> Cleanup failed: %s\n", err)
}
}
As of next step, would it be possible to please
Thank you in advance ! |
Generally representative. The biggest differences I kind spot are:
As far as I know there isn't anything data specific. We've seen this error when we're only uploading 1 file and seen it when we're uploading many. We could tell it was something with the version because only changing the dependency from
|
thank you so much for the added details! a wild idea maybe, but if you could perhaps provide me with a numerical session_id in which you know the issue definitely happened, i can try to look it up and see what queries were running and hopefully that too, what happened with them. session_id is something like If not, it's not a problem, i'll keep trying. |
I'll try to reproduce it in our nightly CI and give you something to work with tomorrow if I can reproduce it. It happens non deterministically so I'll have to just keep running the battery until I get it to happen. |
@sfc-gh-dszmolka we saw it last night in CI for session |
thank you @niger-prequel , let me see what I can find |
found this session yesterday - it was thankfully unique, so found the relevant Snowflake account, deployment, everything.
Would you be able to share a queryId which resulted in this error message you originally reported as an error?
In addition and based on your earlier comment, ran more repros, which I now put into https://github.com/sfc-gh-dszmolka/repro-for-gosnowflake-1292 instead of quoting here. I'm now trying to log the sessionId too.
// Start a fixed number of worker goroutines to process the file uploads
workerCount := 5
for i := 0; i < workerCount; i++ {
wg.Add(1)
go func() {
..
}()
} after the temp stage + temp table is created in SessionX , and the first file is
A queryId which has this error message could be helpful, also if you're perhaps able to create a minimal reproduction which leads to the error for you and I could execute on my end - that could eliminate all the possible differences and be greatly helpful. If you're up for it, I can add you as a collaborator for the above repo, or of course any other means of sharing code is absolutely okay. |
@sfc-gh-dszmolka I apologize, I assumed earlier that any error on our nightly CI was the error noted in this issue. That was not true, I've waited for our CI to surface this exact issue and it did this morning. The error was |
thank you so much @niger-prequel , will look up that sessionId in the next coming days and see what happened inside it. |
This is super weird. Seeing the 20 queries issued in session Since no queryId was provided which query failed with the Would you be possible for you to share verbose (DEBUG level) gosnowflake logs of the issue happening, in a more private setting, where you don't need to expose for the whole world to see ? Thinking something about creating an official case with us, getting to me and I can work with you further on the case where you can privately share logs - if that's allowed in your policy. If that's not possible, would you be able to provide a runnable reproducer (not a snippet) which I can run on my end in loops in hope it fails for me too ? This would be the most helpful perhaps, of all other options. Thank you in advance ! |
@sfc-gh-dszmolka I am setting the log level to debug for gosnowflake and I'll monitor for the next time this error appears. What is lucky is that I can get this to fail in our nightlies every so often so that environment isn't particularly sensitive. So I can probably share the debug logs here directly if that is simpler.
What's hard about this is that our platform is a bit complex. There are a number of different pipelines executing concurrently on that same instance and how those pipelines are configured is dynamic based on state in our system and the query engine. We support over a dozen databases and their drivers, so the code also isn't unique to the Snowflake driver. There is a lot of conditional logic depending on what driver we're using. . There arealso a number of layers of abstractions to facilitate dynamic pipeline generation between the |
Understand the situation, thank you ! When sharing logs, our recommendation is to please make sure nothing sensitive is shared over the public internet so please either sanitize it, or you can also upload to a private GH repository where you can grant my user access, or really any other ways of sharing a file which doesn't end up sharing with everybody. |
@sfc-gh-dszmolka I've attached debug logs. I redacted anything we'd consider sensitive but again there isn't any production data in this snowflake account so we should be fine. The debug logs do seem to reinforce my hunch that this has to do with context cancellation. To re-iterate, we managed to solve this in prod for ourselves by generating a new child context before every use of func () error {
newCtx, _ = context.WithCancel(ctx) // forces new child context to isolate from concurrent queries
_, err = db.ExecContext(newCtx, query)
if err != nil {
return err
}
return nil
} using |
1.12.1
and1.12.0
Debian Bullseye x86
go1.23.3
9.1.0
We are loading data into snowflake via the golang data. Our strategy is to use
PUT...
SQL commands to upload parquet files to an internal stage. then we use aCOPY INTO...
statement to publish the data. We leverage thedatabase/sql
golang abstraction . So generally our code will look likeWe're experiencing issues with both
1.12.0
and1.12.1
. On1.12.0
, if context cancellation occurs, other running queries will fail with:On
1.12.1
this goes away and we get the correct"context canceled"
error message. However, we start experiencing non deterministic errors where thePUT
commands will sometimes return an errorerror: 000605: Identified SQL statement is not currently executing
. No context has been canceled in this case.Going through the release notes and looking at the PRs for what changed, it seems like the #1248 may have introduced some kind of data race into the driver. We were able to get these errors to stop happening across our fleet by not sharing a context between the goroutines and creating a new child context for each spawned worker.
I expect the
PUT
queries to work concurrently as they did on1.12.0
and the cancel context error message to reflect the behavior of1.12.1
.Not right now, this happens in production environments where its against our policy to collect these logs.
The text was updated successfully, but these errors were encountered: