Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files written by this library are unreadable by clickhouse #313

Open
knl opened this issue Mar 13, 2023 · 2 comments
Open

Files written by this library are unreadable by clickhouse #313

knl opened this issue Mar 13, 2023 · 2 comments

Comments

@knl
Copy link

knl commented Mar 13, 2023

When I try writing a simple csv file (initially, I tried parquet to the same effect), the file written by this library ends up not being readable by clickhouse. Clickhouse uses libhdfs3 library, so I assume the issue might be there as well.

What I tried so far:

  • the attached program that writes directly to hdfs: cannot be read by clickhouse
  • the attached program that writes to a local file, then upload using gohdfs put: cannot be read by clickhouse
  • the attached program that writes to a local file, then upload using hdfs dfs -put (hadoop tools): can be read by clickhouse
  • the attached program that writes directly to hdfs, use hdfs dfs -get followed by hdfs dfs -put: can be read by clickhouse, nothing missing

This is the same code:

package main

import (
	"log"
	"fmt"

	"github.com/colinmarc/hdfs/v2"
)

func main() {
	var err error
	client, err := hdfs.New("")
	if err != nil {
		log.Println("Can't create hdfs client", err)
		return
	}
	_ = client.Remove("/random/yet/existing/path/flat.csv")
	fw, err := client.Create("/random/yet/existing/path/flat.csv")
	if err != nil {
		log.Println("Can't create writer", err)
		return
	}

	num := 100
	for i := 0; i < num; i++ {
		if _, err = fmt.Fprintf(fw, "%d,%d,%f\n", int32(20+i%5), int64(i), float32(50.0)); err != nil {
			log.Println("Write error", err)
		}
	}
	log.Println("Write Finished")
	if err = fw.Close(); err != nil {
		log.Println("Issue closing file", err)
	}
	log.Println("Wrote ", num, "rows")
}

this is the response from running clickhouse-local:

server.internal :) select * from hdfs('hdfs://nameservice1/random/yet/existing/path/flat.csv', 'CSV')

SELECT *
FROM hdfs('hdfs://nameservice1/random/yet/existing/path/flat.csv', 'CSV')

Query id: e33d9bf7-41b0-4025-a5fc-8dc6ebb65c0f


0 rows in set. Elapsed: 60.292 sec.

Received exception:
Code: 210. DB::Exception: Fail to read from HDFS: hdfs://nameservice1, file path: /random/yet/existing/path/flat.csv. Error: 
HdfsIOException: InputStreamImpl: cannot read file: /random/yet/existing/path/flat.csv, from position 0, size: 1048576.	
Caused by: HdfsIOException: InputStreamImpl: all nodes have been tried and no valid replica can be read for Block: [block pool 
ID: BP-2134387385-192.168.12.6-1648216715170 block ID 1367614010_294283603].: Cannot extract table structure from CSV
 format file. You can specify the structure manually. (NETWORK_ERROR)

I'm using Hadoop 2.7.3.2.6.5.0-292.

The attached program doesn't produce any error.

@colinmarc
Copy link
Owner

Are you sure the root cause isn't the last in the chain?

Cannot extract table structure from CSV format file. You can specify the structure manually.

Otherwise I have no clue what might be the problem.

@colinmarc
Copy link
Owner

If you can get ahold of the datanode/namenode logs for the failed block read, that would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants