/t /n in data break ingest of the data #96

chomartek · 2023-12-14T07:18:37Z

When saving the data from delta table to starrocks with starrocks-connector-for-apache-spark operations breaks with Value count does not match column count. Expect 6, but got 0.Column delimiter: 9,Row delimiter: 10.. error when data in the delta table contains /t or /n
When characters are filtered import works.

df.withColumn("problem_column", F.regexp_replace(F.col("problem_column"), F.lit("\t"), F.lit("")))
.write.format("starrocks").option(...)
.mode("append")
.save()

The text was updated successfully, but these errors were encountered:

Xuxiaotuan · 2024-09-27T03:21:27Z

This is because by default, the csv mode is used. In the code, "\t" is used as the column separator. Similarly, rowDelimiter = "\n"; for the row separator.

 @Override
    public Write build() {
        RowStringConverter converter;
        if ("csv".equalsIgnoreCase(config.getFormat())) {
            converter = new CsvRowStringConverter(info.schema(), config.getColumnSeparator());
        }  else if ("json".equalsIgnoreCase(config.getFormat())) {
            converter = new JSONRowStringConverter(info.schema());
        } else {
            throw new RuntimeException("UnSupport format " + config.getFormat());
        }
        return new StarRocksWriteImpl(info, config, converter);
    }

In the code, "\t" is used as the separator.

@Override
    public String fromRow(Row row) {
        if (row.schema() == null) {
            throw new RuntimeException("Can't convert Row without schema");
        }
        String[] data = new String[row.length()];
        for (int i = 0; i < row.length(); i++) {
            if (!row.isNullAt(i)) {
                StructField field = row.schema().fields()[i];
                data[i] = convert(field.dataType(), row.get(i)).toString();
            }
        }

        StringBuilder sb = new StringBuilder();
        for (int idx = 0; idx < data.length; idx++) {
            Object val = data[idx];
            sb.append(null == val ? "\\N" : val);
            if (idx < data.length - 1) {
                sb.append(separator);
            }
        }
        return sb.toString();
    }

Solutions:

Customize the separators for rows and columns.
Modify the submission method to json.

private static final String KEY_PROPS_FORMAT = PROPS_PREFIX + "format";
private static final String KEY_PROPS_ROW_DELIMITER = PROPS_PREFIX + "row_delimiter";
private static final String KEY_PROPS_COLUMN_SEPARATOR = PROPS_PREFIX + "column_separator";

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/t /n in data break ingest of the data #96

/t /n in data break ingest of the data #96

chomartek commented Dec 14, 2023

Xuxiaotuan commented Sep 27, 2024

/t /n in data break ingest of the data #96

/t /n in data break ingest of the data #96

Comments

chomartek commented Dec 14, 2023

Xuxiaotuan commented Sep 27, 2024