Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/t /n in data break ingest of the data #96

Open
chomartek opened this issue Dec 14, 2023 · 1 comment
Open

/t /n in data break ingest of the data #96

chomartek opened this issue Dec 14, 2023 · 1 comment

Comments

@chomartek
Copy link

When saving the data from delta table to starrocks with starrocks-connector-for-apache-spark operations breaks with Value count does not match column count. Expect 6, but got 0.Column delimiter: 9,Row delimiter: 10.. error when data in the delta table contains /t or /n
When characters are filtered import works.

df.withColumn("problem_column", F.regexp_replace(F.col("problem_column"), F.lit("\t"), F.lit("")))
.write.format("starrocks").option(...)
.mode("append")
.save()
@Xuxiaotuan
Copy link

This is because by default, the csv mode is used. In the code, "\t" is used as the column separator. Similarly, rowDelimiter = "\n"; for the row separator.

 @Override
    public Write build() {
        RowStringConverter converter;
        if ("csv".equalsIgnoreCase(config.getFormat())) {
            converter = new CsvRowStringConverter(info.schema(), config.getColumnSeparator());
        }  else if ("json".equalsIgnoreCase(config.getFormat())) {
            converter = new JSONRowStringConverter(info.schema());
        } else {
            throw new RuntimeException("UnSupport format " + config.getFormat());
        }
        return new StarRocksWriteImpl(info, config, converter);
    }

In the code, "\t" is used as the separator.

@Override
    public String fromRow(Row row) {
        if (row.schema() == null) {
            throw new RuntimeException("Can't convert Row without schema");
        }
        String[] data = new String[row.length()];
        for (int i = 0; i < row.length(); i++) {
            if (!row.isNullAt(i)) {
                StructField field = row.schema().fields()[i];
                data[i] = convert(field.dataType(), row.get(i)).toString();
            }
        }

        StringBuilder sb = new StringBuilder();
        for (int idx = 0; idx < data.length; idx++) {
            Object val = data[idx];
            sb.append(null == val ? "\\N" : val);
            if (idx < data.length - 1) {
                sb.append(separator);
            }
        }
        return sb.toString();
    }

Solutions:

  1. Customize the separators for rows and columns.
  2. Modify the submission method to json.
private static final String KEY_PROPS_FORMAT = PROPS_PREFIX + "format";
private static final String KEY_PROPS_ROW_DELIMITER = PROPS_PREFIX + "row_delimiter";
private static final String KEY_PROPS_COLUMN_SEPARATOR = PROPS_PREFIX + "column_separator";

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants