
Unicode character database access, case conversion, classification and script detection via Codepoint wrapper; generated lookup tables, surrogate handling, CharSequence and Appendable extensions.
A Kotlin multiplatform library providing limited Unicode Character Database functionality.
Note: This project serves as a temporary solution until KT-23251 Extend Unicode support in Kotlin common and KT-24908 CodePoint inline class are implemented in the Kotlin standard library.
Kodepoint provides Unicode character property access across multiple platforms:
Character classAdd the following dependency to your project:
dependencies {
implementation("me.zolotov.kodepoint:kodepoint:$version")
}import me.zolotov.kodepoint.*
// Create a Codepoint
val codepoint = Codepoint(0x1F600) // 😀
val isLetter = codepoint.isLetter()
val upperCase = codepoint.toUpperCase()
// String conversion
val string = codepoint.asString()
// Work with CharSequence
val text = "Hello 👋 World"
text.forEachCodepoint { cp ->
println("U+${cp.codepoint.toString(16).uppercase()}: ${cp.asString()}")
}
// Append to StringBuilder
val sb = StringBuilder()
sb.appendCodePoint(Codepoint(0x1F44D)) // 👍For detailed technical documentation on data storage, lookup tables, and platform-specific implementations, see ARCHITECTURE.md.
Core library providing Unicode character functionality through @JvmInline value class Codepoint.
CharSequence and Appendable
Contains generated Unicode property lookup tables.
lib for non-JVM platformsContains UnicodeScript enum accessible from both unicode and lib modules.
| Platform | Targets |
|---|---|
| JVM | jvm |
| JavaScript |
js, wasmJs
|
| iOS |
iosArm64, iosSimulatorArm64, iosX64
|
| macOS |
macosArm64, macosX64
|
| tvOS |
tvosArm64, tvosSimulatorArm64, tvosX64
|
| watchOS |
watchosArm64, watchosSimulatorArm64, watchosX64
|
| Linux |
linuxArm64, linuxX64
|
| Windows | mingwX64 |
Note: Only JVM and WasmJS targets are actively tested. Other targets compile and should work correctly, but have not been thoroughly validated.
This library aims to provide consistent Unicode behavior across all platforms.
On non-JVM platforms, it uses generated Unicode Character Database lookup tables.
The implementation is validated against JVM's java.lang.Character methods.
The following functions produce identical results to JVM's Character class for all 1,114,112 Unicode codepoints:
isLetter(), isDigit(), isLetterOrDigit()
isUpperCase(), isLowerCase()
toLowerCase(), toUpperCase()
isSpaceChar()isIdeographic()isIdentifierIgnorable()isISOControl()isJavaIdentifierStart(), isJavaIdentifierPart() - generated directly from JVM's Character class during buildSome functions have intentional differences from JVM behavior to maintain Unicode standard compliance:
This library uses Unicode's White_Space property, which differs from Java's Character.isWhitespace():
| Codepoint | Character | Unicode White_Space | Java isWhitespace |
|---|---|---|---|
| U+001C | File Separator | false |
true |
| U+001D | Group Separator | false |
true |
| U+001E | Record Separator | false |
true |
| U+001F | Unit Separator | false |
true |
| U+0085 | Next Line (NEL) | true |
false |
| U+00A0 | No-Break Space | true |
false |
| U+2007 | Figure Space | true |
false |
| U+202F | Narrow No-Break Space | true |
false |
Java excludes non-breaking spaces from isWhitespace() and includes control characters that Unicode does not classify as whitespace.
JVM includes U+2E2F (VERTICAL TILDA) for backward compatibility, but this character is not in Unicode's ID_Start or ID_Continue properties. This library follows the Unicode standard.
# Build all modules
./gradlew build
# Run tests
./gradlew allTests# Full benchmarks (5 warmups, 5 iterations)
./gradlew :benchmarks:benchmark
# Quick benchmarks (2 warmups, 3 iterations)
./gradlew :benchmarks:quickBenchmarkCodepointsBenchmark – Measures performance across different character sets:
JvmComparisonBenchmark – Direct comparison with java.lang.Character:
isLetter, isDigit, toLowerCase, toUpperCase
isWhitespace, isJavaIdentifierStart, isJavaIdentifierPart
Results are written to benchmarks/build/reports/benchmarks/.
A Kotlin multiplatform library providing limited Unicode Character Database functionality.
Note: This project serves as a temporary solution until KT-23251 Extend Unicode support in Kotlin common and KT-24908 CodePoint inline class are implemented in the Kotlin standard library.
Kodepoint provides Unicode character property access across multiple platforms:
Character classAdd the following dependency to your project:
dependencies {
implementation("me.zolotov.kodepoint:kodepoint:$version")
}import me.zolotov.kodepoint.*
// Create a Codepoint
val codepoint = Codepoint(0x1F600) // 😀
val isLetter = codepoint.isLetter()
val upperCase = codepoint.toUpperCase()
// String conversion
val string = codepoint.asString()
// Work with CharSequence
val text = "Hello 👋 World"
text.forEachCodepoint { cp ->
println("U+${cp.codepoint.toString(16).uppercase()}: ${cp.asString()}")
}
// Append to StringBuilder
val sb = StringBuilder()
sb.appendCodePoint(Codepoint(0x1F44D)) // 👍For detailed technical documentation on data storage, lookup tables, and platform-specific implementations, see ARCHITECTURE.md.
Core library providing Unicode character functionality through @JvmInline value class Codepoint.
CharSequence and Appendable
Contains generated Unicode property lookup tables.
lib for non-JVM platformsContains UnicodeScript enum accessible from both unicode and lib modules.
| Platform | Targets |
|---|---|
| JVM | jvm |
| JavaScript |
js, wasmJs
|
| iOS |
iosArm64, iosSimulatorArm64, iosX64
|
| macOS |
macosArm64, macosX64
|
| tvOS |
tvosArm64, tvosSimulatorArm64, tvosX64
|
| watchOS |
watchosArm64, watchosSimulatorArm64, watchosX64
|
| Linux |
linuxArm64, linuxX64
|
| Windows | mingwX64 |
Note: Only JVM and WasmJS targets are actively tested. Other targets compile and should work correctly, but have not been thoroughly validated.
This library aims to provide consistent Unicode behavior across all platforms.
On non-JVM platforms, it uses generated Unicode Character Database lookup tables.
The implementation is validated against JVM's java.lang.Character methods.
The following functions produce identical results to JVM's Character class for all 1,114,112 Unicode codepoints:
isLetter(), isDigit(), isLetterOrDigit()
isUpperCase(), isLowerCase()
toLowerCase(), toUpperCase()
isSpaceChar()isIdeographic()isIdentifierIgnorable()isISOControl()isJavaIdentifierStart(), isJavaIdentifierPart() - generated directly from JVM's Character class during buildSome functions have intentional differences from JVM behavior to maintain Unicode standard compliance:
This library uses Unicode's White_Space property, which differs from Java's Character.isWhitespace():
| Codepoint | Character | Unicode White_Space | Java isWhitespace |
|---|---|---|---|
| U+001C | File Separator | false |
true |
| U+001D | Group Separator | false |
true |
| U+001E | Record Separator | false |
true |
| U+001F | Unit Separator | false |
true |
| U+0085 | Next Line (NEL) | true |
false |
| U+00A0 | No-Break Space | true |
false |
| U+2007 | Figure Space | true |
false |
| U+202F | Narrow No-Break Space | true |
false |
Java excludes non-breaking spaces from isWhitespace() and includes control characters that Unicode does not classify as whitespace.
JVM includes U+2E2F (VERTICAL TILDA) for backward compatibility, but this character is not in Unicode's ID_Start or ID_Continue properties. This library follows the Unicode standard.
# Build all modules
./gradlew build
# Run tests
./gradlew allTests# Full benchmarks (5 warmups, 5 iterations)
./gradlew :benchmarks:benchmark
# Quick benchmarks (2 warmups, 3 iterations)
./gradlew :benchmarks:quickBenchmarkCodepointsBenchmark – Measures performance across different character sets:
JvmComparisonBenchmark – Direct comparison with java.lang.Character:
isLetter, isDigit, toLowerCase, toUpperCase
isWhitespace, isJavaIdentifierStart, isJavaIdentifierPart
Results are written to benchmarks/build/reports/benchmarks/.