Table of Contents
Scala Deep Dive
How to Count Lines in a File in Scala (And the Source File Handle Leak Nobody Talks About)
Count lines in a file in Scala — Source.fromFile, scala.util.Using, and Java NIO. Covers the Source file handle leak, getLines lazy iterator trap, Spark large-file patterns, and Scala 2.13+ Using utility with benchmarks.
A Stack Overflow question literally titled count number of lines in file - Scala has a highest-scored answer that says:
io.Source.fromFile("file.txt").getLines.size
The answer is half right.
The line counting part is fine. getLines() returns an Iterator[String], so .size consumes lazily and does not read the whole file into memory at once.
The missing part is the dangerous part:
- the
Sourceis never closed - repeated calls can accumulate file descriptors
- another Stack Overflow comment under that answer points out the leak directly
There is a second trap too. getLines() is lazy. If you return that iterator from a helper, a loan-pattern block, or a Future boundary, the underlying Source may already be closed when iteration actually happens.
This guide covers the practical scala count lines options:
scala source fromfile count linesfor the classic approachscala util using filefor Scala 2.13+ resource safety- a loan pattern for Scala 2.12 and older code
Files.linesfor modern JVM streamingscala spark count linesfor HDFS and S3- byte scanning for raw throughput
If you searched for scala count lines in file, the short answer is:
- small to medium local text file:
Using.resource(Source.fromFile(path))(_.getLines().foldLeft(0L)((n, _) => n + 1)) - large JVM text file:
Using.resource(Files.lines(Paths.get(path)))(_.count()) - distributed storage:
spark.read.textFile(path).count()
That is the real count lines scala rule of thumb: first keep the resource lifetime correct, then choose the counting API that matches your storage system.
Quick Method Guide
| I want to... | Use this | Main warning |
|---|---|---|
| Count a local text file with standard Scala | Using.resource(Source.fromFile(path))(_.getLines().foldLeft(0L)((n, _) => n + 1)) | do not let the iterator escape |
| Keep manual control | try / finally around Source.fromFile | easy to forget close() |
Return Try[Long] | Using(Source.fromFile(path)) { ... } | avoid .get until the edge of your app |
| Stream via Java NIO | Using.resource(Files.lines(Paths.get(path)))(_.count()) | the returned Stream must be closed |
| Count on Spark | spark.read.textFile(path).count() | local paths must exist on worker nodes too |
| Get maximum raw speed | buffered byte scan | counts physical newline bytes, not decoded text semantics |
For most scala count lines in file code, the strongest default on Scala 2.13+ is Using.resource plus an in-block terminal operation.
Method 1: Source.fromFile - The Classic Approach with a Hidden Leak
The familiar answer looks like this:
import scala.io.Source
val count = Source.fromFile("data.txt").getLines().size
This is the classic scala source fromfile count lines snippet.
It has two important properties:
getLines()returns anIterator[String], so the count is streaming rather than read-all- the
Sourcestays open until you close it
So the real safe version is:
import scala.io.Source
val source = Source.fromFile("data.txt")
try {
val count = source.getLines().foldLeft(0L)((n, _) => n + 1)
println(s"Lines: $count")
} finally {
source.close()
}
With an explicit encoding:
import scala.io.Source
val source = Source.fromFile("data.txt", "UTF-8")
try {
source.getLines().foldLeft(0L)((n, _) => n + 1)
} finally {
source.close()
}
Why the leak happens
Scala's Source API declares Source as Closeable. The docs do not force a specific resource-management pattern, so many snippets simply omit close().
That omission is what turns a simple scala count lines helper into a long-lived process problem.
On one file, the code often appears to work.
On repeated files, this becomes scala too many open files:
- every
Source.fromFile(...)opens a file-backed resource - if you do not close it promptly, descriptors stay open until GC eventually notices, if it ever does in time
- in a driver loop, service, or batch process, those descriptors accumulate
That is why the bad pattern is not "slow" so much as "resource-unsafe".
This is not a read-all memory trap
This part is easy to get wrong.
Source.fromFile(...).getLines().size does not load the whole file into one Scala collection. The official getLines() doc says it returns an Iterator[String].
That means this scala source fromfile count lines pattern is usually memory-reasonable for line counting itself.
The problem is file-handle lifetime, not read-all allocation.
Reproducing scala too many open files
The failure shape looks like this:
import scala.io.Source
val paths: Seq[String] = (1 to 5000).map(i => s"logs/$i.txt")
val counts = paths.map { path =>
Source.fromFile(path).getLines().size
}
In a short script you may get lucky.
In a long-running JVM, this can eventually become java.io.IOException: Too many open files.
That is why scala source fromfile count lines needs an explicit closing story even though the counting expression itself looks harmless.
If you want the Kotlin version of the same resource-lifetime bug class, the Kotlin useLines guide shows how a lazy sequence can outlive its reader.
Trap 2: getLines() Is Lazy, and That Changes the Resource Boundary
The second Scala-specific gotcha is scala getlines lazy.
The docs for Source.getLines() say it returns Iterator[String].
That means the file is not fully read when you call getLines(). The file is read as the iterator is consumed.
This is correct:
import scala.io.Source
val source = Source.fromFile("data.txt")
try {
val count = source.getLines().size
println(count)
} finally {
source.close()
}
The iterator is fully consumed before close().
This is wrong:
import scala.io.Source
def lines(path: String): Iterator[String] = {
val source = Source.fromFile(path)
try {
source.getLines()
} finally {
source.close()
}
}
val count = lines("data.txt").size
Now the iterator escapes the block, and the Source is already closed when the caller starts consuming it.
That is the scala getlines lazy trap in its simplest form.
Why toSeq is not the best force-evaluation answer
One Stack Overflow thread on Stream Closed shows a subtle follow-up: the asker tried toSeq, but the runtime type was still Stream in that Scala version.
If you need strict materialization, prefer an obviously strict collection:
import scala.io.Source
val source = Source.fromFile("data.txt")
try {
val lines = source.getLines().toVector
println(lines.length)
} finally {
source.close()
}
Or:
import scala.io.Source
val source = Source.fromFile("data.txt")
try {
val lines = source.getLines().toList
println(lines.length)
} finally {
source.close()
}
For simple line counting, you do not need to materialize at all. Just count inside the block.
The same bug in asynchronous code
This shape is also wrong:
import scala.concurrent.Future
import scala.io.Source
def countAsync(path: String): Future[Int] =
withSource(path) { source =>
Future(source.getLines().size)
}
The Future may run after the resource block exits.
The safe pattern is the other way around:
import scala.concurrent.Future
def countAsync(path: String): Future[Long] =
Future {
withSource(path) { source =>
source.getLines().foldLeft(0L)((n, _) => n + 1)
}
}
Keep the whole iterator consumption inside the resource lifetime.
Method 2: scala.util.Using - The Modern Resource-Management Answer
Scala 2.13+ gives you the standard-library answer:
import scala.io.Source
import scala.util.Using
val count = Using.resource(Source.fromFile("data.txt")) { source =>
source.getLines().foldLeft(0L)((n, _) => n + 1)
}
This is the cleanest scala util using file form if you want exceptions to propagate.
If you want explicit error handling, use Using(...), which returns Try[A]:
import scala.io.Source
import scala.util.{Try, Using}
val count: Try[Long] =
Using(Source.fromFile("data.txt")) { source =>
source.getLines().foldLeft(0L)((n, _) => n + 1)
}
Then handle the result:
count match {
case scala.util.Success(n) => println(s"Lines: $n")
case scala.util.Failure(e) => println(s"Error: ${e.getMessage}")
}
Why Using is better than raw try / finally
The Scala docs describe Using as a utility for automatic resource management. They also document two important behaviors:
Using(...)wraps the whole operation in aTryUsing.resource(...)behaves similarly to Java's try-with-resources
That makes scala util using file the right modern answer for most application code.
Safe examples
Count non-empty lines:
import scala.io.Source
import scala.util.Using
val nonEmpty = Using.resource(Source.fromFile("data.txt")) { source =>
source.getLines().foldLeft(0L) { (n, line) =>
if (line.nonEmpty) n + 1 else n
}
}
Batch a small set of local files and keep failures explicit:
import scala.io.Source
import scala.util.{Try, Using}
def countLines(path: String): Try[Long] =
Using(Source.fromFile(path)) { source =>
source.getLines().foldLeft(0L)((n, _) => n + 1)
}
val paths = Seq("a.txt", "b.txt", "c.txt")
val results = paths.map(path => path -> countLines(path)).toMap
This is a much safer answer to scala count lines in file than sprinkling .getLines().size across a codebase and hoping somebody remembers to close everything later.
Method 3: Loan Pattern - For Scala 2.12 and Older Code
If you are not on Scala 2.13+, a small loan-pattern helper keeps the code honest:
import scala.io.Source
def withSource[A](path: String)(f: Source => A): A = {
val source = Source.fromFile(path)
try {
f(source)
} finally {
source.close()
}
}
Use it like this:
val count = withSource("data.txt") { source =>
source.getLines().foldLeft(0L)((n, _) => n + 1)
}
Or with materialization:
val lines = withSource("data.txt") { source =>
source.getLines().toVector
}
The loan-pattern rule that matters
The same resource rule still applies:
- do return a fully computed count
- do return a strict collection like
ListorVector - do not return the raw
Iterator
Wrong:
def lines(path: String): Iterator[String] =
withSource(path)(_.getLines())
Right:
def countLines(path: String): Long =
withSource(path)(_.getLines().foldLeft(0L)((n, _) => n + 1))
This is exactly where the scala getlines lazy problem bites older codebases hardest.
Method 4: Files.lines - Java NIO Streaming with Better Defaults
Scala can call Java NIO directly, and for large JVM text files this is often the cleanest answer:
import java.nio.file.{Files, Paths}
import scala.util.Using
val count = Using.resource(Files.lines(Paths.get("data.txt"))) { lines =>
lines.count()
}
This is a strong default for scala count lines large file.
The Java Files.lines docs are explicit:
- unlike
readAllLines, it does not read all lines into aList - the stream is populated lazily as it is consumed
- the returned stream contains a reference to an open file
- you must close the stream promptly
That is why Using.resource is still important here:
import java.nio.charset.StandardCharsets
import java.nio.file.{Files, Paths}
import scala.util.Using
val count = Using.resource(
Files.lines(Paths.get("data.txt"), StandardCharsets.UTF_8)
)(_.count())
The zero-argument charset overload uses UTF-8 by default.
Why Files.lines is appealing in Scala
It solves several practical problems at once:
- it already returns a
Longfromcount() - its close requirement is well documented in the Java API
- it is a natural fit in mixed Scala and Java codebases
- it avoids the "did we close
Source?" discussion entirely
For teams already living on the JVM, the Java Files.lines guide covers the same API from the Java side.
Counting non-empty lines with NIO
import java.nio.file.{Files, Paths}
import scala.util.Using
val nonEmpty = Using.resource(Files.lines(Paths.get("data.txt"))) { lines =>
lines.filter(line => !line.isEmpty).count()
}
That is still scala count lines in file, just with a filter in the terminal pipeline.
Part 5: Apache Spark - Counting Lines on HDFS and S3
If your input already lives in HDFS, S3, or another cluster-visible filesystem, the best scala spark count lines answer is usually to let Spark read it as text:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("LineCounter")
.getOrCreate()
val count = spark.read.textFile("s3a://bucket/data/large_file.txt").count()
Spark's DataFrameReader.textFile docs say:
- it loads text files and returns a
Dataset[String] - by default, each line in the text files is a new row
That means .count() is exactly a distributed line count.
RDD version
If you prefer the RDD API:
val count = spark.sparkContext
.textFile("hdfs://namenode/data/large_file.txt")
.count()
The Spark RDD programming guide says SparkContext.textFile reads text as a collection of lines, and it supports local paths, HDFS, S3A, compressed files, directories, and wildcards.
Important cluster caveat
For local filesystem paths, Spark's guide also notes that the file must be accessible at the same path on worker nodes.
So this is fine:
hdfs://...s3a://...- shared cluster-visible paths
This is risky on a real cluster:
/tmp/data.txton the driver only
That is a practical scala spark count lines distinction that many small examples omit.
Do not loop on the driver with Source
This is the wrong pattern for a large distributed job:
import scala.io.Source
val paths: Seq[String] = ??? // thousands of files
val counts = paths.map { path =>
Source.fromFile(path).getLines().size
}
That is how you turn a Spark-adjacent workflow into scala too many open files on the driver.
This is better:
val count = spark.read.textFile("s3a://bucket/logs/*.log").count()
Or if you need filtering:
val dataLineCount = spark.read
.textFile("hdfs://namenode/logs/app.log")
.filter(line => !line.startsWith("#"))
.count()
Let the cluster own the file reading whenever the files already live there.
Part 6: Byte Scanning - Maximum Throughput
If you only need the number and you are willing to count physical newline bytes, a buffered byte scan is usually the fastest pure-JVM answer:
import java.io.{BufferedInputStream, FileInputStream}
import scala.util.Using
def countLinesFast(path: String): Long = {
val bufferSize = 1024 * 1024
Using.resource(new BufferedInputStream(new FileInputStream(path), bufferSize)) { stream =>
val buffer = new Array[Byte](bufferSize)
var count = 0L
var sawData = false
var lastByte = '\n'.toByte
var bytesRead = stream.read(buffer)
while (bytesRead != -1) {
if (bytesRead > 0) {
sawData = true
var i = 0
while (i < bytesRead) {
if (buffer(i) == '\n'.toByte) {
count += 1
}
lastByte = buffer(i)
i += 1
}
}
bytesRead = stream.read(buffer)
}
if (sawData && lastByte != '\n'.toByte) {
count += 1
}
count
}
}
This is a strong answer when:
- the file is huge
- you only need the count
- line decoding and per-line
Stringallocation are unnecessary
What byte scanning does and does not mean
This is not a decoded text-line API.
It counts physical LF bytes and treats a missing final LF as one more line. That matches Unix-style and CRLF text files well enough for raw counting.
It is less semantic than Source.getLines() or Files.lines():
- text APIs understand
\r\n,\r, and\nas line separators - byte scanning is just scanning bytes
So use byte scanning for speed, not for rich text semantics.
Benchmark: Representative Comparison
These numbers are representative rather than locally reproduced on this machine. The current workspace does not have scala or spark-shell installed, so the trade-off shape below is based on API behavior and the usual JVM profile for Scala 3.x on Linux with SSD storage.
| Method | Time | Peak memory | Handle safety | Notes |
|---|---|---|---|---|
Source.fromFile(...).getLines().size without close | about 3.0s | about 8MB | no | counting is streaming, but descriptor lifetime is unsafe |
Source plus try / finally | about 3.0s | about 8MB | yes | classic safe baseline |
Using.resource(Source.fromFile(...)) | about 3.0s | about 8MB | yes | best Scala-only local default |
Using.resource(Files.lines(...)) | about 1.8s | about 8MB | yes | modern JVM streaming |
| buffered byte scan | about 0.6s | about 1MB | yes | fastest raw physical-line count |
spark.read.textFile(...).count() | distributed | distributed | yes | best for cluster-visible text inputs |
The important correction here is that plain Source.getLines().size is not a 1GB read-all trap. It is a resource-leak trap if you do not close the Source.
So the practical conclusion is:
- Scala 2.13+ local file:
scala util using file - large JVM file:
Files.lines - distributed storage:
scala spark count lines - raw speed: byte scan
- never forget the close boundary around
Source
Part 7: A Production-Ready Scala Line Counter
The helper below keeps three concerns separate:
- resource safety
- strategy selection
- explicit
Try[Long]results
import java.io.{BufferedInputStream, FileInputStream}
import java.nio.charset.{Charset, StandardCharsets}
import java.nio.file.{Files, Paths}
import scala.io.Source
import scala.util.{Try, Using}
object LineCounter {
private val SmallFileThreshold = 50L * 1024 * 1024
private val BufferSize = 1024 * 1024
def count(
path: String,
charset: Charset = StandardCharsets.UTF_8,
skipEmpty: Boolean = false
): Try[Long] = {
val nioPath = Paths.get(path)
if (!Files.isRegularFile(nioPath)) {
return scala.util.Failure(
new IllegalArgumentException(s"File not found: $path")
)
}
val size = Files.size(nioPath)
if (size < SmallFileThreshold) {
Using(Source.fromFile(path, charset.name())) { source =>
source.getLines().foldLeft(0L) { (n, line) =>
if (skipEmpty && line.isEmpty) n else n + 1
}
}
} else {
Using(Files.lines(nioPath, charset)) { lines =>
if (skipEmpty) {
lines.filter(line => !line.isEmpty).count()
} else {
lines.count()
}
}
}
}
def countFast(path: String): Try[Long] =
Using(new BufferedInputStream(new FileInputStream(path), BufferSize)) { stream =>
val buffer = new Array[Byte](BufferSize)
var count = 0L
var sawData = false
var lastByte = '\n'.toByte
var bytesRead = stream.read(buffer)
while (bytesRead != -1) {
if (bytesRead > 0) {
sawData = true
var i = 0
while (i < bytesRead) {
if (buffer(i) == '\n'.toByte) {
count += 1
}
lastByte = buffer(i)
i += 1
}
}
bytesRead = stream.read(buffer)
}
if (sawData && lastByte != '\n'.toByte) {
count += 1
}
count
}
def countBatch(paths: Seq[String]): Map[String, Try[Long]] =
paths.map(path => path -> count(path)).toMap
}
Examples:
LineCounter.count("data.csv").foreach(n => println(s"Lines: $n"))
LineCounter.count("data.csv", skipEmpty = true).getOrElse(0L)
LineCounter.countFast("huge.log").get
LineCounter.countBatch(Seq("a.txt", "b.txt", "c.txt"))
This is the sort of helper that prevents scala too many open files from showing up months later because somebody copied a one-liner from an old answer.
Quick FAQ
How do I count lines in a file in Scala?
Use Using.resource(Source.fromFile(path))(_.getLines().foldLeft(0L)((n, _) => n + 1)) for a Scala-native local file answer, or Using.resource(Files.lines(Paths.get(path)))(_.count()) if Java NIO is acceptable.
Why does Scala throw Too many open files?
The usual cause is repeated Source.fromFile or other file-backed resource creation without prompt closing. In other words, it is often a resource-lifetime bug, not a counting algorithm bug.
How do I close Source in Scala?
Use try / finally, Using, or Using.resource.
What is scala.util.Using?
It is Scala 2.13+'s standard-library utility for automatic resource management. Using returns Try[A]; Using.resource returns A and throws on failure.
How do I count lines in a large file in Scala?
For scala count lines large file, prefer Files.lines for normal JVM text files or a byte-scanning loop for maximum throughput.
How do I count lines in Scala with Spark?
Use spark.read.textFile(path).count() or spark.sparkContext.textFile(path).count() when the files live on HDFS, S3A, or another cluster-visible filesystem.
Is getLines lazy in Scala?
Yes. scala getlines lazy is real: Source.getLines() returns an Iterator[String], and the file is consumed as the iterator is traversed.
How do I count lines in Scala without loading the file?
Source.getLines() plus an immediate count, Files.lines(...).count(), Spark text readers, and byte scanning all avoid loading the entire file into one in-memory collection.
Sources Checked
- Scala
SourceAPI, includingSource extends ... Closeable: https://www.scala-lang.org/api/current/scala/io/Source.html - Scala
Source.getLines()API, which returnsIterator[String]: https://www.scala-lang.org/api/current/scala/io/Source.html#getLines():Iterator[String] - Scala
UsingAPI for automatic resource management: https://www.scala-lang.org/api/2.13.0/scala/util/Using%24.html - Scala current
UsingAPI docs forTryandUsing.resource: https://www.scala-lang.org/api/current/scala/util/Using%24.html - Java
Files.linesdocumentation: https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/nio/file/Files.html - Java
Streamdocumentation on I/O-backed streams needing closing: https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html - Spark
DataFrameReader.textFileScalaDoc: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader - Spark RDD programming guide for
SparkContext.textFile: https://spark.apache.org/docs/3.5.5/rdd-programming-guide.html - Stack Overflow line-count answer whose top snippet omits closing the
Source: https://stackoverflow.com/questions/8865551/count-number-of-lines-in-file-scala - Stack Overflow discussion showing
getLines()plus escaped iterator leading toStream Closed: https://stackoverflow.com/questions/40460338/read-file-in-scala-stream-closed - Stack Overflow discussion showing repeated file iteration causing
Too many open files: https://stackoverflow.com/questions/10338408/iterating-over-the-lines-of-a-file
Related Guides and Tools
- Java NIO
Files.linesguide - Kotlin
useLinesguide - Haskell lazy I/O guide
- Python line counting
- Go line counting
- Line Counter tool
Building a Spark data pipeline?
Check the line count before you submit the job. Paste the file into the Line Counter. No Source, no file handle leaks, no Too many open files.
Frequently Asked Questions
How do I count lines in a file in Scala?
For Scala 2.13+, the safest simple answer is Using.resource(Source.fromFile(path))(_.getLines().foldLeft(0L)((n, _) => n + 1)). For large files or Java-heavy codebases, Files.lines(Paths.get(path)).count() is a strong default.
Why does Scala throw Too many open files?
The usual cause is opening Source or other file-backed resources repeatedly without closing them promptly. In long-running services or Spark drivers, those open descriptors accumulate until the process hits the operating-system limit.
How do I close Source in Scala?
Use try/finally, scala.util.Using, or Using.resource so the close call always runs even if counting throws.
What is scala.util.Using?
It is the standard-library resource-management utility in Scala 2.13+ that wraps acquisition, use, and release, with Using returning a Try and Using.resource throwing on failure.
How do I count lines in a large file in Scala?
Use Files.lines for normal text files, Spark text readers for distributed storage, or a buffered byte scan when you only need physical newline counts.
How do I count lines in Scala with Spark?
Use spark.read.textFile(path).count() or SparkContext.textFile(path).count() so the work runs across the cluster instead of opening files one by one on the driver.
Is getLines lazy in Scala?
Yes. Source.getLines returns an Iterator[String], and the lines are produced as the iterator is consumed.
How do I count lines in Scala without loading the file?
Use Source.getLines with immediate consumption, Files.lines with a terminal count, Spark text readers, or a byte-scanning loop. None of those need the whole file resident as one collection.
Related Guides
16 min read
How to Count Lines in a File Using Java (6 Methods, Benchmarked)
Count lines in a file using Java — BufferedReader, Files.lines, LineNumberReader, BufferedInputStream, and more. Includes benchmark results for 5GB files and Java 8–17 examples.
14 min read
How to Count Lines in a File Using Kotlin (And the useLines Sequence Trap Nobody Documents)
Count lines in a file using Kotlin — File.readLines, useLines, BufferedReader, and Coroutines Flow. Covers the useLines Sequence escape trap, OOM risks, and Android/Spring Boot patterns with benchmarks.
20 min read
How to Count Lines in Python: 7 Methods, Benchmarked and Battle-Tested
Count lines in Python strings, text files, large files, and directories. Includes real performance benchmarks, empty file handling, splitlines vs split, and production-ready functions.
14 min read
How to Count Lines in a File in Haskell (And Why `lines ""` Is Not the Real Trap)
Count lines in a file in Haskell — readFile, lines, Data.ByteString, and strict vs lazy IO. Covers the `lines ""` myth, final-newline off-by-one bugs, Lazy IO file descriptor leaks, and high-performance streaming with ByteString.